PDF to plain text
This workflow will extract the plain text content of PDF files supplied to the input port. You can connect the Load PDF from directory workflow to this workflows input. We recommend you send the output from this workflow to the Clean plain text workflow, because the PDF to text process can add characters into the text that are XML-invalid and therefore can not be sent to most services as plain text. Another way round this problem is to encode the text as Base64 using the handy local service ("Encode Byte Array to Base 64") included with Taverna, although this requires a service that knows to decode the Base 64 back to text, which is not common. The PDF to text service makes use of the "pdftotext" executable from Xpdf.
This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.
Preview
Run
Run this Workflow in the Taverna Workbench...
Option 1:
Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1058/download?version=1
[ More Info ]
Workflow Components
Reviews (0)
Other workflows that use similar services (2)
Terms from collection of PDF files (2)
Created: 2010-02-19 | Last updated: 2011-12-13
Credits: James Eales
From PDF to lemmatized text (1)
Created: 2010-09-16 | Last updated: 2012-01-18
Credits: Netr James Eales
Attributions: PDF to plain text Clean plain text
Comments (0)
No comments yet
Log in to make a comment