wf4ever_PDF2TXT2Solr_Database
Created: 2013-07-24 13:12:54
This workflow extracts the text of a .pdf file and stores it in a .txt file. Then it stores the .txt file in a Solr database.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
Descriptions (1)
This workflow will extract the text of a PDF file and saves it locally (in the same directory as the pdf file).
After that it stores the newly made txt file in a solr database.
Before this workflow is operational, it is important that solr is running and the pathToPostJar variable is linking to the correct path of post.jar in Solr.
Another thing to note before running this workflow is to make sure that pdftotext is installed and can be called with the commandline or terminal. To test please run:
$ pdftotext -h
If Solr is running locally you can check if the files have been stored by browsing to the following location:http://localhost:8983/solr/#/
Dependencies:
- pdftotext
- Solr
Developed and tested on Fedora |
Dependencies (0)
Inputs (1)
Name |
Description |
userSuppliedPDFCorpus |
A list of the pdf files that should have their text extracted and stored in the Solr database.
|
Processors (4)
Name |
Type |
Description |
pdftotext |
externaltool |
pdftotext
Input:
Path to a PDF File (input)
Path to Text File (output)
Output:
A Text File with the text of the PDF file
pdftotext uses the terminal or commandline to call the program pdftotext. This tool simply extracts the text of a pdf file and stores it in a .txt file.
Make sure that pdftotext is installed on the system and can be called from the terminal or commandline. To test if pdftotext is installed on your system try running the following command in your terminal or commandline:
$pdftotext -h
Copyright of pdftotext:
1996-2004 Glyph & Cog, LLC. |
SolrImport |
externaltool |
SOLRImport takes the path of the txt file and stores this in a Solr database.
Make sure that the SOLR database is running and that the correct path is inside the variable.
If Solr is running locally you can check if the files have been stored by browsing to the following location:http://localhost:8983/solr/#/
Solr can be downloaded at:
http://lucene.apache.org/solr/ |
createFileLocation |
beanshell |
createFileLocation uses the path of the PDF file and adds the string ".txt" to create the output location for pdftotext. Script//We use the path of the pdf file and add the extension .txt to convert the filepath.
//afterwards we can convert the file to the new path.
TXTLocation = PDFLocation + ".txt" |
pathToPostJar |
stringconstant |
This variable links to the post.jar that Solr uses to add files in the database. Value/home/sander/Downloads/solr-4.3.1/example/exampledocs/post.jar |
Beanshells (1)
Name |
Description |
Inputs |
Outputs |
createFileLocation |
createFileLocation uses the path of the PDF file and adds the string ".txt" to create the output location for pdftotext. |
PDFLocation
|
TXTLocation
|
Outputs (4)
Name |
Description |
pdftotext_STDOUT |
The standerdized output of the pdftotext process
|
pdftotext_STDERR |
the standerdized error of the pdftotext process
|
SOLRInport_STDOUT |
The standerdized output of the SolrImport process.
|
SolrInport_STDERR |
The standerdized error of the SolrImport process
|
Datalinks (9)
Source |
Sink |
userSuppliedPDFCorpus |
pdftotext:PDFFileLocation |
createFileLocation:TXTLocation |
pdftotext:TXTFileLocation |
createFileLocation:TXTLocation |
SolrImport:inputFile |
pathToPostJar:value |
SolrImport:pathToPostJar |
userSuppliedPDFCorpus |
createFileLocation:PDFLocation |
pdftotext:STDOUT |
pdftotext_STDOUT |
pdftotext:STDERR |
pdftotext_STDERR |
SolrImport:STDOUT |
SOLRInport_STDOUT |
SolrImport:STDERR |
SolrInport_STDERR |
Coordinations (1)
Controller |
Target |
pdftotext |
SolrImport |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (0)
(People/Groups)
None
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (1)
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment