Chemical structure extractor - img2structure
Extracts images of chemical structures from PDFs and converts them to usable structure data using the OSRA binaries.
An in-progress (incomplete) workflow.
This workflow makes use of the External Tool node to access the OSRA structure recognition binaries.
So you must have a functioning installation of OSRA and it's dependencies. This may require advanced compiler knowledge on your platform, and may not be a trivial task.
OSRA 1.4 is free for both the source & binary distribution. OSRA 2.0 source is free, but the binaries require a small fee.
Requirements:
KNIME 2.7.4
Community Nodes - RDKit (to view structures) See http://tech.knime.org/community
OSRA 1.4 http://osra.sourceforge.net/
Start by creating a working directory where you can put the test PDF and some other files.
Then try this from the command line:
$cd workingdir
$osra -w KNIME.smi -r 150 -o KNIME -f can test.pdf
So that's write the output as smiles one structure per line (-w KNIME.smi), set the resolution of intermediate images to 150 dpi (-r 150), keep those intermediate images so we can check for problems and name them all KNIMEnnn.png (-o KNIME), generate canonical smiles (-f can), and use as the input the file test.pdf. Start your testing with a single page pdf of good quality (not a scan of an old photocopy!).
If you get any errors about missing preferences files, you can add them to the command line with the -l and -a switches, but it is easier to just copy or link them to the workingdir. Copy spelling.txt and superatom.txt from /opt/local/osra/1.4.0, and edit as needed. The first helps osra correct things like it reading "Pl" when it's obviously meant to be "Ph". The second file tells osra "Ph" means "c1ccccc1".
If all that works, onwards to the workflow.
The external tool node writes the incoming table to a file. This file is supposed to be used as input for the actual tool, but is not in a useful format for osra. So the workflow creates a dummy table instead, and writes a blank file to the workingdir.
The actual osra command line switches are passed-in via flow variables from the other branch of the workflow. The workflow expects all of your PDF files to be in the same directory as the intermediate files and output files.
This version of the workflow will only process the first PDF found.
Configure the External Tool node next.
1. The "Input Data File path" needs to point to a blank dummy file.
2. Path to Executable needs to be pre-set (default is /usr/local/bin/osra).
3. Execute in Directory should be fed in via the flow variable from the second branch of the workflow. It needs to be a full path, so no paths like ~/osra/test/here.
4. "Output Data file path" needs to match exactly that specified in the "Command Line Arguments" field, or KNIME won't recover the output. The default is workingdir/KNIME.smi
Once the External Tool node has executed, you can execute the third branch of the workflow. This branch will read in all the intermediate image files generated by osra, and line them up in a table with the interpreted structures so that you can look for errors.
Adjust the two preference files spelling.txt and superatom.txt, and run the workflow again until all the errors are corrected.
To do:
1. Edit-in-place structure correction. In the table view, correct a single structure by clicking on it and correcting the structure without rerunning the workflow.
2. Loop through all the PDFs found in the workingdir, not just the first one.
Preview
Run
Not available
Workflow Components
Not available
Workflow Type
Version 1 (of 1)
Log in to add Tags
Shared with Groups (1)
Statistics
Reviews (0)
Other workflows that use similar services (0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment