Chemical structure extractor - img2structure

Created: 2013-05-14 04:14:26 Last updated: 2013-05-14 04:20:59

Download Workflow

Extracts images of chemical structures from PDFs and converts them to usable structure data using the OSRA binaries.

An in-progress (incomplete) workflow.

This workflow makes use of the External Tool node to access the OSRA structure recognition binaries.

So you must have a functioning installation of OSRA and it's dependencies. This may require advanced compiler knowledge on your platform, and may not be a trivial task.

OSRA 1.4 is free for both the source & binary distribution. OSRA 2.0 source is free, but the binaries require a small fee.

Requirements:
KNIME 2.7.4
Community Nodes - RDKit (to view structures) See http://tech.knime.org/community
OSRA 1.4 http://osra.sourceforge.net/

Start by creating a working directory where you can put the test PDF and some other files.

Then try this from the command line:

$cd workingdir
$osra -w KNIME.smi -r 150 -o KNIME -f can test.pdf

So that's write the output as smiles one structure per line (-w KNIME.smi), set the resolution of intermediate images to 150 dpi (-r 150), keep those intermediate images so we can check for problems and name them all KNIMEnnn.png (-o KNIME), generate canonical smiles (-f can), and use as the input the file test.pdf. Start your testing with a single page pdf of good quality (not a scan of an old photocopy!).

If you get any errors about missing preferences files, you can add them to the command line with the -l and -a switches, but it is easier to just copy or link them to the workingdir. Copy spelling.txt and superatom.txt from /opt/local/osra/1.4.0, and edit as needed. The first helps osra correct things like it reading "Pl" when it's obviously meant to be "Ph". The second file tells osra "Ph" means "c1ccccc1".

If all that works, onwards to the workflow.

The external tool node writes the incoming table to a file. This file is supposed to be used as input for the actual tool, but is not in a useful format for osra. So the workflow creates a dummy table instead, and writes a blank file to the workingdir.

The actual osra command line switches are passed-in via flow variables from the other branch of the workflow. The workflow expects all of your PDF files to be in the same directory as the intermediate files and output files.

This version of the workflow will only process the first PDF found.

Configure the External Tool node next.
1. The "Input Data File path" needs to point to a blank dummy file.
2. Path to Executable needs to be pre-set (default is /usr/local/bin/osra).
3. Execute in Directory should be fed in via the flow variable from the second branch of the workflow. It needs to be a full path, so no paths like ~/osra/test/here.
4. "Output Data file path" needs to match exactly that specified in the "Command Line Arguments" field, or KNIME won't recover the output. The default is workingdir/KNIME.smi

Once the External Tool node has executed, you can execute the third branch of the workflow. This branch will read in all the intermediate image files generated by osra, and line them up in a table with the interpreted structures so that you can look for errors.

Adjust the two preference files spelling.txt and superatom.txt, and run the workflow again until all the errors are corrected.

To do:
1. Edit-in-place structure correction. In the table view, correct a single structure by clicking on it and correcting the structure without rerunning the workflow.
2. Loop through all the PDFs found in the workingdir, not just the first one.

Preview

Run

Not available

Workflow Components

Not available

Information Workflow Type

KNIME

Information Uploader

sauberns

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)

(People/Groups)

sauberns

Information Attributions (1)

(Workflows/Files)

Chemical term extractor - text2structure

Information Tags (7)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

Cheminformatics

Information Featured In Packs (0)

None

Log in to add to one of your Packs

Information Attributed By (1)

(Workflows/Files)

Chemical term extractor - text2structure

Information Favourited By (0)

No one

Information Statistics

2859 viewings

1096 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Chemical structure extractor - img2structure

Created by sauberns on Tuesday 14 May 2013 04:14:26 (UTC)

Last edited by sauberns on Tuesday 14 May 2013 04:20:59 (UTC)

Reviews (0)

No reviews yet

Be the first to review!

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (0)

There are no workflows in myExperiment that use similar services to this Workflow.