PubMed Search and Solr storage

Created: 2013-07-25 14:38:20 Last updated: 2013-08-26 14:13:59

Download Workflow

Based on the work of Fisher: This workflow takes in a search term, are passed to the eSearch function and searched for in PubMed. I extended by removing outputs and text extraction and addded an automatic Solr storage process using a post.jar, specified by the user. Before running this workflow, make sure that a solr server is up and running and the variable attached to the SolrImport process contains the correct path.

Dependencies:

- Solr

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://myexperiment.org/workflows/3659/download?version=2
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (1)

Descriptions (1)

Dependencies (0)

Inputs (5)

Name	Description
search_term	I want to find all abstracts that contain the words:
end_date	The found articles should be older then ____:
start_date	The found articles should not be older then ____.
maximum_articles	I want a maximum of ____ articles. 10 is set as default for testing. More then 100 is good for a normal run.
ResearcherID	The ID of the researcher that runs this workflow. If you do not have an ID, please register at www.researcherid.com.

Processors (16)

Name	Type	Description
pubmed_database	stringconstant	Which database is being used. Value pubmed
extractPMID	localworker	This process extracts the pubmed IDS based on the eSearch run. Script import org.dom4j.Document; import org.dom4j.Node; import org.dom4j.io.SAXReader; SAXReader reader = new SAXReader(false); reader.setIncludeInternalDTDDeclarations(false); reader.setIncludeExternalDTDDeclarations(false); Document document = reader.read(new StringReader(xmltext)); List nodelist = document.selectNodes(xpath); // Process the elements in the nodelist ArrayList outputList = new ArrayList(); ArrayList outputXmlList = new ArrayList(); String val = null; String xmlVal = null; for (Iterator iter = nodelist.iterator(); iter.hasNext();) { Node element = (Node) iter.next(); xmlVal = element.asXML(); val = element.getStringValue(); if (val != null && !val.equals("")) { outputList.add(val); outputXmlList.add(xmlVal); } } List nodelist=outputList; List nodelistAsXML=outputXmlList;
xpath	stringconstant	Value /[local-name(.)='eSearchResult']/[local-name(.)='IdList']/*[local-name(.)='Id']
run_eSearch	wsdl	Wsdl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdl Wsdl Operation run_eSearch
parametersXML_eFecth	xmlsplitter
Retrive_abstracts	workflow	This nested workflow was part of Fishers workflow, but has been decreased in size. This workflow is about storing the xmll files from eFetch and doesn't require to extract the plain text abstract like the original workflow did.
SolrImport	externaltool	SolrImport takes the path of the txt file and stores this in a Solr database. Make sure that the Solr database is running and that the correct path is inside the variable. If Solr is running locally you can check if the files have been stored by browsing to the following location:http://localhost:8983/solr/#/ Solr can be downloaded at: http://lucene.apache.org/solr/
pathToPostJar	stringconstant	This is the path to the Post.jar that Solr uses to import it's documents. Value /run/media/sander/Second Space/Downloads/solr-4.3.1/example/exampledocs/post.jar
LookAtWatch	beanshell	Script Date date = new Date(); stringy = "" + date; out1 = stringy
AddProvinance	beanshell	Script AbstractWithProv = "\n"+ResearcherID+"\n"+ExtractionDate+"\n\n"+ExtractedAbstract
Write_Text_File	localworker	Script BufferedWriter out; if (encoding == void) { out = new BufferedWriter(new FileWriter(outputFile)); } else { out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), encoding)); } out.write(filecontents); out.flush(); out.close(); outputFile = filecontents;
CreateFileLocation	beanshell	Script out1 = "/home/sander/Downloads/" + in1 + ".xml"
CreateListOfArticlesThatNeedExtracting	beanshell	Script //Distribute Lists import java.util.*; List IDsInDatabaseOut = new ArrayList(); List ListOfArticlesThatNeedExtracting = new ArrayList(); String fortest = IDsInDatabase.toString(); Test1 = fortest.equals("File exists"); if (fortest.equals("File exists")) { IDsInDatabaseOut.add(ExtractableIDs + "IDsInDatabase"); } else{ ListOfArticlesThatNeedExtracting.add(ExtractableIDs); }
CheckIfArticleIsInDatabase	externaltool
Flatten_List	localworker	Script flatten(inputs, outputs, depth) { for (i = inputs.iterator(); i.hasNext();) { element = i.next(); if (element instanceof Collection && depth > 0) { flatten(element, outputs, depth - 1); } else { outputs.add(element); } } } outputlist = new ArrayList(); flatten(inputlist, outputlist, 1);
CreateFileLocation_2	beanshell	Script out1 = "/home/sander/Downloads/" + in1 + ".xml"

Beanshells (5)

Name	Inputs	Outputs
LookAtWatch		out1
AddProvinance	ResearcherID ExtractionDate ExtractedAbstract	AbstractWithProv
CreateFileLocation	in1	out1
CreateListOfArticlesThatNeedExtracting	ExtractableIDs IDsInDatabase	ListOfArticlesThatNeedExtracting IDsInDatabaseOut Test1
CreateFileLocation_2	in1	out1

Outputs (5)

Name	Description
SolrImport_STDERR	This is the standerdized error from Solr. If there were any errors while running Solr then the values will turn red. Red is generally considered to be a bad colour.
SolrImport_STDOUT	This is the standerdized output from Solr.
IDsInDatabaseOut
Test1
Flatten_List_outputlist

Datalinks (27)

Source	Sink
xpath:value	extractPMID:xpath
run_eSearch:parameters	extractPMID:xml-text
parametersXML_eFecth:output	run_eSearch:parameters
pubmed_database:value	parametersXML_eFecth:db
search_term	parametersXML_eFecth:term
end_date	parametersXML_eFecth:maxdate
start_date	parametersXML_eFecth:mindate
maximum_articles	parametersXML_eFecth:RetMax
Flatten_List:outputlist	Retrive_abstracts:pubmed_ids
pathToPostJar:value	SolrImport:pathToPostJar
CreateFileLocation_2:out1	SolrImport:inputFile
LookAtWatch:out1	AddProvinance:ExtractionDate
ResearcherID	AddProvinance:ResearcherID
Retrive_abstracts:AbstractXML	AddProvinance:ExtractedAbstract
AddProvinance:AbstractWithProv	Write_Text_File:filecontents
CreateFileLocation_2:out1	Write_Text_File:outputFile
extractPMID:nodelist	CreateFileLocation:in1
CheckIfArticleIsInDatabase:STDOUT	CreateListOfArticlesThatNeedExtracting:IDsInDatabase
extractPMID:nodelist	CreateListOfArticlesThatNeedExtracting:ExtractableIDs
CreateFileLocation:out1	CheckIfArticleIsInDatabase:ExtractableArticles
CreateListOfArticlesThatNeedExtracting:ListOfArticlesThatNeedExtracting	Flatten_List:inputlist
Flatten_List:outputlist	CreateFileLocation_2:in1
SolrImport:STDERR	SolrImport_STDERR
SolrImport:STDOUT	SolrImport_STDOUT
CreateListOfArticlesThatNeedExtracting:IDsInDatabaseOut	IDsInDatabaseOut
CreateListOfArticlesThatNeedExtracting:Test1	Test1
Flatten_List:outputlist	Flatten_List_outputlist

Coordinations (2)

Controller	Target
CheckIfArticleIsInDatabase	CreateListOfArticlesThatNeedExtracting
Write_Text_File	SolrImport

Information Workflow Type

Taverna 2

Information Uploader

Sander van boom

Information License

All versions of this Workflow are licensed under:

Information Version 2 (of 5)

Information Credits (1)

(People/Groups)

Sander van boom

Information Attributions (1)

(Workflows/Files)

PubMed Search

Information Tags (6)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

Concept profile generation pipeline

Information Featured In Packs (1)

BioSemantics Concept Profile Generation Workflows

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (1)

Sander van boom

Information Statistics

4528 viewings

1882 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

PubMed Search and Solr storage

Created by Sander van boom on Thursday 25 July 2013 14:38:20 (UTC)
PubMed Search and Solr storage

Created by Sander van boom on Thursday 01 August 2013 15:26:25 (UTC)

Revision comment:

Added an automatic update check (to prevent downloading the same file twice) and basic provenance to the outputfile.
PubMed Search and Solr storage

Created by Sander van boom on Tuesday 06 August 2013 11:25:43 (UTC)

Revision comment:

Updated the workflow with more provenance data and fixed a couple of bugs of the workflow.
It should be working properly (famous last words) and until there are any bugreports or requests, this will be the final version.
The only planned addition will be additonal documentation.
PubMed Search and Solr storage

Created by Sander van boom on Wednesday 07 August 2013 12:59:21 (UTC)

Revision comment:

Added an extra input port that gives the user the posibility to change their working directory.
PubMed Search and Solr storage

Created by Sander van boom on Friday 09 August 2013 11:57:11 (UTC)

Revision comment:

There was a bug in previous versions that didn't store the Solr data correctly. Creating a nested workflow that extracts and store the files and imports them in Solr was a simple update that solved the problem. Currently only the title and the ID of the abstract are being stored. This is because the example schema of Solr allows these tags. If you want more information being stored then you should add extra xpaths and change your basic Solr Schema. Later on I might add more information or provide additional documentation on how to do this yourself.

Reviews (1)

Write your own Review

Eelke van der Horst

Title: Thanks!

Rating: 4 out of 5

Created: 2013-08-26 14:13:59 | Updated: 2013-08-26 14:13:59

A nice workflow for gathering our in-house document corpus and metadata.

[ more ]

View

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (31)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 2

Uploader

Paul Fisher

Gene to Pubmed (3)

Download

This workflow takes in a list of gene names and searches the PubMed database for corresponding articles. Any matches to the genes are then retrieved (abstracts only). These abstracts are then returned to the user.

Created: 2010-07-05 | Last updated: 2011-01-26

Credits: Paul Fisher

Taverna 2

Uploader

Paul Fisher

Pathway and Gene to Pubmed (2)

Download

This workflow takes in a list of gene names and KEGG pathway descriptions, and searches the PubMed database for corresponding articles. Any matches to the genes are then retrieved (abstracts only). These abstracts are then used to calculate a cosine vector space between two sets of corpora (gene and phenotype, or pathway and phenotype). The workflow counts the number of articles in the pubmed database in which each term occurs, and identifies the total number of articles in the entire PubMe...

Created: 2011-02-10 | Last updated: 2011-02-18

Credits: Paul Fisher

Attributions: Cosine vector space Extract Scientific Terms Rank Phenotype Terms Cosine vector space Rank Phenotype Terms Pathway to Pubmed Extract Scientific Terms Gene to Pubmed

PubMed Search and Solr storage

Preview

Run

Run this Workflow in the Taverna Workbench...

Workflow Components

Value

Script

Value

Wsdl

Wsdl Operation

Value

Script

Script

Script

Script

Script

Script

Script

Reviews (1)

Comments (0)

Other workflows that use similar services (31)