PubMed Search and Solr storage

Created: 2013-07-25 14:38:20 Last updated: 2013-08-26 14:13:59

Download Workflow

Based on the work of Fisher: This workflow takes in a search term, are passed to the eSearch function and searched for in PubMed. I extended by removing outputs and text extraction and addded an automatic Solr storage process using a post.jar, specified by the user. Before running this workflow, make sure that a solr server is up and running and the variable attached to the SolrImport process contains the correct path.

Dependencies:

- Solr

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://myexperiment.org/workflows/3659/download?version=5
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (1)

Descriptions (1)

Dependencies (0)

Inputs (7)

Name	Description
search_term	I want to find all abstracts that contain the words:
end_date	The found articles should be older then ____:
start_date	The found articles should not be older then ____.
maximum_articles	I want a maximum of ____ articles. 10 is set as default for testing. More then 100 is good for a normal run.
ResearcherID	The ID of the researcher that runs this workflow. If you do not have an ID, please register at www.researcherid.com.
Workspace	This is workspace where the found articles will be stored. The name of the files will be [ID of the article].xml.
SolrWorkspace	This is where the indexed files are being stored. This could be the same as workspace as the workspace where the abstracts are being stored.

Processors (22)

Name	Type	Description
pubmed_database	stringconstant	Which database is being used. Value pubmed
extractPMID	localworker	This process extracts the pubmed ID's based on the eSearch run. Script import org.dom4j.Document; import org.dom4j.Node; import org.dom4j.io.SAXReader; SAXReader reader = new SAXReader(false); reader.setIncludeInternalDTDDeclarations(false); reader.setIncludeExternalDTDDeclarations(false); Document document = reader.read(new StringReader(xmltext)); List nodelist = document.selectNodes(xpath); // Process the elements in the nodelist ArrayList outputList = new ArrayList(); ArrayList outputXmlList = new ArrayList(); String val = null; String xmlVal = null; for (Iterator iter = nodelist.iterator(); iter.hasNext();) { Node element = (Node) iter.next(); xmlVal = element.asXML(); val = element.getStringValue(); if (val != null && !val.equals("")) { outputList.add(val); outputXmlList.add(xmlVal); } } List nodelist=outputList; List nodelistAsXML=outputXmlList;
xpath	stringconstant	Value /[local-name(.)='eSearchResult']/[local-name(.)='IdList']/*[local-name(.)='Id']
run_eSearch	wsdl	This process will run eSearch that will extract the ID's of the articles that give a hit on the query. Wsdl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdl Wsdl Operation run_eSearch
parametersXML_eFecth	xmlsplitter	This process will create the parameters that can then be used by eSearch and eFetch.
Retrive_abstracts	workflow	This nested workflow was part of Fishers workflow, but has been decreased in size. This workflow is about storing the xmll files from eFetch and doesn't require to extract the plain text abstract like the original workflow did.
LookAtWatch	beanshell	Time flies like an arrow; fruit flies like banana. This process receives the Abstract XML that activates a time lookup. This is then given to the next process. Script Date date = new Date(); stringy = "" + date; CurrentTime = stringy
CreateProvenance	beanshell	Provenance is important when you want to trace back your data. For this reason I added a process in the workflow that will add some basic provenance based on the work of the w3 (www.w3.org). The Process adds the following types of provenance: ResearcherID - Use www.reasercherID.org to get a researcherID. This can then be linked to your research. ExtractionDate - The date and time the article was extracted. MyExperimentID - The MyExperimentID of the used worklow. WorkflowVersion - The Version of the used workflow. WorkflowDevelopers - A list of the developers of the workflow. StartDate - The starting date of the article search, see imput port for more information. EndDate - The ending date of the article search, see imput port for more information. SearchTerm - The original search query, see imput port for more information. MaximumArticles - The Maximum amount of articles that have been searched, see imput port for more information. Script Prov = "\n" + ResearcherID + "\n" + ExtractionDate + "\n" + MyExperimentID + "\n" + WorkflowVersion + "\n"+ WorkflowDevelopers +"\n" + "\n" + StartDate + "\n"+ EndDate +"\n" + SearchTerm+"\n"+MaximumArticles+"\n\n"+ "\n"
Write_Text_File	localworker	This process writes the content of the workflow to files. The location of the file is created in the create file location process. NOTE: You might want to change the working directory. This can be done by changing the CreateFileLocation process. Script BufferedWriter out; if (encoding == void) { out = new BufferedWriter(new FileWriter(outputFile)); } else { out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), encoding)); } out.write(filecontents); out.flush(); out.close(); outputFile = filecontents;
CreateListOfArticlesThatNeedExtracting	beanshell	This Conditional branch splits the workflow in two directions: IDsInDatabase: A list of ID's of articles that are already in the working directory. They should not be extracted. ListOfArticlesThatNeedExtracting - The List of Pubmed ID's that should be extracted and added to the database. Script //Distribute Lists import java.util.*; List IDsInDatabaseOut = new ArrayList(); List ListOfArticlesThatNeedExtracting = new ArrayList(); String booleanStatement = IDsInDatabase.toString(); if (booleanStatement.equals("File exists")) { IDsInDatabaseOut.add(ExtractableIDs + "IDsInDatabase"); } else{ ListOfArticlesThatNeedExtracting.add(ExtractableIDs); }
CheckIfArticleIsInDatabase	externaltool	This process calls the commandline and checks if the file at the filelocation exists. If this is not the case the process will return the string false.
Flatten_List	localworker	We need to decrease the depth of the list by one level. Otherwise we will get errors in the Validation report. Script flatten(inputs, outputs, depth) { for (i = inputs.iterator(); i.hasNext();) { element = i.next(); if (element instanceof Collection && depth > 0) { flatten(element, outputs, depth - 1); } else { outputs.add(element); } } } outputlist = new ArrayList(); flatten(inputlist, outputlist, 1);
CreateFileLocation_2	beanshell	This process creates the location of the file. This can then be used to check whetevver the ile exists in the next process (CheckIfArticleIsInDatabase). NOTE: if you want to change the working directory, please change this process so it will link to the correct directory. Script //You can change the working directory (Default: "/home/") by another working directory of your likeing. //Make sure it exists. FileLocation = Workspace + PubmedID + ".xml"
MyExperimentID_value	stringconstant	This value stores the MyExperiment ID of the workflow. If you reupload this workflow with improvements feel free to change this value. Value 3659
WorkflowDevelopers_value	stringconstant	Sander van Boom and Paul Fisher created this workflow. If you've changed this workflow and uploaded it on myExperiment feel free to add your name in this variable as well. Value Sander van Boom and Paul Fisher
WorkflowVersion_value	stringconstant	This value stores the current version of the workflow. Value 5
AddProvenance	beanshell	This process fuses the provenance, the found abstracts and the header of the file. The output is also send to an output port for checking the values. Script AbstractWithProvenance = "\n" + Provenance + Abstract+ "\n" + ""
XPath_Service	xpath	This XPath Service removes the header from the file. This is because we want to add provenance to the file later in the workflow. After we've added the provenance then we add the header back to the file. Xpath Expression /default:eFetchResult/default:PubmedArticleSet
Flatten_List_2	localworker	We need to decrease the depth of the list by one level. Otherwise we will get errors in the Validation report. Script flatten(inputs, outputs, depth) { for (i = inputs.iterator(); i.hasNext();) { element = i.next(); if (element instanceof Collection && depth > 0) { flatten(element, outputs, depth - 1); } else { outputs.add(element); } } } outputlist = new ArrayList(); flatten(inputlist, outputlist, 1);
CreateFileLocation_2_2	beanshell	This process creates the location of the file. This can then be used to check whetevver the ile exists in the next process (CheckIfArticleIsInDatabase). NOTE: if you want to change the working directory, please change this process so it will link to the correct directory. Script //You can change the working directory (Default: "/home/") by another working directory of your likeing. //Make sure it exists. FileLocation = Workspace + PubmedID + ".xml"
InformationExtractionAndSolrImport	workflow	Read a file, extract the the content, extrat the pubmedID from the abstract and write the file back to a new workspace. Then import it in solr. NOTE: Make sure that Solr is installed and the variable pathToPostJar is linking to the correct path of post.jar. BONUS NOTE: If you want Solr to detect more then just the title and the ID, you should add extra xpaths and update our solr schema accordingly.
pathToPostJar	stringconstant	This is the path to the Post.jar that Solr uses to import it's documents. NOTE: Please change this variable to your Solr directory. Value /run/media/sander/Second Space/Downloads/Solaria/solr-4.4.0/example/exampledocs/post.jar

Beanshells (8)

Name	Description	Inputs	Outputs
LookAtWatch	Time flies like an arrow; fruit flies like banana. This process receives the Abstract XML that activates a time lookup. This is then given to the next process.	ActivateTimeLookUp	CurrentTime
CreateProvenance	Provenance is important when you want to trace back your data. For this reason I added a process in the workflow that will add some basic provenance based on the work of the w3 (www.w3.org). The Process adds the following types of provenance: ResearcherID - Use www.reasercherID.org to get a researcherID. This can then be linked to your research. ExtractionDate - The date and time the article was extracted. MyExperimentID - The MyExperimentID of the used worklow. WorkflowVersion - The Version of the used workflow. WorkflowDevelopers - A list of the developers of the workflow. StartDate - The starting date of the article search, see imput port for more information. EndDate - The ending date of the article search, see imput port for more information. SearchTerm - The original search query, see imput port for more information. MaximumArticles - The Maximum amount of articles that have been searched, see imput port for more information.	ResearcherID ExtractionDate MyExperimentID WorkflowVersion WorkflowDevelopers StartDate EndDate SearchTerm MaximumArticles	Prov
CreateListOfArticlesThatNeedExtracting	This Conditional branch splits the workflow in two directions: IDsInDatabase: A list of ID's of articles that are already in the working directory. They should not be extracted. ListOfArticlesThatNeedExtracting - The List of Pubmed ID's that should be extracted and added to the database.	ExtractableIDs IDsInDatabase	ListOfArticlesThatNeedExtracting IDsInDatabaseOut
CreateFileLocation_2	This process creates the location of the file. This can then be used to check whetevver the ile exists in the next process (CheckIfArticleIsInDatabase). NOTE: if you want to change the working directory, please change this process so it will link to the correct directory.	PubmedID Workspace	FileLocation
AddProvenance	This process fuses the provenance, the found abstracts and the header of the file. The output is also send to an output port for checking the values.	Provenance Abstract	AbstractWithProvenance
CreateFileLocation_2_2	This process creates the location of the file. This can then be used to check whetevver the ile exists in the next process (CheckIfArticleIsInDatabase). NOTE: if you want to change the working directory, please change this process so it will link to the correct directory.	PubmedID Workspace	FileLocation
CreateFileContent	This process takes all the values from the XPaths and creates an XML file with a format that Solr can understand.	PubmedID Title	XMLContent
CreateFileLocation	This process creates the final location to store the indexed files based on the workspace and the pubmedID.	Workspace PubmedID	FileLocation

Outputs (6)

Name	Description
SolrImport_STDERR	This is the standerdized error from Solr. If there were any errors while running Solr then the values will turn red. Red is generally considered to be a bad colour when programming.
SolrImport_STDOUT	This is the standerdized output from Solr.
IDsInDatabase	This output contains the list of ID's that were found by eSearch but were already in the wokring directory. Better luck next time.
IDsBeingSearched	This output port contains the ID's that has been extracted.
AbstractWithProv	This is the abstract that has been extracted + some provenance.
IndexedFileLocation	The location of the indexed file that was imported in Solr.

Datalinks (42)

Source	Sink
xpath:value	extractPMID:xpath
run_eSearch:parameters	extractPMID:xml-text
parametersXML_eFecth:output	run_eSearch:parameters
pubmed_database:value	parametersXML_eFecth:db
search_term	parametersXML_eFecth:term
end_date	parametersXML_eFecth:maxdate
start_date	parametersXML_eFecth:mindate
maximum_articles	parametersXML_eFecth:RetMax
Flatten_List:outputlist	Retrive_abstracts:pubmed_ids
Retrive_abstracts:AbstractXML	LookAtWatch:ActivateTimeLookUp
ResearcherID	CreateProvenance:ResearcherID
MyExperimentID_value:value	CreateProvenance:MyExperimentID
WorkflowDevelopers_value:value	CreateProvenance:WorkflowDevelopers
WorkflowVersion_value:value	CreateProvenance:WorkflowVersion
LookAtWatch:CurrentTime	CreateProvenance:ExtractionDate
maximum_articles	CreateProvenance:MaximumArticles
search_term	CreateProvenance:SearchTerm
start_date	CreateProvenance:StartDate
end_date	CreateProvenance:EndDate
CreateFileLocation_2:FileLocation	Write_Text_File:outputFile
AddProvenance:AbstractWithProvenance	Write_Text_File:filecontents
CheckIfArticleIsInDatabase:STDOUT	CreateListOfArticlesThatNeedExtracting:IDsInDatabase
extractPMID:nodelist	CreateListOfArticlesThatNeedExtracting:ExtractableIDs
CreateFileLocation_2_2:FileLocation	CheckIfArticleIsInDatabase:ExtractableArticles
CreateListOfArticlesThatNeedExtracting:ListOfArticlesThatNeedExtracting	Flatten_List:inputlist
Flatten_List:outputlist	CreateFileLocation_2:PubmedID
Workspace	CreateFileLocation_2:Workspace
CreateProvenance:Prov	AddProvenance:Provenance
Flatten_List_2:outputlist	AddProvenance:Abstract
Retrive_abstracts:AbstractXML	XPath_Service:xml_text
XPath_Service:nodelistAsXML	Flatten_List_2:inputlist
extractPMID:nodelist	CreateFileLocation_2_2:PubmedID
Workspace	CreateFileLocation_2_2:Workspace
CreateFileLocation_2:FileLocation	InformationExtractionAndSolrImport:AbstractLocation
SolrWorkspace	InformationExtractionAndSolrImport:Workspace
pathToPostJar:value	InformationExtractionAndSolrImport:PathToPostJar
InformationExtractionAndSolrImport:SolrImport_STDERR	SolrImport_STDERR
InformationExtractionAndSolrImport:SolrImport_STDOUT	SolrImport_STDOUT
CreateListOfArticlesThatNeedExtracting:IDsInDatabaseOut	IDsInDatabase
Flatten_List:outputlist	IDsBeingSearched
AddProvenance:AbstractWithProvenance	AbstractWithProv
InformationExtractionAndSolrImport:OutputFileLocation	IndexedFileLocation

Coordinations (2)

Controller	Target
Write_Text_File	InformationExtractionAndSolrImport
CheckIfArticleIsInDatabase	CreateListOfArticlesThatNeedExtracting

Information Workflow Type

Taverna 2

Information Uploader

Sander van boom

Information License

All versions of this Workflow are licensed under:

Information Version 5 (latest) (of 5)

Information Credits (1)

(People/Groups)

Sander van boom

Information Attributions (1)

(Workflows/Files)

PubMed Search

Information Tags (6)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

Concept profile generation pipeline

Information Featured In Packs (1)

BioSemantics Concept Profile Generation Workflows

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (1)

Sander van boom

Information Statistics

4231 viewings

1502 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

PubMed Search and Solr storage

Created by Sander van boom on Thursday 25 July 2013 14:38:20 (UTC)
PubMed Search and Solr storage

Created by Sander van boom on Thursday 01 August 2013 15:26:25 (UTC)

Revision comment:

Added an automatic update check (to prevent downloading the same file twice) and basic provenance to the outputfile.
PubMed Search and Solr storage

Created by Sander van boom on Tuesday 06 August 2013 11:25:43 (UTC)

Revision comment:

Updated the workflow with more provenance data and fixed a couple of bugs of the workflow.
It should be working properly (famous last words) and until there are any bugreports or requests, this will be the final version.
The only planned addition will be additonal documentation.
PubMed Search and Solr storage

Created by Sander van boom on Wednesday 07 August 2013 12:59:21 (UTC)

Revision comment:

Added an extra input port that gives the user the posibility to change their working directory.
PubMed Search and Solr storage

Created by Sander van boom on Friday 09 August 2013 11:57:11 (UTC)

Revision comment:

There was a bug in previous versions that didn't store the Solr data correctly. Creating a nested workflow that extracts and store the files and imports them in Solr was a simple update that solved the problem. Currently only the title and the ID of the abstract are being stored. This is because the example schema of Solr allows these tags. If you want more information being stored then you should add extra xpaths and change your basic Solr Schema. Later on I might add more information or provide additional documentation on how to do this yourself.

Reviews (1)

Write your own Review

Eelke van der Horst

Title: Thanks!

Rating: 4 out of 5

Created: 2013-08-26 14:13:59 | Updated: 2013-08-26 14:13:59

A nice workflow for gathering our in-house document corpus and metadata.

[ more ]

View

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (31)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 2

Uploader

Paul Fisher

Gene to Pubmed (3)

Download

This workflow takes in a list of gene names and searches the PubMed database for corresponding articles. Any matches to the genes are then retrieved (abstracts only). These abstracts are then returned to the user.

Created: 2010-07-05 | Last updated: 2011-01-26

Credits: Paul Fisher

Taverna 2

Uploader

Paul Fisher

Pathway and Gene to Pubmed (2)

Download

This workflow takes in a list of gene names and KEGG pathway descriptions, and searches the PubMed database for corresponding articles. Any matches to the genes are then retrieved (abstracts only). These abstracts are then used to calculate a cosine vector space between two sets of corpora (gene and phenotype, or pathway and phenotype). The workflow counts the number of articles in the pubmed database in which each term occurs, and identifies the total number of articles in the entire PubMe...

Created: 2011-02-10 | Last updated: 2011-02-18

Credits: Paul Fisher

Attributions: Cosine vector space Extract Scientific Terms Rank Phenotype Terms Cosine vector space Rank Phenotype Terms Pathway to Pubmed Extract Scientific Terms Gene to Pubmed

PubMed Search and Solr storage

Preview

Run

Run this Workflow in the Taverna Workbench...

Workflow Components

Value

Script

Value

Wsdl

Wsdl Operation

Script

Script

Script

Script

Script

Script

Value

Value

Value

Script

Xpath Expression

Script

Script

Value

Reviews (1)

Comments (0)

Other workflows that use similar services (31)