PubMed Search and Solr storage
Created: 2013-07-25 14:38:20
Last updated: 2013-08-26 14:13:59
Based on the work of Fisher: This workflow takes in a search term, are passed to the eSearch function and searched for in PubMed. I extended by removing outputs and text extraction and addded an automatic Solr storage process using a post.jar, specified by the user.
Before running this workflow, make sure that a solr server is up and running and the variable attached to the SolrImport process contains the correct path.
Dependencies:
- Solr
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Sander van Boom, Paul Fisher |
Titles (1)
PubMed Search and Solr storage |
Descriptions (1)
Based on the work of Fisher: This workflow takes in a search term, are passed to the eSearch function and searched for in PubMed. I extended by removing outputs and text extraction and addded an automatic Solr storage process using a post.jar, specified by the user.
Before running this workflow, make sure that a solr server is up and running and the variable attached to the SolrImport process contains the correct path.
Dependencies:
- Solr |
Dependencies (0)
Inputs (5)
Name |
Description |
search_term |
I want to find all abstracts that contain the words:
|
end_date |
The found articles should be older then ____:
|
start_date |
The found articles should not be older then ____.
|
maximum_articles |
I want a maximum of ____ articles.
10 is set as default for testing. More then 100 is good for a normal run.
|
ResearcherID |
The ID of the researcher that runs this workflow. If you do not have an ID, please register at www.researcherid.com.
|
Processors (16)
Name |
Type |
Description |
pubmed_database |
stringconstant |
Which database is being used. Valuepubmed |
extractPMID |
localworker |
This process extracts the pubmed IDS based on the eSearch run. Scriptimport org.dom4j.Document;
import org.dom4j.Node;
import org.dom4j.io.SAXReader;
SAXReader reader = new SAXReader(false);
reader.setIncludeInternalDTDDeclarations(false);
reader.setIncludeExternalDTDDeclarations(false);
Document document = reader.read(new StringReader(xmltext));
List nodelist = document.selectNodes(xpath);
// Process the elements in the nodelist
ArrayList outputList = new ArrayList();
ArrayList outputXmlList = new ArrayList();
String val = null;
String xmlVal = null;
for (Iterator iter = nodelist.iterator(); iter.hasNext();) {
Node element = (Node) iter.next();
xmlVal = element.asXML();
val = element.getStringValue();
if (val != null && !val.equals("")) {
outputList.add(val);
outputXmlList.add(xmlVal);
}
}
List nodelist=outputList;
List nodelistAsXML=outputXmlList; |
xpath |
stringconstant |
Value/*[local-name(.)='eSearchResult']/*[local-name(.)='IdList']/*[local-name(.)='Id'] |
run_eSearch |
wsdl |
Wsdlhttp://eutils.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdlWsdl Operationrun_eSearch |
parametersXML_eFecth |
xmlsplitter |
|
Retrive_abstracts |
workflow |
This nested workflow was part of Fishers workflow, but has been decreased in size. This workflow is about storing the xmll files from eFetch and doesn't require to extract the plain text abstract like the original workflow did. |
SolrImport |
externaltool |
SolrImport takes the path of the txt file and stores this in a Solr database.
Make sure that the Solr database is running and that the correct path is inside the variable.
If Solr is running locally you can check if the files have been stored by browsing to the following location:http://localhost:8983/solr/#/
Solr can be downloaded at:
http://lucene.apache.org/solr/ |
pathToPostJar |
stringconstant |
This is the path to the Post.jar that Solr uses to import it's documents. Value/run/media/sander/Second Space/Downloads/solr-4.3.1/example/exampledocs/post.jar |
LookAtWatch |
beanshell |
ScriptDate date = new Date();
stringy = "" + date;
out1 = stringy |
AddProvinance |
beanshell |
ScriptAbstractWithProv = "\n"+ResearcherID+"\n"+ExtractionDate+"\n\n"+ExtractedAbstract |
Write_Text_File |
localworker |
Script
BufferedWriter out;
if (encoding == void) {
out = new BufferedWriter(new FileWriter(outputFile));
}
else {
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), encoding));
}
out.write(filecontents);
out.flush();
out.close();
outputFile = filecontents;
|
CreateFileLocation |
beanshell |
Scriptout1 = "/home/sander/Downloads/" + in1 + ".xml" |
CreateListOfArticlesThatNeedExtracting |
beanshell |
Script//Distribute Lists
import java.util.*;
List IDsInDatabaseOut = new ArrayList();
List ListOfArticlesThatNeedExtracting = new ArrayList();
String fortest = IDsInDatabase.toString();
Test1 = fortest.equals("File exists");
if (fortest.equals("File exists")) {
IDsInDatabaseOut.add(ExtractableIDs + "IDsInDatabase");
}
else{
ListOfArticlesThatNeedExtracting.add(ExtractableIDs);
}
|
CheckIfArticleIsInDatabase |
externaltool |
|
Flatten_List |
localworker |
Scriptflatten(inputs, outputs, depth) {
for (i = inputs.iterator(); i.hasNext();) {
element = i.next();
if (element instanceof Collection && depth > 0) {
flatten(element, outputs, depth - 1);
} else {
outputs.add(element);
}
}
}
outputlist = new ArrayList();
flatten(inputlist, outputlist, 1); |
CreateFileLocation_2 |
beanshell |
Scriptout1 = "/home/sander/Downloads/" + in1 + ".xml" |
Beanshells (5)
Name |
Description |
Inputs |
Outputs |
LookAtWatch |
|
|
out1
|
AddProvinance |
|
ResearcherID
ExtractionDate
ExtractedAbstract
|
AbstractWithProv
|
CreateFileLocation |
|
in1
|
out1
|
CreateListOfArticlesThatNeedExtracting |
|
ExtractableIDs
IDsInDatabase
|
ListOfArticlesThatNeedExtracting
IDsInDatabaseOut
Test1
|
CreateFileLocation_2 |
|
in1
|
out1
|
Outputs (5)
Name |
Description |
SolrImport_STDERR |
This is the standerdized error from Solr. If there were any errors while running Solr then the values will turn red. Red is generally considered to be a bad colour.
|
SolrImport_STDOUT |
This is the standerdized output from Solr.
|
IDsInDatabaseOut |
|
Test1 |
|
Flatten_List_outputlist |
|
Datalinks (27)
Source |
Sink |
xpath:value |
extractPMID:xpath |
run_eSearch:parameters |
extractPMID:xml-text |
parametersXML_eFecth:output |
run_eSearch:parameters |
pubmed_database:value |
parametersXML_eFecth:db |
search_term |
parametersXML_eFecth:term |
end_date |
parametersXML_eFecth:maxdate |
start_date |
parametersXML_eFecth:mindate |
maximum_articles |
parametersXML_eFecth:RetMax |
Flatten_List:outputlist |
Retrive_abstracts:pubmed_ids |
pathToPostJar:value |
SolrImport:pathToPostJar |
CreateFileLocation_2:out1 |
SolrImport:inputFile |
LookAtWatch:out1 |
AddProvinance:ExtractionDate |
ResearcherID |
AddProvinance:ResearcherID |
Retrive_abstracts:AbstractXML |
AddProvinance:ExtractedAbstract |
AddProvinance:AbstractWithProv |
Write_Text_File:filecontents |
CreateFileLocation_2:out1 |
Write_Text_File:outputFile |
extractPMID:nodelist |
CreateFileLocation:in1 |
CheckIfArticleIsInDatabase:STDOUT |
CreateListOfArticlesThatNeedExtracting:IDsInDatabase |
extractPMID:nodelist |
CreateListOfArticlesThatNeedExtracting:ExtractableIDs |
CreateFileLocation:out1 |
CheckIfArticleIsInDatabase:ExtractableArticles |
CreateListOfArticlesThatNeedExtracting:ListOfArticlesThatNeedExtracting |
Flatten_List:inputlist |
Flatten_List:outputlist |
CreateFileLocation_2:in1 |
SolrImport:STDERR |
SolrImport_STDERR |
SolrImport:STDOUT |
SolrImport_STDOUT |
CreateListOfArticlesThatNeedExtracting:IDsInDatabaseOut |
IDsInDatabaseOut |
CreateListOfArticlesThatNeedExtracting:Test1 |
Test1 |
Flatten_List:outputlist |
Flatten_List_outputlist |
Coordinations (2)
Controller |
Target |
CheckIfArticleIsInDatabase |
CreateListOfArticlesThatNeedExtracting |
Write_Text_File |
SolrImport |
Uploader
License
All versions of this Workflow are
licensed under:
Version 2
(of 5)
Credits (1)
(People/Groups)
Attributions (1)
(Workflows/Files)
Shared with Groups (1)
Featured In Packs (1)
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (1)
Statistics
By
|
Title: Thanks!
Rating:
4 out of 5
Created: 2013-08-26 14:13:59
|
Updated: 2013-08-26 14:13:59
A nice workflow for gathering our in-house document corpus and metadata.
[ more ]
|
View
|
Other workflows that use similar services
(31)
Only the first 2 workflows that use similar services are shown. View all workflows that use these services.
Gene to Pubmed
(3)
This workflow takes in a list of gene names and searches the PubMed database for corresponding articles. Any matches to the genes are then retrieved (abstracts only). These abstracts are then returned to the user.
Created: 2010-07-05
| Last updated: 2011-01-26
Credits:
Paul Fisher
Comments (0)
No comments yet
Log in to make a comment