BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created: 2009-05-28 12:21:05

Download Workflow

This workflow finds proteins relevant to the query string via the following steps:

A user query: a single gene/protein name. E.g.: (EZH2 OR "Enhancer of Zeste").
Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apache's Lucene)
Discover proteins: extract proteins discovered in the set of relevant abstracts with a 'named entity recognizer' trained on genomic terms using a Bayesian approach; the AIDA service inside is based on LingPipe. This subworkflow also 'filters' false positives from the discovered protein by requiring a discovery has a valid UniProt ID. Martijn Schuemie's service to do that contains only human UniProt IDs, which is why this workflow only works for human proteins.

Workflow by Marco Roos (AID = Adaptive Information Disclosure, University of Amsterdam; http://adaptivedisclosure.org)

Text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam).

Changes to our original BioAID_DiseaseDiscovery workflow:

* Stops at protein discovery * Use of Martijn Schuemie's synsets service to * add synonyms to the query. * provide uniprot ids to discovered proteins * filter false positive discoveries, only proteins with a uniprot id go through; this introduces some false negatives (e.g. discovered proteins with a name shorter than 3 characters) * Counting of results in various ways, but no outputs defined in this simplified workflow. * Output into simple html table.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://myexperiment.org/workflows/154/download?version=6
[ More Info Expand ]

Workflow Components

Inputs (1)

Name	Description
query_protein	A protein name to query. A sinlge gene/protein name is expected, because the 'ProteinSynonymsToQuery' workflow is used on the query.

Processors (22)

Name	Type	Description
default_max_hits	stringconstant	Default maximum number of documents to retrieve from medline by the query from which to extract proteins.
ExampleOutput	stringconstant
Document_index	stringconstant
search_field	stringconstant
PubMedURLstub	stringconstant
Flatten_list2	local
Flatten_list1	local
DummyRankScore	stringconstant
Flatten_list	local
Concatenate_URLstub_ID	local
CountProteins	beanshell
CountDocuments	beanshell
SliceOutListLevel_doc_ids	beanshell
Clone	beanshell
Retrieve_documents	workflow	This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
SynonymsToQuery	workflow	This workflow creates a query string from the query term using Martijn Schuemie's synonym service. The service is limited to proteins, enzymes and genes. An input query that is a boolean string will be split and processed, but the boolean logic of the input query will be lost.
StructureLists	arbitrarywsdl
DiscoveredProteinsToHtmlTable	arbitrarywsdl
Discover_HumanUniProt_proteins	workflow	This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
file_html_doc	arbitrarywsdl
file_html_doc_init	arbitrarywsdl
Results_pending_html_doc	stringconstant

Beanshells (9)

Name	Inputs	Outputs
Prioritise_lucene_query	query_string priority_string	lucene_query
Concat_synonyms	synonymlist query_term	new_query
UniProtOrNot	uniprotIDlist	uniprotID_or_False
SimpleFindAndReplace	input findstring replacestring	output
FilterTrueProteinByUniProtID	protein uniprot	true_protein true_uniprot
CountProteins	list	count
CountDocuments	list	count
SliceOutListLevel_doc_ids	inListOfLists	outlist
Clone	copy_number input	clones

Outputs (2)

Name	Description
discovery_html_table
discovery_table_url

Links (26)

Source	Sink
Clone:clones	SliceOutListLevel_doc_ids:inListOfLists
CountProteins:count	Clone:copy_number
Discover_HumanUniProt_proteins:discovered_proteins	CountProteins:list
Discover_HumanUniProt_proteins:doc_ids	Clone:input
Discover_HumanUniProt_proteins:doc_ids	Concatenate_URLstub_ID:string2
Discover_HumanUniProt_proteins:doc_ids	CountDocuments:list
Document_index:value	Retrieve_documents:document_index
PubMedURLstub:value	Concatenate_URLstub_ID:string1
Retrieve_documents:relevant_documents	Discover_HumanUniProt_proteins:documents_from_lucene
default_max_hits:value	Retrieve_documents:maxHits
query_protein	SynonymsToQuery:query_term
Discover_HumanUniProt_proteins:discovered_proteins	StructureLists:discovered_protein
Discover_HumanUniProt_proteins:discovered_uniprot_ids	StructureLists:discovered_uniprot_id
DiscoveredProteinsToHtmlTable:DiscoveredProteinsToHtmlTableReturn	file_html_doc:content
DummyRankScore:value	StructureLists:ranking_score
Flatten_list:outputlist	Flatten_list1:inputlist
Results_pending_html_doc:value	file_html_doc_init:content
DiscoveredProteinsToHtmlTable:DiscoveredProteinsToHtmlTableReturn	discovery_html_table
SliceOutListLevel_doc_ids:outlist	StructureLists:pubmed_id
StructureLists:StructureListsReturn	Flatten_list:inputlist
SynonymsToQuery:new_query	Retrieve_documents:query_string
query_protein	StructureLists:query_protein
Flatten_list1:outputlist	Flatten_list2:inputlist
Flatten_list2:outputlist	DiscoveredProteinsToHtmlTable:structuredList
file_html_doc:file_html_docReturn	discovery_table_url
search_field:value	Retrieve_documents:search_field

Coordinations (2)

Controller	Target
file_html_doc_init	file_html_doc
file_html_doc_init	SynonymsToQuery

Information Workflow Type

Taverna 1

Information Uploader

Marco Roos

Information License

All versions of this Workflow are licensed under:

Information Version 6 (of 11)

Information Credits (4)

(People/Groups)

Information Attributions (1)

(Workflows/Files)

BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter

Information Tags (9)

Uploader tags

Log in to add Tags

Information Shared with Groups (2)

Information Featured In Packs (1)

AIDA demo pack

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (0)

No one

Information Statistics

11847 viewings

5789 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Friday 29 February 2008 01:34:47 (UTC)
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Wednesday 05 March 2008 08:12:04 (UTC)

Revision comment:

Demo
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Thursday 15 May 2008 11:41:50 (UTC)

Revision comment:

Balanced list levels for I/O of all beanshells.
Temporarily switched to development service for document search service due to problems with index files.
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Thursday 15 May 2008 17:37:31 (UTC)

Revision comment:

Added new simple web service that provides the html document on a publicly accessible URL.
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Thursday 15 May 2008 22:26:46 (UTC)

Revision comment:

Added initial 'results pending' html doc.
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Thursday 15 May 2008 23:22:52 (UTC)

Revision comment:

updated mime type of url output
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Monday 28 July 2008 20:48:45 (UTC)

Revision comment:

Repaired this workflow. Creating the html is done by a beanshell again.
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Friday 29 February 2008 01:34:46 (UTC)

Last edited by Marco Roos on Wednesday 29 October 2008 09:29:36 (UTC)

Revision comment:

Repaired this workflow. Creating the html is done by a beanshell again.
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Sunday 14 December 2008 21:42:40 (UTC)

Last edited by Marco Roos on Sunday 14 December 2008 21:44:19 (UTC)

Revision comment:

Workflow running from production servers
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Thursday 26 March 2009 20:18:55 (UTC)

Last edited by Marco Roos on Thursday 26 March 2009 20:22:03 (UTC)

Revision comment:

Minor changes to compensate for the changes caused by a migration to a new server. In some cases the changes are temporary until everything is migrated. The functionality of the workflow did not change.
BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html

Created by Marco Roos on Thursday 28 May 2009 12:21:05 (UTC)

Revision comment:

synsets service moved

Reviews (0)

No reviews yet

Be the first to review!

Comments (1)

View Timeline

Log in to make a comment

Giovanni Dall'Olio

I am not sure I understand what this workflow does.

Can you please add some use case/example of how to use it?
What do you mean exactly with 'proteins relevant to the query string'? Proteins that interact with the query gene? Or that are involved in the same metabolism?

With which data have you tested this workflow? Which queries have you tried?

Other workflows that use similar services (10)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 1

Uploader

Marco Roos

BioAID_DiseaseDiscovery_RatHumanMouseUnipr... (4)

Download

This workflow finds disease relevant to the query string via the following steps: 1. A user query: a list of terms or boolean query - look at the Apache Lucene project for all details. E.g.: (EZH2 OR "Enhancer of Zeste" +(mutation chromatin) -clinical); consider adding 'ProteinSynonymsToQuery' in front of the input if your query is a protein. 2. Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apa...

Created: 2008-12-15 | Last updated: 2011-08-11

Credits: Marco Roos AID

Taverna 1

Uploader

Marco Roos

BioAID_ProteinToDiseases (1)

Download

This workflow was based on BioAID_DiseaseDiscovery, changes: expects only one protein name, adds protein synonyms). This workflow finds diseases relevant to the query string via the following steps: A user query: a single protein name Add synonyms (service courtesy of Martijn Scheumie, Erasmus University Rotterdam) Retrieve documents: finds relevant documents (abstract+title) based on query Discover proteins: extract proteins discovered in the set of relevant abstracts 5. Link proteins ...

Created: 2007-11-14 | Last updated: 2007-11-15

Credits: Marco Roos Martijn Schuemie AID

Attributions: BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter