BioAID_EnirchBioModelWithProteinsFromText
This workflow extracts proteins and protein relations from Medline. Extracted protein names (symbols of at least 3 characters) are validated against mouse, rat, and human UniProt symbols, so the results are limited to these species.
This workflow follows the following basic steps:
- it retrieves documents relevant for the query string
- it discovers proteins in those documents, considered relevant to the query string (colocation in text mining terms)
- it extract protein-protein relations (slightly stronger than colocation)
In addition, the results are added to a biological model to support hypthesis formation and a procedural model to log trails to evidence. The models are based on description logic (RDF/OWL format).
Acknowledgements:
Synonyms and Uniprot services: Martijn Scheumie, BioSemantics Group, University of Rotterdam, The Netherlands (BioRange project)
Preview
Run
Run this Workflow in the Taverna Workbench...
Option 1:
Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/379/download?version=3
[ More Info ]
Taverna is available from http://taverna.sourceforge.net/
If you are having problems downloading it in Taverna, you may need to provide your username and password in the URL so that Taverna can access the Workflow:
Replace http:// in the link above with http://yourusername:yourpassword@
Workflow Components
Name | Description |
---|---|
query | Biological query, e.g. a protein of interest. See Lucene documentation for advanced queries (http://lucene.apache.org/) Synonyms for protein names will be searched and added for terms within double quotes. |
max_document_nr | limits the maximum number of hits search will produce. In Taverna 1 '100' works well while a 1000 and above is likely to halt Taverna 1 due to memory problems. This also depends on the memory setting for the java virtual machine by the client (usually your local Taverna). |
aida_magic_word | A magic word is required to make use of the AIDA semantic repository for BioAID workflows. Please ask Scott Marshall (marshall@science.uva.nl) or Marco Roos (M.Roos1@uva.nl) for the magic word. NB: this semantic repository is for temporary data only. You should expect the repository to be cleared often and without warning. |
Name | Type | Description |
---|---|---|
false | stringconstant | |
Timestamp | beanshell | |
true | stringconstant | |
s07_AddScoreToSemanticModel | workflow | Add RDF cf: @prefix mybio: <http://rdf.adaptivedisclosure.org/owl/BioAID/myModel/Proto-ontology/BiologicalModel.owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix pub: <http://rdf.adaptivedisclosure.org/owl/BioAID/myModel/Proto-ontology/Publication.owl#> . @prefix dsc: <http://rdf.adaptivedisclosure.org/owl/BioAID/myModel/Enriched-ontology/DiscoveredEntities.owl#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . <http://rdf.adaptivedisclosure.org/owl/BioAID/myModel/Enriched-ontology/BiologicalDiscoveries.owl#ExampleInstance_DiscoveredEnzyme> a <http://rdf.adaptivedisclosure.org/owl/BioAID/myModel/Enriched-ontology/BiologicalDiscoveries.owl#DiscoveredEnzyme> ; dsc:hasLikelihoodScore "1.0"^^<http://www.w3.org/2001/XMLSchema#float> . |
s06_AddProteinRelationToSemanticModel | workflow | |
06_UniProtXrefURLs | workflow | Adds URL cross references to various protein information resources. |
s01_AddBiologicalModelToSemanticModel | workflow | Add Query to Semantic model with Sesame service cf example Biological Query |
s03_AddExpandedQueryToSemanticModel_Obsolete | workflow | Add Query to Semantic model with Sesame service cf example Biological Query |
s05_AddProteinToSemanticModel | workflow | Add Protein to Semantic model with Sesame service cf example Discovered Proteins |
02_RetrieveDocumentsFromMedline | workflow | This workflow applies the search web service from the AIDA toolbox. Comments: This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction. |
s03_AddExpandedQueryToSemanticModel | workflow | Add autamotically expanded query to Semantic model. |
s02_AddOriginalQueryToSemanticModel | workflow | Add Query to Semantic model with Sesame service cf example Biological Query |
05_ScoreExtractedProteins | workflow | This workflow calculates a min log likelihood score for the combination of a discoverd protein and a protein of interest (the query protein). Note that at the moment the total count of medline papers, which is part of the formula, is hard coded and not exact. Given its size this should not matter that much, and certainly not in comparison with other likelihoods calculated using the same value. |
04_ExtractProteinRelations_HomoSapiens | workflow | Workflow to extract protein protein interactions from text, followed by filtering protein names known as human protein names. The protein protein interaction service takes the output in 'IOB' format from applyCRF, which annotates proteins as such in text. |
01_ProcessQuery | workflow | Workflow to optimize a Lucene document retrieval query to 1. increase the priority of recent years (in decreasing order from 2009 down to 2002) 2. limit a subsequent search to a specific organism using a mesh organism tag |
s04_AddDocToSemanticModel | workflow | Add Document to Semantic model with Sesame service cf example discovered document |
03_ExtractProteins_HomoSapiens | workflow | Workflow to extract proteins from text, followed by filtering protein names known as human protein names. |
s00_InitializeSemanticStorage | workflow |
Name | Description | Inputs | Outputs |
---|---|---|---|
InstantiateSemanticType |
instance_ontology_url
instance_name type_uri label_string comment_string |
NTriple_InstanceOf_statement
instance_uri |
|
ReplaceCharsForID | input | output | |
NTriple_AnnotationsOfInstance_statement |
instance_uri
label_string comment_string |
NTriple_LabelCommentOfInstance_statement | |
NTriple_QueryRelation_stmnt |
domain_instance_uri
range_instance_uri relation_ontology_url relation_name |
NTriple_Relation_statement | |
SlashDoubleQuotes |
input
findstring replacestring |
output | |
NTriple_QueryPropertyOfInstance_statement |
instance_uri
property_ontology_url property_name property_string |
NTriple_PropertyOfInstance_statement
property_uri |
|
ConcatenateRDFdoc |
NTripleInstanceStatement
NTripleAnnotationsStatement NTriplePropertyStatement NTripleRelationStatement |
RDFdoc | |
ReplaceCharsForQueryID | input | output | |
NTriple_InstanceOf_statement |
instance_ontology_url
type_ontology_url instance_name type_name |
instance_uri
NTriple_InstanceOf_statement |
|
DefineSemanticRelation_hasParticipant |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
DefineSemanticRelation_discovered_by |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
DefineSemanticRelation_references |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
InstantiateSemanticType_Protein |
instance_ontology_url
instance_name type_uri label_string comment_string datetime |
NTriple_InstanceOf_statement
instance_uri |
|
InstantiateSemanticType_ProteinTerm |
instance_ontology_url
instance_name type_uri label_string comment_string datetime |
NTriple_InstanceOf_statement
instance_uri |
|
ProteinTermAnnotations |
protein_name
uniprot_id |
protein_label
protein_comment protein_term_label protein_term_comment |
|
ConcatenateStringList |
stringlist
delimiter |
output | |
InstantiateQueryInstance |
instance_ontology_url
instance_name type_uri label_string comment_string datetime |
NTriple_InstanceOf_statement
instance_uri |
|
ReplaceCharsForQueryID | input | output | |
DefineSemanticRelation_expansion_of |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
DefineBooleanPropertyOfInstance |
instance_uri
property_uri property_string |
NTriple_PropertyOfInstance_statement | |
ConcatenateRDFstatements |
stringlist
delimiter |
output | |
DefinePropertyOfInstance |
property_uri
property_string instance_uri |
NTriple_PropertyOfInstance_statement | |
DefineSemanticRelation_references |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
SlashDoubleQuotes |
input
findstring replacestring |
output | |
DefineSemanticRelation |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
DefinePropertyOfInstance |
property_uri
property_string instance_uri |
NTriple_PropertyOfInstance_statement | |
ConcatenateRDFstatements |
stringlist
delimiter |
output | |
ReplaceCharsForQueryID | input | output | |
InstantiateQueryInstance |
instance_ontology_url
instance_name type_uri label_string comment_string datetime |
NTriple_InstanceOf_statement
instance_uri |
|
SlashDoubleQuotes |
input
findstring replacestring |
output | |
DefineBooleanPropertyOfInstance |
instance_uri
property_uri property_string |
NTriple_PropertyOfInstance_statement | |
RelativeFrequencyPoiInCorpus |
corpus_total
poi_count_in_corpus |
relative_frequency | |
RelativeFrequencyPoiInCorpus |
corpus_total
poi_count_in_corpus |
relative_frequency | |
PoiAndDpQuery |
poi_query
dp_query |
poi_and_dp_query | |
CountListElements | list | count | |
CloneQueries |
copy_number
input |
clones | |
MinLogLikelihood |
query_frequency
discovered_frequency query_discovered_frequency total_frequency |
minloglikelihood | |
CloneFrequencies |
copy_number
input |
clones | |
UniProtOrNot2 | uniprotIDlist | uniprotID_or_False | |
UniProtOrNot1 | uniprotIDlist | uniprotID_or_False | |
FilterTrueProteinPairsByUniProtID |
protein1
uniprot1 protein2 uniprot2 |
true_protein1
true_uniprot1 true_protein2 true_uniprot2 |
|
example_interaction_doc | interaction_doc | ||
ConcatenateRelation |
protein_name1
protein_name2 interaction_term uniprot_id1 uniprot_id2 |
relation
id_relation |
|
FilterTrueProteinByUniProtID |
protein
uniprot |
true_protein
true_uniprot |
|
UniProtOrNot | uniprotIDlist | uniprotID_or_False | |
ReplaceLuceneTokenByOriginalTerm |
query_string
tokenlist |
output_list | |
Concat_synonyms |
synonymlist
query_term |
new_query | |
ListFindAndReplace |
input
findstringlist replacestringlist |
output | |
Prioritise_lucene_query |
query_string
priority_string |
lucene_query | |
InstantiateSemanticType |
instance_ontology_url
instance_name type_uri label_string comment_string datetime |
NTriple_InstanceOf_statement
instance_uri |
|
DefineSemanticRelation_has_output |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
DefineSemanticRelation_discovered_by |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
NTriple_WorkflowRelation_stmnt |
domain_instance_uri
relation_uri range_instance_uri |
NTriple_Relation_statement | |
ConcatenateStringList |
stringlist
delimiter |
output | |
UniProtOrNot | uniprotIDlist | uniprotID_or_False | |
FilterTrueProteinByUniProtID |
protein
uniprot |
true_protein
true_uniprot |
|
Strip_xml | tagged_term | term | |
RepositoryRef |
sesame_url
repository |
bioaid_repository_url | |
Timestamp |
now_RFC822
now_short now_ISO8601 |
Name | Description |
---|---|
ProteinURL | |
PubMed_URL | |
protein_discovery_score | |
Protein_name | |
Relation | |
RDFtriples_doc_url | |
BioAID_RDFrepository_URL |
Source | Sink |
---|---|
aida_magic_word | s00_InitializeSemanticStorage:aida_magic_word |
max_document_nr | 02_RetrieveDocumentsFromMedline:maxHits |
query | 01_ProcessQuery:query_string |
01_ProcessQuery:extended_lucene_query | 02_RetrieveDocumentsFromMedline:queryString |
01_ProcessQuery:extended_lucene_query | 05_ScoreExtractedProteins:query |
01_ProcessQuery:extended_lucene_query | s03_AddExpandedQueryToSemanticModel:query |
01_ProcessQuery:extended_lucene_query | s03_AddExpandedQueryToSemanticModel_Obsolete:expanded_query |
02_RetrieveDocumentsFromMedline:abstract | 03_ExtractProteins_HomoSapiens:input_text |
02_RetrieveDocumentsFromMedline:abstract | 04_ExtractProteinRelations_HomoSapiens:input_text |
02_RetrieveDocumentsFromMedline:pubmed_URL | s04_AddDocToSemanticModel:pubmed_URL |
02_RetrieveDocumentsFromMedline:pubmed_id | s04_AddDocToSemanticModel:pubmed_id |
03_ExtractProteins_HomoSapiens:protein_name | 05_ScoreExtractedProteins:discovered_protein |
03_ExtractProteins_HomoSapiens:protein_name | s05_AddProteinToSemanticModel:protein_name |
03_ExtractProteins_HomoSapiens:uniprotID | 06_UniProtXrefURLs:UniProtID |
03_ExtractProteins_HomoSapiens:uniprotID | s05_AddProteinToSemanticModel:uniprot_id |
04_ExtractProteinRelations_HomoSapiens:relation_term | s06_AddProteinRelationToSemanticModel:interaction_term |
04_ExtractProteinRelations_HomoSapiens:uniprot_id1 | s06_AddProteinRelationToSemanticModel:uniprot_id1 |
04_ExtractProteinRelations_HomoSapiens:uniprot_id2 | s06_AddProteinRelationToSemanticModel:uniprot_id2 |
05_ScoreExtractedProteins:min_log_likelihood | s07_AddScoreToSemanticModel:score |
05_ScoreExtractedProteins:min_log_likelihood | s07_AddScoreToSemanticModel:score |
06_UniProtXrefURLs:EntrezUniProtURL | s05_AddProteinToSemanticModel:entrez_pubmed_URL |
06_UniProtXrefURLs:ExpasyUniProtURL | s05_AddProteinToSemanticModel:expasy_URL |
06_UniProtXrefURLs:iHopSearchURL | s05_AddProteinToSemanticModel:iHop_search_URL |
06_UniProtXrefURLs:iHopSentencesURL | s05_AddProteinToSemanticModel:iHop_sentence_URL |
Timestamp:now_ISO8601 | s03_AddExpandedQueryToSemanticModel:datetime |
Timestamp:now_ISO8601 | s04_AddDocToSemanticModel:datetime |
Timestamp:now_ISO8601 | s05_AddProteinToSemanticModel:datetime |
false:value | s00_InitializeSemanticStorage:do_not_add_to_repository |
false:value | s00_InitializeSemanticStorage:do_not_clear_repository |
false:value | s00_InitializeSemanticStorage:do_not_clear_tmp_rdf_file |
query | s01_AddBiologicalModelToSemanticModel:BioModelComment |
query | s01_AddBiologicalModelToSemanticModel:ModelIdentifyingName |
query | s02_AddOriginalQueryToSemanticModel:query |
04_ExtractProteinRelations_HomoSapiens:protein1 | s06_AddProteinRelationToSemanticModel:protein_name1 |
s00_InitializeSemanticStorage:BioAIDinstances_ontology_url | s01_AddBiologicalModelToSemanticModel:InstanceOntologyURL |
s00_InitializeSemanticStorage:BioAIDinstances_ontology_url | s02_AddOriginalQueryToSemanticModel:InstanceOntologyURL |
s00_InitializeSemanticStorage:BioAIDinstances_ontology_url | s03_AddExpandedQueryToSemanticModel_Obsolete:InstanceOntologyURL |
s00_InitializeSemanticStorage:BioAIDinstances_ontology_url | s04_AddDocToSemanticModel:instance_ontology_url |
s00_InitializeSemanticStorage:RDFoutput_doc_url | s02_AddOriginalQueryToSemanticModel:RDF_doc_filename |
s00_InitializeSemanticStorage:RDFoutput_doc_url | s03_AddExpandedQueryToSemanticModel_Obsolete:RDF_doc_filename |
s00_InitializeSemanticStorage:RDFoutput_doc_url | s04_AddDocToSemanticModel:rdf_output_doc_url |
s00_InitializeSemanticStorage:TextMining_ontology_url | s03_AddExpandedQueryToSemanticModel_Obsolete:TextMiningOntologyURL |
s02_AddOriginalQueryToSemanticModel:query_instance | s03_AddExpandedQueryToSemanticModel_Obsolete:original_query_instance_uri |
02_RetrieveDocumentsFromMedline:pubmed_URL | PubMed_URL |
03_ExtractProteins_HomoSapiens:protein_name | Protein_name |
04_ExtractProteinRelations_HomoSapiens:protein2 | s06_AddProteinRelationToSemanticModel:protein_name2 |
04_ExtractProteinRelations_HomoSapiens:relation | Relation |
05_ScoreExtractedProteins:min_log_likelihood | protein_discovery_score |
06_UniProtXrefURLs:EntrezUniProtURL | ProteinURL |
Timestamp:now_ISO8601 | s02_AddOriginalQueryToSemanticModel:datetime |
s00_InitializeSemanticStorage:BioAID_Repository_url | BioAID_RDFrepository_URL |
s00_InitializeSemanticStorage:BioAIDinstances_ontology_url | s03_AddExpandedQueryToSemanticModel:InstanceOntologyURL |
s00_InitializeSemanticStorage:BioAIDinstances_ontology_url | s05_AddProteinToSemanticModel:instance_ontology_url |
s00_InitializeSemanticStorage:RDFoutput_doc_url | s01_AddBiologicalModelToSemanticModel:RDF_doc_filename |
s00_InitializeSemanticStorage:RDFoutput_doc_url | s03_AddExpandedQueryToSemanticModel:RDF_doc_filename |
s00_InitializeSemanticStorage:RDFoutput_doc_url | s05_AddProteinToSemanticModel:tmp_rdf_output_fileref |
s01_AddBiologicalModelToSemanticModel:biomodel_instance_uri | s02_AddOriginalQueryToSemanticModel:model_instance_uri |
s01_AddBiologicalModelToSemanticModel:biomodel_instance_uri | s03_AddExpandedQueryToSemanticModel:model_instance_uri |
s01_AddBiologicalModelToSemanticModel:biomodel_instance_uri | s05_AddProteinToSemanticModel:model_instance_uri |
s02_AddOriginalQueryToSemanticModel:query_instance | s03_AddExpandedQueryToSemanticModel:original_query_instance_uri |
s03_AddExpandedQueryToSemanticModel_Obsolete:QueryRDFdoc_url | RDFtriples_doc_url |
s04_AddDocToSemanticModel:doc_instance_uri | s05_AddProteinToSemanticModel:doc_instance_uri |
s05_AddProteinToSemanticModel:protein_instance | s07_AddScoreToSemanticModel:protein_instance |
None
Workflow Type
Version 3 (of 7)
None
- AIDA
- |
- biorange_nl
- |
- e-science
- |
- hcls2009
- |
- owl
- |
- protein
- |
- rdf
- |
- semantic_web
- |
- text_mining
- |
- VL-e
- AIDA
- |
- biorange_nl
- |
- e-science
- |
- hcls2009
- |
- owl
- |
- protein
- |
- rdf
- |
- semantic_web
- |
- text_mining
- |
- VL-e
Log in to add Tags
Shared with Groups (11)
Log in to add to one of your Packs
Statistics
In chronological order:
-
Created by Marco Roos on Friday 22 August 2008 11:00:29 (UTC)
Last edited by Marco Roos on Tuesday 26 August 2008 15:56:58 (UTC)
-
Created by Marco Roos on Friday 22 August 2008 11:00:29 (UTC)
Last edited by Marco Roos on Tuesday 26 August 2008 16:05:11 (UTC)
Revision comment:Protein discovery workflow that stores instances in a semantic model that separates biology (intensional) knowledge from procedural (extensional) knowledge.
Semantic types in the semantic sub workflows are obtained provisionally using strings produced by a workflow (GetFromSesame.xml) that gets types from a Sesame repository containing the template ontologies. -
Created by Marco Roos on Friday 22 August 2008 11:00:29 (UTC)
Last edited by Marco Roos on Wednesday 27 August 2008 23:15:01 (UTC)
Revision comment:Minor update. Changed the description.
-
Created by Marco Roos on Wednesday 29 October 2008 09:14:51 (UTC)
Revision comment:Adjustments for semantic model updates.
-
Created by Marco Roos on Wednesday 29 October 2008 09:27:18 (UTC)
Last edited by Marco Roos on Friday 15 May 2009 16:38:18 (UTC)
Revision comment:Minor updates to get the syncing of document instances and proteins right.
-
Created by Marco Roos on Saturday 16 May 2009 00:57:14 (UTC)
Revision comment:- Temporary switch to development server because of minor unresolved issues with production server
- Changed location of SynSets service.
-
Created by Marco Roos on Saturday 16 May 2009 01:06:26 (UTC)
Last edited by Marco Roos on Saturday 16 May 2009 01:13:19 (UTC)
Revision comment:- Temporary return to aida development server due to minor issues with production server
- Changed URL of SynSets service (it moved)
Reviews (0)
Other workflows that use similar services (1)
BioAID_ProteinDiscovery_filterOnHumanUnipr... (11)
Created: 2009-05-28
Credits: Marco Roos Martijn Schuemie AID AID_myGrid_collaboration
Attributions: BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter
Comments (1)
Log in to make a comment
This workflow may need some work because of a recent server migration... Our apologies.