This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
http://ws.adaptivedisclosure.org/axis/services/SearcherWS?wsdl
search
text/xml
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
year:(2007^10 2006^9 2005^8 2004^7 2004^6 2003^5 2002^4 2001^3 2000^2 1999^1)
StringBuffer temp=new StringBuffer();
temp.append("+(");
temp.append(query_string);
temp.append(") +");
temp.append(priority_string);
String lucene_query = temp.toString();
query_string
priority_string
lucene_query
Lucene query string
e.g. EZH2
e.g. MedLine_new
e.g. content
e.g. 100
text/xml