Termine with c-value threshold
Created: 2010-02-19 09:57:15
Last updated: 2011-12-13 15:52:56
This workflow accepts a list of sentences from a single document and returns the terms found by the TerMine web service. It also allows you to set a threshold c-value score so that only terms with a user-controlled probability (of being a real term) are returned as an output.
To get sentences to supply to this workflow you can use the sentence splitting workflow. The TerMine service (used in this workflow) only accepts text in ASCII encoding, so you should also use the Clean plain text (ASCII) workflow before splitting sentences.
This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.
Unfortunately there are some restrictions on IP access to the TerMine web service at the NaCTeM. These can be viewed here. If you are at a UK higher eductation institution then there should be no problems, others have to request access through this page.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (0)
|  |
Titles (0)
|  |
Descriptions (0)
|  |
Dependencies (0)
|  |
Inputs (2)
|  |
Name |
Description |
sentencesList |
|
cValueThreshold |
c-Value threshold value (a higher score indicates a higher probability of being a real term). If you want good quality terms raise this value, if you want all terms then lower it.
|
Processors (13)
|  |
Name |
Type |
Description |
sentenceListNormaliser |
beanshell |
Script//this beanshell flattens a list of strings into a single
//new-line delimited string, it also removes any newline
//characters from each string in the list
StringBuilder sb = new StringBuilder(50000);
String newLine = System.getProperty("line.separator");
for(String sentence : sentences) {
sentence = sentence.replaceAll("[\n\r\f]"," ");
sb.append(sentence + newLine);
}
sentenceListString = sb.toString();
|
analyze |
wsdl |
Wsdlhttp://www.nactem.ac.uk/software/termine/webservice/termine.wsdlWsdl Operationanalyze |
filter_value |
stringconstant |
Value |
input_format_value |
stringconstant |
Value |
output_format_value |
stringconstant |
Valuexml |
stoplist_value |
stringconstant |
Value |
jamesXPath |
beanshell |
Scriptimport java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
XPathFactory fact;
private XPathExpression genXP(String exp) {
if (fact == null) {
fact = XPathFactory.newInstance();
}
XPathExpression r = null;
try {
r = fact.newXPath().compile(exp);
} catch (XPathExpressionException ex) {
//throw new Exception("XPath beanshell could not generate XPAthExpression: " + exp);
}
return r;
}
private NodeList getNLFromXP(Node n, XPathExpression exp) {
NodeList nl = null;
try {
nl = (NodeList) exp.evaluate(n, XPathConstants.NODESET);
} catch (XPathExpressionException ex) {
//throw new Exception("XPath beanshell could not evaluate XPAthExpression: " + exp.toString());
}
return nl;
}
//This beanshell is necessary because the default XPath beanshell doesn't seem to work with
//xml output that is described by an external schema.
//For example the Termine web service operation "analyze" can provide xml output described
//by the schema here
//http://www.nactem.ac.uk/software/termine/webservice/termine.xsd
//but an XPath for //termCandidate[@cValueScore>=3.0]/text() does not return any results
//when there are many terms with c-Value scores above or equal to 3.0.
ArrayList vals = new ArrayList();
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml)));
XPathExpression xp = genXP(xPathString);
if (doc != null && xp != null) {
NodeList nl = getNLFromXP(doc, xp);
if (nl != null) {
for (int i = 0; i < nl.getLength(); i++) {
Node n = nl.item(i);
String val = n.getNodeValue();
if (val != null) {
vals.add(val);
}
}
}
}
resultValues = vals;
|
Concatenate_two_strings |
localworker |
Scriptoutput = string1 + string2; |
xpath_first_bit |
stringconstant |
Value//termCandidate[@cValueScore>= |
Concatenate_two_strings_2 |
localworker |
Scriptoutput = string1 + string2; |
xpath_last_bit |
stringconstant |
Value]/text() |
jamesXPath_2 |
beanshell |
Scriptimport java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.StringReader;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
XPathFactory fact;
private XPathExpression genXP(String exp) {
if (fact == null) {
fact = XPathFactory.newInstance();
}
XPathExpression r = null;
try {
r = fact.newXPath().compile(exp);
} catch (XPathExpressionException ex) {
//throw new Exception("XPath beanshell could not generate XPAthExpression: " + exp);
}
return r;
}
private NodeList getNLFromXP(Node n, XPathExpression exp) {
NodeList nl = null;
try {
nl = (NodeList) exp.evaluate(n, XPathConstants.NODESET);
} catch (XPathExpressionException ex) {
//throw new Exception("XPath beanshell could not evaluate XPAthExpression: " + exp.toString());
}
return nl;
}
//This beanshell is necessary because the default XPath beanshell doesn't seem to work with
//xml output that is described by an external schema.
//For example the Termine web service operation "analyze" can provide xml output described
//by the schema here
//http://www.nactem.ac.uk/software/termine/webservice/termine.xsd
//but an XPath for //termCandidate[@cValueScore>=3.0]/text() does not return any results
//when there are many terms with c-Value scores above or equal to 3.0.
ArrayList vals = new ArrayList();
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml)));
XPathExpression xp = genXP(xPathString);
if (doc != null && xp != null) {
NodeList nl = getNLFromXP(doc, xp);
if (nl != null) {
for (int i = 0; i < nl.getLength(); i++) {
Node n = nl.item(i);
String val = n.getNodeValue();
if (val != null) {
vals.add(val);
}
}
}
}
resultValues = vals;
|
xPathString_value |
stringconstant |
Value//termCandidate/text() |
Beanshells (3)
|  |
Name |
Description |
Inputs |
Outputs |
sentenceListNormaliser |
|
sentences
|
sentenceListString
|
jamesXPath |
|
xml
xPathString
|
resultValues
|
jamesXPath_2 |
|
xml
xPathString
|
resultValues
|
Outputs (3)
|  |
Name |
Description |
termCandidatesAboveThreshold |
|
xpathOutput |
|
allTermCandidates |
|
Datalinks (17)
|  |
Source |
Sink |
sentencesList |
sentenceListNormaliser:sentences |
filter_value:value |
analyze:filter |
input_format_value:value |
analyze:input_format |
output_format_value:value |
analyze:output_format |
stoplist_value:value |
analyze:stoplist |
sentenceListNormaliser:sentenceListString |
analyze:src |
analyze:result |
jamesXPath:xml |
Concatenate_two_strings_2:output |
jamesXPath:xPathString |
cValueThreshold |
Concatenate_two_strings:string2 |
xpath_first_bit:value |
Concatenate_two_strings:string1 |
Concatenate_two_strings:output |
Concatenate_two_strings_2:string1 |
xpath_last_bit:value |
Concatenate_two_strings_2:string2 |
analyze:result |
jamesXPath_2:xml |
xPathString_value:value |
jamesXPath_2:xPathString |
jamesXPath:resultValues |
termCandidatesAboveThreshold |
Concatenate_two_strings_2:output |
xpathOutput |
jamesXPath_2:resultValues |
allTermCandidates |
Coordinations (0)
|  |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (1)
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Citations (0)Version History
Other workflows that use similar services
(4)
Only the first 2 workflows that use similar services are shown. View all workflows that use these services.
Terms from collection of PDF files
(2)
This workflow will give you a set of candidate terms for each PDF document in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores.
This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows such as this. You can view the text mining workflow components in this pack.
If you receive errors when running this workflow t...
Created: 2010-02-19
| Last updated: 2011-12-13
Credits:
James Eales
No comments yet
Log in to make a comment