Termine with c-value threshold

Created: 2010-02-19 09:57:15 Last updated: 2011-12-13 15:52:56

Download Workflow

This workflow accepts a list of sentences from a single document and returns the terms found by the TerMine web service. It also allows you to set a threshold c-value score so that only terms with a user-controlled probability (of being a real term) are returned as an output.

To get sentences to supply to this workflow you can use the sentence splitting workflow. The TerMine service (used in this workflow) only accepts text in ASCII encoding, so you should also use the Clean plain text (ASCII) workflow before splitting sentences.

This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Unfortunately there are some restrictions on IP access to the TerMine web service at the NaCTeM. These can be viewed here. If you are at a UK higher eductation institution then there should be no problems, others have to request access through this page.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1060/download?version=1
[ More Info Expand ]

Workflow Components

Authors (0)

Titles (0)

Descriptions (0)

Dependencies (0)

Inputs (2)

Name	Description
sentencesList
cValueThreshold	c-Value threshold value (a higher score indicates a higher probability of being a real term). If you want good quality terms raise this value, if you want all terms then lower it.

Processors (13)

Name	Type	Description
sentenceListNormaliser	beanshell	Script //this beanshell flattens a list of strings into a single //new-line delimited string, it also removes any newline //characters from each string in the list StringBuilder sb = new StringBuilder(50000); String newLine = System.getProperty("line.separator"); for(String sentence : sentences) { sentence = sentence.replaceAll("[\n\r\f]"," "); sb.append(sentence + newLine); } sentenceListString = sb.toString();
analyze	wsdl	Wsdl http://www.nactem.ac.uk/software/termine/webservice/termine.wsdl Wsdl Operation analyze
filter_value	stringconstant	Value
input_format_value	stringconstant	Value
output_format_value	stringconstant	Value xml
stoplist_value	stringconstant	Value
jamesXPath	beanshell	Script import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.io.StringReader; import java.util.logging.Level; import java.util.logging.Logger; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.xpath.XPathConstants; import javax.xml.xpath.XPathExpression; import javax.xml.xpath.XPathExpressionException; import javax.xml.xpath.XPathFactory; import org.w3c.dom.Document; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.xml.sax.InputSource; import org.xml.sax.SAXException; XPathFactory fact; private XPathExpression genXP(String exp) { if (fact == null) { fact = XPathFactory.newInstance(); } XPathExpression r = null; try { r = fact.newXPath().compile(exp); } catch (XPathExpressionException ex) { //throw new Exception("XPath beanshell could not generate XPAthExpression: " + exp); } return r; } private NodeList getNLFromXP(Node n, XPathExpression exp) { NodeList nl = null; try { nl = (NodeList) exp.evaluate(n, XPathConstants.NODESET); } catch (XPathExpressionException ex) { //throw new Exception("XPath beanshell could not evaluate XPAthExpression: " + exp.toString()); } return nl; } //This beanshell is necessary because the default XPath beanshell doesn't seem to work with //xml output that is described by an external schema. //For example the Termine web service operation "analyze" can provide xml output described //by the schema here //http://www.nactem.ac.uk/software/termine/webservice/termine.xsd //but an XPath for //termCandidate[@cValueScore>=3.0]/text() does not return any results //when there are many terms with c-Value scores above or equal to 3.0. ArrayList vals = new ArrayList(); Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml))); XPathExpression xp = genXP(xPathString); if (doc != null && xp != null) { NodeList nl = getNLFromXP(doc, xp); if (nl != null) { for (int i = 0; i < nl.getLength(); i++) { Node n = nl.item(i); String val = n.getNodeValue(); if (val != null) { vals.add(val); } } } } resultValues = vals;
Concatenate_two_strings	localworker	Script output = string1 + string2;
xpath_first_bit	stringconstant	Value //termCandidate[@cValueScore>=
Concatenate_two_strings_2	localworker	Script output = string1 + string2;
xpath_last_bit	stringconstant	Value ]/text()
jamesXPath_2	beanshell	Script import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.io.StringReader; import java.util.logging.Level; import java.util.logging.Logger; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.xpath.XPathConstants; import javax.xml.xpath.XPathExpression; import javax.xml.xpath.XPathExpressionException; import javax.xml.xpath.XPathFactory; import org.w3c.dom.Document; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.xml.sax.InputSource; import org.xml.sax.SAXException; XPathFactory fact; private XPathExpression genXP(String exp) { if (fact == null) { fact = XPathFactory.newInstance(); } XPathExpression r = null; try { r = fact.newXPath().compile(exp); } catch (XPathExpressionException ex) { //throw new Exception("XPath beanshell could not generate XPAthExpression: " + exp); } return r; } private NodeList getNLFromXP(Node n, XPathExpression exp) { NodeList nl = null; try { nl = (NodeList) exp.evaluate(n, XPathConstants.NODESET); } catch (XPathExpressionException ex) { //throw new Exception("XPath beanshell could not evaluate XPAthExpression: " + exp.toString()); } return nl; } //This beanshell is necessary because the default XPath beanshell doesn't seem to work with //xml output that is described by an external schema. //For example the Termine web service operation "analyze" can provide xml output described //by the schema here //http://www.nactem.ac.uk/software/termine/webservice/termine.xsd //but an XPath for //termCandidate[@cValueScore>=3.0]/text() does not return any results //when there are many terms with c-Value scores above or equal to 3.0. ArrayList vals = new ArrayList(); Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml))); XPathExpression xp = genXP(xPathString); if (doc != null && xp != null) { NodeList nl = getNLFromXP(doc, xp); if (nl != null) { for (int i = 0; i < nl.getLength(); i++) { Node n = nl.item(i); String val = n.getNodeValue(); if (val != null) { vals.add(val); } } } } resultValues = vals;
xPathString_value	stringconstant	Value //termCandidate/text()

Beanshells (3)

Name	Inputs	Outputs
sentenceListNormaliser	sentences	sentenceListString
jamesXPath	xml xPathString	resultValues
jamesXPath_2	xml xPathString	resultValues

Outputs (3)

Name	Description
termCandidatesAboveThreshold
xpathOutput
allTermCandidates

Datalinks (17)

Source	Sink
sentencesList	sentenceListNormaliser:sentences
filter_value:value	analyze:filter
input_format_value:value	analyze:input_format
output_format_value:value	analyze:output_format
stoplist_value:value	analyze:stoplist
sentenceListNormaliser:sentenceListString	analyze:src
analyze:result	jamesXPath:xml
Concatenate_two_strings_2:output	jamesXPath:xPathString
cValueThreshold	Concatenate_two_strings:string2
xpath_first_bit:value	Concatenate_two_strings:string1
Concatenate_two_strings:output	Concatenate_two_strings_2:string1
xpath_last_bit:value	Concatenate_two_strings_2:string2
analyze:result	jamesXPath_2:xml
xPathString_value:value	jamesXPath_2:xPathString
jamesXPath:resultValues	termCandidatesAboveThreshold
Concatenate_two_strings_2:output	xpathOutput
jamesXPath_2:resultValues	allTermCandidates

Coordinations (0)

Information Workflow Type

Taverna 2

Information Uploader

James Eales

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)

(People/Groups)

James Eales

Information Attributions (0)

(Workflows/Files)

None

Information Tags (7)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

e-LICO

Information Featured In Packs (1)

Core text mining workflows

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (0)

No one

Information Statistics

3692 viewings

2226 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Termine with c-value threshold

Created by James Eales on Friday 19 February 2010 09:57:15 (UTC)

Last edited by James Eales on Saturday 20 March 2010 15:40:09 (UTC)

Reviews (0)

No reviews yet

Be the first to review!

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (4)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 2

Uploader

James Eales

Terms from collection of PDF files (2)

Download

This workflow will give you a set of candidate terms for each PDF document in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores. This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows such as this. You can view the text mining workflow components in this pack. If you receive errors when running this workflow t...