Matchbox Evaluation
Created: 2012-10-02 12:37:08
Last updated: 2012-10-02 12:40:04
Matchbox evaluation against ground truth. The evaluation process first
creates the matchbox output and ground truth lists. It then counts each page
tuple from the matchbox output that is in the ground truth as correctly
identified tuple (true positive). Those that are not in the ground truth are
counted as incorrectly identified tuples (false positives), and finally,
those that are in the ground truth but not in the matchbox output are counted
as missed tuples (false negatives).
The precision is then calculated as the number of true positives (i.e. the
number of items correctly labeled as duplicate page pairs) divided by the
total number of elements assumed to be duplicate page pairs (i.e. the sum of
true positives and false positives, which are items incorrectly labeled as
being duplicate page pairs ). Recall is then defined as the number of
true positives divided by the total number of elements of duplicate page
pairs (i.e. the sum of true positives and false negatives, which are items
have not been labeled as being duplicate page pairs but actually should have
been).
The ground truth contains single page instances without duplicates and
n-tuples (duplicates, triples, quadruples, etc.). n-tuples with n>2 are
expanded, the result is a list of 2-tuples which is used to determine the
number of missed duplicates (false negatives).
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
Descriptions (1)
Matchbox evaluation against ground truth. The evaluation process first
creates the matchbox output and ground truth lists. It then counts each page
tuple from the matchbox output that is in the ground truth as correctly
identified tuple (true positive). Those that are not in the ground truth are
counted as incorrectly identified tuples (false positives), and finally,
those that are in the ground truth but not in the matchbox output are counted
as missed tuples (false negatives).
The precision is then calculated as the number of true positives (i.e. the
number of items correctly labeled as duplicate page pairs) divided by the
total number of elements assumed to be duplicate page pairs (i.e. the sum of
true positives and false positives, which are items incorrectly labeled as
being duplicate page pairs ). Recall is then defined as the number of
true positives divided by the total number of elements of duplicate page
pairs (i.e. the sum of true positives and false negatives, which are items
have not been labeled as being duplicate page pairs but actually should have
been).
The ground truth contains single page instances without duplicates and
n-tuples (duplicates, triples, quadruples, etc.). n-tuples with n>2 are
expanded, the result is a list of 2-tuples which is used to determine the
number of missed duplicates (false negatives). |
Dependencies (2)
tb-lsdr-matchboxeval-lib-1.0-SNAPSHOT.jar |
combinatoricslib-0.2.jar |
Inputs (2)
Name |
Description |
gt_filelist_path |
Path to textfile containing paths to groundtruth files.
|
bc_dirlist_file_path |
Path to textfile containing paths to barcode directories.
|
Processors (10)
Name |
Type |
Description |
matchbox_evaluate |
beanshell |
Scriptimport eu.scape_project.tb.lsdr.MatchboxEval;
MatchboxEval me = new MatchboxEval(matchbox,groundtruth);
me.evaluate();
cid = me.getTruePositives();
iid = me.getFalsePositives();
md = me.getFalseNegatives();
prec = me.getPrecision();
rec = me.getRecall();
fm = me.getFmeasure();
log = me.getLog(); |
matchbox |
externaltool |
|
read_bc_dirlist_file |
localworker |
ScriptBufferedReader getReader (String fileUrl) throws IOException {
InputStreamReader reader;
try {
reader = new FileReader(fileUrl);
}
catch (FileNotFoundException e) {
// try a real URL instead
URL url = new URL(fileUrl);
reader = new InputStreamReader (url.openStream());
}
return new BufferedReader(reader);
}
StringBuffer sb = new StringBuffer(4000);
BufferedReader in = getReader(fileurl);
String str;
String lineEnding = System.getProperty("line.separator");
while ((str = in.readLine()) != null) {
sb.append(str);
sb.append(lineEnding);
}
in.close();
filecontents = sb.toString();
|
split_bc_dirlist |
localworker |
ScriptList split = new ArrayList();
if (!string.equals("")) {
String regexString = ",";
if (regex != void) {
regexString = regex;
}
String[] result = string.split(regexString);
for (int i = 0; i < result.length; i++) {
split.add(result[i]);
}
}
|
newline |
stringconstant |
Value\n |
parse_matchbox_stdout |
beanshell |
Script#!Pairtree pt = new Pairtree();
#!String id = pt.mapToId("/mnt/abonas/linktree/", barcode_path.substring(0,barcode_path.lastIndexOf("/")));
String duplicates_result = "";
String duplicates_matches = "";
duplicates_result += barcode_path+ ":\n";
duplicates_matches += barcode_path + "\t";
StringTokenizer st = new StringTokenizer(matchbox_stdout, "\n");
boolean startDuplicates = false;
boolean hasDuplicates = false;
while (st.hasMoreTokens()) {
String token = st.nextToken();
if (startDuplicates) {
if (token.contains("=>")) {
duplicates_result += token + "\n";
hasDuplicates = true;
}
}
if (token.contains("=== List of detected duplicates ===")) {
startDuplicates = true;
}
}
if(hasDuplicates)
duplicates_matches += "1";
else
duplicates_matches += "0"; |
read_gt_filelist_file |
localworker |
ScriptBufferedReader getReader (String fileUrl) throws IOException {
InputStreamReader reader;
try {
reader = new FileReader(fileUrl);
}
catch (FileNotFoundException e) {
// try a real URL instead
URL url = new URL(fileUrl);
reader = new InputStreamReader (url.openStream());
}
return new BufferedReader(reader);
}
StringBuffer sb = new StringBuffer(4000);
BufferedReader in = getReader(fileurl);
String str;
String lineEnding = System.getProperty("line.separator");
while ((str = in.readLine()) != null) {
sb.append(str);
sb.append(lineEnding);
}
in.close();
filecontents = sb.toString();
|
split_gt_filelist |
localworker |
ScriptList split = new ArrayList();
if (!string.equals("")) {
String regexString = ",";
if (regex != void) {
regexString = regex;
}
String[] result = string.split(regexString);
for (int i = 0; i < result.length; i++) {
split.add(result[i]);
}
}
|
read_gt_file |
localworker |
ScriptBufferedReader getReader (String fileUrl) throws IOException {
InputStreamReader reader;
try {
reader = new FileReader(fileUrl);
}
catch (FileNotFoundException e) {
// try a real URL instead
URL url = new URL(fileUrl);
reader = new InputStreamReader (url.openStream());
}
return new BufferedReader(reader);
}
StringBuffer sb = new StringBuffer(4000);
BufferedReader in = getReader(fileurl);
String str;
String lineEnding = System.getProperty("line.separator");
while ((str = in.readLine()) != null) {
sb.append(str);
sb.append(lineEnding);
}
in.close();
filecontents = sb.toString();
|
report |
beanshell |
Scriptreport = "Correctly identified duplicates (true positives): "+cid+"\n"+
"Incorrectly identified duplicates (false positives): "+iid+"\n"+
"Missed duplicates (false negatives): "+md+"\n"+
"Precision: "+prec+"%\n"+
"Recall: "+rec+"%\n"+
"F-Measure: "+fm+"%\n"; |
Beanshells (3)
Name |
Description |
Inputs |
Outputs |
matchbox_evaluate |
|
matchbox
groundtruth
|
cid
md
iid
prec
rec
fm
log
|
parse_matchbox_stdout |
|
barcode_path
matchbox_stdout
|
duplicates_result
duplicates_matches
|
report |
|
prec
md
iid
fm
cid
rec
|
report
|
Outputs (6)
Name |
Description |
results |
|
stderr |
|
stdout |
|
matches |
|
report |
|
evlog |
|
Datalinks (24)
Source |
Sink |
read_gt_file:filecontents |
matchbox_evaluate:groundtruth |
parse_matchbox_stdout:duplicates_result |
matchbox_evaluate:matchbox |
split_bc_dirlist:split |
matchbox:barcode_path |
bc_dirlist_file_path |
read_bc_dirlist_file:fileurl |
newline:value |
split_bc_dirlist:regex |
read_bc_dirlist_file:filecontents |
split_bc_dirlist:string |
matchbox:STDOUT |
parse_matchbox_stdout:matchbox_stdout |
split_bc_dirlist:split |
parse_matchbox_stdout:barcode_path |
gt_filelist_path |
read_gt_filelist_file:fileurl |
read_gt_filelist_file:filecontents |
split_gt_filelist:string |
newline:value |
split_gt_filelist:regex |
split_gt_filelist:split |
read_gt_file:fileurl |
matchbox_evaluate:cid |
report:cid |
matchbox_evaluate:fm |
report:fm |
matchbox_evaluate:iid |
report:iid |
matchbox_evaluate:md |
report:md |
matchbox_evaluate:prec |
report:prec |
matchbox_evaluate:rec |
report:rec |
parse_matchbox_stdout:duplicates_result |
results |
matchbox:STDERR |
stderr |
matchbox:STDOUT |
stdout |
parse_matchbox_stdout:duplicates_matches |
matches |
report:report |
report |
matchbox_evaluate:log |
evlog |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment