Matchbox Evaluation

Created: 2012-10-02 12:37:08 Last updated: 2012-10-02 12:40:04

Download Workflow

Matchbox evaluation against ground truth. The evaluation process first creates the matchbox output and ground truth lists. It then counts each page tuple from the matchbox output that is in the ground truth as correctly identified tuple (true positive). Those that are not in the ground truth are counted as incorrectly identified tuples (false positives), and finally, those that are in the ground truth but not in the matchbox output are counted as missed tuples (false negatives). The precision is then calculated as the number of true positives (i.e. the number of items correctly labeled as duplicate page pairs) divided by the total number of elements assumed to be duplicate page pairs (i.e. the sum of true positives and false positives, which are items incorrectly labeled as being duplicate page pairs ). Recall is then defined as the number of true positives divided by the total number of elements of duplicate page pairs (i.e. the sum of true positives and false negatives, which are items have not been labeled as being duplicate page pairs but actually should have been). The ground truth contains single page instances without duplicates and n-tuples (duplicates, triples, quadruples, etc.). n-tuples with n>2 are expanded, the result is a list of 2-tuples which is used to determine the number of missed duplicates (false negatives).

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/3212/download?version=1
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (1)

Descriptions (1)

Dependencies (2)

Inputs (2)

Name	Description
gt_filelist_path	Path to textfile containing paths to groundtruth files.
bc_dirlist_file_path	Path to textfile containing paths to barcode directories.

Processors (10)

Name	Type	Description
matchbox_evaluate	beanshell	Script import eu.scape_project.tb.lsdr.MatchboxEval; MatchboxEval me = new MatchboxEval(matchbox,groundtruth); me.evaluate(); cid = me.getTruePositives(); iid = me.getFalsePositives(); md = me.getFalseNegatives(); prec = me.getPrecision(); rec = me.getRecall(); fm = me.getFmeasure(); log = me.getLog();
matchbox	externaltool
read_bc_dirlist_file	localworker	Script BufferedReader getReader (String fileUrl) throws IOException { InputStreamReader reader; try { reader = new FileReader(fileUrl); } catch (FileNotFoundException e) { // try a real URL instead URL url = new URL(fileUrl); reader = new InputStreamReader (url.openStream()); } return new BufferedReader(reader); } StringBuffer sb = new StringBuffer(4000); BufferedReader in = getReader(fileurl); String str; String lineEnding = System.getProperty("line.separator"); while ((str = in.readLine()) != null) { sb.append(str); sb.append(lineEnding); } in.close(); filecontents = sb.toString();
split_bc_dirlist	localworker	Script List split = new ArrayList(); if (!string.equals("")) { String regexString = ","; if (regex != void) { regexString = regex; } String[] result = string.split(regexString); for (int i = 0; i < result.length; i++) { split.add(result[i]); } }
newline	stringconstant	Value \n
parse_matchbox_stdout	beanshell	Script #!Pairtree pt = new Pairtree(); #!String id = pt.mapToId("/mnt/abonas/linktree/", barcode_path.substring(0,barcode_path.lastIndexOf("/"))); String duplicates_result = ""; String duplicates_matches = ""; duplicates_result += barcode_path+ ":\n"; duplicates_matches += barcode_path + "\t"; StringTokenizer st = new StringTokenizer(matchbox_stdout, "\n"); boolean startDuplicates = false; boolean hasDuplicates = false; while (st.hasMoreTokens()) { String token = st.nextToken(); if (startDuplicates) { if (token.contains("=>")) { duplicates_result += token + "\n"; hasDuplicates = true; } } if (token.contains("=== List of detected duplicates ===")) { startDuplicates = true; } } if(hasDuplicates) duplicates_matches += "1"; else duplicates_matches += "0";
read_gt_filelist_file	localworker	Script BufferedReader getReader (String fileUrl) throws IOException { InputStreamReader reader; try { reader = new FileReader(fileUrl); } catch (FileNotFoundException e) { // try a real URL instead URL url = new URL(fileUrl); reader = new InputStreamReader (url.openStream()); } return new BufferedReader(reader); } StringBuffer sb = new StringBuffer(4000); BufferedReader in = getReader(fileurl); String str; String lineEnding = System.getProperty("line.separator"); while ((str = in.readLine()) != null) { sb.append(str); sb.append(lineEnding); } in.close(); filecontents = sb.toString();
split_gt_filelist	localworker	Script List split = new ArrayList(); if (!string.equals("")) { String regexString = ","; if (regex != void) { regexString = regex; } String[] result = string.split(regexString); for (int i = 0; i < result.length; i++) { split.add(result[i]); } }
read_gt_file	localworker	Script BufferedReader getReader (String fileUrl) throws IOException { InputStreamReader reader; try { reader = new FileReader(fileUrl); } catch (FileNotFoundException e) { // try a real URL instead URL url = new URL(fileUrl); reader = new InputStreamReader (url.openStream()); } return new BufferedReader(reader); } StringBuffer sb = new StringBuffer(4000); BufferedReader in = getReader(fileurl); String str; String lineEnding = System.getProperty("line.separator"); while ((str = in.readLine()) != null) { sb.append(str); sb.append(lineEnding); } in.close(); filecontents = sb.toString();
report	beanshell	Script report = "Correctly identified duplicates (true positives): "+cid+"\n"+ "Incorrectly identified duplicates (false positives): "+iid+"\n"+ "Missed duplicates (false negatives): "+md+"\n"+ "Precision: "+prec+"%\n"+ "Recall: "+rec+"%\n"+ "F-Measure: "+fm+"%\n";

Beanshells (3)

Name	Inputs	Outputs
matchbox_evaluate	matchbox groundtruth	cid md iid prec rec fm log
parse_matchbox_stdout	barcode_path matchbox_stdout	duplicates_result duplicates_matches
report	prec md iid fm cid rec	report

Outputs (6)

Name	Description
results
stderr
stdout
matches
report
evlog

Datalinks (24)

Source	Sink
read_gt_file:filecontents	matchbox_evaluate:groundtruth
parse_matchbox_stdout:duplicates_result	matchbox_evaluate:matchbox
split_bc_dirlist:split	matchbox:barcode_path
bc_dirlist_file_path	read_bc_dirlist_file:fileurl
newline:value	split_bc_dirlist:regex
read_bc_dirlist_file:filecontents	split_bc_dirlist:string
matchbox:STDOUT	parse_matchbox_stdout:matchbox_stdout
split_bc_dirlist:split	parse_matchbox_stdout:barcode_path
gt_filelist_path	read_gt_filelist_file:fileurl
read_gt_filelist_file:filecontents	split_gt_filelist:string
newline:value	split_gt_filelist:regex
split_gt_filelist:split	read_gt_file:fileurl
matchbox_evaluate:cid	report:cid
matchbox_evaluate:fm	report:fm
matchbox_evaluate:iid	report:iid
matchbox_evaluate:md	report:md
matchbox_evaluate:prec	report:prec
matchbox_evaluate:rec	report:rec
parse_matchbox_stdout:duplicates_result	results
matchbox:STDERR	stderr
matchbox:STDOUT	stdout
parse_matchbox_stdout:duplicates_matches	matches
report:report	report
matchbox_evaluate:log	evlog