AIT Matchbox Scenario Professional
Created: 2012-11-24 22:47:17
Last updated: 2012-11-24 22:56:24
In this scenario matchbox will find duplicates in passed digital collection. All matchbox workflow steps are executed automatically in one turn. User will get a list of duplicates in result. Matchbox in this scenario is installed on remote Linux VM. Digital collection is stored on Windows machine. This workflow starts duplicate finding process using the FindDuplicates python script of the matchbox tool. Matchbox tool support python in version 2.7. Execution starts from the directory where python scripts are located. If you use source code from Github, then it is a scape/pc-qa-matchbox/Python/ directory. The python script supports different parameter. Experienced user can apply extract, train, bowhist and compare parameters in order to execute associated step in the matchbox workflow for duplicate seach. The order of execution steps should not be changed, because each next step requires an output from a previous step. E.g. if you are going to repeat the comparison step you should have calculated required BOWHistogram files from bowhist step.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
AIT Matchbox Scenario Professional |
Descriptions (1)
In this scenario matchbox will find duplicates in passed digital collection. All matchbox workflow steps are executed automatically in one turn. User will get a list of duplicates in result. Matchbox in this scenario is installed on remote Linux VM. Digital collection is stored on Windows machine. This workflow starts duplicate finding process using the FindDuplicates python script of the matchbox tool. Matchbox tool support python in version 2.7. Execution starts from the directory where python scripts are located. If you use source code from Github, then it is a scape/pc-qa-matchbox/Python/ directory. The python script supports different parameter. Experienced user can apply extract, train, bowhist and compare parameters in order to execute associated step in the matchbox workflow for duplicate seach. The order of execution steps should not be changed, because each next step requires an output from a previous step. E.g. if you are going to repeat the comparison step you should have calculated required BOWHistogram files from bowhist step. |
Dependencies (0)
Inputs (2)
Name |
Description |
orig_dirlist_file_path |
Path to directory on server where digital collection that we are going to analyse is located. Please note that path should contain '/' at the end.
|
matchbox_parameter |
This is one of the main matchbox parameters that are used for duplicate finding: all, extract, train, bowhist, compare.
|
Processors (2)
Name |
Type |
Description |
matchbox |
externaltool |
This command starts duplicate finding process using the FindDuplicates python script of the matchbox tool. Matchbox tool support python in version 2.7. Execution starts from the directory where python scripts are located. If you use source code from Github, then it is a scape/pc-qa-matchbox/Python/ directory. The python script supports different parameter. Experienced user can apply extract, train, bowhist and compare parameters in order to execute associated step in the matchbox workflow for duplicate seach. The order of execution steps should not be changed, because each next step requires an output from a previous step. E.g. if you are going to repeat the comparison step you should have calculated required BOWHistogram files from bowhist step. |
parse_matchbox_stdout |
beanshell |
Script#!Pairtree pt = new Pairtree();
#!String id = pt.mapToId("/mnt/abonas/linktree/", barcode_path.substring(0,barcode_path.lastIndexOf("/")));
String duplicates_result = "";
String duplicates_matches = "";
duplicates_result += target_collection_path+ ":\n";
duplicates_matches += target_collection_path + "\t";
StringTokenizer st = new StringTokenizer(matchbox_stdout, "\n");
boolean hasDuplicates = false;
while (st.hasMoreTokens()) {
String token = st.nextToken();
if (token.contains("=>")) {
duplicates_result += token + "\n";
hasDuplicates = true;
}
}
if(hasDuplicates)
duplicates_matches += "1";
else
duplicates_matches += "0"; |
Beanshells (1)
Name |
Description |
Inputs |
Outputs |
parse_matchbox_stdout |
|
target_collection_path
matchbox_stdout
|
duplicates_result
duplicates_matches
|
Outputs (4)
Name |
Description |
results |
|
stderr |
|
stdout |
|
matches |
|
Datalinks (8)
Source |
Sink |
orig_dirlist_file_path |
matchbox:target_collection_path |
matchbox_parameter |
matchbox:parameter |
orig_dirlist_file_path |
parse_matchbox_stdout:target_collection_path |
matchbox:STDOUT |
parse_matchbox_stdout:matchbox_stdout |
parse_matchbox_stdout:duplicates_result |
results |
matchbox:STDERR |
stderr |
matchbox:STDOUT |
stdout |
parse_matchbox_stdout:duplicates_matches |
matches |
Uploader
License
All versions of this Workflow are
licensed under:
Version 2
(of 3)
Credits (2)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (0)
None
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment