Find Duplicates using Matchbox command line tool
Created: 2012-07-31 11:00:21
Last updated: 2012-07-31 11:35:12
The workflow takes a list of digital documents as input, extracts SIFT features using image processing algorithms, creates dictionary of visual words, generates BoW (Bag of Words) histogramms and finds duplicates. The count of parallel threads can be passed as a parameter. Finally search results are stored in a text file that contains a list of possible duplicates with associated similarity score. This score values are spread between 0 (low similarity) and 1 (high similarity). Image comparison is performed by Matchbox command line tool and associated python scripts. The sources of the Matchbox tool are located in https://github.com/openplanets/scape/tree/master/pc-qa-matchbox.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (0)
Titles (0)
Descriptions (0)
Dependencies (0)
Processors (9)
Name |
Type |
Description |
ExtractFeatures |
externaltool |
This operation extracts SIFT features of the documents placed in given directory and stores them in xml.gz archives in the same directory. The threads count manages task execution in parallel. |
Train |
externaltool |
This operation analyzes visual words of each document and builds a BoW (Bag of Words) vocabulary, that comprises about 1000 distinctive visual words. These key poins are the most characteristical features of analyzed collection. The threads count manages task execution in parallel. |
Threadscount |
stringconstant |
This parameter defines how many threads should be used. Value1 |
Workingdir |
stringconstant |
This is a working directory where temporary files are stored. These files comprise SIFT features and BoW histogramms for associated document. Value/tmp/TestCollection |
BoW_histograms |
externaltool |
This operation counts visual words that are matching one of the key visual words from BoW dictionary and creates BoW histogram file for associated document. The threads count manages task execution in parallel. |
Compare |
externaltool |
This operation performs nearest neighbour search for each document in collection. The main goal of the search is to find duplicates in collection. This method based on comparison of BoW histogram associated with particular document with all BoW histograms stored in collection. The result of comparison is a structural similarity score between 0 and 1. 0 stands for low similarity and 1 stands for high similarity. The threads count manages task execution in parallel. |
Write_Text_File |
localworker |
Script
BufferedWriter out;
if (encoding == void) {
out = new BufferedWriter(new FileWriter(outputFile));
}
else {
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), encoding));
}
out.write(filecontents);
out.flush();
out.close();
outputFile = filecontents;
|
Duplicates_list |
stringconstant |
This is a name of resulting file that contains duplicates list. Valueduplicates.txt |
Executable_path |
stringconstant |
This is a path to the folder where executables are stored. Value/usr/local/bin/ |
Outputs (1)
Name |
Description |
Write_Text_File_outputFile |
|
Datalinks (15)
Source |
Sink |
Workingdir:value |
ExtractFeatures:workingdir |
Threadscount:value |
ExtractFeatures:threadscount |
Executable_path:value |
ExtractFeatures:execpath |
Threadscount:value |
Train:threadscount |
Workingdir:value |
Train:workingdir |
Executable_path:value |
Train:execpath |
Workingdir:value |
BoW_histograms:workingdir |
Threadscount:value |
BoW_histograms:threadscount |
Executable_path:value |
BoW_histograms:execpath |
Workingdir:value |
Compare:workingdir |
Executable_path:value |
Compare:execpath |
Threadscount:value |
Compare:threadscount |
Duplicates_list:value |
Write_Text_File:outputFile |
Compare:STDOUT |
Write_Text_File:filecontents |
Write_Text_File:outputFile |
Write_Text_File_outputFile |
Coordinations (3)
Controller |
Target |
BoW_histograms |
Compare |
ExtractFeatures |
Train |
Train |
BoW_histograms |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment