A heuristic measure for detecting undesired influence of lossy JP2 compression on OCR in the absence of ground truth
Created: 2012-02-06 12:27:23
Last updated: 2012-03-09 14:33:19
Analysing the impact of JPEG2000 compression on the OCR.
Requires the following tools to be installed:
- ImageMagick (command: convert) - for removing TIF compression if need be
- Kakadu (commands: kdu_compress, kdu_expand) for encoding/decoding TIF images to JP2
- Tesseract for OCR
- Apache Java-library in <taverna-home>/lib: commons-lang-2.4.jar for the CalculateLevenshteinDistance Beanshell's dependency
- Gnuplot for creating a diagram for demonstrating the result
Workflow has been designed to be executed on a linux system, some variables, like the temporary directory "/tmp/" would have to be changed for other operating systems.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
Kakadu encode uncompressed TIF image files (tool service) |
Descriptions (0)
Dependencies (1)
Inputs (4)
Name |
Description |
rates |
Compression rates
|
inFilesAbsPaths |
List of absolute paths to book image files
|
tess_langmod |
Tesseract language module
|
tesscmd |
|
Processors (6)
Name |
Type |
Description |
kakadu_encode |
workflow |
|
createSessionID |
beanshell |
Script//Create a random name ("sessionID") and a directory in /tmp
sessionID = new Random().nextInt(10000000); |
split_rates |
localworker |
ScriptList split = new ArrayList();
if (!string.equals("")) {
String regexString = ",";
if (regex != void) {
regexString = regex;
}
String[] result = string.split(regexString);
for (int i = 0; i < result.length; i++) {
split.add(result[i]);
}
}
|
newline |
stringconstant |
Value\n |
AggregateResults |
localworker |
ScriptString outputFile = "/tmp/"+sessionID+"/data";
BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "utf-8"));
int num = resultList.size();
List sublist = resultList.get(0);
int num = resultList.size();
List sublist = (List) resultList.get(0);
int slnum = sublist.size();
for (int i = 0; i < slnum; i++) {
int linesum = 0;
for (int j = 0; j < num; j++) {
linesum += Integer.parseInt((String)((List)resultList.get(j)).get(i));
}
double lineAvg = (double)linesum / (double)num;
out.write(Double.toString(lineAvg) + "\n");
}
out.flush();
out.close();
File outFile = new File(outputFile);
if(outFile.exists())
datafile = "file://"+outFile.getAbsolutePath();
else
datafile = "file://null";
|
Read_Text_File |
localworker |
ScriptBufferedReader getReader (String fileUrl, String encoding) throws IOException {
InputStreamReader reader;
try {
if (encoding == null) {
reader = new FileReader(fileUrl);
} else {
reader = new InputStreamReader(new FileInputStream(fileUrl),encoding);
}
}
catch (FileNotFoundException e) {
// try a real URL instead
URL url = new URL(fileUrl);
if (encoding == null) {
reader = new InputStreamReader (url.openStream());
} else {
reader = new InputStreamReader (url.openStream(), encoding);
}
}
return new BufferedReader(reader);
}
StringBuffer sb = new StringBuffer(4000);
if (encoding == void) {
encoding = null;
}
BufferedReader in = getReader(fileurl, encoding);
String str;
String lineEnding = System.getProperty("line.separator");
while ((str = in.readLine()) != null) {
sb.append(str);
sb.append(lineEnding);
}
in.close();
filecontents = sb.toString();
|
Beanshells (3)
Name |
Description |
Inputs |
Outputs |
createSessionID |
|
|
sessionID
|
create_tmp_environment |
|
inFileAbsPath
rate
sessionID
|
tmpDirAbsPath
tmpFileTrunk
|
CalculateLevenshteinDistance |
|
text1
text2
|
levenshtein_distance
|
Outputs (2)
Name |
Description |
tess_result |
|
datafile |
|
Datalinks (12)
Source |
Sink |
createSessionID:sessionID |
kakadu_encode:sessionID |
inFilesAbsPaths |
kakadu_encode:inFileAbsPath |
tess_langmod |
kakadu_encode:tess_langmod |
split_rates:split |
kakadu_encode:rate |
tesscmd |
kakadu_encode:tesscmd |
rates |
split_rates:string |
newline:value |
split_rates:regex |
kakadu_encode:levenshtein_distance |
AggregateResults:resultList |
createSessionID:sessionID |
AggregateResults:sessionID |
AggregateResults:datafile |
Read_Text_File:fileurl |
kakadu_encode:tess_result |
tess_result |
Read_Text_File:filecontents |
datafile |
Coordinations (1)
Controller |
Target |
createSessionID |
kakadu_encode |
Uploader
License
Version 1 (earliest)
(of 2)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (1)
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment