ARC to WARC Migration and CDX Index Comparison
Created: 2014-04-23 13:31:02
Workflow for migrating ARC to WARC and comparing the CDX index files (Linux).
The workflow has an input port “input_directory” which is a local path to the directory containing the ARC files, and an input port “output_directory” which is the directory where the workflow outputs are created. The files in the input directory are migrated using the “arc2warc_migration_cli” tool service component to perform the migration. The “cdx_creator_arc” and “cdx_creator_warc” tool service components create cdx index files for both, the original ARC file and the migrated WARC file which, subsequently, are compared by the “cdx_comparison” tool service component that uses the CSV file comparison tool csvdiff (http://csvdiff.sourceforge.net) to compare defined columns of the two CSV files.
Command line applications used by the tool service components:
arc2warc_migration_cli: https://github.com/openplanets/hawarp/tree/master/arc2warc-migration-cli
cdx_creator_(w)arc: https://github.com/internetarchive/webarchive-commons
diff_cdx: http://csvdiff.sourceforge.net
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
ARC to WARC Migration and CDX Index Comparison |
Descriptions (1)
Workflow for migrating ARC to WARC and comparing the CDX index files (Linux).
The workflow has an input port “input_directory” which is a local path to the directory containing the ARC files, and an input port “output_directory” which is the directory where the workflow outputs are created. The files in the input directory are migrated using the “arc2warc_migration_cli” tool service component to perform the migration. The “cdx_creator_arc” and “cdx_creator_warc” tool service components create cdx index files for both, the original ARC file and the migrated WARC file which, subsequently, are compared by the “cdx_comparison” tool service component that uses the CSV file comparison tool csvdiff (http://csvdiff.sourceforge.net) to compare defined columns of the two CSV files.
Command line applications used by the tool service components:
arc2warc_migration_cli: https://github.com/openplanets/hawarp/tree/master/arc2warc-migration-cli
cdx_creator_(w)arc: https://github.com/internetarchive/webarchive-commons
diff_cdx: http://csvdiff.sourceforge.net
|
Dependencies (0)
Inputs (2)
Name |
Description |
input_directory |
Local input directory containing ARC files
|
output_directory |
Local output directory where the migrated WARC files will be stored
|
Processors (11)
Name |
Type |
Description |
arc2warc_migration_cli |
externaltool |
|
arc2warc_jar |
stringconstant |
Value/home/onbscs/Development/scape-dev/hawarp/arc2warc-migration-cli/target/arc2warc-migration-cli-1.0-jar-with-dependencies.jar |
cdx_creator_arc |
externaltool |
|
List_Files_by_Extension |
localworker |
Scriptclass FileExtFilter implements FileFilter {
public FileExtFilter(String ext) {
this.ext = ext;
}
public boolean accept(File file) {
return file.getName().endsWith(ext);
}
String ext = null;
}
if (extension == void || extension.equals("")) {
throw new RuntimeException(
"The 'extension' parameter cannot be null. Please enter a valid file extension.");
}
if (directory == void || directory.equals("")) {
throw new RuntimeException(
"The 'directory' parameter cannot be null. Please enter a valid file directory.");
}
File dirObj = new File(directory);
if (!dirObj.exists()) {
throw new RuntimeException("The 'directory' parameter specified:" + directory
+ "does not exist. Please enter a valid file directory.");
}
File[] fileObjList = dirObj.listFiles(new FileExtFilter(extension));
List filelist = new ArrayList();
for (int i = 0; i < fileObjList.length; i++) {
filelist.add(fileObjList[i].getAbsolutePath());
}
|
arc.gz |
stringconstant |
Valuearc.gz |
List_Files_by_Extension_2 |
localworker |
Scriptclass FileExtFilter implements FileFilter {
public FileExtFilter(String ext) {
this.ext = ext;
}
public boolean accept(File file) {
return file.getName().endsWith(ext);
}
String ext = null;
}
if (extension == void || extension.equals("")) {
throw new RuntimeException(
"The 'extension' parameter cannot be null. Please enter a valid file extension.");
}
if (directory == void || directory.equals("")) {
throw new RuntimeException(
"The 'directory' parameter cannot be null. Please enter a valid file directory.");
}
File dirObj = new File(directory);
if (!dirObj.exists()) {
throw new RuntimeException("The 'directory' parameter specified:" + directory
+ "does not exist. Please enter a valid file directory.");
}
File[] fileObjList = dirObj.listFiles(new FileExtFilter(extension));
List filelist = new ArrayList();
for (int i = 0; i < fileObjList.length; i++) {
filelist.add(fileObjList[i].getAbsolutePath());
}
|
warc |
stringconstant |
Valuewarc |
cdx_creator_warc |
externaltool |
|
webarchive_commons_jar |
stringconstant |
Value/home/onbscs/Development/scape-dev/webarchive-commons/target/webarchive-commons-jar-with-dependencies.jar |
create_temp_dir |
externaltool |
|
diff_cdx |
externaltool |
|
Outputs (5)
Name |
Description |
arc2warc_STDOUT |
|
arc2warc_STDERR |
|
arc_cdx |
|
warc_cdx |
|
diff_out |
|
Datalinks (19)
Source |
Sink |
arc2warc_jar:value |
arc2warc_migration_cli:arc2warc_jar_path |
input_directory |
arc2warc_migration_cli:input_directory |
output_directory |
arc2warc_migration_cli:output_directory |
List_Files_by_Extension:filelist |
cdx_creator_arc:inputfile |
webarchive_commons_jar:value |
cdx_creator_arc:archmdextr_jar_path |
create_temp_dir:STDOUT |
cdx_creator_arc:temp_dir |
input_directory |
List_Files_by_Extension:directory |
arc.gz:value |
List_Files_by_Extension:extension |
output_directory |
List_Files_by_Extension_2:directory |
warc:value |
List_Files_by_Extension_2:extension |
List_Files_by_Extension_2:filelist |
cdx_creator_warc:inputfile |
webarchive_commons_jar:value |
cdx_creator_warc:archmdextr_jar_path |
create_temp_dir:STDOUT |
cdx_creator_warc:temp_dir |
create_temp_dir:STDOUT |
diff_cdx:temp_dir |
arc2warc_migration_cli:STDOUT |
arc2warc_STDOUT |
arc2warc_migration_cli:STDERR |
arc2warc_STDERR |
cdx_creator_arc:STDOUT |
arc_cdx |
cdx_creator_warc:STDOUT |
warc_cdx |
diff_cdx:STDOUT |
diff_out |
Coordinations (5)
Controller |
Target |
cdx_creator_arc |
diff_cdx |
arc2warc_migration_cli |
cdx_creator_warc |
cdx_creator_warc |
diff_cdx |
arc2warc_migration_cli |
List_Files_by_Extension_2 |
arc2warc_migration_cli |
cdx_creator_arc |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment