ARC to WARC Migration and CDX Index Comparison
Workflow for migrating ARC to WARC and comparing the CDX index files (Linux).
The workflow has an input port “input_directory” which is a local path to the directory containing the ARC files, and an input port “output_directory” which is the directory where the workflow outputs are created. The files in the input directory are migrated using the “arc2warc_migration_cli” tool service component to perform the migration. The “cdx_creator_arc” and “cdx_creator_warc” tool service components create cdx index files for both, the original ARC file and the migrated WARC file which, subsequently, are compared by the “cdx_comparison” tool service component that uses the CSV file comparison tool csvdiff (http://csvdiff.sourceforge.net) to compare defined columns of the two CSV files.
Command line applications used by the tool service components:
arc2warc_migration_cli: https://github.com/openplanets/hawarp/tree/master/arc2warc-migration-cli cdx_creator_(w)arc: https://github.com/internetarchive/webarchive-commons diff_cdx: http://csvdiff.sourceforge.net
Preview
Run
Run this Workflow in the Taverna Workbench...
Option 1:
Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/4267/download?version=1
[ More Info ]
Workflow Components
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
Inputs (2) | ![]() |
Processors (11) | ![]() |
Beanshells (0) | ![]() |
Outputs (5) | ![]() |
Datalinks (19) | ![]() |
Coordinations (5) | ![]() |
Reviews
(0)
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
No comments yet
Log in to make a comment