ARC to WARC Migration with CDX Index and wayback rendering screenshot comparison
Workflow for migrating ARC to WARC and comparing the CDX index files (Linux).
The workflow has an input port “input_directory” which is a local path to the directory containing the ARC files, and an input port “output_directory” which is the directory where the workflow outputs are created. The files in the input directory are migrated using the “arc2warc_migration_cli” tool service component to perform the migration. The “cdx_creator_arc” and “cdx_creator_warc” tool service components create cdx index files for both, the original ARC file and the migrated WARC file which, subsequently, are compared by the “cdx_comparison” tool service component that uses the CSV file comparison tool csvdiff (http://csvdiff.sourceforge.net) to compare defined columns of the two CSV files. Furthermore, via deployment to a wayback web application deployed in an Apache Tomcat servlet container, screenshots are taken from rendered HTML pages retrieved from ARC and WARC container file records respectively. The screenshots are then compared using Exiftool (geometric properties) and ImageMagick (measure: AE).
The following requirements must be fulfilled to be able to execute the Taverna workflow:
- Linux operating system - Java version >= 1.7 - Maven2 and Git - Download and build hawarp (git clone https://github.com/openplanets/hawarp.git), build executable jar with dependencies from modules arc2warc-migration-cli and cdx-creator using „mvn assembly:assembly“. - OpenSource wayback deployed to Apache Tomcat servlet container. See this setup and configuration guide for installing the wayback software. - Wayback configured with CDX collection. Configuration is done in configuration files available in the WEB-INF folder of the deployed web application. See this example wayback.xml and CDXcollection.xml configuration files to see how to setup the CDX collection. - Perl script CSVDIFF installed - PhantomJS installed and PhantomJS script for taking snapshots from URLs available. - Exiftool and ImageMagick installed - Configure the constant values in the Taverna workflow to match your system's environment paths.
Preview
Run
Run this Workflow in the Taverna Workbench...
Option 1:
Copy and paste this link into File > 'Open workflow location...'
http://myexperiment.org/workflows/4333/download?version=1
[ More Info ]
Workflow Components
Reviews (0)
Other workflows that use similar services (0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment