Hadoop Large Document Collection Data Preparation
Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop Map/Reduce, and Hive) are used for specific purposes.
The *PathCreator components create text files with absolute file paths using the unix command 'find'. The workflow then uses 1) a Hadoop Streaming API component (HadoopStreamingExiftoolRead) based on a bash script for reading image metadata using Exiftool, 2) the Map/Reduce component (HadoopHocrAvBlockWidthMapReduce) presented above, and 3) Hive components for creating data tables (HiveLoad*Data) and performing queries on the result files (HiveSelect).
The code for the two hadoop jobs is available on Github: tb-lsdr-seqfilecreator and tb-lsdr-hocrparser.
Preview
Run
Run this Workflow in the Taverna Workbench...
Option 1:
Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/3105/download?version=1
[ More Info ]
Workflow Components
![header=[] body=[This is the author information extracted from the workflow version] cssheader=[boxoverTooltipHeader] cssbody=[boxoverTooltipBody] delay=[200] Information](/images/famfamfam_silk/information.png?1680607579)
![header=[] body=[These are the descriptive titles embedded within the workflow version] cssheader=[boxoverTooltipHeader] cssbody=[boxoverTooltipBody] delay=[200] Information](/images/famfamfam_silk/information.png?1680607579)
![header=[] body=[These are the descriptions embedded within the workflow version] cssheader=[boxoverTooltipHeader] cssbody=[boxoverTooltipBody] delay=[200] Information](/images/famfamfam_silk/information.png?1680607579)
![header=[] body=[These are the listed dependencies of the workflow] cssheader=[boxoverTooltipHeader] cssbody=[boxoverTooltipBody] delay=[200] Information](/images/famfamfam_silk/information.png?1680607579)
Reviews
(0)
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
No comments yet
Log in to make a comment