Hadoop Large Document Collection Data Preparation
Created: 2012-08-17 12:19:39
Last updated: 2012-08-18 18:39:26
Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop Map/Reduce, and Hive) are used for specific purposes.
The *PathCreator components create text files with absolute file paths using the unix command 'find'. The workflow then uses 1) a Hadoop Streaming API component (HadoopStreamingExiftoolRead) based on a bash script for reading image metadata using Exiftool, 2) the Map/Reduce component (HadoopHocrAvBlockWidthMapReduce) presented above, and 3) Hive components for creating data tables (HiveLoad*Data) and performing queries on the result files (HiveSelect).
The code for the two hadoop jobs is available on Github: tb-lsdr-seqfilecreator and tb-lsdr-hocrparser.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
Descriptions (0)
Dependencies (0)
Inputs (2)
Name |
Description |
hadoop_job_name_prefix |
Hadoop job name prefix for
|
rootpath |
|
Processors (10)
Name |
Type |
Description |
HadoopHocrAvBlockWidthMapReduce |
externaltool |
|
HadoopSequenceFileCreator |
externaltool |
|
HtmlPathCreator |
externaltool |
|
Jp2PathCreator |
externaltool |
|
html_extension |
stringconstant |
Valuehtml |
jp2_extension |
stringconstant |
Valuejp2 |
HadoopStreamingExiftoolRead |
externaltool |
|
HiveLoadExifData |
externaltool |
|
HiveLoadHocrData |
externaltool |
|
HiveSelect |
externaltool |
|
Datalinks (13)
Source |
Sink |
HadoopSequenceFileCreator:STDOUT |
HadoopHocrAvBlockWidthMapReduce:hdfs_input_dir |
hadoop_job_name_prefix |
HadoopHocrAvBlockWidthMapReduce:hadoop_job_name_prefix |
hadoop_job_name_prefix |
HadoopSequenceFileCreator:hadoop_job_name_prefix |
HtmlPathCreator:STDOUT |
HadoopSequenceFileCreator:hdfs_input_path |
rootpath |
HtmlPathCreator:rootpath |
html_extension:value |
HtmlPathCreator:extfilter |
rootpath |
Jp2PathCreator:rootpath |
jp2_extension:value |
Jp2PathCreator:extfilter |
Jp2PathCreator:STDOUT |
HadoopStreamingExiftoolRead:hdfs_input_dir |
hadoop_job_name_prefix |
HadoopStreamingExiftoolRead:hadoop_job_name_prefix |
HadoopStreamingExiftoolRead:STDOUT |
HiveLoadExifData:hdfs_result_file |
HadoopHocrAvBlockWidthMapReduce:STDOUT |
HiveLoadHocrData:hdfs_result_file |
HiveSelect:STDOUT |
Out |
Coordinations (7)
Controller |
Target |
HadoopSequenceFileCreator |
HadoopHocrAvBlockWidthMapReduce |
HtmlPathCreator |
HadoopSequenceFileCreator |
HiveLoadExifData |
HiveSelect |
HadoopStreamingExiftoolRead |
HiveLoadExifData |
HiveLoadHocrData |
HiveSelect |
Jp2PathCreator |
HadoopStreamingExiftoolRead |
HadoopHocrAvBlockWidthMapReduce |
HiveLoadHocrData |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (1)
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment