Reviews (0)
For Workflow: Hadoop Large Document Collection Data Preparation
Hadoop Large Document Collection Data Prep... (1)
Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop Map/Reduce, and Hive) are used for specific purposes.
The *PathCreator components create text files with absolute file paths using the unix command 'find'. The workflow then uses 1) a Hadoop Streaming API component (HadoopStreamingExiftoolRead) based on a bash script for reading image metadata using Exiftool, 2) the Map/Reduce component (HadoopHocrAvBlockWidthMapR...
Created: 2012-08-17 | Last updated: 2012-08-18
Credits: Sven