ToMaR HDFS Input Directory Processing
Created: 2014-03-04 12:47:29
Last updated: 2014-03-11 09:45:37
This workflow allows processing an HDFS input directory using ToMaR.
The "hdfs_working_dir" input port is the HDFS input directory which containes the data to be processed by ToMaR.
The "toolspec" input port contains the toolspec XML describing operations that can be used (see "operation" input port).
The "operation" input port defines the operation to be used in the current ToMaR job execution (see "toolspec" input port, an operation port used here must be defined in the tool specification).
The "hdfs_working_dir" input port defines the directory where the outputs will be stored in a date/time-subdirectory.
For example:
tomarworkingdir/20140304130007/dataout
tomarworkingdir/20140304130007/joboutput
tomarworkingdir/20140304130007/tomar-controlfile.txt
tomarworkingdir/20140304130007/toolspec
The "dataout" directory contains the output data of the ToMaR process. Depending on the operation used, this can be the result of a file format identification or a data migration process.
The "joboutput" directory contains the Hadoop job output of the ToMaR Hadoop job.
The "tomar-controlfile.txt" file is the input file for the ToMaR Hadoop job execution.
The "toolspec" directory contains the tool specification file given by the "toolspec" input port.
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (1)
Titles (1)
ToMaR HDFS Input Directory Processing |
Descriptions (1)
This workflow allows processing an HDFS input directory using ToMaR.
The "hdfs_working_dir" input port is the HDFS input directory which containes the data to be processed by ToMaR.
The "toolspec" input port contains the toolspec XML describing operations that can be used (see "operation" input port).
The "operation" input port defines the operation to be used in the current ToMaR job execution (see "toolspec" input port, an operation port used here must be defined in the tool specification).
The "hdfs_working_dir" input port defines the directory where the outputs will be stored in a date/time-subdirectory.
For example:
tomarworkingdir/20140304130007/dataout
tomarworkingdir/20140304130007/joboutput
tomarworkingdir/20140304130007/tomar-controlfile.txt
tomarworkingdir/20140304130007/toolspec
The "dataout" directory contains the output data of the ToMaR process. Depending on the operation used, this can be the result of a file format identification or a data migration process.
The "joboutput" directory contains the Hadoop job output of the ToMaR Hadoop job.
The "tomar-controlfile.txt" file is the input file for the ToMaR Hadoop job execution.
The "toolspec" directory contains the tool specification file given by the "toolspec" input port.
|
Dependencies (0)
Inputs (5)
Name |
Description |
hdfs_input_dir |
HDFS input directory which contains the data to be processed by ToMaR (flat list directory, no sub-directories)
|
operation |
Operation (must relate to an operation in the tool-spec XML, see port "toolspec").
|
toolspec |
Tool-spec XML. Describes operations which can be executed using ToMaR.
|
hdfs_working_dir |
Working directory (without trailing slash). A date/time-subdirectory in this working directory will contain all outputs of the workflow.
For example:
tomarworkingdir/20140304130007/dataout
tomarworkingdir/20140304130007/joboutput
tomarworkingdir/20140304130007/tomar-controlfile.txt
tomarworkingdir/20140304130007/toolspec
The "dataout" directory contains the output data of the ToMaR process.
The "joboutput" directory contains the Hadoop job output of the ToMaR Hadoop job.
The "tomar-controlfile.txt" file is the input file for the ToMaR Hadoop job execution.
The "toolspec" directory contains the tool specification file given by the "toolspec" input port.
|
num_lines_per_task |
Number of lines that ToMaR should process per task. One line is a processing instruction in the ToMaR control file.
|
Processors (4)
Name |
Type |
Description |
tomar_prepare_hadoopjob |
externaltool |
|
tomar_run_hadoopjob |
externaltool |
|
ls_result |
externaltool |
|
tomar_jar_path |
stringconstant |
Value/home/onbfue/pt-mapred-0.0.1-SNAPSHOT-jar-with-dependencies.jar |
Outputs (1)
Name |
Description |
tomar_run_hadoopjob_STDOUT |
|
Datalinks (10)
Source |
Sink |
hdfs_input_dir |
tomar_prepare_hadoopjob:hdfs_input_dir |
operation |
tomar_prepare_hadoopjob:operation |
hdfs_working_dir |
tomar_prepare_hadoopjob:hdfs_working_dir |
toolspec |
tomar_prepare_hadoopjob:toolspec |
hdfs_working_dir |
tomar_run_hadoopjob:hdfs_working_dir |
tomar_prepare_hadoopjob:STDOUT |
tomar_run_hadoopjob:datetime_token |
tomar_jar_path:value |
tomar_run_hadoopjob:tomar_jar_path |
num_lines_per_task |
tomar_run_hadoopjob:num_lines_per_task |
tomar_run_hadoopjob:STDOUT |
ls_result:hdfs_data_dir |
ls_result:STDOUT |
tomar_run_hadoopjob_STDOUT |
Coordinations (1)
Controller |
Target |
tomar_run_hadoopjob |
ls_result |
Uploader
License
All versions of this Workflow are
licensed under:
Version 2 (latest)
(of 2)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment