Data Set Metadata Generator
Created: 2008-08-19 21:17:14
Last updated: 2008-08-19 21:25:53
This workflow generates ePrints XML import files with data set metadata for the FLOSSmole project. It reads in an input file generated from a Notre Dame SourceForge dump SQL query and uses regular expressions to parse the filename for the data set's source repository, download URL, and basic description. It also translates the epoch date into a sql format suitable for import, and the file size from bytes into larger units, e.g. GB, MB, etc. These data are inserted into an XML eprint record template (specific to the FLOSSmole ePrints repository configuration at wp.floss.syr.edu) and the individual eprints are aggregated into an XML import file.
Unfortunately, I'm not sure that I can provide the input file due to license restrictions. I can provide the SQL query, however, so that anyone who has signed a license agreement for access to the ND SourceForge data can retrieve the same input:
SELECT f.filename, f.file_id, f.file_size, f.post_date FROM sf0508.frs_file as f, sf0508.groups as g WHERE g.unix_group_name = 'ossmole' AND f.group_id=g.group_id ORDER BY f.post_date
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Processors (13)
Name |
Type |
Description |
read_input |
local |
Shim to read in the file, location provided by a string constant. |
file_location |
stringconstant |
Edit to use your local path to the input file location. |
split_rows |
local |
Takes a flat CSV input file and splits it into a list. |
split_more |
local |
Takes the list input and creates a 2-deep list. |
parse_filename_for_description |
beanshell |
Creates a general description of the data set contents based on regex matching on filenames. |
parse_filename_for_source |
beanshell |
Extracts the repository data source from each filename. |
aggregate_eprints |
beanshell |
Aggregates the individual eprint records into a depositable XML file, configured specifically for the wp.floss.syr.edu ePrints repository. |
change_date_format |
beanshell |
Changes the date format from epoch to sql. |
format_filesize |
beanshell |
Formats the filesize in bytes into a more human-readable format, conditionally displaying results in GB, MB, KB, or B. |
build_URL_from_filename |
beanshell |
Constructs the SourceForge file download URL for FLOSSmole data sets, given the name of the files. |
parse_filename_for_filetype |
beanshell |
Uses pattern matching to identify file types in file names. |
split_fields |
beanshell |
Reads the 2-deep input list and splits out the values into separate variables. Not all fields were used. |
generate_xml_record |
beanshell |
Generates an XML ePrint record based on a template specifically configured for data set deposit in the wp.floss.syr.edu ePrints repository. |
Beanshells (9)
Name |
Description |
Inputs |
Outputs |
parse_filename_for_description |
Creates a general description of the data set contents based on regex matching on filenames. |
filename
|
description
|
parse_filename_for_source |
Extracts the repository data source from each filename. |
filename
|
source
|
aggregate_eprints |
Aggregates the individual eprint records into a depositable XML file, configured specifically for the wp.floss.syr.edu ePrints repository. |
eprint
|
import_file
|
change_date_format |
Changes the date format from epoch to sql. |
post_date
|
date_posted
|
format_filesize |
Formats the filesize in bytes into a more human-readable format, conditionally displaying results in GB, MB, KB, or B. |
filesize
|
formatted_filesize
|
build_URL_from_filename |
Constructs the SourceForge file download URL for FLOSSmole data sets, given the name of the files. |
filename
|
url
|
parse_filename_for_filetype |
Uses pattern matching to identify file types in file names. |
filename
|
filetype
|
split_fields |
Reads the 2-deep input list and splits out the values into separate variables. Not all fields were used. |
file
|
file_name
post_date
filesize
|
generate_xml_record |
Generates an XML ePrint record based on a template specifically configured for data set deposit in the wp.floss.syr.edu ePrints repository. |
filename
source
filesize
post_date
filetype
url
description
|
eprint_record
|
Outputs (1)
Name |
Description |
XMLoutput |
Text output of XML input file for ePrints metadata records. |
Links (19)
Source |
Sink |
aggregate_eprints:import_file |
XMLoutput |
build_URL_from_filename:url |
generate_xml_record:url |
change_date_format:date_posted |
generate_xml_record:post_date |
file_location:value |
read_input:fileurl |
format_filesize:formatted_filesize |
generate_xml_record:filesize |
generate_xml_record:eprint_record |
aggregate_eprints:eprint |
parse_filename_for_description:description |
generate_xml_record:description |
parse_filename_for_filetype:filetype |
generate_xml_record:filetype |
parse_filename_for_source:source |
generate_xml_record:source |
read_input:filecontents |
split_rows:string |
split_fields:file_name |
build_URL_from_filename:filename |
split_fields:file_name |
generate_xml_record:filename |
split_fields:file_name |
parse_filename_for_description:filename |
split_fields:file_name |
parse_filename_for_filetype:filename |
split_fields:file_name |
parse_filename_for_source:filename |
split_fields:filesize |
format_filesize:filesize |
split_fields:post_date |
change_date_format:post_date |
split_more:split |
split_fields:file |
split_rows:split |
split_more:string |
Uploader
License
All versions of this Workflow are
licensed under:
Version 1
(of 1)
Credits (1)
(People/Groups)
Attributions (0)
(Workflows/Files)
None
Shared with Groups (1)
Featured In Packs (1)
Log in to add to one of your Packs
Attributed By (1)
(Workflows/Files)
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment