Data Set Metadata Generator

Created: 2008-08-19 21:17:14 Last updated: 2008-08-19 21:25:53

Download Workflow

This workflow generates ePrints XML import files with data set metadata for the FLOSSmole project. It reads in an input file generated from a Notre Dame SourceForge dump SQL query and uses regular expressions to parse the filename for the data set's source repository, download URL, and basic description. It also translates the epoch date into a sql format suitable for import, and the file size from bytes into larger units, e.g. GB, MB, etc. These data are inserted into an XML eprint record template (specific to the FLOSSmole ePrints repository configuration at wp.floss.syr.edu) and the individual eprints are aggregated into an XML import file.

Unfortunately, I'm not sure that I can provide the input file due to license restrictions. I can provide the SQL query, however, so that anyone who has signed a license agreement for access to the ND SourceForge data can retrieve the same input:

SELECT f.filename, f.file_id, f.file_size, f.post_date FROM sf0508.frs_file as f, sf0508.groups as g WHERE g.unix_group_name = 'ossmole' AND f.group_id=g.group_id ORDER BY f.post_date

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/376/download?version=1
[ More Info Expand ]

Workflow Components

Inputs (0)

Processors (13)

Name	Type	Description
read_input	local	Shim to read in the file, location provided by a string constant.
file_location	stringconstant	Edit to use your local path to the input file location.
split_rows	local	Takes a flat CSV input file and splits it into a list.
split_more	local	Takes the list input and creates a 2-deep list.
parse_filename_for_description	beanshell	Creates a general description of the data set contents based on regex matching on filenames.
parse_filename_for_source	beanshell	Extracts the repository data source from each filename.
aggregate_eprints	beanshell	Aggregates the individual eprint records into a depositable XML file, configured specifically for the wp.floss.syr.edu ePrints repository.
change_date_format	beanshell	Changes the date format from epoch to sql.
format_filesize	beanshell	Formats the filesize in bytes into a more human-readable format, conditionally displaying results in GB, MB, KB, or B.
build_URL_from_filename	beanshell	Constructs the SourceForge file download URL for FLOSSmole data sets, given the name of the files.
parse_filename_for_filetype	beanshell	Uses pattern matching to identify file types in file names.
split_fields	beanshell	Reads the 2-deep input list and splits out the values into separate variables. Not all fields were used.
generate_xml_record	beanshell	Generates an XML ePrint record based on a template specifically configured for data set deposit in the wp.floss.syr.edu ePrints repository.

Beanshells (9)

Name	Description	Inputs	Outputs
parse_filename_for_description	Creates a general description of the data set contents based on regex matching on filenames.	filename	description
parse_filename_for_source	Extracts the repository data source from each filename.	filename	source
aggregate_eprints	Aggregates the individual eprint records into a depositable XML file, configured specifically for the wp.floss.syr.edu ePrints repository.	eprint	import_file
change_date_format	Changes the date format from epoch to sql.	post_date	date_posted
format_filesize	Formats the filesize in bytes into a more human-readable format, conditionally displaying results in GB, MB, KB, or B.	filesize	formatted_filesize
build_URL_from_filename	Constructs the SourceForge file download URL for FLOSSmole data sets, given the name of the files.	filename	url
parse_filename_for_filetype	Uses pattern matching to identify file types in file names.	filename	filetype
split_fields	Reads the 2-deep input list and splits out the values into separate variables. Not all fields were used.	file	file_name post_date filesize
generate_xml_record	Generates an XML ePrint record based on a template specifically configured for data set deposit in the wp.floss.syr.edu ePrints repository.	filename source filesize post_date filetype url description	eprint_record

Outputs (1)

Name	Description
XMLoutput	Text output of XML input file for ePrints metadata records.

Links (19)

Source	Sink
aggregate_eprints:import_file	XMLoutput
build_URL_from_filename:url	generate_xml_record:url
change_date_format:date_posted	generate_xml_record:post_date
file_location:value	read_input:fileurl
format_filesize:formatted_filesize	generate_xml_record:filesize
generate_xml_record:eprint_record	aggregate_eprints:eprint
parse_filename_for_description:description	generate_xml_record:description
parse_filename_for_filetype:filetype	generate_xml_record:filetype
parse_filename_for_source:source	generate_xml_record:source
read_input:filecontents	split_rows:string
split_fields:file_name	build_URL_from_filename:filename
split_fields:file_name	generate_xml_record:filename
split_fields:file_name	parse_filename_for_description:filename
split_fields:file_name	parse_filename_for_filetype:filename
split_fields:file_name	parse_filename_for_source:filename
split_fields:filesize	format_filesize:filesize
split_fields:post_date	change_date_format:post_date
split_more:split	split_fields:file
split_rows:split	split_more:string

Coordinations (0)

Information Workflow Type

Taverna 1

Information Uploader

Andrea Wiggins

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)

(People/Groups)

Andrea Wiggins

Information Attributions (0)

(Workflows/Files)

None

Information Tags (6)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

Free/Libre Open Source Software

Information Featured In Packs (1)

Metadata Management

Log in to add to one of your Packs

Information Attributed By (1)

(Workflows/Files)

DOI Record Generator

Information Favourited By (0)

No one

Information Statistics

2882 viewings

2300 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Data Set Metadata Generator

Created by Andrea Wiggins on Tuesday 19 August 2008 21:17:14 (UTC)

Last edited by Andrea Wiggins on Tuesday 19 August 2008 21:25:54 (UTC)

Reviews (0)

No reviews yet

Be the first to review!

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (0)

There are no workflows in myExperiment that use similar services to this Workflow.