Success-Abandonment-Classification
Retrieves data from FLOSSmole and from the Notre Dame SourceForge repository to compute project statistics based on releases, downloads and project lifespan. Project statistics are then used to classify projects according to the criteria set up in English & Schweik, but comparison criteria are parameterized so that a different set of criterion thresholds can be used to evaluate the project characteristics.
Preview
Run
Run this Workflow in the Taverna Workbench...
Option 1:
Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/140/download?version=3
[ More Info ]
Taverna is available from http://taverna.sourceforge.net/
If you are having problems downloading it in Taverna, you may need to provide your username and password in the URL so that Taverna can access the Workflow:
Replace http:// in the link above with http://yourusername:yourpassword@
Workflow Components
Name | Description |
---|---|
sf_unixname | List of SourceForge unixnames for the projects of interest; accesses data for time periods through mid-2005. |
Name | Type | Description |
---|---|---|
mortality_threshold | stringconstant | Length of time for a project to be considered abandoned, if no releases have been made by this time. Unit: days, integer values only. |
release_lag_threshold | stringconstant | Desired time between releases so that the releases are not made "too fast" for a sustainable rate of growth. Unit: days, integer values only. |
download_threshold | stringconstant | |
initiation_age_threshold | stringconstant | The threshold for how long a project may remain in the "initiation" stage without having produced a release, and still be considered not abandoned. Unit: days, integer values only |
release_rate_type | stringconstant | Allows switching between three versions of deriving the release rate values for comparison to a threshold to determine whether the releases are too frequent for sustainable growth; integer values should be 1 for first_last, 2 for recent_density, or 3 for average_rate. |
release_count_threshold | stringconstant | Minimum number of releases to be considered a success. |
delist_classtypes | local | Takes classtype output from iterated procedure out of list format and into CSV instead. |
delist_stages | local | Takes stage output from iterated procedure out of list format and into CSV instead. |
Stages_Analysis | rshell | Author: Andrea Wiggins Provides simple proportions for the stages of projects as output from the classification. |
classifier | beanshell | Author: Andrea Wiggins Uses the output of several criterion tests to determine the classification for a given SourceForge project. The if/else statements form a truth table of possible values based on the classification scheme in the English & Schweik article. |
classification_csv | beanshell | Author: Andrea Wiggins Assembles outputs of analysis procedures with raw criterion data and creates a single CSV file to record all of the relevant variables in a single file. |
Class_Analysis | rshell | Author: Andrea Wiggins Provides simple proportions for the classes of projects as output from the classification. |
release_count_test | beanshell | Author: Andrea Wiggins For each project, determines whether the number of releases meets the threshold value for minimum number of releases. |
growth_stage_test | beanshell | Author: Andrea Wiggins For each project, determines whether the lifespan of the project (aggregate data from FLOSSmole: data collection date minus founding date) meets the threshold between initiation phase and growth phase. |
release_lag_test | beanshell | Author: Andrea Wiggins For each project, determines whether the amount of time over which a given number of releases has occurred exceeds a threshold, which is intended to indicate an appropriate amount of time between releases for sustainable project activity, i.e. not too fast. There are three different methods to compare release rate and the lag threshold, based on: 1) "first_last" time elapsed between first and most recent release, 2) "recent_density" time elapsed between last X releases (where X is the workflow variable to indicate minimum number of releases for success), and 3) "average_releases" average time between each release since the first one. Note that method 3 will have a significantly different appropriate value for the release_lag_threshold variable, as it is based on average time between individual releases rather than aggregate time between several releases. |
downloads_test | beanshell | Author: Andrea Wiggins For each project, determines whether the number of aggregate downloads for the project exceeds a minimum threshold for usefulness. This would be an interesting place to substitute a scaling function option for the download_threshold value, perhaps adjusting the threshold according to the project's lifespan or number of releases. |
mortality_test | beanshell | Author: Andrea Wiggins For each project, determines whether the time between the last release and the date of data collection is within a threshold limit that indicates whether the project is active or inactive. |
GetData | workflow | Author: James Howison Procedure to fetch data from FLOSSmole and the Notre Dame SourceForge dumps based on the SourceForge unixname for a given project. Input is a list of project SourceForge unixnames, and a threshold value for the "recent release density" value. Not currently suited for running large batches of projects. |
Name | Description | Inputs | Outputs |
---|---|---|---|
ConvertEpochToXSD | epoch | xsdDateTime | |
MergeReleasesToYamlArray | datetimes | releases_xml | |
buildQueryWhere | sf_unixname | where_clause | |
buildFLOSSmoleStatisticsQueryString | sf_unixname | queryString | |
split_SQL_results | result_row |
aggregate_downloads
lifespan_days data_for_date |
|
ConvertSQLDateToXSDDateTime | sql_date | xsd_datetime | |
BuildFLOSSmoleURLQueryString | sf_unixname | url_query_string | |
MatchSFURL |
result_row
sf_unixname |
has_sf_url | |
count_releases | releases | release_count | |
GetFirstRelease | datetimes | chosen_datetime | |
CalcDensityLength |
datetime_1
datetime_2 |
seconds_between | |
GetLastRelease |
index_wanted
datetimes |
chosen_datetime | |
TruncateReleasesList |
release_datetimes
cutoff_date |
trunc_release_datetimes | |
TimeBetweenLastAndCutoff |
datetime_1
datetime_2 |
seconds_between | |
GetReleaseForDensityCalc |
index_wanted
datetimes |
chosen_datetime | |
CalcTimeBetweenFirstAndLast |
datetime_1
datetime_2 |
seconds_between | |
classifier | Author: Andrea Wiggins Uses the output of several criterion tests to determine the classification for a given SourceForge project. The if/else statements form a truth table of possible values based on the classification scheme in the English & Schweik article. |
stage
usage releases release_mortality release_rate web_site downloads |
classtype |
classification_csv | Author: Andrea Wiggins Assembles outputs of analysis procedures with raw criterion data and creates a single CSV file to record all of the relevant variables in a single file. |
downloads_list
lifespan_list sf_unixname_list release_count_list time_last_first_list release_density_list time_last_current_list has_sf_url_list stages_list classification_list release_lag_test_list release_count_test_list downloads_test_list mortality_test_list |
out_csv |
release_count_test | Author: Andrea Wiggins For each project, determines whether the number of releases meets the threshold value for minimum number of releases. |
release_count_threshold
num_releases |
releases |
growth_stage_test | Author: Andrea Wiggins For each project, determines whether the lifespan of the project (aggregate data from FLOSSmole: data collection date minus founding date) meets the threshold between initiation phase and growth phase. |
lifespan
initiation_threshold |
stage |
release_lag_test | Author: Andrea Wiggins For each project, determines whether the amount of time over which a given number of releases has occurred exceeds a threshold, which is intended to indicate an appropriate amount of time between releases for sustainable project activity, i.e. not too fast. There are three different methods to compare release rate and the lag threshold, based on: 1) "first_last" time elapsed between first and most recent release, 2) "recent_density" time elapsed between last X releases (where X is the workflow variable to indicate minimum number of releases for success), and 3) "average_releases" average time between each release since the first one. Note that method 3 will have a significantly different appropriate value for the release_lag_threshold variable, as it is based on average time between individual releases rather than aggregate time between several releases. |
release_rate_type
first_last_release recent_release_density num_releases release_lag_threshold |
release_rate |
downloads_test | Author: Andrea Wiggins For each project, determines whether the number of aggregate downloads for the project exceeds a minimum threshold for usefulness. This would be an interesting place to substitute a scaling function option for the download_threshold value, perhaps adjusting the threshold according to the project's lifespan or number of releases. |
downloads
download_threshold |
usage |
mortality_test | Author: Andrea Wiggins For each project, determines whether the time between the last release and the date of data collection is within a threshold limit that indicates whether the project is active or inactive. |
mortality_threshold
time_since_last_release |
release_mortality |
Name | Description |
---|---|
Class_Output | Proportions of each class of project in the sample. Potential values include: SG (success-growth), TG (tragedy-growth), IG (indeterminate-growth), TI (tragedy-initiation), II (indeterminate-initiation), unclassifiable (has 0 downloads, 0 releases, and a non-SourceForge web site), and other (cannot be classified.) Any project producing an "other" value should be closely scrutinized to determine why it does not fit into the classification scheme; the most likely reason would be the presence of null values due to missing data from one of the repositories. |
Stages_Output | Proportions of projects classified as being in the growth or initiation phase. |
Classification_Data | CSV output of all the data used for classification, the classification criterion values, and the final class assigned to each project. Suitable for use with R or Excel for later ad-hoc analysis. |
Source | Sink |
---|---|
GetData:downloads | classification_csv:downloads_list |
GetData:downloads | downloads_test:downloads |
GetData:first_last_release | classification_csv:time_last_first_list |
GetData:lifespan_days | classification_csv:lifespan_list |
GetData:lifespan_days | growth_stage_test:lifespan |
GetData:num_releases | classification_csv:release_count_list |
GetData:num_releases | release_count_test:num_releases |
GetData:recent_release_density | classification_csv:release_density_list |
GetData:recent_release_density | release_lag_test:recent_release_density |
GetData:time_since_last_release | classification_csv:time_last_current_list |
GetData:time_since_last_release | mortality_test:time_since_last_release |
GetData:web_site | classification_csv:has_sf_url_list |
GetData:web_site | classifier:web_site |
classifier:classtype | classification_csv:classification_list |
delist_stages:concatenated | Stages_Analysis:stages |
download_threshold:value | downloads_test:download_threshold |
downloads_test:usage | classification_csv:downloads_test_list |
downloads_test:usage | classifier:usage |
growth_stage_test:stage | classification_csv:stages_list |
growth_stage_test:stage | classifier:stage |
growth_stage_test:stage | delist_stages:stringlist |
initiation_age_threshold:value | growth_stage_test:initiation_threshold |
mortality_test:release_mortality | classification_csv:mortality_test_list |
mortality_test:release_mortality | classifier:release_mortality |
mortality_threshold:value | mortality_test:mortality_threshold |
release_count_test:releases | classifier:releases |
release_lag_test:release_rate | classifier:release_rate |
release_lag_threshold:value | release_lag_test:release_lag_threshold |
release_rate_type:value | release_lag_test:release_rate_type |
sf_unixname | GetData:sf_unixname |
GetData:downloads | classifier:downloads |
GetData:first_last_release | release_lag_test:first_last_release |
GetData:num_releases | release_lag_test:num_releases |
classifier:classtype | delist_classtypes:stringlist |
delist_classtypes:concatenated | Class_Analysis:classtypes |
sf_unixname | classification_csv:sf_unixname_list |
release_count_test:releases | classification_csv:release_count_test_list |
release_count_threshold:value | GetData:release_density_threshold |
release_count_threshold:value | release_count_test:release_count_threshold |
release_lag_test:release_rate | classification_csv:release_lag_test_list |
Class_Analysis:analysis_output | Class_Output |
Stages_Analysis:analysis_output | Stages_Output |
classification_csv:out_csv | Classification_Data |
None
Workflow Type
Version 3 (latest) (of 3)
- classification
- |
- draft
- |
- floss
- |
- open source software
- |
- oss
- |
- social sciences
- |
- taverna 1
Log in to add Tags
Shared with Groups (1)
Statistics
In chronological order:
-
Created by Andrea Wiggins on Wednesday 06 February 2008 14:35:41 (UTC)
Last edited by Andrea Wiggins on Monday 11 February 2008 16:50:39 (UTC)
-
Created by Andrea Wiggins on Wednesday 06 February 2008 14:35:41 (UTC)
Last edited by Andrea Wiggins on Thursday 19 June 2008 02:14:31 (UTC)
Revision comment:This version of the classifier is fully functional, drawing on data from FLOSSmole and the Notre Dame SourceForge repositories. The thresholds for classification decisions are set as string constants, which you can change as you like.
In addition, this workflow offers three versions of the recent release rate classification: the original version (first_last) based on the time between first and last releases; an averaged version (average_rate) which uses the average time between releases, and a version suggested by English & Schweik as future work (recent_density), which uses the time between the last X releases where X is the threshold number of releases to be considered successful.
-
Created by Andrea Wiggins on Wednesday 06 February 2008 14:35:41 (UTC)
Last edited by Andrea Wiggins on Wednesday 02 July 2008 17:15:25 (UTC)
Revision comment:This revision replaces the large RShell classifier component with a series of Beanshell components that execute the pre-classification criterion comparisons, do the project classification, and assemble a nice CSV output for ad-hoc analysis.
Reviews (0)
Other workflows that use similar services (0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (1)
Log in to make a comment
I quite the way this workflow is designed. The author used nested-workflows to favor modularity, and the data links seems to be carefully designed. On the downside, I couldnt execute teh workflow, which may be due to the fact that some constituent services are no longer available. Would be good if the author (or someone knowledgeable) to repair and create a new version of this workflow :-)
khalid