Extract unique proteins from blast results

Created: 2011-03-24 19:49:43 Last updated: 2011-04-01 12:26:27

Download Workflow

Workflow outputs a list of proteins encoded by the target genomes that do not have sequences similarity to those encoded by the source genome

This workflow allows you to configure a BioMart query to fetch sequences you want from Ensembl. These sequences are retrieved and a blast database of them is created (by default, in the directory you ran taverna from).

Warning: This workflow assumes that you have blastall and formatdb installed on the machine, and that by default, these are both found or linked in /usr/local/bin. It also assumes that you have write permission to the directory you have run taverna from. The beanshells "create_blastall_cmdArgs" and "create_formatdb_cmdArgs" are what you need to edit if the default locations are not appropriate for you.

Shortcomings:

The names of all the files created and used is hard coded in this workflow. This means that if you run this workflow more than once without editing anything, you will overwrite files you have previously created.

All files created in the working directory are not yet coded to be deleted via the workflow. Ideally there would be an option that a user could choose that would set the files to be kept or deleted after use.

The workflow parses uses the blast results to determine the unique proteins found in the target genome that have no similairty to the source genome. Using these unique protein ids, and the original target protein fasta file, a fasta file of unique proteins is created.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1981/download?version=3
[ More Info Expand ]

Workflow Components

Authors (2)

Titles (2)

Descriptions (4)

Dependencies (0)

Inputs (2)

Name	Description
blastFile	The URL or file path location of the tab-delimited format of the Blast results. The URL or file path location of the tab-delimited format of the Blast results. Type the path as a string (not a file location). The URL or file path location of the tab-delimited format of the Blast results. Type the path as a string (not inputs the xml format of the blast results
tfasta	Fasta file of the target proteins to extract the sequences. Fasta file of the target proteins to extract the sequences. Add as file location. fasta file of the target proteins to extract the sequences

Processors (3)

Name	Type	Description
Read_Text_File	localworker	Script BufferedReader getReader (String fileUrl) throws IOException { InputStreamReader reader; try { reader = new FileReader(fileUrl); } catch (FileNotFoundException e) { // try a real URL instead URL url = new URL(fileUrl); reader = new InputStreamReader (url.openStream()); } return new BufferedReader(reader); } StringBuffer sb = new StringBuffer(4000); BufferedReader in = getReader(fileurl); String str; String lineEnding = System.getProperty("line.separator"); while ((str = in.readLine()) != null) { sb.append(str); sb.append(lineEnding); } in.close(); filecontents = sb.toString();
extract_blast_ids	beanshell	Script //import tab-delimited blast results //split the input on new lines String[] input = xml_result.split(System.getProperty("line.separator")); ArrayList gi_lines = new ArrayList(); //for each line in the BLAST file for (int i=0; i
find_unique_proteins	beanshell	Script //import the target fasta String[] gi_array = gi_val.split("\n"); String [] target_array = tfasta_in.split("\n"); List targets = new ArrayList(); List blast_gis = new ArrayList(); String fastas = ""; String value = ""; for (int i =0; i")) { targets.add(line.substring(1,line.length())); } } for (int i=0; i

Beanshells (2)

Name	Description	Inputs	Outputs
extract_blast_ids		xml_result	gi_lines_val
find_unique_proteins		gi_val tfasta_in	cfasta_out

Outputs (2)

Name	Description
unique_identifiers	Unique identifiers that appear in the FASTA file but not in the BLAST file. Identifiers are separated by new lines.
blasted_identifiers	Identifiers of the target genome from the BLAST results. These identifiers are those that are similar to the source proteome.

Datalinks (6)

Source	Sink
blastFile	Read_Text_File:fileurl
Read_Text_File:filecontents	extract_blast_ids:xml_result
extract_blast_ids:gi_lines_val	find_unique_proteins:gi_val
tfasta	find_unique_proteins:tfasta_in
find_unique_proteins:cfasta_out	unique_identifiers
extract_blast_ids:gi_lines_val	blasted_identifiers

Coordinations (1)

Controller	Target
Read_Text_File	extract_blast_ids

Information Workflow Type

Taverna 2

Information Uploader

Morgan Taschuk

Information License

All versions of this Workflow are licensed under:

Information Version 3 (of 4)

Information Credits (2)

(People/Groups)

Information Attributions (1)

(Workflows/Files)

Parse unique proteins from Blast file

Information Tags (8)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

A Team

Information Featured In Packs (0)

None

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (0)

No one

Information Statistics

2889 viewings

2001 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Extract unique proteins from blast results

Created by Morgan Taschuk on Thursday 24 March 2011 19:49:43 (UTC)
Find Unique Proteins from BLAST and FASTA

Created by Morgan Taschuk on Friday 25 March 2011 20:03:43 (UTC)

Last edited by Morgan Taschuk on Tuesday 29 March 2011 16:44:51 (UTC)

Revision comment:

This one actually does what it says on the box!
Extract unique proteins from blast results

Created by Morgan Taschuk on Tuesday 29 March 2011 18:07:05 (UTC)

Revision comment:

Works on query instead of subject!

Splits lines properly!
Extract unique proteins from blast results

Created by Morgan Taschuk on Friday 01 April 2011 12:26:23 (UTC)

Revision comment:

Final version