Extract unique proteins from blast results
Created: 2011-03-24 19:49:43
Last updated: 2011-04-01 12:26:27
The workflow parses uses the tab-delimited BLAST results to determine the unique proteins found in the target genome that have no similarity to the source genome.
The workflow parses uses the blast results to determine the unique proteins found in the target genome that have no similairty to the source genome. Using these unique protein ids, and the original target protein fasta file, a fasta file of unique proteins is created.
This workflow allows you to configure a BioMart query to fetch sequences you want from Ensembl. These sequences are retrieved and a blast database of them is created (by default, in the directory you ran taverna from).
Warning: This workflow assumes that you have blastall and formatdb installed on the machine, and that by default, these are both found or linked in /usr/local/bin. It also assumes that you have write permission to the directory you have run taverna from. The beanshells "create_blastall_cmdArgs" and "create_formatdb_cmdArgs" are what you need to edit if the default locations are not appropriate for you.
Shortcomings:
The names of all the files created and used is hard coded in this workflow. This means that if you run this workflow more than once without editing anything, you will overwrite files you have previously created.
All files created in the working directory are not yet coded to be deleted via the workflow. Ideally there would be an option that a user could choose that would set the files to be kept or deleted after use.
Workflow outputs a list of proteins encoded by the target genomes that do not have sequences similarity to those encoded by the source genome
Preview
Run
Run this Workflow in the Taverna Workbench...
Workflow Components
Authors (3)
A-Team |
nclteamc |
Bela Tiwari |
Titles (2)
Extract unique proteins from blast results |
fetchEnsemblSeqsAndBlast |
Descriptions (5)
|
The workflow parses uses the tab-delimited BLAST results to determine the unique proteins found in the target genome that have no similarity to the source genome. |
The workflow parses uses the blast results to determine the unique proteins found in the target genome that have no similairty to the source genome. Using these unique protein ids, and the original target protein fasta file, a fasta file of unique proteins is created. |
This workflow allows you to configure a BioMart query to fetch sequences you want from Ensembl. These sequences are retrieved and a blast database of them is created (by default, in the directory you ran taverna from).
Warning: This workflow assumes that you have blastall and formatdb installed on the machine, and that by default, these are both found or linked in /usr/local/bin. It also assumes that you have write permission to the directory you have run taverna from. The beanshells "create_blastall_cmdArgs" and "create_formatdb_cmdArgs" are what you need to edit if the default locations are not appropriate for you.
Shortcomings:
The names of all the files created and used is hard coded in this workflow. This means that if you run this workflow more than once without editing anything, you will overwrite files you have previously created.
All files created in the working directory are not yet coded to be deleted via the workflow. Ideally there would be an option that a user could choose that would set the files to be kept or deleted after use. |
Workflow outputs a list of proteins encoded by the target genomes that do not have sequences similarity to those encoded by the source genome |
Dependencies (0)
Inputs (2)
Name |
Description |
blastFile |
The URL or file path location of the tab-delimited format of the Blast results. Type the path as a string (not
inputs the xml format of the blast results
The URL or file path location of the tab-delimited format of the Blast results.
The URL or file path location of the tab-delimited format of the Blast results. Type the path as a string (not a file location).
|
tfasta |
Fasta file of the target proteins to extract the sequences.
Fasta file of the target proteins to extract the sequences. Add as file location.
fasta file of the target proteins to extract the sequences
|
Processors (3)
Name |
Type |
Description |
Read_Text_File |
localworker |
ScriptBufferedReader getReader (String fileUrl) throws IOException {
InputStreamReader reader;
try {
reader = new FileReader(fileUrl);
}
catch (FileNotFoundException e) {
// try a real URL instead
URL url = new URL(fileUrl);
reader = new InputStreamReader (url.openStream());
}
return new BufferedReader(reader);
}
StringBuffer sb = new StringBuffer(4000);
BufferedReader in = getReader(fileurl);
String str;
String lineEnding = System.getProperty("line.separator");
while ((str = in.readLine()) != null) {
sb.append(str);
sb.append(lineEnding);
}
in.close();
filecontents = sb.toString();
|
extract_blast_ids |
beanshell |
Script//import tab-delimited blast results
//split the input on new lines
String[] input = xml_result.split(System.getProperty("line.separator"));
ArrayList gi_lines = new ArrayList();
//for each line in the BLAST file
for (int i=0; i |
find_unique_proteins |
beanshell |
Script//import the blast target ids
String[] gi_array = gi_val.split("\n");
//import the target fasta
String [] target_array = tfasta_in.split("\n");
//make necessary variables
List targets = new ArrayList();
List blast_gis = new ArrayList();
String fastas = "";
String value = "";
//iterate through the entries in the target fasta and add all the ids
//to a List without the starting ">"
for (int i =0; i"))
{
targets.add(line.substring(1,line.length()));
}
}
//iterate through the blast target ids and add them to a List
for (int i=0; i |
Beanshells (2)
Name |
Description |
Inputs |
Outputs |
extract_blast_ids |
|
xml_result
|
gi_lines_val
|
find_unique_proteins |
|
gi_val
tfasta_in
|
cfasta_out
|
Outputs (2)
Name |
Description |
unique_identifiers |
Unique identifiers that appear in the FASTA file but not in the BLAST file. Identifiers are separated by new lines.
|
blasted_identifiers |
Identifiers of the target genome from the BLAST results. These identifiers are those that are similar to the source proteome.
|
Datalinks (6)
Source |
Sink |
blastFile |
Read_Text_File:fileurl |
Read_Text_File:filecontents |
extract_blast_ids:xml_result |
extract_blast_ids:gi_lines_val |
find_unique_proteins:gi_val |
tfasta |
find_unique_proteins:tfasta_in |
find_unique_proteins:cfasta_out |
unique_identifiers |
extract_blast_ids:gi_lines_val |
blasted_identifiers |
Coordinations (1)
Controller |
Target |
Read_Text_File |
extract_blast_ids |
Uploader
License
All versions of this Workflow are
licensed under:
Version 4 (latest)
(of 4)
Credits (2)
(People/Groups)
Attributions (1)
(Workflows/Files)
Shared with Groups (1)
Featured In Packs (0)
None
Log in to add to one of your Packs
Attributed By (0)
(Workflows/Files)
None
Favourited By (0)
No one
Statistics
Other workflows that use similar services
(0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment