get promoter region + operate on genomic intervals

Created: 2013-11-04 19:28:07

Download Workflow

Finds the overlap between two datasets which contain genomic information (e.g. [gene id], chromosome name, gene start, gene end), plus some statistics. Returns rows of file_1 which overlap with the second file. A kolmogorov smirnov test is applied between the list that overlaps and the one that does not. NOTE: The library(GenomicRanges) is a prerequisite for this workflow

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/3886/download?version=1
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (1)

Descriptions (1)

Dependencies (0)

Inputs (11)

Name	Description
workingdir	working directory
biomartfile	genomic information for the gene file.
epigen_file	File with epigenetic information
epigen_file_end	column for the end coordinate for each row in the epigenetic file info
epigen_file_start	column for the start coordinate for each row in the epigenetic file info
epigen_file_chr	chromosome column at the epigenetic file
gene_chr	column for the gene chromosome coordinate
upstream	number of base pairs for computing a promoter region upstream the transcriptional start site (TSS)
downstream	number of base pairs for computing a promoter region downstream the transcriptional start site (TSS)
overlap	overlap parameter (in base pairs) for computing overlapping regions
gene_expr_file	file with the gene expression values

Processors (6)

Name	Type	Description
read_files	rshell	component that reads the epigenetic file and the file that was mapped on the genome Script setwd(dir) epigen_file<-read.table(epigenetic_file, header=TRUE, sep="\t") epigen_file[,2]<-gsub("chr","",epigen_file[,2]) genes<-read.table(biomart,header=TRUE, sep="\t") #genes_pvalues<-read.table(gene_pvals,header=TRUE) #sep="\t" print("filesread") R Server localhost:6311
transform_files_to_gen_ranges	rshell	transform the files to genomic ranges using the R package GenomicRanges to compute overlapping regions Script library(GenomicRanges) #transform each file in GRanges format #t1 <- GRanges(seqnames = genes[,gene_chr], # ranges = IRanges(genes[,gene_start], genes[,gene_end] , names = genes$entrezgene)) t1 <- GRanges(seqnames = genes$chromosome_name, ranges = IRanges(genes$promoter_start, genes$promoter_end , names = genes$entrezgene)) t2 <- GRanges(seqnames = epigen_file[,inter_chr],ranges = IRanges(epigen_file[,inter_start], epigen_file[,inter_end])) print("t1_t2_done") #t2 <- GRanges(seqnames = cpg$chrom, # ranges = IRanges(cpg$chromStart, cpg$chromEnd)) R Server localhost:6311
calculate_overlaps	rshell	compute overlaps between the two genomic files Script ov1<-findOverlaps(t1,t2,minoverlap=overlap) print("ovdone") genes_pvalues<-read.table(gene_expr_file,header=TRUE, sep="\t") idx<-match(genes$entrezgene[as.matrix(ov1)[,1]],genes_pvalues$gene_id ) overlapping_genes<-genes$entrezgene[as.matrix(ov1)[,1]] non_overlapping_genes<-genes$entrezgene[-(as.matrix(ov1)[,1])] list_non_overlapping<-unique(non_overlapping_genes) list_overlapping<-unique(overlapping_genes) idx_com<-match(list_non_overlapping,list_overlapping) idx_non_overlapping_without_overlapping_idx<-which(is.na(idx_com)==TRUE) list_non_overlapping_new<-list_non_overlapping[idx_non_overlapping_without_overlapping_idx] idx_overlapping<- match(list_overlapping, genes_pvalues$gene_id) overlapping_genes_pvals<-genes_pvalues[idx_overlapping,] overlapping_genes_pvals_out<- paste("overlapping_genes",Sys.time(),".txt",sep="") idx_non_overlapping<- match(list_non_overlapping_new, genes_pvalues$gene_id) non_overlapping_genes_pvals<-genes_pvalues[idx_non_overlapping,] non_overlapping_genes_pvals_out<-paste("non_overlapping_genes",Sys.time(),".txt",sep="") write.table(overlapping_genes_pvals,overlapping_genes_pvals_out,col.names=TRUE, row.names=FALSE, sep="\t") write.table(non_overlapping_genes_pvals,non_overlapping_genes_pvals_out ,col.names=TRUE,row.names=FALSE,sep="\t") R Server localhost:6311
calculate_promoter_region	rshell	For each one of the genes we compute a promoter region according to the prespecified values of the variables upstream_bp and downstream_bp the transcription start site. In this component we take into account the direction that a gene is transcribed. The variable "strand" is responsible for that. Script #a<-unique(biomart$entrezgene) setwd(workdir) biomart<-read.table(genes,header=TRUE, sep="\t") mat<-matrix(NA, nrow=dim(biomart)[1], ncol=10) colnames(mat)<-c(colnames(biomart),"promoter_start","promoter_end") for (i in 1:dim(biomart)[1] ) { if(biomart[i,3] > 0 ) { promoter_start<-as.numeric(biomart$transcript_start[i])-upstream promoter_end<-as.numeric(biomart$transcript_start[i])+downstream mat[i,]<-c(biomart[i,1],as.character(biomart[i,2]),biomart[i,3],as.character(biomart[i,4]), biomart[i,5],biomart[i,6],biomart[i,7],biomart[i,8],promoter_start,promoter_end) } if(biomart[i,3] < 0 ) { promoter_start<-as.numeric(biomart$transcript_end[i])+upstream promoter_end<-as.numeric(biomart$transcript_end[i])-downstream mat[i,]<-c(biomart[i,1],as.character(biomart[i,2]),biomart[i,3],as.character(biomart[i,4]), biomart[i,5],biomart[i,6],biomart[i,7],biomart[i,8],promoter_end,promoter_start) } } output_file<- "mapped_file_promoter_region.txt" write.table(mat,output_file, col.names=TRUE, row.names=FALSE, sep="\t", quote=FALSE) R Server localhost:6311
ecdf_plot	rshell	this component plots the empirical cumulative distribution function of the P-values of the two groups of genes (the ones that overlap and the ones that do not). in black we indicate overlapping genes and in red non overlapping genes Script png(filename=cn, height=400, width=400, bg="white"); plot(ecdf(as.numeric((overlapping_genes_pvals[,4]))),xlab="adjusted p-value", main="",ylab= "proportion of genes <= x", col.main="black"); lines(ecdf(as.numeric((non_overlapping_genes_pvals[,4]))), col="red") legend("topleft", inset=.05, c("overlapping genes","nonOverlapping genes"), lwd=2, lty=c(1, 1, 1, 2), col=c("black","red")) dev.off() R Server localhost:6311
ks.test	rshell	Using again the adjusted pvalues of the two groups of genes (overlapping genes and non overlapping genes), we perform a kolmogorov smirnov test in order to test for differences between the two distributions. We export the p value of the statistical test, and the distance D between the two distributions. Script #ks.test(cpg_genes_pvals[,4],no_cpg_genes_pvals[,4]) ks_test<-ks.test(overlapping_genes_pvals[,4],non_overlapping_genes_pvals[,4]) #cn_ks_p<-ks.test(as.numeric(brain_cpgs_cn[,4]), as.numeric(brain_no_cpgs_cn[,4]))$D ks_list<-unlist(ks_test) ks_D<-ks_list[[1]] ks_p<-ks_list[[2]] R Server localhost:6311

Beanshells (0)

Outputs (6)

Name	Description
genes_overlap	The file with the genes that overlap with the epigenetic file
genes_non_overlap	The file with the genes that do not overlap with the epigenetic file
ecdf_plot	empirical cumulative distribution plot between the two distributions of genes that overlap and the ones that do not overlap with the epigenetic file under study.
distance	the distance D between the two distributions of genes that overlap and the ones that do not overlap with the epigenetic file. The greatest the difference the bigger the effect of the epigenetic file on the list of genes
pvalue	the pvalue indicating the significance of the kolmogorov smirnov test, between the two distributions of genes that overlap and the ones that do not overlap with the epigenetic file.
promoter_file	The file we provided as input with genomic information with a promoter region for each gene included.

Datalinks (19)

Source	Sink
workingdir	read_files:dir
epigen_file	read_files:epigenetic_file
calculate_promoter_region:output_file	read_files:biomart
gene_chr	transform_files_to_gen_ranges:gene_chr
epigen_file_end	transform_files_to_gen_ranges:inter_end
epigen_file_start	transform_files_to_gen_ranges:inter_start
epigen_file_chr	transform_files_to_gen_ranges:inter_chr
overlap	calculate_overlaps:overlap
gene_expr_file	calculate_overlaps:gene_expr_file
upstream	calculate_promoter_region:upstream
downstream	calculate_promoter_region:downstream
biomartfile	calculate_promoter_region:genes
workingdir	calculate_promoter_region:workdir
calculate_overlaps:overlapping_genes_pvals_out	genes_overlap
calculate_overlaps:non_overlapping_genes_pvals_out	genes_non_overlap
ecdf_plot:cn	ecdf_plot
ks.test:ks_D	distance
ks.test:ks_p	pvalue
calculate_promoter_region:output_file	promoter_file