cutadapt+BWA-MEM-mis1-i2,2-gapex2,2+filter_c5.0_c4.15+idxStats

Created: 2020-06-01 22:02:22 Last updated: 2020-06-01 22:19:49

Download Workflow

This workflow is designed to genotype trinucleotide repeats from sequencing reads that span the repeat and its flanks

Detailed description as follows:

1. Input fastq sequencing reads files (Galaxy tool: Input dataset)

Input files for the workflow are single-end MiSeq reads (R1 produced using the protocol from Ciosi et al. 2018) or PacBio reads of insert (ROI) in fastq files format.

2. Removing Illumina sequencing adaptor sequence in 3’ of the MiSeq reads (Galaxy tool: Cutadapt 1.16)

Cutadapt removes the sequencing adaptor sequence from the 3’ end (custom 3' adapter sequence used: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC) of the forward reads (R1 produced using the protocol from Ciosi et al. 2018) allowing a maximum error rate of 0.39.

2’. QC of sequencing adaptor removal (Galaxy tool: FastQC version 0.71)

Carried out using FastQC after removing the sequencing adaptor to check that it has happened as expected

3. Input the fasta file containing the synthetic reference sequence(s) for BWA-MEM alignment

Multiple synthetic reference sequences containing different number of repeats are typically used (typically variable number of CAG and/or CCG repeats to genotype the HTT exon one trinucleotide repeat)

3’. BWA-MEM aligns reads to one or more reference sequences (Galaxy tool: Map with BWA-MEM version 0.7.17.1)

For MiSeq reads BWA-MEM parameters are kept as default except the following to make gap-related cost higher than mismatch-related costs:

Penalty for mismatch: 1

Gap open penalties for deletions and insertions: 2,2

Gap extension penalties: 2,2

4. BAM converted to SAM (Galaxy tool: BAM-to-SAM version 2.0.1)

Alignment output BAM files are converted to SAM format using BAM-to-SAM.

5. Reads are filtered out from the SAM file (Galaxy tool: Filter version 1.1.0)

Reads with a MAPQ score of 0 (obtained for reads aligned equally well to >1 reference sequence) and/or reads associated with an alignment that did not start in the sequence flanking the CAG repeat in 5’ (probably only relevant if reads have been generated by PCR and the 5’ end of the reference sequence corresponds to the PCR primer used to amplify the locus sequenced) are filtered out of the SAM file [filtering criteria used were c5>0 and c4<15 for HTT repeat reads generated with the forward PCR primers 31329 (5’- ATGAAGGCCTTCGAGTCCCTCAAGTCCTTC-3’) and c5>0 and c4<130 for HTT repeat reads generated with the forward PCR primers MS1F (5’-GCCCAGAGCCCCATTCATTG-3’); c5 for MAPQ and c4 for the position of the start of the alignment relatively to the reference sequence]. Number of header lines to skip = (number of reference sequences considered for the alignment) + 2

5’. QC of read filtering from SAM (Galaxy tool: FastQC version 0.71)

Carried out using FastQC after filtering out some reads from the SAM file to check that it has happened as expected

6. Generate an alignment report table from the SAM file (Galaxy tool: IdxStats version 2.0.1)

IdxStats is ran on each alignment SAM file to generate an alignment report table

7. Columns of the IdxStats alignment report table are removed to only keep the columns that contain the reference sequence identifier and the number of reads aligned to each of these reference sequences (Galaxy tool: Cut columns from a table version 1.0.2).

Removes column 1 and 3 from the initial IdxStats output.

Visible output files produced by the workflow:

two FastQC read quality reports (txt and html) post-cutadapt (output of step 2’)

a sam file of aligned reads for visualising repeat genotypes (output of step 4)

one FastQC read quality report (txt) post-filtering of SAM file (output of step 5’)

two tab-delimited files with the number of reads aligned to each synthetic reference sequence considered for the alignment (output steps 7 and 8)

Preview

Download as scalable diagram (SVG)

Import

Not currently available.

Workflow Components

Inputs (15)

Name	Description
Reference file
library	runtime parameter for tool Cutadapt
reference_source	runtime parameter for tool Map with BWA-MEM
fastq_input	runtime parameter for tool Map with BWA-MEM
contaminants	runtime parameter for tool FastQC
limits	runtime parameter for tool FastQC
input_file	runtime parameter for tool FastQC
input1	runtime parameter for tool BAM-to-SAM
input	runtime parameter for tool Filter
limits	runtime parameter for tool FastQC
input_file	runtime parameter for tool FastQC
contaminants	runtime parameter for tool FastQC
adapters	runtime parameter for tool FastQC
input	runtime parameter for tool IdxStats
input	runtime parameter for tool Cut

Steps (10)

Name	Tool	Description
Input dataset	None
Input dataset	None
Cutadapt	toolshed.g2.bx.psu.edu/repos/lparsons/cutadapt/cutadapt/1.16
Map with BWA-MEM	toolshed.g2.bx.psu.edu/repos/devteam/bwa/bwa_mem/0.7.17.1
FastQC	toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.71
BAM-to-SAM	toolshed.g2.bx.psu.edu/repos/devteam/bam_to_sam/bam_to_sam/2.0.1
Filter	Filter1
FastQC	toolshed.g2.bx.psu.edu/repos/devteam/fastqc/fastqc/0.72
IdxStats	toolshed.g2.bx.psu.edu/repos/devteam/samtools_idxstats/samtools_idxstats/2.0.1
Cut	Cut1

Outputs (21)

Name	Type
out1	input
out2	input
report	txt
info_file	input
rest_output	input
wild_output	input
untrimmed_output	input
untrimmed_paired_output	input
too_short_output	input
too_short_paired_output	input
too_long_output	input
too_long_paired_output	input
bam_output	bam
html_file	html
text_file	txt
output1	sam
out_file1	input
html_file	html
text_file	txt
output	tabular
out_file1	tabular