cutadapt+BWA-MEM-mis1-i2,2-gapex2,2+filter_c5.0_c4.15+idxStats
This workflow is designed to genotype trinucleotide repeats from sequencing reads that span the repeat and its flanks
Detailed description as follows:
1. Input fastq sequencing reads files (Galaxy tool: Input dataset)
Input files for the workflow are single-end MiSeq reads (R1 produced using the protocol from Ciosi et al. 2018) or PacBio reads of insert (ROI) in fastq files format.
2. Removing Illumina sequencing adaptor sequence in 3’ of the MiSeq reads (Galaxy tool: Cutadapt 1.16)
Cutadapt removes the sequencing adaptor sequence from the 3’ end (custom 3' adapter sequence used: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC) of the forward reads (R1 produced using the protocol from Ciosi et al. 2018) allowing a maximum error rate of 0.39.
2’. QC of sequencing adaptor removal (Galaxy tool: FastQC version 0.71)
Carried out using FastQC after removing the sequencing adaptor to check that it has happened as expected
3. Input the fasta file containing the synthetic reference sequence(s) for BWA-MEM alignment
Multiple synthetic reference sequences containing different number of repeats are typically used (typically variable number of CAG and/or CCG repeats to genotype the HTT exon one trinucleotide repeat)
3’. BWA-MEM aligns reads to one or more reference sequences (Galaxy tool: Map with BWA-MEM version 0.7.17.1)
For MiSeq reads BWA-MEM parameters are kept as default except the following to make gap-related cost higher than mismatch-related costs:
Penalty for mismatch: 1
Gap open penalties for deletions and insertions: 2,2
Gap extension penalties: 2,2
4. BAM converted to SAM (Galaxy tool: BAM-to-SAM version 2.0.1)
Alignment output BAM files are converted to SAM format using BAM-to-SAM.
5. Reads are filtered out from the SAM file (Galaxy tool: Filter version 1.1.0)
Reads with a MAPQ score of 0 (obtained for reads aligned equally well to >1 reference sequence) and/or reads associated with an alignment that did not start in the sequence flanking the CAG repeat in 5’ (probably only relevant if reads have been generated by PCR and the 5’ end of the reference sequence corresponds to the PCR primer used to amplify the locus sequenced) are filtered out of the SAM file [filtering criteria used were c5>0 and c4<15 for HTT repeat reads generated with the forward PCR primers 31329 (5’- ATGAAGGCCTTCGAGTCCCTCAAGTCCTTC-3’) and c5>0 and c4<130 for HTT repeat reads generated with the forward PCR primers MS1F (5’-GCCCAGAGCCCCATTCATTG-3’); c5 for MAPQ and c4 for the position of the start of the alignment relatively to the reference sequence]. Number of header lines to skip = (number of reference sequences considered for the alignment) + 2
5’. QC of read filtering from SAM (Galaxy tool: FastQC version 0.71)
Carried out using FastQC after filtering out some reads from the SAM file to check that it has happened as expected
6. Generate an alignment report table from the SAM file (Galaxy tool: IdxStats version 2.0.1)
IdxStats is ran on each alignment SAM file to generate an alignment report table
7. Columns of the IdxStats alignment report table are removed to only keep the columns that contain the reference sequence identifier and the number of reads aligned to each of these reference sequences (Galaxy tool: Cut columns from a table version 1.0.2).
Removes column 1 and 3 from the initial IdxStats output.
Visible output files produced by the workflow:
two FastQC read quality reports (txt and html) post-cutadapt (output of step 2’)
a sam file of aligned reads for visualising repeat genotypes (output of step 4)
one FastQC read quality report (txt) post-filtering of SAM file (output of step 5’)
two tab-delimited files with the number of reads aligned to each synthetic reference sequence considered for the alignment (output steps 7 and 8)
Preview
Import
Not currently available.
Workflow Components
Reviews (0)
Other workflows that use similar services (0)
There are no workflows in myExperiment that use similar services to this Workflow.
Comments (0)
No comments yet
Log in to make a comment