Primary Analysis

The "primary analysis" workflow is the standard iCLIP analysis pipeline. This document describes each step in the pipeline (a step is called a module in Nextflow), the inputs and the outputs you can expect.

1. FASTQC

Input: demultiplexed reads, Output: FastQC report.

This step runs the program FastQC on your demultiplexed reads file. It's useful to check the html report to see how many reads you have in your file and their general sequencing quality.

Read more about FastQC and how to interpret the report here in this helpful article from Michegan State University.

2. TRIMGALORE

Input: demultiplexed reads, Output: trimmed reads, Trim Galore! report

This step runs the program Trim Galore! The purpose is to remove any remaining Illumina sequencing adapter at the 3' end of reads or low quality bases at the end of the reads. The default settings are mostly used, meaning that bases at the 3' end of reads are trimmed if they have Phred score < 20. The first 13 bp of the Illumina adapter 'AGATCGGAAGAGC' are searched for at the end of reads and will be trimmed if even 1 base overlaps - an error rate of 0.1 is allowed and the matching sequence will still be trimmed. Reads that are shorter than 10 bases long after trimming are removed.

3. BOWTIE_ALIGN

Your demultiplexed reads are aligned against an index of rRNA and mature tRNA using the Bowtie aligner. The purpose of this mapping is twofold: 1) It prevents the contamination of rRNA and tRNA derived reads in your genomic alignment. tRNA genes for example can reside within protein-coding genes, and they are not annotated in most primary genomic annotations meaning if we do not filter them out, we may mistakenly attribute these reads to mRNA. 2) You might discover you have a large amount of tRNA and/or rRNA mapping, which could lead you to reanalyse your data with a more specialised pipeline to quantify this in more detail.

We use the parameters "-v 2 -m 100 --norc --best --strata". Let's break down these parameters:

Parameter

Meaning

-v 2

Allow a maximum of 2 mismatches in valid alignments.

-m 100

Only report a read alignment if it has < 100 possible valid alignments.

--norc

Do not attempt to map to the reverse of the index sequences (this is because we are providing transcripts).

--best --strata

In the alignment file (sam/bam) only report the best possible alignments. eg. if there is an alignment with 2 mismatches and 3 alignments with 1 mismatch, the alignment with 2 mismatches will not be reported.

--un {sample}_unaligned.fq

Store the unaligned reads - we want to use these to map to the genome.

4. STAR_ALIGN

Parameter

Meaning

PreviousDemultiplex and Analyse NextGraphical Overview

Last updated 3 years ago