nadia-reads

Overview

Perform quality control on sequencing reads and trim adapter sequences.

The purpose of this step is checking the quality of sequencing reads before doing the alignment. The reads from single cell experiments could contain adapter sequences as well as polyA/polyT sequences if they are derived from short RNA molecules. Those sequences need to be trimmed out from the reads to increase the mapping rate to reference genome.

How it works

1. Concatenate reads across lanes

Sequencing reads for one sample could be delivered in muliple fastq files for different lanes. So first, they will be concatenated into a single fastq file.

Tip

We can specify multiple fastq file in -r1, –read1 and -r2, –read2 arguments, but they must have the same order. For example:

nadia-reads \
    -r1 lane1_R1.fastq.gz lane2_R1.fastq.gz \
    -r1 lane1_R2.fastq.gz lane2_R2.fastq.gz \

Note

When multiple fastq files are input, sample name is required by -n, –name argument.

2. Trim sequences by Cutadapt

To trim sequences, use –trim flag.

The sequences to trim could be specified by -a, –adapter argument. This argument accepts an FASTA file containing trim sequences.

If no adapter file is specified, the following sequences will be used by default:

>Illumina_Universal
AGATCGGAAGAG
>PrefixNX/1
AGATGTGTATAAGAGACAG
>Trans1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Trans1_rc
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Trans2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>Trans2_rc
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>polyA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>polyT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>polyC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>polyG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>drop-seq
GTACTCTGCGTTGATACCACTGCTTCCGCGGACAGGC
>Nextera
CTGTCTCTTATACACATCT

Besides that, Cutadapt is run with the following default parameters. See cutadapt documentation for more detail.

cutadapt \
    --max-n=0 \
    --minimum-length=20 \
    -q 20,20 \
    --overlap=8

1. Quality control by FASTQC

Quality of sequencing reads is reported by FASTQC. FASTQC is run on both raw reads and trimmed reads. Adapter contamination is checked agaist the same sequences as Cutadapt (specified by -a, –adapter argument).

Input

  • Sequencing reads in FASTQ format

  • Adapter sequences in FASTA format

Output

  • A multiqc report of FASTQC and Cutadapt. You can download an example report

  • Trimmed reads in FASTQ format (ready to be aligned)

Usage examples

nadia-reads \
    -r1 tests/testdata/L1_R1.fastq.gz tests/testdata/L2_R1.fastq.gz \
    -r2 tests/testdata/L1_R2.fastq.gz tests/testdata/L2_R2.fastq.gz \
    -n test_sample \
    -o tests/testresult/reads \
    --trim -a tests/testdata/adapters.fa

Argument details

-r1, --read1

Required

Read 1 fastq files. If multiple files are input, they must have the same order with –read2

-r2, --read2

Required

Read 1 fastq files. If multiple files are input, they must have the same order with –read2

-n, --name

Sample name. It will be used for naming output files.

Required if there are multiple input files

If single fastq file is input and –name is not specified, then filename of read 2 will be used for sample name.

-o, --outdir

Required

Output directory.

--trim

If this flag is used, then run Cutadapt to trim sequence

-a, --adapter

Path to adapter file in FASTA format. See 2. Trim sequences by Cutadapt