nadia-ref

Overview

Generate reference index for read alignment.

To get gene expression matrix, we need to align reads to reference genome or transcriptome. STAR aligner and Salmon (Alevin-fry) aligner both require index to align reads. So, the first step of the analysis is preparing the index.

We need to create an index for each species/genome, but one index could be used multiple times for many samples, so it only need to be generated once. STAR and Salmon have different indexes, and nadia-ref could create index for both aligners (specify by -i option).

How it works

Filter GTF file

If a read is mapped to multiple genes (multi-mapped), it will not be counted in the gene expression matrix. GTF files can contain non-protein-coding transcripts that overlap with protein-coding transcripts. Those transcripts could cause reads be flagged as multi-mapped and not counted. So we need to filter out those transcripts from the GTF file before creating index.

This could be done by -f, –filter and –gene-biotype options.

Create STAR index

STAR index is generated by STAR --runMode genomeGenerate command.

We can choose to create STAR index using -i star argument. We also need to specify the length of read 2 by -l argument (default is 91 bp).

Create Salmon index

For salmon, splici index (spliced+intron) will be created. Splici index and Alevin-fry try to address the issue of spurious expression arising from mapping errors, while maintaining the speed advantages and only somewhat compromising on memory usage.

Create salmon index by -i salmon.

There are two steps: (1) prepare splici reference and (2) build salmon index. This is done by pyroe.make_splici_txome function and salmon index command. It also require the length of read 2 by -l argument

Input

  • Reference genome in FASTA format

  • Transcript annotation in GTF format

Output

It will create an index folder inside output directory (star_index or salmon_index)

If filtering gene biotype, then a filtered GTF file will be output.

Note

..._t2g_3col.tsv and gene_id_to_name.tsv are also output in the salmon index folder. Those files are needed in nadia-quant command.

Usage examples

Filter GTF (only keep “protein_coding” gene biotype) and create STAR index

nadia-ref \
    -g tests/testdata/ref/genome.fa.gz \
    -a tests/testdata/ref/annotation.gtf \
    -i star \
    -l 91 \
    -o tests/testresult/human \
    -f  \
    --gene-biotype protein_coding

Create Salmon index for Alevin-fry (no filter)

nadia-ref \
    -g tests/testdata/ref/genome.fa.gz \
    -a tests/testdata/ref/annotation.gtf \
    -i salmon \
    -l 91 \
    -o tests/testresult/human

Argument details

-g, --fasta

Required

Reference genome in FASTA format (gzip compressed support).

-a, -gtf

Required

Transcript annotation in GTF format (gzip compressed support).

-o, --outdir

Required

Output directory.

-i, --index

Required, Options: star, salmon

Which index to create.

-l, --read-length

Default: 91

The length of read 2

-f, --filter

Default: False

If this flag is used (True), then filter GTF file.

--gene-biotype

Default: protein_coding

Accept multiple strings

Gene biotype to kept (if -f is used)