nadia-ref
Overview
Generate reference index for read alignment.
To get gene expression matrix, we need to align reads to reference genome or transcriptome. STAR aligner and Salmon (Alevin-fry) aligner both require index to align reads. So, the first step of the analysis is preparing the index.
We need to create an index for each species/genome, but one index could be used
multiple times for many samples, so it only need to be generated once. STAR and
Salmon have different indexes, and nadia-ref could create index for both
aligners (specify by -i option).
How it works
Filter GTF file
If a read is mapped to multiple genes (multi-mapped), it will not be counted in the gene expression matrix. GTF files can contain non-protein-coding transcripts that overlap with protein-coding transcripts. Those transcripts could cause reads be flagged as multi-mapped and not counted. So we need to filter out those transcripts from the GTF file before creating index.
This could be done by -f, –filter and –gene-biotype options.
Create STAR index
STAR index is generated by STAR --runMode genomeGenerate command.
We can choose to create STAR index using -i star argument. We also need to
specify the length of read 2 by -l argument (default is 91 bp).
Create Salmon index
For salmon, splici index (spliced+intron) will be created. Splici index and Alevin-fry try to address the issue of spurious expression arising from mapping errors, while maintaining the speed advantages and only somewhat compromising on memory usage.
Create salmon index by -i salmon.
There are two steps: (1) prepare splici reference and (2) build salmon index.
This is done by pyroe.make_splici_txome function and salmon index command.
It also require the length of read 2 by -l argument
Input
Reference genome in FASTA format
Transcript annotation in GTF format
Output
It will create an index folder inside output directory (star_index or salmon_index)
If filtering gene biotype, then a filtered GTF file will be output.
Note
..._t2g_3col.tsv and gene_id_to_name.tsv are also output in the salmon
index folder. Those files are needed in nadia-quant command.
Usage examples
Filter GTF (only keep “protein_coding” gene biotype) and create STAR index
nadia-ref \
-g tests/testdata/ref/genome.fa.gz \
-a tests/testdata/ref/annotation.gtf \
-i star \
-l 91 \
-o tests/testresult/human \
-f \
--gene-biotype protein_coding
Create Salmon index for Alevin-fry (no filter)
nadia-ref \
-g tests/testdata/ref/genome.fa.gz \
-a tests/testdata/ref/annotation.gtf \
-i salmon \
-l 91 \
-o tests/testresult/human
Argument details
-g, --fasta
Required
Reference genome in FASTA format (gzip compressed support).
-a, -gtf
Required
Transcript annotation in GTF format (gzip compressed support).
-o, --outdir
Required
Output directory.
-i, --index
Required, Options: star, salmon
Which index to create.
-l, --read-length
Default: 91
The length of read 2
-f, --filter
Default: False
If this flag is used (True), then filter GTF file.
--gene-biotype
Default: protein_coding
Accept multiple strings
Gene biotype to kept (if -f is used)