DRAGEN RNA Pipeline

DRAGEN includes an RNA-seq (splicing-aware) aligner, as well as RNA specific analysis components for gene expression quantification, gene fusion detection, splice variant calling, and small variant calling.

Most of the functionality and options described in Host Software Options and DNA Mapping also apply to RNA applications. Additional RNA-specific aspects are described in this section.

Input Files

Gene Annotation File

In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can also take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions and is required for gene expression quantification and gene fusion calling.

To specify a gene annotation file, use the -a (--annotation-file) command line option. The input file must conform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The file must contain features of type exon, and the record must contain attributes of type gene_id and transcript_id. An example of a valid GTF file is shown below.

chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        11869   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    HAVANA  exon        13221   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000456328.2; ...
chr1    ENSEMBL transcript  11872   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        11872   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
chr1    ENSEMBL exon        13225   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id ENST00000515242.2; ...
...

Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that is used to group exons. An example of a valid GFF file is shown below.

1   ensembl_havana  processed_transcript    11869   14409       .   +   .   ID=transcript:ENST00000456328;
1   havana          exon                    11869   12227       .   +   .   Parent=transcript:ENST00000456328; ...
1   havana          exon                    12613   12721       .   +   .   Parent=transcript:ENST00000456328; ...
1   havana          exon                    13221   14409       .   +   .   Parent=transcript:ENST00000456328; ...
...

NB. For proper handling of genes in the PAR regions of chromosome X and Y, it is required that the gene_id attribute of all exons of the same gene is distinct between the two chromosomes, in order to distinguish exons within the PAR region of chromosome X from the ones within the PAR region of chromosome Y. That is, it is often the case that the gene_id of all exons of a transcript from geneA is equal to gene_id=geneA in chromosome X, and gene_id=geneA_PAR_Y in chromosome Y. This allows the GTF/GFF parser and downstream components to discriminate data associated with PAR genes in chromosome X from data associated with the same PAR genes in chromosome Y.

The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. The following output displays the number of splice junctions detected.

==================================================================
Generating annotated splice junctions
==================================================================
Input annotations file: ./gencode.v19.annotation.gtf
Splice junctions database file: output/rna.sjdb.annotations.out.tab

Number of genes: 27459

Number of transcripts: 196520
Number of exons: 1196293
Number of splice junctions: 343856

The splice junctions that are detected from the annotation file are also written to *.sjdb.annotations.out.tab. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts. This minimum annotation splice junction length is controlled by the --rna-ann-sj-min-len option, which has a default value of 6.

GFF3 Support

Note that GFF3 is a different file format from GFF. GFF3 files are not officially supported due to inconsistent contig naming conventions between GENCODE and Ensembl.

For the same reference, GENCODE provides all the attributes necessary for DRAGEN to build a hierarchical structure:

#description: evidence-based annotation of the human genome (GRCh38), version 32 (Ensembl 98)
...
chr1    HAVANA  exon    11869   12227   .       +       .       ID=exon:ENST00000456328.2:1;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=lncRNA;transcript_name=DDX11L1-202;exon_number=1;exon_id=ENSE00002234944.1;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1

Ensembl has a different notation:

#!genome-build Genome Reference Consortium GRCh38.p14
#!genome-version GRCh38
...
1       havana  exon    11869   12227   .       +       .       Parent=transcript:ENST00000456328;Name=ENSE00002234944;constitutive
=1;ensembl_end_phase=-1;ensembl_phase=-1;exon_id=ENSE00002234944;rank=1;version=1

Ensembl uses different notation for contigs (for GRCh38) than GENCODE. Ensembl contigs do not have the "chr" prefix. The contig identifiers in the annotation file must match the DRAGEN reference in use, and by most conventions GRCh38/hg38 contigs are prefixed with "chr".

If necessary, DRAGEN may support GFF3 files that are GENCODE-compatible with the following annotations present in the attributes of each exon record:

  • For gene: "gene_name" or "name" or "gene" or "gene_id"

  • For transcript: "transcript_id" or "Parent"

Due to the flexibility of the GFF3 file format, issues may arise as it continues to evolve.

Two-Pass Splice-junction Alignment

Instead of using a GTF file for annotated splice junctions, the DRAGEN software is also capable of reading in an SJ.out.tab file (see SJ.out.tab). This file enables DRAGEN to run in a two-pass mode, where the splice junctions discovered in the first pass (output as SJ.out.tab file) are used to guide the mapping and alignment reads during a second run through DRAGEN. This mode of operation is useful to increase sensitivity for spliced alignments in cases when a gene annotations file is not readily available for the target genome. If a well curated GTF is already availble for your target genome, then there is no need to run a second pass with the SJ.out.tab.

Please be aware that depending on the characteristics of the input file (i.e. read depth and distribution) the second pass using the first pass SJ.out.tab may take longer than the first pass.

NOTE: Components downstream of aligner like gene expression quantification, gene fusion detection and RNA variant calling require GTF file as the input annotations file and are NOT compatible with two-pass splice-junction alignment mode.

Last updated