RNA Alignment

RNA Alignment

The DRAGEN RNA pipeline uses the DRAGEN RNA-Seq spliced aligner. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.

Alignment Output

The output files generated when running DRAGEN in RNA mode are similar to those generated in DNA mode. RNA mode also produces extra information related to spliced alignments. Details regarding the splice junctions are present both in the SAM alignment record and an additional file, the SJ.out.tab file.

BAM

The output BAM file meets the SAM specification and is compatible with downstream RNA-Seq analysis tools.

RNA-Seq BAM Tags

The following BAM tags are emitted alongside spliced alignments.

TagDescription

XS:A

The XS tag denotes the strand orientation of an intron. See [Compatibility with Cufflinks]{.underline}.

NH:i

A standard SAM tag indicating the number of reported alignments that contains the query in the current record. This tag may be used for downstream tools such as featureCounts.

HI:i

A standard SAM tag denoting the query hit index, with its value indicating that this alignment is the i-th one stored in the SAM. Its value ranges from 1 ... NH. This tag may be used for downstream tools such as featureCounts.

jM:B

The jM tag lists the intron motifs for all junctions in the alignments. It has the following definitions

jM:B

Definition

0

non-canonical

1

GT/AG

2

CT/AC

3

GC/AG

4

CT/GC

5

AT/AC

6

GT/AT

If a gene annotations file is used during the map/align stage, and the splice junction is detected as an annotated junction, then 20 is added to its motif value.

Cufflinks might require spliced alignments to emit the XS:A strand tag. This tag is present in the SAM record if the alignment contains a splice junction. The possible values for XS:A strand tag are as follows:

'.' (undefined), '+' (forward strand), '-' (reverse strand), or '*' (ambiguous).

By default, if the spliced alignment has an undefined strand or an ambiguous (conflicting) strand, then the alignment output is suppressed. These alignments can be output into the output alignment file by setting the --no-ambig-strand option to 1.

Cufflinks also expects that the MAPQ for a uniquely mapped read is a single value. This value is specified by the --rna‑mapq-unique option. To force all uniquely mapped reads to have a MAPQ equal to this value, set ‑‑rna‑mapq‑unique to a nonzero value.

SJ.out.tab

Along with the alignments emitted in the SAM/BAM file, an additional SJ.out.tab file summarizes the high confidence splice junctions in a tab-delimited file. The columns for this file are as follows:

  1. contig name

  2. first base of the splice junction (1-based)

  3. last base of the splice junction (1-based)strand (0: undefined, 1: +, 2:-)

  4. strand (0: undefined, 1: +, 2: -)

  5. intron motif: 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT

  6. 0: unannotated, 1: annotated, only if an input gene annotations file was used

  7. number of uniquely mapping reads spanning the splice junction

  8. number of multimapping reads spanning the splice junction

  9. maximum spliced alignment overhang

The maximum spliced alignment overhang (column 9) field in the SJ.out.tab file is the anchoring alignment overhang. For example, if a read is spliced as ACGTACGT------------ACGT, then the overhang is 4. For the same splice junction, across all reads that span this junction, the maximum overhang is reported. The maximum overhang is a confidence indicator that the splice junction is correct based on anchoring alignments.

There are two SJ.out.tab files generated by the DRAGEN host software, an unfiltered version and a filtered version. The records in the unfiltered file are a consolidation of all spliced alignment records from the output SAM/BAM. However, the filtered version has a much higher confidence for being correct due to the use of the following filters.

A splice junction entry in the SJ.out.tab file is filtered out if any of these conditions are met:

  • SJ is a noncanonical motif and is only supported by < 3 unique mappings.

  • SJ of length > 50000 and is only supported by < 2 unique mappings.

  • SJ of length > 100000 and is only supported by < 3 unique mappings.

  • SJ of length > 200000 and is only supported by < 4 unique mappings.

  • SJ is a noncanonical motif and the maximum spliced alignment overhang is < 30.

  • SJ is a canonical motif and the maximum spliced alignment overhang is < 12.

The filtered SJ.out.tab is recommended for use with any downstream analysis or post processing tools. Alternatively, you can use the unfiltered SJ.out.tab and apply your own filters (for example, with basic awk commands).

Note that the filter does not apply to the alignments present in the BAM or SAM file.

Chimeric.out.junction File

If there are chimeric alignments present in the sample, then a supplementary Chimeric.out.junction file is also output. This file contains information about split-reads that can be used to perform downstream gene fusion detection. Each line contains one chimerically aligned read. The columns of the file are as follows:

  1. Chromosome of the donor.

  2. First base of the intron of the donor (1-based).

  3. Strand of the donor.

  4. Chromosome of the acceptor.

  5. First base of the intron of the acceptor (1-based).

  6. Strand of the acceptor.

  7. N/A---not used, but is present to be compatible with other tools. It will always be 1.

  8. N/A---not used, but is present to be compatible with other tools. It will always be *.

  9. N/A---not used, but is present to be compatible with other tools. It will always be *.

  10. Read name.

  11. First base of the first segment, on the + strand.

  12. CIGAR of the first segment.

  13. First base of the second segment.

  14. CIGAR of the second segment.

CIGARs in this file follow the standard CIGAR operations as found in the SAM specification, with the addition of a gap length L that is encoded with the operation p. For paired end reads, the sequence of the second mate is always reverse complemented before determining strandedness.

The following is an example entry that shows two chimerically aligned read pairs, in which one of the mates is split, mapping segments of chr19 to chr12. Also shown are the corresponding SAM records associated with these entries.

chr19 580462 + chr12 120876182 + 1 * * R_15448 571532 49M8799N26M8p49M26S 120876183 49H26M
chr19 580462 + chr12 120876182 + 1 * * R_15459 571552 29M8799N46M8p29M46S 120876183 29H46M

R_15448:1   99    chr19   571531      60  49M8799N26M  =       580413
R_15448:2   147   chr19   580413      60  49M26S       =       571531
R_15448:2   2193  chr12   120876182   15  49H26M       chr19   571531

R_15459:1   99    chr19   571551      60  29M8799N46M  =       580433
R_15459:2   147   chr19   580433      4   29M46S       =       571551
R_15459:2   2193  chr12   120876182   15  29H46M       chr19   571551

Mapping Metrics

The RNA Pipeline reports summary and per read group statistics pertaining to read mapping in the mapping_metrics.csv file. The majority of the matrics are as described in the DNA mapping metrics section, but the metrics that are specific to RNA-seq are listed below.

  • Filtered rRNA reads---Total number of ribosomal RNA reads that are filtered out with the --rrna-filter-enable option.

  • Mitochondrial reads excluded---Total number of reads detected to be in ChrM if the --rna-mapping-metrics-exclude-chrm option is enabled.

  • Mapped reads adjusted for filtered mapping---Adjusted count of mapped reads by adding in the filtered rRNA reads.

  • Mapped reads adjusted for excluded mapping---Adjusted count of mapped reads by adding in the excluded mitocondrial reads.

  • Mapped reads adjusted for filtered and excluded mapping---Adjusted count of mapped reads by adding in both the filtered rRNA and excluded mitocondrial reads.

  • Unmapped reads adjusted for filtered mapping---Adjusted count of unmapped reads by subtracting out the filtered rRNA reads.

  • Unmapped reads adjusted for excluded mapping---Adjusted count of unmapped reads by subtracting out the excluded mitocondrial reads.

  • Unmapped reads adjusted for filtered and excluded mapping---Adjusted count of unmapped reads by subtracting out both the filtered rRNA and the excluded mitocondrial reads.

  • Reads with splice junction---Total number of reads that included a spliced alignment that crosses an intron

RNA Alignment Options

The aligner stage of the RNA spliced aligner uses Smith-Waterman Alignment Scoring options and Splicing Scoring Options.

Smith-Waterman Alignment Scoring Options

Refer to [Smith-Waterman Alignment Scoring Settings]{.underline} for more details about the alignment algorithm used within DRAGEN. The following scoring options are specific to the processing of canonical and noncanonical motifs within introns.

  • --Aligner.intron-motif12-pen The --Aligner.intron-motif12-pen option controls the penalty for canonical motifs 1/2 (GT/AG, CT/AC). The default value calculated by the host software is 1 * (match-score + mismatch-pen).

  • --Aligner.intron-motif34-pen The --Aligner.intron-motif34-pen option controls the penalty for canonical motifs 3/4 (GC/AG, CT/GC). The default value calculated by the host software is 3 * (match-score + mismatch-pen).

  • --Aligner.intron-motif56-pen The --Aligner.intron-motif56-pen option controls the penalty for canonical motifs 5/6 (AT/AC, GT/AT). The default value calculated by the host software is 4 * (match-score + mismatch-pen).

  • --Aligner.intron-motif0-pen The --Aligner.intron-motif0-pen option controls the penalty for noncanonical motifs. The default value calculated by the host software is 6 * (match-score + mismatch-pen).

Splicing Scoring Options

  • --Mapper.min-intron-bases For RNA-Seq mapping, a reference alignment gap can be interpreted as a deletion or an intron. In the absence of an annotated splice junction, the min-intron-bases option is a threshold gap length separating this distinction. Reference gaps at least this long are interpreted and scored as introns, and shorter reference gaps are interpreted and scored as deletions. However, alignments can be returned with annotated splice junctions shorter than this threshold.

  • --Mapper.max-intron-bases The max-intron-bases option controls the largest possible intron that is reported, which useful for preventing false splice junctions that would otherwise be reported. Set this option to a value that is suitable to the species you are mapping against.

  • --Mapper.ann-sj-max-indel For RNA-seq, seed mapping can discover a reference gap in the position of an annotated intron, but with slightly different length. If the length difference does not exceed this option, the mapper investigates the possibility that the intron is present exactly as annotated, but an indel on one side or the other near the splice junction explains the length difference. Indels longer than this option and very near annotated splice junctions are not likely to be detected. Higher values may increase mapping time and false detections.

Duplicate Marking

DRAGEN RNA can detect duplicate reads, which are defined as fragments that have both ends mapping to the same (clipping-adjusted) position during alignment. In RNA-Seq data, the reads can represent PCR duplicates during library prep or as a result from deep coverage of highly expressed regions.

If --enable-duplicate-marking is set to true, duplicate fragments are marked in the BAM file and the total number of duplicate reads is reported as a mapping metric. Marking of duplicates does not affect gene expression quantification and gene fusion calling.

Downsampling

DRAGEN RNA also supports internal downsampling, which is a process by which a random sub-sample of reads is selected from the dataset after trimming and alignment for downstream analysis. In RNA-Seq, this can be useful in two ways - it can speed up analysis of samples with excessively high coverage, and it can allow for more accurate cross-comparisons between different samples.

If --enable-down-sampler is set to true and a value specified for --down-sampler-reads, DRAGEN will use only that many RNA fragments (including both Read 1 and Read 2) for quantification, fusion, variant calling and output to BAM. Please note the the entire input dataset is still used for the generation of trimming, fastqc, and mapping metrics.

Ribosomal RNA filtering

Ribosomal RNA (rRNA) sequences can contribute a large fraction of reads in some RNA-Seq datasets, depending on the sample type and library prep method. You can use the DRAGEN RNA pipeline to filter rRNA reads during alignment, because the reads are not relevant for downstream analysis. By filtering rRNA, you can reduce run time and file size and avoid deep read alignment pile ups at rRNA repeat loci on the genome to make downstream analysis of RNA BAM files easier.

rRNA filtering relies on a decoy contig with the rRNA sequence included in the reference hash table. Any read that maps to the decoy contig, including multimappers, is tagged with rRNA and is not mapped in the output.

NOTE: The rrna filter option only accepts a single contig by default. In the event multiple contigs need to be provided, they can be concatenated using a 1kb N mask between them, and added to the reference FASTA while creating the hash table.

NOTE: rRNA filtering is not supported with chm13-based references and it will be automatically disabled.

Command-Line Options

The following are the required command-line options for rRNA filtering.

  • --rrna-filter-enable=true--Enables rRNA filtering. Set to true to enable rRNA filtering. The default value is false.

  • --rrna-filter-contig--Specify the name of the rRNA sequences to use for filtering. If you do not specify a value, the default gl000220 is provided for human genome alignments by using the reference autodetect feature. gl000220 is an unplaced contig included in hg19 and hg38 genomes, which include a full copy of the rRNA repeat. For other genomes, you must include a rRNA decoy contig when creating a hash table.

Output Files

All rRNA filtering reads are left unaligned in the BAM files and tagged ZS:Z:FLT. The number and percentage of filtered rRNA reads is reported as a mapper metric Adjustment of reads matching filter contigs.

PolyA trimming

PolyA tails may be trimmed by including the settings --read-trimmers polya or --soft-read-trimmers polyg,polya (Note: polyg soft trimming is enabled by default). The minimum number of poly-A/poly-T bases required for trimming may be set using --trim-polya-min-trim. The default threshold is 20 poly-A/poly-T bases. Refer to DNA pipeline read trimmers section for usage of read trimmers options.

PolyA trimming by read orientation

The PolyA trimmer determines which end of the reads to trim for poly-A and poly-T sequences based on the library type. For example, for Illumina forward stranded paired reads the trimmer will trim poly-A sequences at 3' end of read 1 and poly-T sequences at 5' end of read 2. If --rna-library-type is not provided or set to autodetect (A), the trimmer assumes the library is unstranded and trims poly-A sequences from 3' end of each read and poly-T sequences from 5' end of each read. The option --rna-library-type is described in the Gene Expression section.

MAPQ Scoring

By default, the MAPQ calculation for RNA-Seq is identical to DNA-Seq. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. Therefore, adjusting the alignment scoring parameters impacts the MAPQ estimate. These adjustments are outlined in Smith-Waterman Alignment Scoring Settings.

The --mapq-strict-sjs option is specific to RNA, and applies where at least one exon segment is aligned confidently, but there is ambiguity regarding possible splice junctions. When this option is set to 0, a higher MAPQ value is returned, expressing confidence that the alignment is at least partially correct. When this option is set to 1, a lower MAPQ value is returned, reflecting the splice junction ambiguity.

Some downstream tools, such as Cufflinks, expect the MAPQ value to be a unique value for all uniquely mapped reads. This value is specified with the --rna-mapq-unique option. Setting this option to a nonzero value overrides all MAPQ estimates based on alignment score. Instead, all uniquely mapped reads have a MAPQ set to the value of --rna-mapq-unique. All multimapped reads have a MAPQ value of int(-10*log10(1 ‑ 1/NH), where the NH value is the number of hits (primary and secondary alignments) for that read.

Last updated