Splice Variant Caller

The identification of alternatively spliced isoforms (using their constitutive splice variants) and their functional effects is of high importance in the study of genetic variation and diseases, including cancer and neurological disorders. The main types of alternative splicing events resulting in splice variants are:

  • Exon skipping

  • Intron retention

  • Mutually exclusive exons

  • Alternative 5' splice site

  • Alternative 3' splice site

Running Splice Variant Caller

When enabled with the --enable-rna-splice-variant=true option added to an RNA Map/Align job, DRAGEN runs a Splice Variant caller by taking advantage of its fast and highly accurate splice-aware read mapper/aligner that aligns to the whole genome to identify novel alternative Splice Junction (SJ) candidates. These candidates can be filtered by additional information provided such as a "knowns list", "normals list", and a "target regions list".

Next during the read sorting phase, evidence for these alternative splices candidates (alts) vs. reference splicing are accumulated. Finally, each of the candidates are scored based on the accumulated read evidence and the results are written to TSV and VCF files for downstream tertiary analysis.

Following is an example command line.

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
-a <GTF_FILE> \
--output-dir <OUT_DIRECTORY> \
--output-file-prefix <PREFIX> \
--RGID <READ_GROUP_ID> \
--RGSM <Sample_NAME> \
--enable-rna true \
--enable-rna-splice-variant true

Spice Variant Optional Input Files

In addition to the required inputs listed in the above example (i.e. paired fastq reads, reference hashtable, and annotation), the following 3 optional input resource files can be provided to help provide better precision by reducing FP count.

Normals List

A list of Normal splice variants that will be filtered out of the final output (i.e. operating as a blacklist), as long as they are not in the "knowns" list, using the "--rna-splice-variant-normals" option.

The format of this file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.

  1. contig name

  2. first base of the splice junction (1-based)

  3. last base of the splice junction (1-based)

  4. strand (0: undefined, 1: +, 2: -)

To create a Normals list file, a collection of DRAGEN RNA mapper output SJ.out.tab files for at least 30 samples can be used along with a simple script to process all the SJs in these files. The pseudo code block below describes the function of this script:

Generate_Normals(SJ_out_tab_files)
{
Typedef tuple(int,int,int,int) = SJ_key     // for contig #, start, end, strand
Typedef dict(SJ_key, int) = SJ_count
Const MIN_UNIQUE_READS = 3
Const MIN_OCCURRENCE = 2

SJ_count All_SJ = {}
list Normal_SJ = []

// create list of all candidate SJ
for sj_file in SJ_out_tab_files
    open(sj_file,'r')
    for each sj in sj_file
        if sj.unique_reads >= MIN_UNIQUE_READS
            if exists sj in All_SJ
                All_SJ[sj] += 1
            else
                All_SJ[sj] = 1
    close(sj_file)

// save any SJ that occur in enough samples
for each (sj, count) in All_SJ
    if count >= MIN_OCCURRENCE
        Normal_SJ.append(sj)

// Write out the Normals.txt
Normal_SJ.sort();
normals_file = open("Normals.txt",'w')
for each sj in Normal_SJ
    write(normals_file,sj[0..3],0,0,0,0,0) // pad sj tuple's 4 vals with 5 unused field 0's
close(normals_file)
}

Knowns List

A list of known splice variants that are exempt from being filtered out of the final output (i.e. operating as a whitelist), using the "--rna-splice-variant-knowns" option.

The format of the file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.

  1. contig name

  2. first base of the splice junction (1-based)

  3. last base of the splice junction (1-based)

  4. strand (0: undefined, 1: +, 2: -)

Target Regions BED

A list of regions that called splice variants must fall within using the "--rna-splice-variant-regions" option. Any splice variant candidates will be excluded if they are not within these regions.

This file should be in BED file format with the following info, except that the regions are 1-based

  1. chromosome id

  2. start position (1-based)

  3. end position (1-based)

  4. region (i.e. gene) name

Spice Variant Output Files

The detected splice variants are output as two separate TSV files for the intragenic and intergenic candidates, and as a VCF for the intragenic candidates.

The following categories are used when accumulating read counts for each alt SJ candidate:

  • DedupUniqueSupportingReads - Non-duplicate marked reads that are unique and precisely align with the SJ

  • DupUniqueSupportingReads - Duplicate marked reads that are unique and precisely align with the SJ

  • DedupUniqueNonsupportingReads - Non-duplicate marked reads that are unique but don't support the splice variant

  • DupUniqueNonsupportingReads - Duplicate marked reads that are unique but don't support the splice variant

To be counted, a paired read alignment:

  1. must be primary and properly paired

  2. must contain a splice junction (i.e. an alignment gap in the CIGAR containing skip ops)

  3. must have overhangs on either side of the skip that are at least 6 bases

  4. considered to be "unique" only if NH=1 and the MAPQ > 35

Splice Variant TSV Files

These two output files are named:

  • .splice_variants.tsv which contains the intragenic alt splice junctions that result in transcript variants

  • .splice_variant_fusions.tsv which contains the intergenic alt splice junctions that result fusions across genes

Each detected splice junction contains the following columns:

  1. gene_start - Gene name(s) at the start of the SJ. Multiple genes are separated by a semicolon

  2. gene_end - Gene name(s) at the end of the SJ. Multiple genes are separated by a semicolon

  3. chromosome - Chromosome containing the SJ

  4. start - SJ's start position (1-based genomic coordinate)

  5. end - SJ's start position (1-based genomic coordinate)

  6. strand - Detected strand for the SJ (1: +, 2: -)

  7. motif - intron motif

  8. annotated - True if annotated, otherwise False

  9. split_unique_reads_ref - DedupUniqueNonsupportingReads count that support reference

  10. split_total_reads_ref - DupUniqueNonsupportingReads + DedupUniqueNonsupportingReads count that support reference

  11. split_unique_reads_alt DedupUniqueSupportingReads count that support variant

  12. split_total_reads_alt - DupUniqueSupportingReads + DedupUniqueSupportingReads count that support variant

  13. max_spliced_alignment_overhang - maximum spliced alignment overhang from all supporting reads

  14. score - The splice junction variant score (ranging from 0.0 to 1.0). Currently, this is just a linear function of the number of split_unique_reads divided by 10, i.e. equals MIN(1.0, split_unique_reads_alt/10)

Note:

  • In the intragenic output file containing transcript variant splice junctions, the gene_start and gene_end columns must match.

  • In the intergenic output file containing fusions from splice junctions, the gene_start and gene_end columns must be different.

Splice Variant VCF File

This file contains the detected intra-genic splice junction variants that are not filtered out, and are written into a zipped VCF file titled .splice_variants.vcf.gz, where each splice variant candidate is written as a one-line VCF record containing the fields below:

  • CHROM - Chromosome of the splice

  • POS - SJ start position (1-based) i.e. first base of intron

  • ID - "." (unused)

  • REF - Base from the reference genome FASTA at the SJ start position

  • ALT - "<DEL>"

  • QUAL - The junction score from 0.0 - 1.0

  • FILTER - Semicolon separated list of filters: LowQ and LowUniqueAlignment

  • INFO - See the possible "Info fields below"

  • FORMAT - AD:DP

  • SAMPLE - Counts for {DedupUniqueSupportingReads}:{DedupUniqueNonsupportingReads}

The following lines of the VCF header describe columns 5 to 10 (last 6 columns)

##ALT=<ID=DEL,Description="Deletion">
##QUAL=<Description="QUAL score correlates support for the read count of splice junctions, not Phred-scaled">
##FILTER=<ID=LowQ,Description="Indicates the variant has a quality score below the passing threshold.">
##FILTER=<ID=LowUniqueAlignments,Description="Indicates the variant has a number of supporting reads below the passing threshold.">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=ALTDEDUP,Number=1,Type=Integer,Description="Reads split across deletion. Uniquely mapping, no duplicate reads">
##INFO=<ID=ALTDUP,Number=1,Type=Integer,Description="Reads split across deletion. Uniquely mapping including duplicate reads">
##INFO=<ID=REFDEDUP,Number=1,Type=Integer,Description="Reads across deletion region which do not support deletion.Uniquely mapping, no duplicate reads.">
##INFO=<ID=REFDUP,Number=1,Type=Integer,Description="Reads across deletion region which do not support deletion. Uniquely mapping including duplicate reads.">
##INFO=<ID=INTERGENIC,Number=0,Type=Flag,Description="Indicates that this splice variant may be an intergenic fusion.">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="SpliceSupportingReads: Reads across splice varaint region which do not support deletion.Uniquely mapping, no duplicate reads.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="ReferenceReads: reads split across splice variant. Uniquely mapping, no duplicate reads.">

Note on Filter Thresholds

The passing thresholds for the LowQ and LowUniqueAlignments filters are fixed to the settings below.

Filter

Description

Value

LowQ

Below splice variant score threshold

less than 1.0

LowUniqueAlignments

Below unique supporting read count threshold

less than 2

Output Merged with Fusion Caller

When the splice variant caller and gene fusion caller are both enabled, the passing and failed intergenic fusion SJ's will also be merged into the relevant fusion output TSV files.

The passing calls get added to the fusion caller's .fusion_candidates.final file. The tab separated fields are described below.

Field Names

Description

FusionGene

Left and Right gene names (separated by '--')

Score

Value between 0 and 1

LeftBreakpoint, RightBreakpoint

The location for left and right sides of the splice with three colon separated fields: chromosome:coordinate:strand(+/-)

Gene1Location, Gene2Location

Splice Variant caller always outputs "SpliceVar" here instead of Exon/Intron location

Gene1Sense, Gene2Sense

Always TRUE for by design

Gene1Id, Gene2Id

Long form ID (i.e. for Gencode it is usually "ENSG.version")

NumSplitReads

Taken from the dedupUniqueSupportingReads count (i.e. split_unique_reads_alt column value)

NumSoftClippedReads, NumPairedReads

These values are not used by RSV caller and are set to '0'

ReadNames

Not provided by this caller and set to 'N/A'

The failing calls get added into the fusion caller's .fusion_candidates.filter_info output file. The output fields are the same as described above for the "final" output file, with the addition of the FILTER_INFO field in the first column. The value in this field will be "RSV_FILTER:" followed by the specific filters that are not passing, as described in the table below.

Filter Names

Description

LOW_QUAL

Below the "low quality score" threshold

LOW_UNIQUE_ALIGNS

Number of unique anchors either on left or right side are below the MIN_UNIQUE_ALIGNS=2 threshold

LOW_EVIDENCE_OR_OVERHANG

Not meeting the SJ.out read count vs. splice length and overhang requirements

READTHROUGH

Gene partner is the next downstream annotated gene

Last updated