Splice Variant Caller

DRAGEN calls splice variants by taking advantage of its fast and highly accurate splice-aware read mapper/aligner that aligns reads to the whole genome to identify novel alternative Splice Junction (SJ) candidates. These candidates can be filtered by additional information provided such as a "normals list" and a "target regions list", or whitelisted with a "knowns list".

During the read sorting phase, evidence for these alternative splice variant candidates vs. reference splicing are accumulated. Then, each of the candidates are scored based on the accumulated read evidence and the results are written to TSV and VCF files for downstream tertiary analysis.

To use the RNA Splice Variant caller, use the option --enable-rna-splice-variant=true. Following is an example command line.

dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
-a <GTF_FILE> \
--output-dir <OUT_DIRECTORY> \
--output-file-prefix <PREFIX> \
--RGID <READ_GROUP_ID> \
--RGSM <Sample_NAME> \
--enable-rna true \
--enable-rna-splice-variant true \
--enable-duplicate-marking true

Splice Variant Optional Input Files

In addition to the required inputs listed in the above example (i.e. paired fastq reads, reference hashtable, and annotation), the following three optional input resource files can be provided to help provide better precision by reducing FP count.

Normals List

A list of Normal splice variants that will be filtered out of the final output (i.e. operating as a blacklist), as long as they are not in the "knowns" list, using the --rna-splice-variant-normals option.

The format of this file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.

contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)

To create a Normals list file, a collection of DRAGEN RNA mapper output SJ.out.tab files for at least 30 samples can be used along with a simple script to process all the SJs in these files. The pseudo code block below describes the function of this script:

Generate_Normals(SJ_out_tab_files)
{
Typedef tuple(int,int,int,int) = SJ_key     // for contig #, start, end, strand
Typedef dict(SJ_key, int) = SJ_count
Const MIN_UNIQUE_READS = 3
Const MIN_OCCURRENCE = 2

SJ_count All_SJ = {}
list Normal_SJ = []

// create list of all candidate SJ
for sj_file in SJ_out_tab_files
    open(sj_file,'r')
    for each sj in sj_file
        if sj.unique_reads >= MIN_UNIQUE_READS
            if exists sj in All_SJ
                All_SJ[sj] += 1
            else
                All_SJ[sj] = 1
    close(sj_file)

// save any SJ that occur in enough samples
for each (sj, count) in All_SJ
    if count >= MIN_OCCURRENCE
        Normal_SJ.append(sj)

// Write out the Normals.txt
Normal_SJ.sort();
normals_file = open("Normals.txt",'w')
for each sj in Normal_SJ
    write(normals_file,sj[0..3],0,0,0,0,0) // pad sj tuple's 4 vals with 5 unused field 0's
close(normals_file)
}

Knowns List

A list of known splice variants that are exempt from being filtered out of the final output (i.e. operating as a whitelist), using the --rna-splice-variant-knowns option. The format of the file should be a tab separated file in the same format as the SJ.out.tab with 9 columns present, except only the first 4 columns are evaluated, i.e.

contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)

By default, the caller will not consider any splice variant candidates that are found in the input annotation file since it is looking for denovo variants, unless it is included in the knowns list which directs it not to discard the specified candidate. Note that some newer gene annotation models have added alt transcripts that contain clinically relevant splice variants, which causes the DRAGEN to skip reporting them.

To ensure these are reported, the user may want to pass these in with a knowns file containing these common variants if they are found in the annotation that is used. An example is shown below using hg38 coordinates specifying the MET exon 14 skip, EGFRv3, and ARv7 alt splicing events, respectively.

chr7	116771655	116774880	1	0	0	0	0	0
chr7	55019366	55155829	1	0	0	0	0	0
chrX	67686127	67694672	1	0	0	0	0	0

Target Regions BED

A list of regions that called splice variants must fall within using the --rna-splice-variant-regions option. Any splice variant candidates will be excluded if they are not within these regions. This file should be in BED file format with the following info, except that the regions are 1-based.

chromosome id
start position (1-based)
end position (1-based)
region (i.e. gene) name

Splice Variant Output Files

The detected splice variants are output as two separate TSV files for the intragenic and intergenic candidates, and as a VCF for the intragenic candidates. The number of reads supporting the reference vs. the variant SJ are reported and used to score the candidate.

For a read to be considered as support for a SJ candidate it must meet the following criteria:

Must contain a splice junction (i.e. an alignment gap in the CIGAR containing skip ops).
Must have overhangs on either side of the skip that are at least 6 base pairs.

Reads are classified by whether they are marked as PCR duplicates or not, and whether they are uniquely mapping (NH=1) or multi-mapping (NH>1). (See RNA-Seq BAM Tags for mor information.) For a splice variant to be reported, at least one deduplicated uniquely mapping read supporting it must be found.

Splice Variant TSV Files

The two TSV output files are named:

<output-file-prefix>.splice_variants.tsv which contains the intragenic alt splice junctions that result in transcript variants
<output-file-prefix>.splice_variant_fusions.tsv which contains the intergenic alt splice junctions that cause fusions across genes

Each detected splice junction contains the following columns:

gene_start - Gene name(s) at the start of the SJ. Multiple genes are separated by a semicolon
gene_end - Gene name(s) at the end of the SJ. Multiple genes are separated by a semicolon
chromosome - Chromosome containing the SJ
start - SJ's start position (1-based genomic coordinate)
end - SJ's end position (1-based genomic coordinate)
filter - A flag determining whether the splice variant passes the minimum score threshold. The minimum score threshold can be set using rna-splice-variant-min-score. The default value is 0.5.
strand - Detected strand of the SJ (+ or -)
motif - Intron motif, 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
annotated - "TRUE" if annotated in the reference GTF, otherwise "FALSE"
unique_dedup_ref_reads - The number of deduplicated uniquely mapping reads that support the reference SJ. A read is considered to be "uniquely mapping" only if NH=1.
unique_total_ref_reads - Total number of uniquely mapping reads that support the reference SJ, both duplicate and deduplicated.
multi_dedup_ref_reads - The number of deduplicated multi-mapping reads that support the reference SJ. Reads are considered multi-mapping if NH>1.
multi_total_ref_reads - Total number of multi-mapping reads that support the reference SJ, both duplicate and deduplicated.
unique_dedup_alt_reads - The number of deduplicated uniquely mapping reads that support the candidate SJ.
unique_total_alt_reads - Total number of uniquely mapping reads that support the candidate SJ, both duplicate and deduplicated.
multi_dedup_alt_reads - The number of deduplicated multi-mapping reads that support the candidate SJ.
multi_total_alt_reads - Total number of multi-mapping reads that support the candidate SJ, both duplicate and deduplicated.
high_qual_unique_dedup_alt_reads - Number of uniquely mapping deduplicated reads that support the candidate SJ and have MAPQ higher than a threshold determined by the option rna-splice-variant-min-mapq. The default value for the MAPQ threshold is 35.
max_mapQ_ref - Maximum MAPQ of deduplicated reads uniquely mapping to the reference SJ. If no reads, the value will be zero.
max_mapQ_alt - Maximum MAPQ of deduplicated reads uniquely mapping to the candidate SJ. If no reads, the value will be zero.
avg_mapQ_ref - Average MAPQ of deduplicated reads uniquely mapping to the reference SJ. If no reads, the value will be zero.
avg_mapQ_alt - Average MAPQ of deduplicated reads uniquely mapping to the candidate SJ. If no reads, the value will be zero.
max_spliced_alignment_overhang - Maximum spliced alignment overhang from all uniquely mapping reads supporting the candidate SJ.
normalized_overhang - max_spliced_alignment_overhang normalized by maximum read length.
score - The candidate SJ score (ranging from 0 to 1). This score is calculated from a pre-trained ML model.
read_through - Only for intergenic output - This column will have value "1", if the splice variant is read through (adjacent genes) or "0", otherwise.

Note:

In the intragenic output file containing transcript variant splice junctions, the gene_start and gene_end columns must match.
In the intergenic output file containing fusions from splice junctions, the gene_start and gene_end columns must be different.

Splice Variant VCF File

This file contains the detected intra-genic splice junction variants that are not filtered out, and are written into a zipped VCF file titled <output-file-prefix>.splice_variants.vcf.gz, where each splice variant candidate is written as a one-line VCF record containing the fields below:

CHROM - Chromosome of the splice
POS - SJ start position (1-based) i.e. first base of intron
ID - "." (unused)
REF - Base from the reference genome FASTA at the SJ start position
ALT - "<DEL>"
QUAL - The junction score from 0.0 - 1.0
FILTER - Semicolon separated list of filters: LowQ and LowUniqueAlignment
INFO - See the possible Info fields below
FORMAT - AD:DP
SAMPLE - Counts for {unique_dedup_alt_reads}:{unique_dedup_ref_reads}

The following lines of the VCF header describe columns 5 to 10 (last 6 columns)

##ALT=<ID=DEL,Description="Deletion">
##QUAL=<Description="QUAL score correlates support for the read count of splice junctions, not Phred-scaled">
##FILTER=<ID=LowQ,Description="Indicates the variant has a quality score below the passing threshold.">
##FILTER=<ID=LowUniqueAlignments,Description="Indicates the variant has a number of supporting reads below the passing threshold.">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=ALTDEDUP,Number=1,Type=Integer,Description="Reads split across deletion. Uniquely mapping, no duplicate reads">
##INFO=<ID=ALTDUP,Number=1,Type=Integer,Description="Reads split across deletion. Uniquely mapping including duplicate reads">
##INFO=<ID=REFDEDUP,Number=1,Type=Integer,Description="Reads across deletion region which do not support deletion. Uniquely mapping, no duplicate reads.">
##INFO=<ID=REFDUP,Number=1,Type=Integer,Description="Reads across deletion region which do not support deletion. Uniquely mapping including duplicate reads.">
##INFO=<ID=INTERGENIC,Number=0,Type=Flag,Description="Indicates that this splice variant may be an intergenic fusion.">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="SpliceSupportingReads: Reads across splice varaint region which do not support deletion. Uniquely mapping, no duplicate reads.">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="ReferenceReads: reads split across splice variant. Uniquely mapping, no duplicate reads.">

All splice variants are reported as SV DEL events. For example:

#CHROM	POS	    ID    REF	ALT    QUAL    FILTER    INFO    FORMAT    SAMPLE
chr7    55087058     .     G    <DEL>    0.96    PASS    SVTYPE=DEL;END=55223522;ALTDEDUP=22;ALTDUP=1644;REFDEDUP=82;REFDUP=5369     SR    82,22  sample1

Note on Filter Thresholds The passing thresholds for the LowQ and LowUniqueAlignments filters are fixed to the settings below.

Filter

Description

LowQ

Score < 0.5

LowUniqueAlignments

Unique supporting read count < 2 for bulk RNA or < 10 for panel*

* Panel is determined by setting the rna-splice-variant-regions option.

Merging Splice Variants with the Gene Fusion Caller

When the splice variant caller and gene fusion caller are both enabled, the passing and failed intergenic splice variants will be passed to the gene fusion caller to be scored and merged into the relevant fusion output VCF and TSV files.

The passing calls get added to the fusion caller's <output-file-prefix>.fusion_candidates.final and <output-file-prefix>.fusion_candidates.vcf.gz file. In the fusion_candidates.vcf.gz the variant has the "SPLICE_VARIANT" flag in the info field. In the fusion_candidates.final file, the tab separated fields are described below.

Field Names

Description

FusionGene

Left and Right gene names (separated by "--")

Score

Value between 0 and 1

LeftBreakpoint, RightBreakpoint

The location for left and right sides of the splice with three colon separated fields: chromosome:coordinate:strand(+/-)

Gene1Location, Gene2Location

Splice Variant caller always outputs "SpliceVar" here instead of Exon/Intron location

Gene1Sense, Gene2Sense

Always TRUE for by design

Gene1Id, Gene2Id

Long form ID (i.e. for Gencode it is usually "ENSG.version")

NumSplitReads

Taken from the dedupUniqueSupportingReads count (i.e. split_unique_reads_alt column value)

NumSoftClippedReads, NumPairedReads

These values are not used by RSV caller and are set to '0'

ReadNames

Not provided by this caller and set to 'N/A'

By default intergenic splice variants on adjacent genes are not passed to the gene fusion caller. In order to enable read through splice variant fusions, use rna-splice-variant-enable-readthrough=true.

A list of known splice variant fusions can be given to the splice variant caller to be passed to the gene fusion caller. To pass intergenic fusions, enter two gene names on one line and for intra-genic splice variants, enter one gene on one line. Pass the file with the option rna-splice-variant-fusion-genes. Below is an example of a file with intragenic splice variant fusions on EGFR and an intergenic splice variant fusion between AJM1 and PHPT1. Note that the gene names should be the present in the annotation file. Lines starting with the # symbol will be ignored.

# Force the below splice variant genes (intra and inter genic) to fusion caller 
EGFR
AJM1 PHPT1

List of RNA Splice Variant Options

Option

Description

Type

Default Value

enable-rna-splice-variant

Enable rna splice variant caller

true/false

false

rna-splice-variant-knowns

Candidate (expected) splice junctionsNormal (non-variant) splice junctions

string (path to file)

None

rna-splice-variant-normals

Normal (non-variant) splice junctions

string (path to file)

None

rna-splice-variant-fusion-genes

List of hotspot genes that may contain spliced fusions

string (path to file)

None

rna-splice-variant-regions

List of regions that splice junctions must overlap (used for panels)

string (path to file)

None

rna-splice-variant-min-score

Score threshold for filtering RNA splice variant candidates.

number between 0 and 1

0.5

rna-splice-variant-enable-readthrough

Enable the calling of splice variants resulting in fusion of adjacent genes on the same strand

true/false

false

PreviousRNA Variant Calling NextDRAGEN Single Cell Pipeline

Last updated 1 month ago

Was this helpful?