# Splice Variant Caller

DRAGEN calls splice variants by taking advantage of its fast and highly accurate splice-aware read mapper/aligner that aligns reads to the whole genome to identify novel alternative Splice Junction (SJ) candidates. These candidates can be filtered by additional information provided such as a "normals list" and a "target regions list", or whitelisted with a "knowns list". The "splice fusion genes" file allows intergenic or intragenic splice variants to be reported in the RNA Gene Fusion output.

During the read sorting phase, evidence for these alternative splice variant candidates vs. reference splicing is accumulated. Then, the candidates are scored based on the accumulated read evidence and the results are written to TSV and VCF files for downstream tertiary analysis.

To use the RNA Splice Variant caller, use the option `--enable-rna-splice-variant=true`. Following is an example command line for a WTS dataset.

```
dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
-a <GTF_FILE> \
--output-dir <OUT_DIRECTORY> \
--output-file-prefix <PREFIX> \
--RGID <READ_GROUP_ID> \
--RGSM <SAMPLE_NAME> \
--enable-rna true \
--enable-rna-splice-variant true \
--enable-duplicate-marking true 
```

For RNA panels, use `rna-enriched-regions` (bed format) or `rna-enriched-genes` (text file with one gene per line) to set the targeted gene information.

### Splice Variant Optional Input Files

In addition to the required inputs listed in the above example (i.e. paired fastq reads, reference hashtable, and annotation), the following three optional input resource files can be provided to help provide better precision by reducing FP count.

#### Normals List

A list of Normal splice variants that will be filtered out of the final output (i.e. operating as a blacklist), as long as they are not in the "knowns" list, using the `--rna-splice-variant-normals` option.

The format of this file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.

1. contig name
2. first base of the splice junction (1-based)
3. last base of the splice junction (1-based)
4. strand (0: undefined, 1: +, 2: -)

To create a Normals list file, a collection of DRAGEN RNA mapper output **SJ.out.tab** files for at least 30 samples can be used along with a simple script to process all the SJs in these files. The pseudo code block below describes the function of this script:

```
Generate_Normals(SJ_out_tab_files)
{
  Typedef tuple(int,int,int,int) = SJ_key     // for contig #, start, end, strand
  Typedef dict(SJ_key, int) = SJ_count
  Const MIN_UNIQUE_READS = 3
  Const MIN_OCCURRENCE = 2

  SJ_count All_SJ = {}
  list Normal_SJ = []

  // create list of all candidate SJ
  for sj_file in SJ_out_tab_files
      open(sj_file,'r')
      for each sj in sj_file
          if sj.unique_reads >= MIN_UNIQUE_READS
              if exists sj in All_SJ
                  All_SJ[sj] += 1
              else
                  All_SJ[sj] = 1
      close(sj_file)

  // save any SJ that occur in enough samples
  for each (sj, count) in All_SJ
      if count >= MIN_OCCURRENCE
          Normal_SJ.append(sj)

  // Write out the Normals.txt
  Normal_SJ.sort();
  normals_file = open("Normals.txt",'w')
  for each sj in Normal_SJ
      write(normals_file,sj[0..3],0,0,0,0,0) // pad sj tuple's 4 vals with 5 unused field 0's
  close(normals_file)
}
```

#### Knowns List

A list of known splice variants that are exempt from being filtered out of the final output (i.e. operating as a whitelist), using the `--rna-splice-variant-knowns` option. The format of the file should be a tab separated file in the same format as the SJ.out.tab with 9 columns present, except only the first 4 columns are evaluated, i.e.

1. contig name
2. first base of the splice junction (1-based)
3. last base of the splice junction (1-based)
4. strand (0: undefined, 1: +, 2: -)

By default, the caller will not consider any splice variant candidates that are found in the input annotation file since it is looking for denovo variants, unless it is included in the *knowns* list which directs it not to discard the specified candidate. Note that some newer gene annotation models have added alt transcripts that contain clinically relevant splice variants, which causes DRAGEN to skip reporting them.

To ensure these are reported, the user may want to pass these in with a *knowns* file containing these common variants if they are found in the annotation that is used. An example is shown below using hg38 coordinates specifying the *MET exon 14 skip*, *EGFRv3*, and *ARv7* alt splicing events, respectively.

```
chr7	116771655	116774880	1	0	0	0	0	0
chr7	55019366	55155829	1	0	0	0	0	0
chrX	67686127	67694672	1	0	0	0	0	0
```

#### Splice Fusion Genes

A list of genes or gene pairs can be defined with the option `--rna-splice-variant-fusion-genes`. These genes or gene pairs will be passed to the RNA Gene Fusion output if detected by the splice variant caller. Note that the gene names should be present in the annotation file. Lines starting with the `#` symbol will be ignored. The default list of splice fusions are:

```
EGFR
MET
BC039389	GATM
BCL2L2	PABPN1
CHFR	GOLGA3
CTSC	RAB38
CTSD	IFITM10
D2HGDH	GAL3ST2
DUS4L	BCAP2
DUS4L	BCAP29
EIF3K ACTN4
INS	IGF2
JMJD7	PLA2G4B
KLK4	KRSP1
LHX6	NDUFA8
NFATC3	PLA2G15
PPP1R1B	STARD3
RRM2	C2orf48
SCNN1A	TNFRSF1A
SLC2A11	MIF
SLC45A3	ELK4
STX16	NPEPL1
SYT8	TNNI2
TMED6	COG8
TSNAX	DISC1
```

The file above was made based on clinical recurrent splice fusions and indicates that:

* any intragenic splice variant in the EGFR gene or any intergenic splice fusion with one gene EGFR will be passed to the RNA Gene Fusion component;
* any intragenic splice variant in the MET gene or any intergenic splice fusion with one gene MET will be passed to the RNA Gene Fusion component;
* all intergenic splice fusions between BC039389 and GATM will be passed to the RNA Gene Fusion component; and so on.

For more details see Section [Merging Splice Variants with the Gene Fusion Caller](#merging-splice-variants-with-the-gene-fusion-caller).

#### Target Regions BED or text file

If the RNA dataset consists of a gene panel, the gene names or amplified regions can be passed using `--rna-enriched-genes` or `--rna-enriched-regions`. Both options cannot be set together. To pass a list of genes (one per line), use `--rna-enriched-genes`. Only splice variants within these genes will be reported. For intergenic splice variants, at least one gene must be enriched to be reported. To pass a list of regions, use the `--rna-enriched-regions` option. Any splice variant candidates will be excluded if they are not within these regions. This file should be in BED file format with the following info, except that the regions are 1-based.

1. chromosome id
2. start position (1-based)
3. end position (1-based)
4. region (i.e. gene) name

### Splice Variant Output Files

The detected splice variants are output as two separate TSV files for the intragenic (within one gene) and intergenic (between two genes) candidates, and as a VCF for the intragenic candidates. The number of reads supporting the reference vs. the variant SJ are reported and used to score the candidate.

For a read to be considered as support for a SJ candidate it must meet the following criteria:

1. Must contain a splice junction (i.e. an alignment gap in the CIGAR containing skip ops).
2. Must have overhangs on either side of the skip that are at least 6 base pairs.

Reads are classified by whether they are marked as PCR duplicates or not, and whether they are uniquely mapping (NH=1) or multi-mapping (NH>1). (See [RNA-Seq BAM Tags](https://help.dragen.illumina.com/product-guides/dragen-v4.5/rna-alignment#rna-seq-bam-tags) for more information.) For a splice variant to be reported, at least one deduplicated uniquely mapping read supporting it must be found.

#### Splice Variant TSV Files

The two TSV output files are named:

* **output-file-prefix.splice\_variants.tsv** which contains the *intragenic* alt splice junctions that result in transcript variants
* **output-file-prefix.splice\_variant\_fusions.tsv** which contains the *intergenic* alt splice junctions that cause fusions across genes

Each detected splice junction contains the following columns:

1. **gene\_start** - Gene name(s) at the start of the SJ. Multiple genes are separated by a semicolon
2. **gene\_end** - Gene name(s) at the end of the SJ. Multiple genes are separated by a semicolon
3. **chromosome** - Chromosome containing the SJ
4. **start** - SJ's start position (1-based genomic coordinate)
5. **end** - SJ's end position (1-based genomic coordinate)
6. **filter** - A list of filters separated by semicolon. See Section [Splice Variant Filters](#splice-variant-filters).
7. **strand** - Detected strand of the SJ (+ or -)
8. **motif** - Intron motif, 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
9. **annotated** - "TRUE" if annotated in the reference GTF, otherwise "FALSE"
10. **unique\_dedup\_ref\_reads** - The number of deduplicated *uniquely* mapping reads that support the reference SJ. A read is considered to be *"uniquely mapping"* only if NH=1.
11. **unique\_total\_ref\_reads** - Total number of *uniquely* mapping reads that support the reference SJ, both duplicate and deduplicated.
12. **multi\_dedup\_ref\_reads** - The number of deduplicated multi-mapping reads that support the reference SJ. Reads are considered multi-mapping if NH>1.
13. **multi\_total\_ref\_reads** - Total number of multi-mapping reads that support the reference SJ, both duplicate and deduplicated.
14. **unique\_dedup\_alt\_reads** - The number of deduplicated *uniquely* mapping reads that support the candidate SJ.
15. **unique\_total\_alt\_reads** - Total number of *uniquely* mapping reads that support the candidate SJ, both duplicate and deduplicated.
16. **multi\_dedup\_alt\_reads** - The number of deduplicated multi-mapping reads that support the candidate SJ.
17. **multi\_total\_alt\_reads** - Total number of multi-mapping reads that support the candidate SJ, both duplicate and deduplicated.
18. **high\_qual\_unique\_dedup\_alt\_reads** - Number of uniquely mapping deduplicated reads that support the candidate SJ and have MAPQ higher than a threshold determined by the option `rna-splice-variant-min-mapq`. The default value for the MAPQ threshold is 35.
19. **max\_mapQ\_ref** - Maximum MAPQ of deduplicated reads uniquely mapping to the reference SJ. If no reads, the value will be zero.
20. **max\_mapQ\_alt** - Maximum MAPQ of deduplicated reads uniquely mapping to the candidate SJ. If no reads, the value will be zero.
21. **avg\_mapQ\_ref** - Average MAPQ of deduplicated reads uniquely mapping to the reference SJ. If no reads, the value will be zero.
22. **avg\_mapQ\_alt** - Average MAPQ of deduplicated reads uniquely mapping to the candidate SJ. If no reads, the value will be zero.
23. **max\_spliced\_alignment\_overhang** - Maximum spliced alignment overhang from all uniquely mapping reads supporting the candidate SJ.
24. **normalized\_overhang** - `max_spliced_alignment_overhang` normalized by maximum read length.
25. **score** - The candidate SJ score (ranging from 0 to 1). This score is calculated from a pre-trained ML model.
26. **read\_through** - *Only for intergenic output* - This column will have value "1", if the splice variant is read through (adjacent genes on the same strand) or "0", otherwise.

Note:

* In the *intragenic* output file containing transcript variant splice junctions, the **gene\_start** and **gene\_end** columns must match.
* In the *intergenic* output file containing fusions from splice junctions, the **gene\_start** and **gene\_end** columns must be different.

#### Splice Variant VCF File

The candidate intragenic splice variants are reported in a zipped VCF file titled `<output-file-prefix>.splice_variants.vcf.gz`, where each splice variant candidate is written as a one-line VCF record of SV DEL event. The Splice Variant VCF output contains the following fields:

* CHROM - Chromosome of the splice
* POS - SJ start position (1-based) i.e. first base of intron
* ID - "." (unused)
* REF - Base from the reference genome FASTA at the SJ start position
* ALT - Always "DEL"
* QUAL - The junction score from in Phred scale
* FILTER - Semicolon separated list of filters
* INFO - See the possible Info fields below
* FORMAT - SR (supporting reads for ref and alt)
* SAMPLE - Counts for {unique\_dedup\_alt\_reads},{unique\_dedup\_ref\_reads}

The VCF header is given below:

```
##ALT=<ID=DEL,Description="Deletion">
##QUAL=<Description="QUAL score correlates support for the read count of splice junctions (Phred-scaled)">
##FILTER=<ID=LOW_SCORE,Description="Indicates the variant has a quality score below the passing threshold.">
##FILTER=<ID=LOW_UNIQUE_ALIGNMENTS,Description="Number of uniquely mapping dedup reads supporting the variant is below the passing threshold.">
##FILTER=<ID=MIN_SUPPORT,Description="Number of reads supporting variant is below the passing threshold.">
##FILTER=<ID=ANCHOR_SUPPORT,Description="Number of bases covered by supporting reads around breakpoint of variant is below the passing threshold">
##FILTER=<ID=READ_THROUGH,Description="Splice is variant is between adjacent genes.">
##FILTER=<ID=NON_PROTEIN_CODING,Description="All genes in splice variant were non protein-coding.">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=ALTDEDUP,Number=1,Type=Integer,Description="Reads split across deletion. Uniquely mapping, no duplicate reads">
##INFO=<ID=ALTDUP,Number=1,Type=Integer,Description="Reads split across deletion. Uniquely mapping including duplicate reads">
##INFO=<ID=REFDEDUP,Number=1,Type=Integer,Description="Reads across deletion region which do not support deletion.Uniquely mapping, no duplicate reads.">
##INFO=<ID=REFDUP,Number=1,Type=Integer,Description="Reads across deletion region which do not support deletion. Uniquely mapping including duplicate reads.">
##INFO=<ID=GENE_NAMES,Number=1,Type=String,Description="Overlapping gene names, separated by '|'.">
##INFO=<ID=GENE_IDS,Number=1,Type=String,Description="Overlapping gene ids, separated by '|'.">
##FORMAT=<ID=SR,Number=2,Type=Integer,Description="Split reads across splice variant region for the REF and ALT alleles in the order listed. Uniquely mapping, no duplicate reads.">
```

For example:

```
#CHROM	POS	 ID REF	ALT QUAL FILTER INFO FORMAT SAMPLE
chr7	55019365	.	G	<DEL>	7.8929	PASS	SVTYPE=DEL;END=55155829;ALTDEDUP=96;ALTDUP=167;REFDEDUP=77;REFDUP=353;GENE_NAMES=EGFR;GENE_IDS=ENSG00000146648.21	SR	77,96
```

### Splice Variant Filters

The following filters are applied for confidence or informative only on splice variant candidates.

| **Filter**              | **Type**                  | **Description**                                                                                                                                                                        | **Option to adjust**                                                                                                                                     |
| ----------------------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| LOW\_SCORE              | intragenic and intergenic | Score < minimum threshold (by default 0.5)                                                                                                                                             | `--rna-splice-variant-min-score`                                                                                                                         |
| MIN\_SUPPORT            | intragenic and intergenic | Supporting reads (unique\_dedup\_alt\_reads) is <3 if motif=0, or <2 if dist>50000, or <3 if dist>100000, or <4 if dist>200000 where distance is the distance between the breakpoints. |                                                                                                                                                          |
| LOW\_UNIQUE\_ALIGNMENTS | intragenic and intergenic | Unique supporting read count <2                                                                                                                                                        | `--rna-splice-variant-min-support-known` for known and annotated splice variants and `--rna-splice-variant-min-support-novel` for novel splice variants. |
| ANCHOR\_SUPPORT         | Intragenic and Intergenic | if sj.motif == 0, requires at least 30bp overhang. Otherwise requires at least 12bp overhang                                                                                           | -                                                                                                                                                        |
| NON\_PROTEIN\_CODING    | Intragenic and Intergenic | If genes have non protein coding genes.                                                                                                                                                | To enforce this filter for intergenic splice variants, set `--rna-gf-restrict-genes=true`                                                                |
| LOW\_ALT\_TO\_REF       | Intergenic only           | REF support is 200x or more than the ALT support                                                                                                                                       | `--rna-splice-fusion-max-ref-alt-ratio`                                                                                                                  |
| READ\_THROUGH           | Intergenic only           | Whether the 5' gene is downstream of the 3' gene                                                                                                                                       | `--rna-gf-report-read-through`                                                                                                                           |

### Merging Splice Variants with the Gene Fusion Caller

When the splice variant caller and gene fusion caller are both enabled, the passing intergenic splice variants will be passed to the gene fusion caller to be merged into the relevant fusion output VCF and TSV files. ![](https://25033470-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FG9szlFZupV6Q2DasL98y%2Fuploads%2Fgit-blob-5d17c7b41aeceb92a8ed1afd62b458c6abe6cf22%2Fdragen-rna-pipeline.splice_fusion_pipeline.png?alt=media)

The following splice variant calls will be passed to the gene fusion caller:

* All passing intergenic splice variants. By default intergenic splice variants on adjacent genes on the same strand are not passed to the gene fusion caller. In order to enable read-through splice fusions, use `--rna-gf-report-read-through=true`. Many read through fusions are present in mRNA and only a number of them are clinically relevant. The [Splice Fusion Genes input file](#splice-fusion-genes) lists clinically relevant readthrough fusions to be reported.
* All intergenic and intragenic splice variants matching the gene patterns in the Splice Fusion Genes input.

The **passing** calls are reported in the fusion caller's `<output-file-prefix>.fusion_candidates.final` and `<output-file-prefix>.fusion_candidates.vcf.gz` file. In the `<output-file-prefix>.fusion_candidates.vcf.gz` the variant has the "SPLICE\_VARIANT" flag in the info field. In the `fusion_candidates.final` file, the tab separated fields are described below.

| **Field Names**                     | **Description**                                                                                                          |
| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| FusionGene                          | Left and Right gene names (separated by "--")                                                                            |
| Score                               | Value between 0 and 1                                                                                                    |
| LeftBreakpoint, RightBreakpoint     | The location for left and right sides of the splice with three colon separated fields: chromosome:coordinate:strand(+/-) |
| Gene1Location, Gene2Location        | Splice Variant caller always outputs "**SpliceVar**" here instead of Exon/Intron location                                |
| Gene1Sense, Gene2Sense              | Always TRUE by design                                                                                                    |
| Gene1Id, Gene2Id                    | Long form ID (i.e. for Gencode it is usually "ENSG.version")                                                             |
| NumSplitReads                       | Taken from the *dedupUniqueSupportingReads* count (i.e. *split\_unique\_reads\_alt* column value)                        |
| NumSoftClippedReads, NumPairedReads | These values are not used by RSV caller and are set to '0'                                                               |
| ReadNames                           | Not provided by this caller and set to 'N/A'                                                                             |

### List of All RNA Splice Variant Options

| Option                               | Description                                                                                    | Type                   | Default Value |
| ------------------------------------ | ---------------------------------------------------------------------------------------------- | ---------------------- | ------------- |
| enable-rna-splice-variant            | Enable rna splice variant caller                                                               | true/false             | false         |
| rna-splice-variant-knowns            | Expected splice junctions, call even if they are annotated                                     | string (path to file)  | None          |
| rna-splice-variant-normals           | Normal (non-variant) splice junctions, do not report if found                                  | string (path to file)  | None          |
| rna-splice-variant-fusion-genes      | List of hotspot genes that may contain splice fusions                                          | string (path to file)  | None          |
| rna-enriched-regions                 | List of regions that splice junctions must overlap (used for panels)                           | string (path to file)  | None          |
| rna-enriched-genes                   | List of genes that splice junctions must overlap (used for panels)                             | string (path to file)  | None          |
| rna-splice-variant-min-score         | Score threshold for filtering RNA splice variant candidates.                                   | number between 0 and 1 | 0.5           |
| rna-splice-variant-min-support-known | Minimum number of unique reads supporting known and annotated splice variant required to PASS. | Integer >= 0           | 2             |
| rna-splice-variant-min-support-novel | Minimum number of unique reads supporting novel splice variant required to PASS.               | Integer >= 0           | 2             |
| rna-splice-fusion-max-ref-alt-ratio  | Maximum ref to alt ratio for the spliced fusion required to PASS.                              | number >= 0            | 200           |
