# Gene Fusion Detection

The DRAGEN Gene Fusion module uses the DRAGEN RNA splice-aware aligner to detect supplementary (chimeric) alignments. The chimeric alignments which are output to the `<output-file-prefix>.Chimeric.out.junction` output file are used to find potential gene fusion breakpoints and read evidence is accumulated for the resulting fusion event candidates. Then, an ML model is applied to score the fusion candidates. The ML scoring model is currently available on human samples only, and does not support non-human reference genomes.

### Running DRAGEN Gene Fusion

You can run the DRAGEN Gene Fusion module together with a regular RNA-Seq map/align job. To enable the DRAGEN Gene Fusion module, set `--enable-rna-gene-fusion` to "true". The DRAGEN Gene Fusion module requires a gene annotations file in GTF or GFF format.

The following is an example command line for running an end-to-end WTS RNA-Seq experiment with RNA fusion detection.

```
dragen \
  -r <HASHTABLE> \
  -1 <FASTQ1> \
  -2 <FASTQ2> \
  -a <ANNOTATION_FILE> \
  --output-dir <OUT_DIRECTORY> \
  --output-file-prefix <OUTPUT_PREFIX> \
  --RGID <READ_GROUP_ID> \
  --RGSM <SAMPLE_NAME> \
  --enable-rna true \
  --enable-rna-gene-fusion true \
  --enable-duplicate-marking true 
```

To run an end-to-end targeted panel RNA-Seq experiment with RNA fusion detection, add the option `--rna-enriched-regions` or `--rna-enriched-genes`. See Section [RNA Fusion Options](#gene-fusion-options) for more details.

At the end of a run, a summary of detected gene fusion events is output, which is like the following example.

```
==================================================================
Completed DRAGEN Gene Fusion Detection
==================================================================
Chimeric alignments: 3072
Total fusion candidates: 259
Final fusion candidates: 223
```

### Gene Fusion modes

| Option                         | Description                                                                                                  | Default |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------ | ------- |
| `--rna-gf-report-intronic`     | Report fusion calls with breakpoints inside introns                                                          | True    |
| `--rna-gf-report-antisense`    | Report fusion calls with antisense and intronic breakpoints                                                  | False   |
| `--rna-gf-report-intergenic`   | Report fusion calls with breakpoints outside of genes, antisense and intronic breakpoints                    | False   |
| `--rna-gf-report-read-through` | Report read-through fusion calls (genes within distance min cis distance set by `--rna-gf-min-cis-distance`) | False   |
| `--rna-gf-ptd-genes`           | List of gene names where we allow PTD/ITD associated self fusions, separated by space                        | ""      |

#### ITD/PTD Event Calling

By default the fusion caller will filter out any events that are intragenic to suppress false positive reporting but has the option to enable reporting of Partial Tandem Duplication (PTD) or Internal Tandem Duplication events (ITD) in specific genes. This is done by specifying a space separated list of genes in the `–-rna-gf-ptd-genes` option, which is empty by default. For example, by adding a command line option `--rna-gf-ptd-genes="KMT2A FGFR1"` we can see a detected output such as the one below in the fusion\_candidates.final file (only first 4 columns shown)

```
KMT2A--KMT2A. 0.99873  chr11:118482495:+  chr11:118468775:+
```

#### Reporting read-through fusions

Read-through gene fusions occur when neighboring genes are spliced together. These fusions are detected by the Splice Variant Caller as intergenic splice variants on adjacent genes and by default are not passed to the gene fusion caller. To detect them, enable the gene fusion and splice caller together with the following options:

```
 --enable-rna=true \
 --enable-rna-gene-fusion=true \
 --enable-rna-splice-variant=true \
 --rna-splice-variant-enable-readthrough=true
```

### Gene Fusion Output

The Gene Fusion module produces the following output files:

| Output filename                                       | Format     | Description                                                                    |
| ----------------------------------------------------- | ---------- | ------------------------------------------------------------------------------ |
| `<output-file-prefix>.fusion_metrics`                 | CSV format | Metrics summarizing the total number of fusions detected                       |
| `<output-file-prefix>.fusion_candidates.vcf`          | VCF format | All candidate fusions in VCF format                                            |
| `<output-file-prefix>.final.fusion_candidates.vcf`    | VCF format | Candidate fusions with *PASS* filter in VCF format                             |
| `<output-file-prefix>.fusion_candidates.final`        | TSV format | Summary of candidate fusions with *PASS* filter in TSV format                  |
| `<output-file-prefix>.fusion_candidates.preliminary`  | TSV format | Summary of candidate fusions with *FAIL* filter in TSV format                  |
| `<output-file-prefix>.fusion_candidates.features.csv` | CSV format | All candidate fusions in CSV format with detail features, useful for debugging |
| `<output-file-prefix>.fusion_candidates.filter_info`  | TSV format | List of all filters applied to the fusion candidates                           |

The columns and details of the files are described below.

#### `<output-file-prefix>.fusion_candidates.features.csv`

This file lists all the detected gene fusion events. The output CSV file includes the numerous informative columns. Note that the specific features and column values are subject to change in future DRAGEN versions as more RNA data is analyzed. The list of the current columns in the features.csv output are:

* *FusionGene*: Parent gene names (in 5' to 3' order of transcript) participating in the fusion; hereafter referred to as Gene 1 and Gene 2.
* *Score*: Fusion call confidence score predicted by the ML model. If the ML model is used, the score can be 0 (low confidence) to 1 (high-confidence call). Currently the ML model only supports human references. In the case an ML model is not available, the number of supporting reads will be reported as the score.
* *LeftBreakpoint*: Gene 1 breakpoint formatted as `<Chromosome>:<Position>:<Strand>`.
* *RightBreakpoint*: Gene 2 breakpoint formatted as `<Chromosome>:<Position>:<Strand>`.
* *Filter*: Semicolon separated list of filter flags. The filters are described in the Section [Gene Fusion Filters](#gene-fusion-filters).
* *SplitScore*: Combined count of fusion-supporting read pairs reported as split reads and soft-clipped reads
* *NumSplitReads*: Number of fusion-supporting read pairs with at least one split read alignment.
* *NumSoftClippedReads*: Number of fusion-supporting read pairs with no split read alignment, but at least one soft clipped alignment. Includes soft-clipped reads for both Gene1 and Gene2
* *NumSoftClippedReadsGene1*: Number of fusion-supporting read pairs with no split read alignment, but at least one soft clipped alignment to Gene 1
* *NumSoftClippedReadsGene2*: See above (`NumSoftClippedReadsGene1`) for Gene 2
* *NumPairedReads*: Number of fusion-supporting read pairs such that one of the reads maps to Gene1 and the other maps to Gene2, without any breakpoint overlap
* *NumRefSplitReadsGene1*: Number of read pairs that map fully within Gene 1 such that at least one of the reads aligns across the breakpoint. These reads support the reference transcript and do not support the fusion.
* *NumRefSplitReadsGene2*: See above (`NumRefSplitReadsGene1`) for Gene 2
* *NumRefPairedReadsGene1*: Number of read pairs such that one of the reads maps on the left side of the Gene 1 breakpoint and the other maps on the right side of the Gene 1 breakpoint, without overlapping the break. These reads support the reference transcript and do not support the fusion.
* *NumRefPairedReadsGene2*: See above (`NumRefPairedReadsGene1`) for Gene 2
* *RefToAlt* : Log2 value of the ratio of max(NumRefSplitReadsGene1, NumRefSplitReadsGene2) / (fusion split + soft clipped reads); used for the `LOW_ALT_TO_REF` filter
* *UniqueAlignmentsGene1*: Unique (start-end) positions of fusion-supporting read alignments to Gene 1 (after dedup); used for the `LOW_UNIQUE_ALIGNMENTS` filter
* *UniqueAlignmentsGene2*: Unique (start-end) positions of fusion-supporting read alignments to Gene 2 (after dedup); used for the `LOW_UNIQUE_ALIGNMENTS` filter
* *MaxMapqGene1*: Maximum MAPQ for fusion-supporting reads in Gene 1
* *AvgMapqGene1*: Average MAPQ for fusion-supporting reads in Gene 1
* *MaxMapqGene2*: Maximum MAPQ for fusion-supporting reads in Gene 2
* *AvgMapqGene2*: Average MAPQ for fusion-supporting reads in Gene 2
* *CoverageBasesGene1*: Bases in Gene 1 with read coverage across the exon of the breakpoint in the direction of the breakpoint strand which is part of the fusion transcript. If the fusion is intronic, the coverage will be calculated over an average exon size (200 for human genomes).
* *CoverageBasesGene2*: See above (`CoverageBasesGene1`) for Gene 2
* *DeltaExonBoundaryGene1*: Distance from the Gene 1 breakpoint for the closest fusion-supporting alignment (higher distance to boundary lowers score)
* *DeltaExonBoundaryGene2*: See above (`DeltaExonBoundaryGene1`) for Gene 2
* *IsRestrictedGene1*: Indicator variable of whether Gene 1 is tagged as protein coding in the annotation file
* *IsRestrictedGene2*: Indicator variable of whether Gene 2 is tagged as protein coding in the annotation file
* *IsEnrichedGene1*: If enrichment or amplicon assay, then indicates whether Gene 1 is enriched. If whole transcriptome sequencing, then set to 1
* *IsEnrichedGene2*: See above (`IsEnrichedGene1`) for Gene 2
* *CisDistance*: Distance between breakpoints if they are adjacent to each other and on the same strand. Large value (3.2G) if not a CIS break; used for the `READ_THROUGH` filter.
* *BreakpointDistance*: Distance between breakpoints if they are adjacent. Large value (3.2G) if not within same chromosome
* *GenePairHomologyEval*: E-value of pairwise BLAST alignment of the parent genes
* *AnchorLength1*: Longest alignment of a fusion-supporting read to Gene 1
* *AnchorLength2*: Longest alignment of a fusion-supporting read to Gene 2
* *NormalizedAnchorLength1*: Normalized value of `AnchorLength1` by the maximum read length.
* *NormalizedAnchorLength2*: Normalized value of `AnchorLength2` by the maximum read length.
* *FusionLengthGene1*: Distance from breakpoint to the end of Gene 1
* *FusionLengthGene2*: Distance from breakpoint to the end of Gene 2
* *NonFusionLengthGene1*: Breakpoint distance to the end of transcript not part of the fusion for Gene 1
* *NonFusionLengthGene2*: Breakpoint distance to the end of transcript not part of the fusion for Gene 2
* *Gene1Id*: Gene ID reported in the annotation file for Gene 1
* *Gene2Id*: Gene ID reported in the annotation file for Gene 2
* *Gene1Location*:
* *IntactExon*: Breakpoint matches exon boundary,
* *BrokenExon*: Breakpoint is within an exon but does not match the exon boundary,
* *Intron*: Breakpoint is within an intron,
* *Intergenic*: Breakpoint does not overlap any gene
* *Gene2Location*: See above (`Gene1Location`) for Gene 2
* *Gene1Sense*: "TRUE" if the Gene 1 5' to 3' direction matches the breakpoint order, indicating that the gene is the upstream gene in the fusion transcript
* *Gene2Sense*: See above (`Gene1Sense`) for Gene 2

#### `<output-file-prefix>.fusion_candidates.final`

This output file lists each passing fusion in tab-separated format with the following columns from the features.csv file: FusionGene, Score, LeftBreakpoint, RightBreakpoint, Gene1Location, Gene2Location, Gene1Sense, Gene2Sense, Gene1Id, Gene2Id, NumSplitReads, NumSoftClippedReads, NumPairedReads; and the following unique columns:

* *FusionSequence*: If reporting fusion sequence is enabled (`--rna-gf-output-fusion-sequence`), the fusion sequence at the breakpoint will be reported if assembly of the reads succeeds. NA if the assembly failed.
* *BreakpointLeeway*: If fusion assembly succeeds and the fusion breakpoint sequence can be aligned to the reference breakpoint in multiple ways (e.g. by shifting the assembly), a leeway value will be reported in the format `-XX|+YY` where XX is the number of bases the reported breakpoint can be shifted left (relative to the left side of the fusion) while maintaining the maximal alignment score and YY is similarly the number of bases it can be shifted to the right. NA if assembly failed, or was aligned only to one breakpoint and failed on the other breakpoint.
* *ReadNames*: The name of all split, spanning, and soft-clipped reads supporting the fusion separated by semicolon. These reads can be extracted from the output BAM file and used to visualize the fusions (i.e. in IGV).

#### `<output-file-prefix>.fusion_candidates.preliminary`

The same information as the `<output-file-prefix>.fusion_candidates.final` output but for all fusion candidates that did *not* PASS. These fusions will not have FusionSequence and BreakpointLeeway.

#### `<output-file-prefix>.fusion_candidates.filter_info`

Contains a list of all non-passing fusion candidates in `<output-file-prefix>.fusion_candidates.preliminary` with the filter that caused it to fail in the first column.

#### `<output-file-prefix>.fusion_candidates.vcf.gz`

This output file provides the VCF representation for all of the breakpoints for the candidate fusions using structural variant-style BND notation. The VCF header is annotated with `##source=DRAGEN_RNA_GF` to indicate the file is generated by the DRAGEN RNA Gene Fusion pipeline. All fusion candidates (passing and failing) are represented in the VCF output with one entry for each side of the fusion breakpoint (Gene 1 and Gene 2). The QUAL field is the Phred value of the score, calculated as `-10log_{10}(probability score)`. The probability score can be found in the `fusions_candidates.final` and `fusion_candidates.features.csv` files. The Gene Fusion VCF header is as follows:

```
##ALT=<ID=BND,Description="Gene fusion represented using structural variant breakend notation">
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=LOW_SCORE,Description="The fusion candidate has low probabilistic score as determined by the features of the candidate.">
##FILTER=<ID=MIN_SUPPORT,Description="The fusion candidate has too few fusion supporting read pairs.">
##FILTER=<ID=LOW_UNIQUE_ALIGNMENTS,Description="All fusion supporting read alignments near at least one of the two breakpoints map to the very few reference intervals (start, end).">
##FILTER=<ID=LOW_MAPQ,Description="All fusion supporting read alignments at either of the breakpoints have low MAPQ.">
##FILTER=<ID=UNENRICHED_GENES,Description="If enrichment panel, then both parent genes are not enriched.">
##FILTER=<ID=HOMOLOGOUS,Description="The candidate is likely to be a false candidate generated because the two genes involved have high gene homology.">
##FILTER=<ID=LOW_GENE_COVERAGE,Description="Either of the two breakpoints have very few bases with nonzero read coverage.">
##FILTER=<ID=LOW_ALT_TO_REF,Description="The number of fusion supporting reads is a small fraction of the number of reads supporting the reference transcript at either of the two breakpoints.">
##FILTER=<ID=READ_THROUGH,Description="The breakpoints are cis neighbors on the reference genome.">
##FILTER=<ID=NO_COMPLETE_SPLIT_READS,Description="All split read alignments only cover a small fraction of the total bases in the read.">
##FILTER=<ID=MITOCHONDRIAL_GENES,Description="The fusion candidate involves mitochondrial genes.">
##FILTER=<ID=ANCHOR_SUPPORT,Description="Read alignments of fusion supporting reads are not long enough at either of the two breakpoints.">
##FILTER=<ID=DOUBLE_BROKEN_EXON,Description="If both breakpoints are distant from annotated exon boundaries, then the number of supporting reads do not satisfy a high threshold requirement.">
##FILTER=<ID=INCONSISTENT_MOTIF,Description="Fusion breakpoints do not have canonical motifs.">
##FILTER=<ID=MIN_SCORE_RATIO,Description="Fusion has low supporting reads as compared to the most-supported fusion of a parent gene.">
##FILTER=<ID=NEIGHBOR_MERGE,Description="Fusion merged with another fusion with a neighboring gene.">
##FILTER=<ID=PARALOG,Description="Higher scoring fusion call with a paralogous gene exists.">
##FILTER=<ID=ADJACENT_BREAKPOINTS,Description="Breakpoints within close proximity on reference genome.">
##FILTER=<ID=MAX_PARTNERS,Description="A parent gene has too many fusion partners.">
##FILTER=<ID=INTRONIC_BREAKPOINT,Description="Breakpoint is in gene intronic region.">
##FILTER=<ID=DUPLICATE_GENE_PAIR,Description="The fusion between gene pairs is detected at another breakpoint with higher read support.">
##FILTER=<ID=BLOCKLIST,Description="The fusion is in the blocklist.">
##INFO=<ID=GENE_NAMES,Number=2,Type=String,Description="Pair of genes names involved in fusion (left,right).">
##INFO=<ID=GENE_IDS,Number=2,Type=String,Description="Pair of genes IDs involved in fusion (left,right).">
##INFO=<ID=GENE_SENSE,Number=2,Type=String,Description="Direction of gene1 and gene2, TRUE if gene direction matches the fusion direction (left, right).">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##INFO=<ID=MATEID,Number=1,Type=String,Description="ID of mate breakend">
##INFO=<ID=RNA_STRANDED,Number=0,Type=Flag,Description="For RNA fusions, the direction of transcription is known">
##INFO=<ID=RNA_FIRST,Number=0,Type=Flag,Description="For RNA fusions, this break-end is 5' in the fusion transcript">
##INFO=<ID=SPLICE_VARIANT,Number=0,Type=Flag,Description ="This fusion is an intergenic splice variant.">
##INFO=<ID=FUSION_SEQ,Number=1,Type=String,Description="The assembled sequence at the fusion breakends, only for passing fusions and if assembly was successful.">
##INFO=<ID=FUSION_SEQ_LEN,Number=1,Type=Integer,Description="The length of the assembled sequence at the breakends, only for passing fusions and if assembly was successful.">
##INFO=<ID=FUSION_LEEWAY,Number=1,Type=String,Description="Allowable shifting of breakpoint positions with identical alignment score.">
##FORMAT=<ID=PR,Number=3,Type=Integer,Description="Number of spanning paired-read support for the ref and alt alleles in the order listed (gene1 ref, gene2 ref, gene fusion)">
##FORMAT=<ID=SR,Number=3,Type=Integer,Description="Number of split read pairs supporting the ref and alt alleles in the order listed (gene1 ref, gene2 ref, gene fusion)">
##FORMAT=<ID=CR,Number=1,Type=Integer,Description="Number of soft-clip reads supporting the alt allele">
```

Here is an example of one fusion candidate in the VCF output. Note that there is one record per breakpoint.

```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE
chr1	9972018	chr1:9972018:-:chr1:10337203:-	A	]chr1:10337203]A	30.650926	PASS	GENE_NAMES=KIF1B,NMNAT1;GENE_IDS=ENSG00000054523.20,ENSG00000173614.14;GENE_SENSE=True,True;SVTYPE=BND;CIPOS=0,0;CIEND=0,0;MATEID=chr1:10337203:+:chr1:9972018:+;RNA_STRANDED;FUSION_SEQ=AGAAAGGTGAAGTGCGGGGATTTCTGCGTGTGGCTGTACAGGCCATCGCAGCGGATGAAGAAGCTCCTGATTATGGCTCTGGAATTCGACAGTCAGGAACAGCTAAAATATCTTTTGATAATGAATACTTTAATCAGAGTGACTTTTCGTCTGTTGCAATGACTCGTTCTGGTCTGTCCTTGGAGGAGTTGAGGATTGTGGAAGGACAGGGTCAGAGTTCTGAGGTCATCACTCCTCCAGAAGAAATCAGTCGAATTAATGACTTGGACAACAAGGGAGGTGTCACAGTTTTCCATTTAGATCAACAACTTCAAGTTCTTACCATGGAAAATTCCGAGAAGACTGAAGTGGTTCTCCTTGCTTGTGGTTCATTCAATCCCATCACCAACATGCACCTCAGGTTGTTTGAGCTGGCCAAGGACTACATGAATGGAACAGGAAGGTACACAGTTGTTAAAGG;FUSION_SEQ_LEN=460;FUSION_LEEWAY=-1|+0	PR:SR:CR	15,35,3:261,63,13:10
chr1	10337203	chr1:10337203:+:chr1:9972018:+	G	G[chr1:9972018[	30.650926	PASS	GENE_NAMES=KIF1B,NMNAT1;GENE_IDS=ENSG00000054523.20,ENSG00000173614.14;GENE_SENSE=True,True;SVTYPE=BND;CIPOS=0,0;CIEND=0,0;MATEID=chr1:9972018:-:chr1:10337203:-;RNA_STRANDED;RNA_FIRST;FUSION_SEQ=AGAAAGGTGAAGTGCGGGGATTTCTGCGTGTGGCTGTACAGGCCATCGCAGCGGATGAAGAAGCTCCTGATTATGGCTCTGGAATTCGACAGTCAGGAACAGCTAAAATATCTTTTGATAATGAATACTTTAATCAGAGTGACTTTTCGTCTGTTGCAATGACTCGTTCTGGTCTGTCCTTGGAGGAGTTGAGGATTGTGGAAGGACAGGGTCAGAGTTCTGAGGTCATCACTCCTCCAGAAGAAATCAGTCGAATTAATGACTTGGACAACAAGGGAGGTGTCACAGTTTTCCATTTAGATCAACAACTTCAAGTTCTTACCATGGAAAATTCCGAGAAGACTGAAGTGGTTCTCCTTGCTTGTGGTTCATTCAATCCCATCACCAACATGCACCTCAGGTTGTTTGAGCTGGCCAAGGACTACATGAATGGAACAGGAAGGTACACAGTTGTTAAAGG;FUSION_SEQ_LEN=460;FUSION_LEEWAY=-1|+0	PR:SR:CR	15,35,3:261,63,13:10
```

#### `<output-file-prefix>.final.fusion_candidates.vcf.gz`

This output file provides the filtered VCF for all passing breakpoints for the candidate fusions.

#### `<output-file-prefix>.fusion_metrics.csv`

This metrics output file provides a simple count of the total number of fusion candidates, the total number of passing fusion candidates, total number of passing splice fusions, and the number of unique left-right gene combinations that are found. Here is an example of this file output:

```
RNA GENE FUSION STATISTICS,,All fusion candidates (unfiltered),2290
RNA GENE FUSION STATISTICS,,Final fusion candidates (passing filter),31
RNA GENE FUSION STATISTICS,,Final splice fusions passing filter,0
RNA GENE FUSION STATISTICS,,Unique passing gene fusions,24
```

### Gene Fusion Filters

Multiple filters are applied to the fusion candidates. Some are informative and some for confidence and will make the fusion to fail even if the score is high. The option `--rna-gf-enable-high-precision-filters` is by default on, which causes `MIN_SUPPORT` and `LOW_UNIQUE_ALIGNMENTS` to be applied as confidence filters in addition to `LOW_SCORE`. Turning on the option `--rna-gf-enable-post-filters` which is by default off, will force all filters to be applied. The following table summarizes the filters and options to adjust them.

| Filter                  | Type                        | Description                                                                                                                                                                                                                                                                                   | Option to set threshold                                             |
| ----------------------- | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- |
| `LOW_SCORE`             | Confidence (always applied) | The fusion candidate has low score (< 0.5) as determined by the ML model.                                                                                                                                                                                                                     | `--rna-gf-min-score`                                                |
| `MIN_SUPPORT`           | Confidence (optional)       | The fusion has < 2 supporting read pairs.                                                                                                                                                                                                                                                     | `--rna-gf-min-split-support`                                        |
| `LOW_UNIQUE_ALIGNMENTS` | Confidence (optional)       | The minimum number of unique supporting read alignments required at each breakpoint in not met. Unique alignments have unique start and end positions and are not PCR duplicates.                                                                                                             | `--rna-gf-min-unique-alignments`                                    |
| `LOW_MAPQ`              | Information only            | All fusion-supporting read alignments at either breakpoint have MAPQ < 20.                                                                                                                                                                                                                    | `--rna-gf-min-breakpoint-mapq`                                      |
| `DOUBLE_BROKEN_EXON`    | Information only            | If both breakpoints are >50 bp away from annotated exon boundaries, then the number of supporting reads do not satisfy a high threshold requirement (≥10 supporting reads). The distance indicates an intronic fusion.                                                                        | `--rna-gf-exon-snap`, `--rna-gf-min-support-be`                     |
| `UNENRICHED_GENES`      | Information only            | If enrichment list provided, then neither parent gene is enriched. If Amplicon mode is enabled, then at least one parent gene is not enriched (See [DRAGEN Amplicon Pipeline](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-amplicon-pipeline) for further information). | `--rna-gf-enriched-only`                                            |
| `MITOCHONDRIAL_GENES`   | Information only            | The fusion candidate involves mitochondrial genes. Set `--rna-gf-filter-chrm=false` to disable this filter. Default value is "true".                                                                                                                                                          |                                                                     |
| `READ_THROUGH`          | Information only            | The breakpoints are cis neighbors (< 200,000 bp) on the reference genome.                                                                                                                                                                                                                     | `--rna-gf-report-read-through=true` and `--rna-gf-min-cis-distance` |
| `ANCHOR_SUPPORT`        | Information only            | Read alignments of fusion-supporting reads are not long enough (less than 12 bp) at either breakpoint.                                                                                                                                                                                        | `--rna-gf-min-anchor`                                               |
| `HOMOLOGOUS`            | Information only            | The candidate is likely to be a false candidate generated because the two genes involved have high gene homology. Default threshold is 1e-100.                                                                                                                                                | `--rna-gf-min-blast-pairs-eval`                                     |
| `LOW_ALT_TO_REF`        | Information only            | The number of reads supporting the fusion is < 1% of the number of reads supporting the reference transcript at either breakpoint.                                                                                                                                                            | `--rna-gf-min-alt-to-ref`                                           |
| `LOW_GENE_COVERAGE`     | Information only            | Either breakpoint has less than 50 bp with nonzero read coverage.                                                                                                                                                                                                                             | `--rna-gf-min-covered-bases`                                        |
| `ADJACENT_BREAKPOINTS`  | Information only            | Breakpoints are too close to each other                                                                                                                                                                                                                                                       | `--rna-gf-min-breakpoint-distance`                                  |

### Gene Fusion Options

The following options may be used to configure the fusion caller:

* `--rna-gf-blast-pairs` A tab separated file listing gene pairs that have a high level of similarity. The first and second column are the gene names, and the third column is the e-score. This list of gene pairs is used as a homology filter to reduce false positives. For runs on human genome assemblies GRCH38 and hg19, DRAGEN automatically applies a default file generated using [Gencode Human Release 32](https://www.gencodegenes.org/human/release_32.html) annotations for primary chromosomes if no other file is specified using the command-line.
* `--rna-enriched-genes` For RNA enrichment assays, a list of targeted genes specified as one gene-name per line. Only fusion calls involving at least one gene on the list are reported. The enriched genes list should only contain genes listed in the input annotation file. This option cannot be provided together with `--rna-enriched-regions`. If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then you do not need to set this option; DRAGEN will read the enriched genes names from the amplicon BED file (fifth column). See [DRAGEN Amplicon Pipeline](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-amplicon-pipeline) for further information.
* `--rna-enriched-regions` Alternative to `--rna-enriched-genes`, but input is provided as a bed-file with regions' coordinates instead of a gene list. All the genes in the provided annotation file that overlap such regions are included. Genes that are extracted in this way are summarized in output in the `*.fusion.enriched_genes.txt` file. This option cannot be provided together with `--rna-enriched-genes`.
* `--rna-repeat-genes` Text file that contains the names or IDs (from the annotation file) of targeted repetitive genes for sensitive fusion detection. Exclusive from `--rna-repeat-intervals`. This option overrides the default BED file. The repeat genes list should only contain genes listed in the input annotation file.
* `--rna-repeat-intervals` BED file that contains a target list of repeat intervals for sensitive fusion detection. Exclusive from `--rna-repeat-genes`. This option overrides the default files.
* `--enable-variant-annotation=true`, `--variant-annotation-assembly`, and `--variant-annotation-data` Enable Illumina Annotation Engine (IAE) to report fusion annotations in JSON format. `--enable-variant-annotation` must be set to "true". For more information, see [Illumina Annotation Engine](https://help.dragen.illumina.com/product-guides/dragen-v4.5/nirvana).
* `--rna-gf-restrict-genes` When parsing the gene annotations file for use in the DRAGEN Gene Fusion module, you can use this option to restrict the entries of interest to only protein-coding regions. Restricting the annotation to only the protein-coding genes reduces false positive rates in currently studied fusion events. To report non-coding gene fusions such as pseudo genes and lincRNAs, turn off this option. The default value is "true".
* `--rna-gf-merge-calls` If multiple genes overlap a fusion breakpoint, DRAGEN generates and scores a separate fusion candidate for each gene pair overlapping the breakpoint. The default value is "false" so that each reported fusion event only has one left and right gene in the fusion, and overlapping genes are output as separate events.
* `--rna-gf-allow-overlapping-genes` Allows for fusion calls between overlapping genes. The default value is "false".
* `--rna-gf-enable-high-precision-filters` Enable high precision filters of gene fusion candidates. Applies LOW\_SCORE (always on), LOW\_UNIQUE\_ALIGNMENTS, and MIN\_SUPPORT. Default value is true.
* `--rna-gf-enable-post-filters` Enable post-filtering of gene fusion candidates by confidence flags. The filter flags are listed in the table above. The default value is "false".
* `--rna-gf-output-fusion-sequence` Add a "FusionSequence" column for all passing fusions in the `<output-file-prefix>.fusion_candidates.final` file based on the contig assembly of all supporting reads. If no assembly was generated, then "NoAssembly" is reported. Setting this option to "true" also updates the fusion breakpoints based on the alignment of the assembled contig to the reference for the passing fusions in the `<output-file-prefix>.fusion_candidates.final` file and the output VCF files. The left and right breakpoint positions are chosen such that the alignment score between the assembled contig and the reference sequences is maximized. If there are multiple maximal alignments, the positions minimizing the distance from the original fusion breakpoints is reported. An additional "BreakpointLeeway" column is also added to the `<output-file-prefix>.fusion_candidates.final` file. This column has the form "-XX|+YY" where XX is the number of bases the reported breakpoint can be shifted left (relative to the left side of the fusion) while maintaining the maximal alignment score and YY is similarly the number of bases it can be shifted to the right. For example, "-2|+1" indicates the breakpoint could be shifted 2 to the left or 1 to the right and still have a maximal alignment score (due to identical bases occurring adjacent to the breakpoint). "NA" is output if no assembly is generated. The default value for this option is "true".
* `--enable-rna-amplicon` A separate fusion filtering model is trained for RNA amplicon mode. Duplicate removal for fusion-supporting reads is disabled for RNA amplicon mode and both genes are required to be in the list of enriched genes. By default, the DRAGEN fusion caller filters candidates if a breakpoint overlaps both transcripts (e.g. fusions such as FIP1L1--PDGFRA and GOPC--ROS1). In RNA amplicon mode, such candidates are not filtered. See [DRAGEN Amplicon Pipeline](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-amplicon-pipeline) for further information. The default is "false".
* `--rna-gf-sv-vcf` Structural Variant VCF file output from DRAGEN DNA structural variant caller run in somatic mode. See below for more information.

### Merging Fusion Caller with the Splice Variant Caller

When the splice variant caller and gene fusion caller are both enabled, the passing and failed intergenic splice variants will be passed to the gene fusion caller to be reported as candidate fusion events. This merging only occurs for genomes supported by the ML model for gene fusions (currently only Human genomes are supported). The **passing** calls are output to the fusion caller's `<output-file-prefix>.fusion_candidates.final` file with value `SpliceVar` for GeneLocation. The tab separated fields for splice fusions in the `<output-file-prefix>.fusion_candidates.final` output are described below.

| **Field Names**                     | **Description**                                                                                                                                                                                                                                                |
| ----------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| FusionGene                          | Left and Right gene names (separated by "--")                                                                                                                                                                                                                  |
| Score                               | Value between 0 and 1, from the splice variant caller                                                                                                                                                                                                          |
| LeftBreakpoint, RightBreakpoint     | The location for left and right sides of the splice with three colon separated fields: chromosome:coordinate:strand(+/-)                                                                                                                                       |
| Gene1Location, Gene2Location        | Splice Variant caller always outputs "**SpliceVar**" here instead of Exon/Intron location                                                                                                                                                                      |
| Gene1Sense, Gene2Sense              | Always TRUE by design                                                                                                                                                                                                                                          |
| Gene1Id, Gene2Id                    | Long form ID (i.e. for Gencode it is usually "ENSG.version")                                                                                                                                                                                                   |
| NumSplitReads                       | Taken from the *split\_unique\_reads\_alt* column value of the splice\_variant\_fusions.tsv file. See [RNA Splice Variant caller](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-rna-pipeline/splice-variant-caller) for more information. |
| NumSoftClippedReads, NumPairedReads | These values are not used by RSV caller and are set to "0"                                                                                                                                                                                                     |
| ReadNames                           | Not provided by this caller and set to "N/A"                                                                                                                                                                                                                   |

The passing splice variant calls will also be output to the VCF outputs: `<output-file-prefix>.final.fusion_candidates.vcf` and `<output-file-prefix>.fusion_candidates.vcf` with the "SPLICE\_VARIANT" flag in the info field.

### Running RNA fusion detection with somatic SV evidence

You can run the DRAGEN Gene Fusion module with a VCF file containing somatic Structural Variant (SV) calls. DRAGEN will report SV events matching each fusion candidate in the `<output-file-prefix>.fusion_candidates.features.csv` output file for informational purposes but will not use this data in the scoring or filtering of the fusion candidates. The SV events must be run in somatic mode (for more information see [DRAGEN Structural Variant Calling](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/sv-calling) pipeline). The following is an example command line for running an end-to-end RNA-Seq experiment with a somatic SV VCF file.

```
dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
-a <ANNOTATION_FILE> \
--output-dir <OUT_DIRECTORY> \
--output-file-prefix <OUTPUT_PREFIX> \
--RGID <READ_GROUP_ID> \
--RGSM <SAMPLE_NAME> \
--enable-rna true \
--enable-rna-gene-fusion true \
--rna-gf-sv-vcf <SV_VCF_PATH>
```

When the SV VCF input is provided to the RNA fusion caller, the following additional features will be reported in the `features.csv` output file:

* `SvEvent`: A semi-colon separated string representation of SV events matching the fusion candidate.
* `SvType`: A semi-colon separated list of type of the matching SV events.
* `SomaticScore`: The highest SomaticScore value of the matching SV events.
* `SvDistance`: The maximum distance between any SV breakpoint to any fusion breakpoints (if multiple matching SV events, then minimum of all maximum distances over all SV events).
* `LeftSvDistance`: The distance between the left fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
* `RightSvDistance`: The distance between the right fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
* `SvPresent`: Set to 1 if matching SV event is present, otherwise 0.
* `SvAbsent`: Set to 1 if no matching SV event is present, otherwise 0.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-rna-pipeline/gene-fusion-detection.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
