# Structural Variant Calling

The DRAGEN Structural Variant (SV) Caller integrates and extends Manta structural variant calling methods to provide SV and indel calls larger than or equal to SV\_MIN\_SCORED\_VARIANT\_SIZE ([default values](#default-values)) of bases. SVs and indels are called from mapped paired-end sequencing reads. The SV caller is optimized for analysis of diploid germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.

The SV caller performs the following actions:

* Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions within a single efficient workflow.
* Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise.
* Provides scoring models for 1) germline variants in small sets of diploid samples, 2) somatic variants in matched tumor-normal sample pairs, and 3) somatic and germline variants in tumor-only samples.

All SV and indel inferences are output in VCF 4.2 format.

## DRAGEN SV Caller Overview

The DRAGEN SV caller divides the SV and indel discovery process into the following steps.

1. Reads input files to estimate alignment statistics, including fragment size distribution and chromosome level depth. For more information on the SV caller input options, see [Command Line Options](#command-line-options).
2. Scans the genome or a subset of the genome (specified by the call regions) to build various genome-wide data structures, including a breakend association graph of all SV associated regions from single reads. DRAGEN SV then merges and filters these regions in the graph to reduce noise that can improve precision and runtime. The graph contains edges that connect all regions of the genome that have a possible breakend association. Edges can connect two different regions of the genome to represent evidence of a long-range association, or an edge can connect to a region to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis and multiple breakend candidates might be found on one edge. Typically only one or two candidates are found per edge. Instead of passing an inclusion region BED file, an exclusion region BED file can be passed to DRAGEN so that any SV breakend that overlaps with these regions gets removed from downstream analyses. The excluded regions are removed from the graph building process, but active regions can get extended and present in the excluded regions in the refinement step. This can happen for the active regions that are close to the boundaries of the excluded regions. Hence, the final SV calls may still get extended to these regions.
3. Analyzes the breakend association graph to discover candidate SVs, then scores discovered candidate SVs. Analysis and scoring are performed as follows.
   1. Infers SV candidates that are associated with the given graph edge.
   2. Assembles the SV breakends.
   3. Scores/genotypes and filters each SV candidate under various biological models (currently germline, tumor-normal, and tumor-only).
   4. Outputs scored SVs to VCF.

### DRAGEN SV Caller Capabilities

The DRAGEN SV caller can discover all identifiable structural variant types in the absence of copy number analysis and large-scale *de novo* assembly. For more information on detectable types, see [Detected Variant Classes](#detected-variant-classes).

For each structural variant and indel, the SV caller attempts to assemble the breakends by gathering nearby evidential reads, and to call SV events to base pair resolution by aligning assemblies against the reference genome. Then SV caller reports the left-shifted breakend coordinate (per the VCF 4.2 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. As a result, SV events' reported coordinates may not be directly reflected by read alignments' IGV view. Often the assembly will fail to provide a confident explanation of the data, especially in repeat regions. As a result, the SV caller will skip providing single-base resolution breakpoints or the associated split read support. In such cases, the SV caller will approximate the event breakpoints and score the events under the unified likelihood model as in other regular cases but report the variant as IMPRECISE instead.

The sequencing reads provided as input to the SV caller are expected to be from a paired-end sequencing assay that results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.

The SV caller is primarily tested for whole-genome and whole-exome (or other targeted enrichment) sequencing assays on DNA. For these assays the following applications are supported:

* Joint analysis of 5 or fewer diploid individuals
* Subtractive analysis of a matched tumor-normal sample pair
* Analysis of an individual tumor sample

For joint analysis, there is no specific restriction against larger cohorts, but there might be stability or call quality issues.

When performing somatic calling, the matched normal sample might be contaminated with tumor cells. The contamination can substantially reduce somatic variant recall. To account for Tumor-in-Normal (TiN) contamination, you can enable liquid tumor mode. For more information, see [Liquid Tumor Calling](#liquid-tumor-calling-and-tumor-in-normal-contamination).

Tumor samples can be analyzed without a matched normal sample. In this case, both germline and somatic variants are scored and reported in the output.

### Detected Variant Classes

The SV caller can discover all variation classes that can be explained as novel DNA adjacencies in the genome. Novel DNA adjacencies are classified into the following categories based on the breakend pattern:

* Deletions
* Insertions
  * Insertions in the result can be divided into the following two subclasses depending on if the inserted sequence can be fully assembled. 1) Fully-assembled insertions; 2) Partially-assembled/incomplete (inferred) insertions. See VCF record [example](#insertions-with-incomplete-insert-sequence-assembly).
  * Partially assembled insertion sequences can be reported in the single-breakend format (described in [VCF v4.2 Section 5.4.9](https://samtools.github.io/hts-specs/VCFv4.2.pdf)) with `--sv-report-incomplete-ins-as-bnd=true`.
    * As a heuristic, DRAGEN will attempt to pair partially-assembled insertions into a single INS SV call if they are proximal and have consistent orientations e.g. two partially-assembled insertions which derive from a single large insertion. If such a pairing cannot be found and for a partially-assembled insertion sequence and it does not match any MEI sequence (via the workflow described in [Mobile Element Insertions Detection](#mobile-element-insertions-detection)), no SV record will be created.
* Tandem Duplications
* Inversions
* Unclassified breakpoints corresponding to intra and inter-chromosomal translocations, or complex structural variants. These are reported as a matching pair of VCF `BND` records as per [VCF v4.2 Section 5.4](https://samtools.github.io/hts-specs/VCFv4.2.pdf).

### Mobile Element Insertions Detection

The general purpose SV routine can detect Mobile Element Insertions (MEIs) with assembled inserted sequences like other regular insertions. If missed by the general purpose SV routine, MEIs will be rescued by the MEI specific routine based on similarity between assembled contigs and known sequences in the MEI catalog.

* The MEI catalog based rescuing functionality is enabled by default via `--sv-enable-mobile-element-sequences=true`.
* The MEI catalog is described in the file `<INSTALL_PATH>/config/sv_mobile_element_sequences.fa` and accepted via `--sv-mobile-element-sequences-file` by default. The catalog contains sequences of common mobile elements from the [Dfam](https://www.dfam.org/home) database, including `Alu`, `LINE1`, and `SVA` subfamilies. The catalog can be customized by the user with additional sequences or with a different set of sequences.
* The rescued records will be placed in same VCF as the general purpose SV routine and presented as regular INS events.

An example of such rescued record:

```
chr10 2394759 DRAGEN:INS:10001:0:0:0:3:0 T TAATGAGGACATTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTNNNNNGGGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGCCGGACTGTGGACTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCTTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCGCGCCCGGCTAATTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCTTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCATGATCCACCCGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGCGCCCGGCC 999 PASS END=2394759;SVTYPE=INS;CONTIG=ATATGTGTCTTTGCTAGGTATTGCCAAATTTATCTCCAGAAATCTTGCACAAATCTGTACTCCTGTTAGCAATGTGTGCGTATACCTGCTTCCACATGACCTCAGTAAAAGAATGTGTTGTCATATTGGTATTGAAATTTTAGCACTGTAAGCAACAGGTCATTTTGGAAAACCTGAGCTTTCGCCAAATTCAGCTATTTTGATTTGCTTTTATTATTAGCATATACCAAAATAAATAGGCATATTAGAGTTTCCTTTCTTGCATCTTAAAATTCATCTAACACATCTATAATAACATTCTTTTCTTTTTTTTCCATTCTAGGACTTGCCCCTTTCGTCTATTTGTCAGACGAATGTTACAATTTACTGGCAATAAAGTTTTGGATAGACCTTAATGAGGACATTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGCCGGACTGTGGACTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCTTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCGCGCCCGGCTAATTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCTTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCATGATCCACCCGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGCGCCCGGCCAATGAGGACATTATTAAGCCTCATATGTTAATTGCTGCAAGCAACCTCCAGTGGCGACCAGAATCCAAATCAGGCCTTCTTACTTTATTTGCTGGAGATTTTTCTGTGTTTTCTGCTAGTCCAAAAGAGGGCCACTTTCAAGAGACATTCAACAAAATGAAAAATACTGTTGAGGTAAGGTTACTTTTCAGCATCACCACACATTTTGGTATTTTTCTATTTTGACAGTCCAGTATCAAGGAAATAGCTTTTATACAAATTGGATAGTTGAGGTAGTATGTGAGGTAAAGTTTAATCATATATTAATTGCCCATGAACCTCAGGAGATGGGGGAATGGGGAAATGACAGCAACTAGAAAGAGAAGAATGACTTGAAGGGAAATGAGTTAGGAGAAATTGTGAGAAGGATGTTCAGAAATGCAGACTTTGTAAGCAAACTGGAAATTGGTTACAAGAATAATATGAGTTATCTGTGGTTTGCAGCAGTCAGCAGTGTGATTGAGGATCACAAGGTCAGGGGTTCAAGACCAGCCTGGGCAAGATGAGTTTTCAGT;CIPOS=0,15;CIEND=0,15;HOMLEN=15;HOMSEQ=AATGAGGACATTATT;LEFT_SVINSSEQ=AATGAGGACATTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT;RIGHT_SVINSSEQ=GGGTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGCCGGACTGTGGACTGCAGTGGCGCAATCTCGGCTCACTGCAAGCTCCGCTTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCGCGCCCGGCTAATTTTTTGTATTTTTAGTAGAGACGGGGTTTCACCTTGTTAGCCAGGATGGTCTCGATCTCCTGACCTCATGATCCACCCGCCTCGGCCTCCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGCGCCCGGCC GT:GQ:PL:PR:SR:SB:FS:VF 0/1:105:999,0,102:9,9:10,23:5,5,21,2:17.936:13,32
```

### Known Limitations

The SV caller cannot directly discover the following variant types:

* Dispersed duplications.
  * Dispersed duplications may be indirectly called as insertions or unclassified breakends.
* Most expansion/contraction variants of a reference tandem repeat.
* Breakends corresponding to small inversions.
  * The limiting size is not tested, but in theory, detection falls off below \~200 bases. Micro-inversions might be detected indirectly as combined insertion/deletion variants.
* Fully-assembled large insertions.
  * The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but power to fully assemble the insertion should fall off to impractical levels before this size.
  * The SV caller does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.
* Large germline deletions and duplications
  * The SV caller does not report germline deletions and duplications larger than 1 Mb because it relies on split-read and read-pair evidence at breakpoint loci, which is insufficient to report the variants spanning such large regions with high confidence. The size limits can be adjusted using the options, `--sv-max-del-scored-variant-size` (deletions) and `--sv-max-dup-scored-variant-size` (duplications).
* Fold-back inversions
  * Fold-back inversion in which the start and end positions are less than 1kbp apart are not reliably called.
* Increased runtime due to poor quality reads.
  * High levels of discordant alignments can burden the SV caller by generating an excessive number of structural variant candidates, leading to increased runtime. The DRAGEN aligner provides three key metrics to gauge potential sample quality issues: soft-clipped bases, supplementary alignments (indicating chimeric reads) and improperly paired reads (indicating discordant reads). For well-behaved samples, these metrics typically remain below 10% of the total aligned bases or reads. If these metrics are excessively high, it is recommended to investigate potential sources of error upstream. This may include library preparation protocols, input material quantity, and sequencing run quality. We also recommend loading the sample into IGV and comparing to more well-behaved samples.

More general repeat-based limitations exist for all variant types:

* Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
* Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.

While the SV caller classifies certain novel DNA-adjacencies into variant classes, it has a limited ability to infer high-level events resulting from complex rearrangements, so certain calls summarized as deletions, duplications, and insertions might be better described by looking at the full system of breakends and copy number changes associated with a given event. For example, a set of overlapping deletions, duplications and inversion-like breakpoints could form a chromothriptic rearrangement, or the two sides of an insertion of unknown length may not actually be connected and could instead form a balanced interchromosomal translocation. Care should be taken when interpreting somatic SVs as complex rearrangements are common in cancer and the SV caller classifications are only valid for simple isolated SVs.

### Systematic Noise Filtering

When DRAGEN-SV is used in the somatic mode (tumor-only or tumor-normal), a BEDPE file with a set of paired-end regions in the BEDPE file format can be specified to filter out sequencing / systematic noise and also recurrent germline calls. Any variant that overlaps with one of the systematic noise paired regions (with a population count of at least 2) and has the same orientation will be marked as `SystematicNoise` in the final VCF file. This BEDPE file can be passed via the command line option `--sv-systematic-noise`.

The systematic noise BEDPE file is built using VCFs that were generated by the DRAGEN-SV tumor-only pipeline when run on normal samples that do not necessarily match to the subject the tumor sample was taken from. The file might contain several dozen samples.

#### Generating systematic noise BEDPE file

You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.

To generate a BEDPE file, do as follows.

1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.
2. Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below

```
dragen \
-r <HASHTABLE> \
--sv-build-systematic-noise-vcfs-list <LIST OF VCF FILES>
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
```

You can also build systematic noise BEDPE files in the cloud using the [DRAGEN Baseline Builder App on BaseSpace](https://www.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/dragen-baseline-builder.html) or the DRAGEN Systematic Noise File Builder Pipeline on [ICA](https://www.illumina.com/products/by-type/informatics-products/connected-analytics.html).

#### Pre-built SV systematic noise BEDPE files

The following prebuilt systematic noise files for WGS are available for download on the [DRAGEN Software Support Site page](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html). It is recommended to select the one that best matches your library prep and application and ideally to generate it from your own set of samples. Each systematic noise file contains a version string that DRAGEN uses to check the compatibility by default and exits early if a wrong systematic noise file is provided. More details are provided in the README within the downloadable package.

#### SV systematic noise BEDPE file format

The systematic noise BEDPE is formatted as follows:

| ID              | Description                                                                                                                                                                          |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| contig1         | chromosome of the first region (string)                                                                                                                                              |
| start1          | start position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)                                                         |
| end1            | end position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)                                                           |
| contig2         | chromosome of the second region (string)                                                                                                                                             |
| start2          | start position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)                                                       |
| end2            | end position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)                                                         |
| event\_id       | The paired region unique ID (string)                                                                                                                                                 |
| score           | The number of occurrences in the cohort                                                                                                                                              |
| orientation1    | direction of breakpoint1 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")                                                                |
| orientation2    | direction of breakpoint2 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")                                                                |
| assembly-status | If all variants used to generate the noise candidate have end-to-end local assemblies, noise candidate is "precise", otherwise it is "imprecise" (string, "precise", or "imprecise") |

### SV Scoring

The SV caller applies a diploid scoring model for one or more diploid samples(treated as unrelated), as well as a somatic scoring model when a tumor and matched normal sample pair are given.

#### Germline scoring model

The germline scoring model produces diploid genotype probabilities for each candidate structural variant. Most candidates are approximated as independent for scoring purposes and modeled under a bi-allelic and diploid genotype likelihood setting. DRAGEN solves for the posterior probability over possible genotypes given the sequencing data by approximating it proportionally to the product between the prior probability of a genotype and the conditional probability of observing a set of read fragments(post-filtering) given the underlying genotype. DRAGEN treats each read fragment independently and represents the conditional probability of the set of read fragments as the product over all the individual read fragments'. For each individual read fragment's conditional probability, DRAGEN combines both paired-read and split-read evidence components, and approximates their contributions as independent by representing it as a product of these components, with the condition that the paired-read component is weighted by a linear ramp from one to zero depending on the candidate event type and size as tiny event will not affect pair-read mapping status significantly.

* The paired-read component is modeled as a function measuring the deviation of the inferred fragment length from the overall distribution.
* The split-read component is modeled as a function measuring the correctness of a read alignment to the a breakend by multiplying across all the non-gap bases' probability of observing a certain base call given the corresponding base of the evaluated allele.

Each read fragment may contribute only paired-read support, only split-read support, or both. Where a fragment contributes split-read support, this support may come from either or both reads in the read pair.

#### Somatic scoring model

The somatic scoring model is a Bayesian probabilistic model using a tumor-normal joint genotyping approach. It aims to call somatic structural variants in tumors while avoiding germline variants and noisy variants. In this model, the tumor and normal allele frequencies are treated as nonindependent random variables. DRAGEN calculates posterior probabilities for a range of genotype hypotheses, under the assumption that the normal sample conforms to the diploid germline genotype considering homozygous reference, heterozygous, and homozygous states. The tumor sample is a mixture of the germline genotype and, if present, the somatic allele. For the somatic genotype, we consider only two states referring to the absence and presence of the somatic variant in the tumor sample. In cases where the somatic variant is not present, we account for unsystematic independent noise, while assuming an error-free scenario when the somatic variant is considered. To calculate the genotype likelihoods, the model integrates allele frequency likelihoods over the joint tumor and normal allele frequencies and applies modifications to address liquid tumors with Tumor-in-Normal (TiN) contamination. The integration is approximated with a discrete summation. In these calculations, the likelihood for each read to support a given allele is shared with the germline scoring model. The tumor-only somatic scoring model is seen as a special case of the somatic scoring model in the absence of normal data (zero coverage). The posterior probability is converted into a Phred quality score and reported in the VCF output INFO/SOMATICSCORE field. In somatic mode, a genotype state (SAMPLE/GT) and genotype posterior probabilities (SAMPLE/GQ) are not reported out, as the diploid assumption may not be valid under a tumor analysis.

### Large Contig Filter

The large contig filter improves SV calling precision by filtering SVs whose assembled contigs fail to corroborate the underlying breakends of the reported variant. It is enabled by default in both germline and somatic mode, but can be disabled with `--sv-enable-large-contig-filter=false`. For germline variant calling, this filter is only applied to inter-chromosomal breakends (translocations).

#### Filter Methodology

When an SV is called and an assembled contig is available, DRAGEN realigns the contig back to the reference genome. For each breakend of a true SV, there should be high quality alignment of part of the SV contig to the region near the breakend with the alignment orientation consistent with the breakend. If either of the underlying breakends of an SV does not have such an alignment, the `LargeContigFilter` filter is applied. Regardless of whether or not the record passes the filter, the `LCF` tag is appended to the `INFO` field of the SV to indicate that it was processed by the large contig filter.

#### Filtering Criteria

The large contig filter will process all variants that meet both of the following criteria:

* The SV is an inter-chromosomal BND, or the SV is for a somatic sample and the underlying breakends of the SV are at least 1kbp apart.
* The SV's assembled contig length is at least 100bp

For variants meeting these criteria, the filter evaluates whether the assembled contig successfully realigns to the regions ±200bp of each breakend with:

* Mapping quality (MAPQ) ≥ 40
* Alignment identity ≥ 90%

### Input Requirements

When running the SV caller, the input sequencing reads must be from a standard Illumina paired-end sequencing assay with an FR read pair orientation, where for each sequence fragment, a read proceeds from each end of the fragment inwards. For more information, see [DRAGEN SV caller Capabilities](#dragen-sv-caller-capabilities).

The SV caller is optimized for paired-end libraries where the fragment size is typically larger than the size of both reads. Overlapping read pairs can be used to discover SVs, but might not always be handled optimally. For libraries where the typical fragment size is less than the read length, the SV caller attempts to differentiate reads sequencing into adapter sequence from the variant signal. In such cases, the SV caller's input quality checks may fail and cause SV analysis to be skipped.

If using the standalone mode, your BAM/CRAM inputs must first be mapped. If you have not mapped and aligned your data yet, you can generate an alignment file.

#### Alignment Contig Checks

If running from a mapped and aligned BAM, then the contigs specified in the header must strictly match those in the DRAGEN hashtable specified in the current analysis. Missing or extra contigs will lead to a "Reference genome mismatch" configuration error and the analysis will not proceed. If such an error is observed, it is recommended to regenerate the alignment file with the intended DRAGEN hashtable, or to run with the DRAGEN map/align module enabled.

#### Input Quality Checks

The SV caller runs quality checks on the input sequencing reads for each sample to make sure that the input corresponds to a paired read assay with the expected FR orientation, before estimating the fragment size distribution. To check consensus read pair orientation, a subset of high-quality read pairs is sampled. At least 90% of these must have the expected FR orientation for SV analysis to continue, otherwise, the SV caller issues a warning, skips any further analysis, and the resulting output files display empty results.

The SV caller can tolerate non-paired reads in the input, if sufficient paired-end reads exist to estimate the fragment size distribution. To estimate the fragment size distribution, the SV caller requires at least 100 read pairs that meet the quality requirements of the estimation routine. Both reads of the pair must have a non-zero mapping quality to the same chromosome, are not filtered or part of a split read mapping, and do not contain indels or soft-clipping. If a sample does not contain a sufficient number of such read pairs, the SV caller issues a warning, skips any further analysis, and writes empty results to its output files.

#### Read Groups

The SV caller disregards any read group labels applied to the input sequences. Each input sample is treated as a separate library with a single fragment size distribution.

#### File Format

In standalone mode, input sequencing reads must be mapped and provided as input in either BAM or CRAM format. Each input file must be coordinate sorted and indexed to produce an htslib-style index in a file named to match the input BAM or CRAM file with an additional `.bai`, `.crai`, or `.csi` file name extension. For more information on standalone mode, see [Modes of Operation](#modes-of-operation).

At least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file is treated as a separate sample as part of a joint diploid sample analysis.

In standalone mode, input BAM or CRAM files contain the following limitations:

* Alignments cannot have an unknown read sequence (SEQ="\*")
* Alignments cannot contain the "=" character in the SEQ field.
* Alignments cannot use the sequence match/mismatch ("="/"X"). CIGAR notation RG (read group) tags in the alignment records are ignored. Each alignment file is treated as representing one sample.
* Alignments with base call quality values greater than 70 are rejected. These are not supported on the assumption that this indicates an offset error.

#### Generate an Alignment File

The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the pipeline.

You need to generate alignment files for all samples that have not already been mapped and aligned. Each sample must have a unique sample identifier. Use the *--RGSM* option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the *--RGSM* option is not required.

The following example command maps and aligns a FASTQ file:

```
dragen \
-r <HASHTABLE> \
-1 <FASTQ1> \
-2 <FASTQ2> \
--RGSM <SAMPLE> \
--RGID <RGID> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true
```

The following example command maps and aligns an existing BAM file:

```
dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true
```

The following example command maps and aligns an existing CRAM file:

```
dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align true
```

### Exome/Targeted Calling

The SV caller can be configured for targeted sequencing inputs, which disables high-depth filters. Exome mode can be directly set to true or false with the command line option `--sv-exome`. If not directly set, exome mode defaults to false.

### Targeted Somatic Panel Calling

To enable targeted calling for somatic panels, set the options `--sv-enable-liquid true` or `--sv-enable-solid true` for liquid and solid biopsies, respectively. Additionally, use the option `--sv-call-regions-bed ${BEDFILE}` to specify the target regions in a BED file. For details on command line options, see [DRAGEN recipes for somatic pipelines](https://help.dragen.illumina.com/product-guides/dragen-recipes#somatic-pipelines).

> Note: The `sv-enable-liquid` option only applies to targeted panels of liquid biopsies (e.g. ctDNA), and is different from `sv-enable-liquid-tumor-mode` which applies a specialized scoring model for liquid tumors (e.g. leukemia). See [Liquid Tumor Calling](#liquid-tumor-calling-and-tumor-in-normal-contamination) for details on liquid tumor mode.

### Internal Tandem Duplications Calling

You can use the `--sv-somatic-ins-tandup-hotspot-regions-bed ${BEDFILE}` option to specify ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from `<INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed`. The file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps). To disable this feature, enter `--sv-enable-somatic-ins-tandup-hotspot-regions false`.

### Liquid Tumor Calling and Tumor-in-normal Contamination

Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. In tumor-normal analysis, DRAGEN accounts for Tumor-in-Normal (TiN) contamination by running liquid tumor mode. TiN contamination is accounted for by allowing a non-zero variant allele frequency for the matched normal when calculating the posterior probability of the somatic state. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

> Note: liquid tumors are not equivalent to liquid biopsies. For targetted panels used for liquid biopsies such as ctDNA assays, refer to section [Targeted Somatic Panel Calling](#targeted-somatic-panel-calling).

Use the following two options to control liquid tumor mode behavior.

* `--sv-enable-liquid-tumor-mode` ---Enable liquid tumor mode. Liquid tumor mode is disabled by default.
* `--sv-tin-contam-tolerance` ---Set the TiN contamination tolerance level. DRAGEN calls variants in the presence of TiN contamination up to a specified maximum tolerance level. You can enter any value between 0–1. The default maximum TiN contamination tolerance is SV\_TIN\_CONTAM\_TOLERANCE ([default values](#default-values)). If using the default value, somatic variants are expected to be observed in the normal sample with allele frequencies up to (SV\_TIN\_CONTAM\_TOLERANCE \* 100)% of the corresponding allele in the tumor sample.

## Command Line Options

The following command line options are supported for the Structural Variant Caller.

### Input and Output Options

The following are the top-level options that are shared with the DRAGEN Host Software to control the SV pipeline. You can use BAM and CRAM files as input. Alternatively, if using read mapping with the SV calling in a single run, you can use all of the DRAGEN input options, including FASTQ, BAM, and CRAM files.

* `--cram-input`---The CRAM file to be processed.
* `--tumor-cram-input`---If performing tumor-normal or tumor-only analysis, the tumor CRAM file to be processed.
* `--fastq-file1`, `--fastq-file2`, `--fastq-list`---Input FASTQ files or a list of files to be processed.
* `--tumor-fastq1`, `--tumor-fastq2`, `--tumor-fastq-list`---Input tumor FASTQ file or list of files to be processed.
* `--enable-map-align`---Enables DRAGEN map/align. The default is true, so all input reads are remapped and aligned unless the option is set to false.
* `--output-directory`---Output directory where all results are stored.
* `--output-file-prefix`---Output file prefix that will be prepended to all result file names.
* `--ref-dir`---The DRAGEN reference genome hashtable directory. For more information about the reference genome hashtable, see [Prepare a Reference Genome](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-reference-support/prepare-a-reference-genome).
* `--bam-input`---The BAM file to be processed.
* `--tumor-bam-input`--If performing tumor-normal or tumor-only analysis, the tumor BAM file to be processed.

## Structural Variant Caller Pipeline Options

* `--enable-sv` ---Enable or disable the structural variant caller. The default is false.
* `--sv-target-bed` ---Specifies a BED file containing the set of regions to call. Optionally, you can compress the file in gzip or bgzip format. SVs with **both** breakends within the specified regions will be called.
* `--sv-locus-node-target-file` --- Specifies a BED file containing a set of target regions for locus nodes. Each locus node roughly corresponds to one SV breakend. SVs with **at least one** breakend within the specified regions will be called. This option makes the SV caller more sensitive than using `sv-target-bed`.
* `--sv-exclusion-bed` --- Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.
* `--sv-region` --- Limit the analysis to a specified region of the genome for debugging purposes. This option can be specified multiple times to build a list of regions. The value must be in the format "chr:startPos-endPos".
* `--sv-exome` --- Set to true to configure the variant caller for targeted sequencing inputs, which includes disabling high depth filters. The default is false.
* `--sv-output-contigs` --- Set to true to have assembled contig sequences output in a VCF file. The default is true.
* `--sv-discovery` --- Enable SV discovery. The default is true.
* `--sv-report-small-dup-as-ins` --- Set to true to convert small duplications (<1000 bps) as insertions. The default is true.
* `--sv-use-overlap-pair-evidence` --- Allow overlapping read pairs to be considered as evidence. The default is false.
* `--sv-somatic-ins-tandup-hotspot-regions-bed` --- Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from `<INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed`.
* `--sv-enable-somatic-ins-tandup-hotspot-regions` --- Enable or disable the ITD hotspot region input. The default is true in somatic variant analysis.
* `--sv-enable-solid` --- Enable SV mode for solid panels. See [Targeted Somatic Panel Calling](#targeted-somatic-panel-calling).
* `--sv-enable-liquid` --- Enable SV mode for liquid panels. See [Targeted Somatic Panel Calling](#targeted-somatic-panel-calling). This option applies only for liquid biopsies, and is different from `sv-enable-liquid-tumor-mode` which applies to hematological cancer that accounts for tumor-in-normal contamination.
* `--sv-enable-liquid-tumor-mode` --- Enable liquid tumor mode. See [Liquid Tumor Calling](#liquid-tumor-calling-and-tumor-in-normal-contamination).
* `--sv-tin-contam-tolerance`--- Set the Tumor-in-Normal (TiN) contamination tolerance level. See [Liquid Tumor Calling](#liquid-tumor-calling-and-tumor-in-normal-contamination) for more information.
* `--sv-systematic-noise`--- Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information see [Systematic Noise Filtering](#systematic-noise-filtering).
* `--sv-detect-systematic-noise`--- Set to true to generate VCF output per normal sample. For more information see [Systematic Noise Filtering](#systematic-noise-filtering)
* `--sv-build-systematic-noise-vcfs-list` --- List of input VCFs from previous step. Enter one VCF per line. For more information see [Systematic Noise Filtering](#systematic-noise-filtering)
* `--sv-min-edge-observations`--- Remove all edges from the graph with less than this many observations. The default value is set to SV\_MIN\_EDGE\_OBSERVATIONS ([default values](#default-values)).
* `--sv-min-candidate-spanning-count`--- Run SV caller and report all large SVs with at least this many spanning support observations. The default value is set to SV\_MIN\_CANDIDATE\_SPANNING\_COUNT ([default values](#default-values)).
* `--sv-min-scored-variant-size`--- After candidate identification, only score and report SVs/indels at or above size of SV\_MIN\_SCORED\_VARIANT\_SIZE ([default values](#default-values)). This parameter doesn't affect the somatic hotspot region.
* `--sv-hotspot-min-scored-variant-size`--- After candidate identification, only score and report SVs/indels at or above this size inside the SV hotspot region, which includes FLT3, ARHGEF7, and KMT2A genes by default. The default value is set to SV\_HOTSPOT\_MIN\_SCORED\_VARIANT\_SIZE ([default values](#default-values)).
* `--sv-skip-parsing-ga-tag`--- By default SV caller will make use of the graph alignment tag (ga:Z) to improve SV calling sensitivity whenever a ga tag is present in the alignment record. This option provides a way to disable ga tag related functionalities in SV calling. The default value is set to false.
* `--sv-enable-methylation`--- Enable methylation-aware SV calling mode (Default=false).
* `--sv-ml-model` --- SV ML trained model location.
* `--sv-ml-metafile` --- Meta file for SV ML with versioned information.
* `--sv-enable-ml` --- If true, use SV ML filtering. (Default=true).
* `--sv-ml-enable-logging` --- If true, enable SV ML debugging mode (Default=false).
* `--sv-ml-enable-feature-extraction` --- Enable feature extraction for training the SV ML model (Default=false).
* `--sv-ml-min-pass-del-prob` --- Minimum pass probability in SV ML for deletions. The default is ([default values](#default-values)).
* `--sv-ml-min-pass-ins-prob` --- Minimum pass probability in SV ML for insertions. The default is ([default values](#default-values)).
* `--sv-ml-max-del-svlen` --- Maximum deletion size that SV ML model can be applied to (Default=DoubleMax).
* `--sv-ml-key` --- Key used for ML decryption.
* `--sv-skip-artifact-early-exit` --- Override early exit upon the detection of excessive artefact sequences to continue SV calling (Default=false).
* `--sv-enable-large-contig-filter` --- Enable the [large contig filter](#large-contig-filter) (Default=true).

## Modes of Operation

Structural Variant calling can run in the following modes:

* Standalone --- Uses mapped BAM/CRAM input files. If you have not mapped and aligned your data yet, see [Input Requirements](#input-requirements). This mode requires the following options:
  * `--enable-map-align false`
  * `--enable-sv true`
* Integrated -- Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires the following options:
  * `--enable-map-align true`
  * `--enable-sv true`
  * `--enable-map-align-output true`
  * `--output-format bam`

You can also enable Structural Variant calling with any other caller.

The following is an example command line for Integrated mode:

```
dragen -f \
--ref-dir=<HASH_TABLE> \
--enable-map-align true \
--enable-map-align-output true \
--enable-sv true \
--output-directory \<OUT\_DIR\> \
--output-file-prefix \<PREFIX\> \
--RGID Illumina_RGID \
--RGSM <sample name> \
-1 <FASTQ1> \
-2 <FASTQ2>
```

The following is an example command line for joint diploid calling in standalone mode:

```
dragen -f \
--ref-dir <HASH_TABLE> \
--bam-input <BAM1> \
--bam-input <BAM2> \
--bam-input <BAM3> \
--enable-map-align false \
--enable-sv true \
--output-directory <OUT_DIR> \
--output-file-prefix <PREFIX>
```

## Structural Variant VCF Output

The structural variants VCF output file is available in the output directory. The file is named `<output-file-prefix>.sv.vcf.gz`. The contents of the file depend on the type of analysis.

For each major analysis category (germline, tumor-normal, and tumor-only), the appropriate VCF output file is output, reflecting variant calls made under the variant calling mode corresponding to the given analysis type.

### VCF Output

VCF output follows the VCF 4.2 specification for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The following sections provide information on the variant representation details and the primary VCF field values.

#### VCF Sample Names

Sample names output in the VCF output are extracted from each input alignment file from the first read group (@RG) record found in the header. If no sample name is found, a default (SAMPLE1, SAMPLE2, etc.) label is used instead.

#### Small Indel Classification and Representation

A variant is classified as a small indel if all of the following criteria are met:

* The variant can be entirely expressed as a combination of inserted and deleted sequences.
* The deletion or insertion length is not 1000 or greater.
* The variant breakends and/or the inserted sequence are not imprecise.
* The variant has not been converted from a deletion to intra-chromosomal breakends by the depth-based SV classification routine.

All small indels are reported using full sequences in the VCF REF and ALT allele fields. Additionally, their VCF records include the CIGAR INFO tag describing the combined insertion and deletion event.

#### Large Variant Representation

In somatic mode, variants that do not meet the "small indel" criteria described above are reported using **symbolic alleles**.

An example of a 1505-base deletion reported using symbolic allele:

```
chr1	188818137	DRAGEN:DEL:26492:0:1:0:0:0	T	<DEL>	.	PASS	END=188819642;SVTYPE=DEL;SVLEN=-1505;CONTIG=TACATCTGGCTGATATTATGATGACAACAGATCTGTTAAGGCAAATGCTTTCCTGACAAGCTATTTAATTAGAATTACAATCATCTTCCCAGAAATGATCCTTAAAATAAATTGTCTACTGTAAATGGATGAAAGTAAATAGTTTGTGTTACAGATTTATTTACATCATACATTAAATAAAGAACTGAAAAAATCCCCATGAAAAACAAAATGTTATATATAACTATAGTATTAAAGTAATTAAATGATGTAAGTCTACACTATATATAACCTCAAAATTCATAAAATAAATATTTATAAAGAAAAATGACTAATAAAGTATTGTCTTTAAACTTGAAGACATTTTTAAAATTAGCCCTCTTTTTATTATATTATAATGTGAAAACCCAAAGGTTGAATTTCTAGGAATTTATTAATGA;CIPOS=0,2;CIEND=0,2;HOMLEN=2;HOMSEQ=TG;SOMATIC;SOMATICSCORE=102.26	PR:SR:VF	103,0:71,0:130,0	189,37:163,25:272,53
```

In germline mode, all variants are reported in the VCF using **full variant sequences** in the REF and ALT field unless their sizes exceed thresholds defined below:

* The variant is a insertion/deletion and its length is larger than or equal to 1000000 (1 million) bases.
* The variant is a duplication and its length is larger than or equal to 1000 bases.

#### Insertions with Incomplete Insert Sequence Assembly

Large insertions are reported in some cases even when the insert sequence cannot be fully assembled. These can be identified by the `INCOMPLETEINS` `INFO` field. In addition, the incomplete insertion records will have `INFO` fields `LEFT_SVINSSEQ` and/or `RIGHT_SVINSSEQ` that describe the assembled left and right ends of the insert sequence. If the record was rescued due to a match to an MEI sequence, `INTEGRATION_TYPE=MEI` will be added to the `INFO` field as well.

In germline mode, the inserted sequence of an incomplete insertion is represented as a concatenation of `LEFT_SVINSSEQ`, 5 "N"s, and `RIGHT_SVINSSEQ` in ALT field. In the `CONTIG` field, the left and right sequences are concatenated with 100 "N"s in between. The following is an example of such a record from HG002 mapped to hg38:

```
chr1	187497595	DRAGEN:INS:18526:0:0:0:4:0	TTA	TAATGTAATACTCAATATTGTGACAATGTCAGCTCTCTGTAAATTAAGCTATAAATTAAATGAAATGATAATTAACATCCTAATTGAGGCCTGGCACAATGGTTCATGCCTGTAATCCCAATACNNNNNTTCCCCACTGAAATCGATCTCACTAAGTCACTGATGTTCCCAAGTTTTTAACACAAGTTTCTGTCTTTGGTTTTCAGCCTAATTTTTTAGATATTACTACCAATAGAATGGTCTCCTGGATTT	967	PASS	END=187497597;SVTYPE=INS;CONTIG=CTAGGGAGGCCTCAGGAAGCTTACAGTCATGGCTGAATACAAAGAGAGAGCAAGCATGTCATATGGCAAAAGCAGGAACAAGTGAGAAAGAGAGGGGAGGCGGGGGAGATGCCACACCCTTTAAAACAACCAGATCTCTCCAGAACTCAGTCACTATCCTGAAGAAAGCACCAAGCCATGAGGGATCCGCCCCCATGATCCAAACATCTCCCATGGGGCCATGCCTCCCACATTGGGAATTACAATTCAACATGAGATTTGAGCAGGGACACATATCCAAACTGTTAGGGCCATTCTTTAGTCTATTTGTCTTTTGTGTCTAGATGAATGACTAACAGAGTAGCCATTTAATCAATGTGTTTTTTAATGAATGGAAAGATAAGAAATAGCAAAGATATTGAGGAAATCCAGGAGACCATTCTATTGGTAGTAATATCTAAAAAATTAGGCTGAAAACCAAAGACAGAAACTTGTGTTAAAAACTTGGGAACATCAGTGACTTAGTGAGATCGATTTCAGTGGGGAAGGGATAGATATGAATGCTAGAATGTAATGGATGAGAAATGGATGACAAGGGATGAATTAAAGTAAGACGGAAAAATGTATTGTTTCAAAAAAATTGAGAGTGACACCCAAAAACAGAAAGCATATTAATGTAATACTCAATATTGTGACAATGTCAGCTCTCTGTAAATTAAGCTATAAATTAAATGAAATGATAATTAACATCCTAATTGAGGCCTGGCACAATGGTTCATGCCTGTAATCCCAATACNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTCCCCACTGAAATCGATCTCACTAAGTCACTGATGTTCCCAAGTTTTTAACACAAGTTTCTGTCTTTGGTTTTCAGCCTAATTTTTTAGATATTACTACCAATAGAATGGTCTCCTGGATTTCCTTGTGAGAGAGATAGTCTTAATATTACTAACTTCAAGAATGAGAGAAATATTAAGTTTTGGGTTACAGAAGAAGGGCTCTAGAAGGAATTTAAATTCAATGGTGTTTATCTTCTGTTTGAAGGAGGGATAATTGAATTTCATGT;LEFT_SVINSSEQ=AATGTAATACTCAATATTGTGACAATGTCAGCTCTCTGTAAATTAAGCTATAAATTAAATGAAATGATAATTAACATCCTAATTGAGGCCTGGCACAATGGTTCATGCCTGTAATCCCAATAC;RIGHT_SVINSSEQ=TTCCCCACTGAAATCGATCTCACTAAGTCACTGATGTTCCCAAGTTTTTAACACAAGTTTCTGTCTTTGGTTTTCAGCCTAATTTTTTAGATATTACTACCAATAGAATGGTCTCCTGGATTT	GT:GQ:PL:PR:SR:SB:FS:VF	0/1:426:999,0,423:34,0:21,25:13,8,11,14:5.972:44,2
```

In somatic mode, incomplete insertions are reported with `<INS>` in ALT field, as shown in the example below:

```
chr1	147944009	DRAGEN:INS:19626:2:2:4:12:0	G	<INS>	.	PASS	END=147944009;SVTYPE=INS;CONTIG=GAATGAATTGGCACAAAATAAATATTGAATGAATGATTAGGTTATTTGTTCTGCCACAGCTTTATTTTTCTACCTATTAATAATGTCTCCGTTAGTTCATCCCACACCTTAAATTCAGCTAAATTTCTCTTACGATGGTAAGGATGCTTTCACTAAAAATTATGCTGTTCATCAGTCATTTGCCAAGTGAATTAGCAGCCAAATATTAGTTAGCATTATCTTATATTTTTAAATAGAGCAAGCTCTTTAACCTATATCTGTTTTTCCTAACAGTTTTGTCTCCTCTGTCTCAGTGCATAAACAACGACTGCTTTTCCTTATACCCTTTTACTTAAGTACCCTTTAAGTATATTCAGTAAATGAAGGTCAAAAATTCCTACTCTAGAATGATAGAAGTGAGGTCCAAAACCAGCATTAGTTATCTCTTAGAATGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGCCTAGAGGCAGTTTCCTGCACTGCAGAGAGGGGAGCCGCAATAGAGTCAGCAGTCTCATTGAGTGGAGAAAACAGAGATTAGGGTTGGACATTACTTATACTGCTTTTTTCCTGTCTCTTATGGCTGACCTTTATGTATTTCTTCTGGAAACTAGGAGATCCCTTTCCCATTCTCAGCCCAAAACATGGTGAGTATATATAGGAGGGCAGAGGAAGTGGGGGGAGAAGATTTGTTTAATTTTCTGATTTCAAAAACAATTAATTAAATTTCTTGGGGAAAAAATGTATTTGGGGAAATAATTTATTAATTCACAGGTTCTTGGGATTTGGAAAAAAATTTTTCTGATTGAGGAGGATAGCAAGCTATTTCTCTGAAAAATGTCTGATTTTGAACCACATTTTCTCCTTCATTCTAAGAGATAACTAATGCTGGTTTTGGACCTCACTTCTATCATTCTAGAGTAGGAATTTTTGACCTTCATTTACTGAATATACTTAAAGGGTATAAGGAAAGAGAACGGTGTCTAACTCATATTGTATATGGAACTACAGGAATATATTTGTATGTGTATTTGGAGGTAAGGTATCAATAA;CIPOS=0,9;CIEND=0,9;HOMLEN=9;HOMSEQ=ACTGCTTTT;LEFT_SVINSSEQ=ACTGCTTTTCCTTATACCCTTTTACTTAAGTACCCTTTAAGTATATTCAGTAAATGAAGGTCAAAAATTCCTACTCTAGAATGATAGAAGTGAGGTCCAAAACCAGCATTAGTTATCTCTTAGAATG;RIGHT_SVINSSEQ=TGCCTAGAGGCAGTTTCCTGCACTGCAGAGAGGGGAGCCGCAATAGAGTCAGCAGTCTCATTGAGTGGAGAAAACAGAGATTAGGGTTGGACATTACTTAT;SOMATIC;SOMATICSCORE=37.50	PR:SR:VF	17,0:33,0:49,0	58,2:118,17:172,19
```

**Single Breakend Format for Incomplete Insertions**

In addition to the `<INS>` record, DRAGEN can also output single-breakend formatted records (described in [VCF v4.2 Section 5.4.9](https://samtools.github.io/hts-specs/VCFv4.2.pdf)) corresponding to the partially-assembled sides of the incomplete insertion record. This behavior is enabled by default if viral integration detection is enabled (i.e. `--enable-sv=true` and `--enable-oncovirus-detection=true`). Otherwise, it can be enabled with `--sv-report-incomplete-ins-as-bnd=true`. These single-breakend records will have the `Duplicate` filter applied unless they represent a viral integration site.

If the incomplete insertion sequence aligns to an MEI sequence (by default, the sequences in `<INSTALL_PATH>/config/sv_mobile_element_sequences.fa`), details about this match will be added to the `INFO` field:

| ID                 | Value                                                                                                                    |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------ |
| INTEGRATION\_RNAME | The reporting name of the sequence the incomplete insertion matched to (e.g. AluY).                                      |
| INTEGRATION\_ALT   | The mated breakpoint ALT notation for an incomplete insertion matching an external sequence (e.g. `[DF0000002.4:100[C`). |
| INTEGRATION\_CIGAR | The CIGAR of the alignment between the inserted sequence and the external sequence.                                      |

For example, the following set of records correspond to a single MEI event. Note that if `--sv-report-incomplete-ins-as-bnd=true` is not provided and viral integration detection is not enabled, only the second record (the `<INS>` record) will be output.

```
chr20   3878210 DRAGEN:BND:848:0:0:0:5:0:0      T       TTACTCAGGAGGCTGAGGCAGGAGAATCGCTTGAACCCAGAAGGCGGAGGTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGCAACAGAGTGAGACTCTGTCACACACACACACACACACCCACACAAATTCCAGTTAACCAACCAT.   .       Duplicate       SVTYPE=BND;INCOMPLETEINS;INTEGRATION_TYPE=MEI;INTEGRATION_RNAME=AluSq;INTEGRATION_ALT=T[DF0000043.4:171[;INTEGRATION_CIGAR=95M60S;SOMATIC PR:SR:VF:VF1:VAF1   106,4:10,7:109,5:109,5:0.043860
chr20   3878210 DRAGEN:INS:848:0:0:0:5:0        T       <INS>   .       PASS END=3878210;SVTYPE=INS;CONTIG=GAATGAATTGGCACAAAATAAATATTGAATGAATGATTAGGTTATTTGTTCTGCCACAGCTTTATTTTTCTACCTATTAATAATGTCTCCGTTAGTTCATCCCACACCTTAAATTCAGCTAAATTTCTCTTACGATGGTAAGGATGCTTTCACTAAAAATTATGCTGTTCATCAGTCATTTGCCAAGTGAATTAGCAGCCAAATATTAGTTAGCATTATCTTATATTTTTAAATAGAGCAAGCTCTTTAACCTATATCTGTTTTTCCTAACAGTTTTGTCTCTTACTCAGGAGGCTGAGGCAGGAGAATCGCTTGAACCCAGAAGGCGGAGGTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGCAACAGAGTGAGACTCTGTCACACACACACACACACACCCACACAAATTCCAGTTAACCAACCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGCTTGCAGTGAGCCGACATAGCGTCACTGCACTCCAGCCTGGGCGACAGAGACTCTGCCTCAAAAAAACAAAACAAAACAAAACAAAAATAAGCTAGACGTGGTGGCACGCGCCTGTAGTCCCAGCTAAGGAAAGAGAACGGTGTCTAACTCATATTGTATATGGAACTACAGGAATATATTTGTATGTGTATTTGGAGGTAAGGTATCAATAA;INCOMPLETEINS;LEFT_SVINSSEQ=TACTCAGGAGGCTGAGGCAGGAGAATCGCTTGAACCCAGAAGGCGGAGGTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCCTGGGCAACAGAGTGAGACTCTGTCACACACACACACACACACCCACACAAATTCCAGTTAACCAACCAT;RIGHT_SVINSSEQ=AGCTTGCAGTGAGCCGACATAGCGTCACTGCACTCCAGCCTGGGCGACAGAGACTCTGCCTCAAAAAAACAAAACAAAACAAAACAAAAATAAGCTAGACGTGGTGGCACGCGCCTGTAGTCCCAGCT;INTEGRATION_TYPE=MEI;SOMATIC PR:SR:VF:VF1:VAF1:VF2:VAF2  106,4:10,7:109,9:109,5:0.043860:109,8:0.068376
chr20   3878211 DRAGEN:BND:848:0:0:0:5:0:1      A       .AGCTTGCAGTGAGCCGACATAGCGTCACTGCACTCCAGCCTGGGCGACAGAGACTCTGCCTCAAAAAAACAAAACAAAACAAAACAAAAATAAGCTAGACGTGGTGGCACGCGCCTGTAGTCCCAGCTA       .       Duplicate       SVTYPE=BND;INCOMPLETEINS;INTEGRATION_TYPE=MEI;INTEGRATION_RNAME=AluYk3;INTEGRATION_ALT=]DF0001145.2:267]ACTCTGCCTCAAAAAAACAAAACAAAACAAAACAAAAATAAGCTAGACGTGGTGGCACGCGCCTGTAGTCCCAGCT;INTEGRATION_CIGAR=52M75S;SOMATIC     PR:SR:VF:VF1:VAF1   106,4:10,7:109,8:109,8:0.068376
```

**Viral Integration Site Detection**

When [oncovirus detection](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/oncovirus-detection) is enabled (`--enable-oncovirus-detection=true`), the SV caller can identify sites where oncoviral sequences have integrated into the human genome. This is done by aligning the partially assembled insertion sequences for incomplete insertion records to the oncoviral reference sequences identified by the oncovirus component of DRAGEN. Partially assembled insertion sequences with high scoring alignments are then reported as integration events in the SV VCF output.

To enable detection of integration sites, the SV caller must be enabled with `--enable-sv=true`, oncovirus detection enabled with `--enable-oncovirus-detection=true`, and the database specified with `--oncovirus-detection-db`.

An example command with viral integration enabled is given below:

```shell
dragen \
  --ref-dir $ref \
  --tumor-fastq-list $tumorFastqList \
  --output-file-prefix $prefix \
  --output-directory $out \
  --enable-sv true \
  --enable-oncovirus-detection true \
  --oncovirus-detection-db $db
```

Viral integration events are reported as single breakends by default and will have `INTEGRATION_TYPE=VIRAL` in the `INFO` field, and like single breakend MEI integration records, details about the viral integration will be added to the `INFO` field:

| ID                 | Value                                                                                                                   |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------- |
| INTEGRATION\_RNAME | The reporting name of the sequence the incomplete insertion matched to (e.g. HBV\_Occult\_HK514).                       |
| INTEGRATION\_ALT   | The mated breakpoint ALT notation for an incomplete insertion matching an external sequence (e.g. `[KJ410519.4:100[C`). |
| INTEGRATION\_CIGAR | The CIGAR of the alignment between the inserted sequence and the external sequence.                                     |

It is worth noting that unlike single breakend MEI records, single breakend viral integration records are **not** `Duplicate` filtered and do **not** have a corresponding INS record. For example, below we see three viral integration records output by DRAGEN. The first is a single, isolated site while the latter two are in close proximity with orientations compatible with a single large insertion. In both cases, only the single breakend records will be output.

```
chr8    39230335        DRAGEN:BND:69733:0:0:0:1:0:1    T       .TGAAAGCGGGAGGAGTGCGAATCCACACTCCAAAAGACACCAAATACTCAAGAACAGTTTCTCTTCCAAAAGTAAGACAGGAAATGTGAAACCACAATAGTTGTCTGATTTTTAGGCCCATATTAACATTGACATAGCTTACT        .       PASS    SVTYPE=BND;INCOMPLETEINS;INTEGRATION_TYPE=VIRAL;INTEGRATION_RNAME=HBV_Occult_HK514;INTEGRATION_ALT=[KJ410519.1:2159[TACT;INTEGRATION_CIGAR=3S136M3S;DUPLICATES=DRAGEN:INS:69733:0:0:0:1:0;CIPOS=0,1;HOMLEN=1;HOMSEQ=A;SOMATIC;SOMATICSCORE=40.30;SOMATIC_EVENT=0   PR:SR:VF:VF1:VAF1       28,27:9,35:32,51:32,51:0.614458
chr19   35721821        DRAGEN:BND:124270:0:0:0:2:0:0   G       GAGACCACCGTGAACGCCCGCCAGGTCTTGCCCAAGGTCTTACATAAGAGGACTCTTGGACTCTCAGCAATGTCAACGACCGACCTTGAGGCATACTTCAAAGACTGTGTATTTAAGGACTGGGAGGAGTTGGGGGAGGAGAT.        .       PASS    SVTYPE=BND;INCOMPLETEINS;INTEGRATION_TYPE=VIRAL;INTEGRATION_RNAME=HBV_Occult_HK514;INTEGRATION_ALT=G[KJ410519.1:1612[;INTEGRATION_CIGAR=142M;DUPLICATES=DRAGEN:INS:124270:0:0:0:2:0;SOMATIC;SOMATICSCORE=43.57;SOMATIC_EVENT=0     PR:SR:VF:VF1:VAF1       9,6:10,22:8,7:8,7:0.466667
chr19   35721836        DRAGEN:BND:124270:0:0:0:2:0:1   C       .CAGCGCTGAATCCCGCGGACGACCCGTCTCGGGGCCGTTTGGGACTCTACCGTCCCCTTCTTCGTCTGCCGTTCCGACCGACCACGGGGCGCACCTCTCTTTACGCGGTCTCCCCGTCTGTGCCTC .       PASS    SVTYPE=BND;INCOMPLETEINS;INTEGRATION_TYPE=VIRAL;INTEGRATION_RNAME=HBV_Occult_HK514;INTEGRATION_ALT=]KJ410519.1:1560]C;INTEGRATION_CIGAR=125M;DUPLICATES=DRAGEN:INS:124270:0:0:0:2:0;SOMATIC;SOMATICSCORE=43.57;SOMATIC_EVENT=0     PR:SR:VF:VF1:VAF1       9,6:10,22:6,4:6,4:0.400000
```

#### Normalizing Small Tandem Duplications

The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.

To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions that use the symbolic allele `<INS>` for the `ALT` field in somatic mode, and report the full sequence in `ALT` field in germline mode. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.

```
chr1	34839286	DRAGEN:DUP:TANDEM:5042:0:0:1:0:0	T	TAGAGAAGAGAAGAGAAAAGAAGAGAAGAGAAGAGAAATGAAAAGAAGAAAAGAAAAGAGAAGAGAAGAGA	204	PASS	END=34839286;SVTYPE=INS;SVLEN=70;DUPSVLEN=70;CONTIG=TTGAACCCAGGAGGCAGAGATTGCAGTGAGCCAAGATCGCACCACTTGCACTCCAGCCTGGGCAACAGAGGGAGATTCCGAAAGAAAGAAAGAAAAGAAAGGAGAGAGAGAGAGAGAGAGAGAGGGAGAGACAGAGACAGAGAGAGAGAGGGAGGGAGGGAGGAAGGGAGGGAGAGAGAGAGAGAGAGAGAGAGAGAAAGAAAGGAGGGAGGGAGGGAGGAAGAAAGAAAAGAGAAGAGGAGAGAAGATTAGAGAAGAGAAAAGAAGAGAAGAGAAGAGAAATGAAAAGAAGAAAAGAAAAGAGAAGAGAAGAGAAGAGAAGAAGAGAACAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAAGAGAGGAAAAGAAAAGAAAAGAAATCACCAAAGCAGAGAAGGCCGGGACCATCCCTGCAAACCCCTGCTCTGCTCAAGAGGAGTTGTCACACCTGTTCCTTCTTGAAACGTCATCTGAGGAACCATGTGCCTTCTCTGGGAACTGTGACCCCATCCACAGGCCACTGCAGAAGCTGCCTCGTCTTACCTAACACGTCTTATTCCCA	GT:GQ:PL:PR:SR:SB:FS:VF0/1:204:206,0,255:18,10:24,6:13,11,3,3:0.000:21,8
```

An example of an insertion converted from duplication in somatic mode:

```
chr2	98302863	DRAGEN:DUP:TANDEM:50986:0:1:0:0:0	A	<INS>	.	PASS	END=98302863;SVTYPE=INS;SVLEN=153;DUPSVLEN=151;CONTIG=AAACCCCTTGGAGAAAATGGAGTTTCCTGAAGCTGTCTGGATAGCAGTATGGCGCAAAGGAGAAAGACCATCAGAGCCAGCGGCCACTGGAGCACTGTATCCAACCCAACAGGGCAAGCGGAGAAGGAGTTGTCTGCTCATCACCCCAGAAAAAAACCCCTTGGAGAAAATGGAGTTTCCTGAAGCTGTCTGGATAGCAGTATGGCGCAAAGGAGAAAGACCATCAGAGCCAGCGGCCACTGGAGCACTGTATCCAACCCAACAGGGCAAG;DUPSVINSLEN=2;DUPSVINSSEQ=TG;SOMATIC;SOMATICSCORE=34.72	PR:SR:VF	40,1:55,0:65,1	115,18:187,20:206,35
```

Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, `INFO/DUPSVINSSEQ` provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to `INFO/SVINSSEQ`. The example above also shows a converted insertion with a breakpoint insertion value (`DUPSVINSSEQ=TG`).

To prevent small tandem duplications from being reported as insertions, use the option `--sv-report-small-dup-as-ins false`. For more information about copied `INFO` fields, see VCF INFO Fields. All `INFO` fields use the same `DUP` prefix.

#### Multi-record Deduplication

In addition to `<INS>` and `<DUP>` deduplication, the SV caller will `Duplicate` filter large structural variants are represented in the VCF in a more precise notation. For example, if A-B-C rearrangment exists in a sample the SV caller will call A-B and B-C as expected, but if the B segment is short, it may also call an `IMPRECISE` A-C SV or a A-C SV with `SVINSSEQ` of the B sequence. This deduplication step identifies these A-C SVs and `Duplicate` filters them.

Candidates for deduplication marking are `IMPRECISE` SVs, and `SV` with a non-empty `SVINSSEQ`. Candidates are considered duplicates when a path traversal through one or more SVs can be found such that:

* The length of the traversed sequence is within 10bp of the expected length.
  * For `IMPRECISE` this is determined by `CIPOS` and `CIEND` and for precise variants, the expected length is the `SVINSSEQ` length.
* The start or end of at least one SV is at least 1kbp away from both the candidate start and end.
* All traversed segments are at least 50bp in length.
* The start and end SVs are within 5bp of the range of acceptable candidate start and end positions respectively.
* The traversed sequence is at most 600bp.
* The path traverses at most 3 SVs.

When a candidate `PASS`es filtering, only `PASS` SVs are considered for deduplication matching. If a candidate has any `FILTER` applied, all SVs in the VCF are considered.

#### Inversions

Inversions are usually reported as a set of breakends. For example, given a simple reciprocal inversion, four breakends are reported, sharing the same `INFO/EVENT` tag. The following is an example breakend records representing a simple reciprocal inversion:

```
chr6	130527040	DRAGEN:BND:126718:0:1:0:0:0:0	C	C]chr6:130531150]	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:126718:0:1:0:0:0:1;CONTIG=GTGCCTTGTGCCAAATCCTAGGAGCATTTTTCTTAGAACTCACCTTCATATAGGTTGAAGGTGGAAGGTGGAAGCTTTCCTTTCGTTTCCCTAGGGAAAAGCCACTGTGTCCCTCTTCATAGACTGACATCTGATATTTCCCATTTTGCTTCTAATTGAGCTTTTTTTTTTTTTTGGATCTTCTCTCTTCTTGGTTAATCTCACTGGTGGTCTATCAATGTTATTTATCTTTTCAAAGAAACAGCTTGTTGTTTTTGTATTTTTTTGTCTCAAT;CIPOS=0,1;HOMLEN=1;HOMSEQ=T;EVENT=DRAGEN:BND:126718:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=57;MATE_BND_DEPTH=65	GT:GQ:PL:PR:SR:SB:FS:VF    0/1:999:999,0,999:64,29:48,16:24,24,0,16:37.390:88,45
chr6	130527041	DRAGEN:BND:126718:0:1:1:0:0:0	T	[chr6:130531151[T	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:126718:0:1:1:0:0:1;CONTIG=GTAAGTACCCAGGAGGAACAGGTCATAACCTGTTAAATAGCAGCAAAACCAGGCCAGCTACCACGAGCCTGGTCCTTGATTTAAACAGAAGTCTGGAGATACTTGAAATGAGTCATTTGAAAAAAAACAACCAATACCACAGAAATACAAAAAAAATCATTCAAGGCTACAATGAACATCTTTACTTGCATAAACTAGAAAACCTAGAGGAGATGAATAAATTCCTGGAAATATACAAGCCTCCTAGATTAAACTAGGAAGAAAT;CIPOS=0,1;HOMLEN=1;HOMSEQ=T;EVENT=DRAGEN:BND:126718:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=57;MATE_BND_DEPTH=65	GT:GQ:PL:PR:SR:SB:FS:VF    0/1:999:999,0,999:65,25:48,19:24,24,18,1:32.535:88,43
chr6	130531149	DRAGEN:BND:126718:0:1:0:0:0:1	C	C]chr6:130527041]	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:126718:0:1:0:0:0:0;CONTIG=GTGCCTTGTGCCAAATCCTAGGAGCATTTTTCTTAGAACTCACCTTCATATAGGTTGAAGGTGGAAGGTGGAAGCTTTCCTTTCGTTTCCCTAGGGAAAAGCCACTGTGTCCCTCTTCATAGACTGACATCTGATATTTCCCATTTTGCTTCTAATTGAGCTTTTTTTTTTTTTTGGATCTTCTCTCTTCTTGGTTAATCTCACTGGTGGTCTATCAATGTTATTTATCTTTTCAAAGAAACAGCTTGTTGTTTTTGTATTTTTTTGTCTCAAT;CIPOS=0,1;HOMLEN=1;HOMSEQ=A;EVENT=DRAGEN:BND:126718:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=65;MATE_BND_DEPTH=57	GT:GQ:PL:PR:SR:SB:FS:VF    0/1:999:999,0,999:64,29:48,16:24,24,0,16:37.390:88,45
chr6	130531150	DRAGEN:BND:126718:0:1:1:0:0:1	A	[chr6:130527042[A	999	PASS	SVTYPE=BND;MATEID=DRAGEN:BND:126718:0:1:1:0:0:0;CONTIG=GTAAGTACCCAGGAGGAACAGGTCATAACCTGTTAAATAGCAGCAAAACCAGGCCAGCTACCACGAGCCTGGTCCTTGATTTAAACAGAAGTCTGGAGATACTTGAAATGAGTCATTTGAAAAAAAACAACCAATACCACAGAAATACAAAAAAAATCATTCAAGGCTACAATGAACATCTTTACTTGCATAAACTAGAAAACCTAGAGGAGATGAATAAATTCCTGGAAATATACAAGCCTCCTAGATTAAACTAGGAAGAAAT;CIPOS=0,1;HOMLEN=1;HOMSEQ=C;EVENT=DRAGEN:BND:126718:0:1:0:0:0:0;JUNCTION_QUAL=999;BND_DEPTH=65;MATE_BND_DEPTH=57	GT:GQ:PL:PR:SR:SB:FS:VF    0/1:999:999,0,999:65,25:48,19:24,24,18,1:32.535:88,43
```

For a micro inversion that is entirely contained within the alignment, and represented by adjacent 'I' and 'D' CIGAR operations of comparable lengths, SV caller can also output a single VCF record with format similar to small INS/DEL for such an event, for example:

```
chr4    144359319   DRAGEN:INV:1:0:0:0  TTGACTCCAGCCTTTATAACCTGCCCAGGAGTTATTCCCATGATGTTTTACCAAAGGTGTCTCCATCAAGT <INV>   82  PASS    END=144359389;SVTYPE=INV;SVLEN=-70;CIGAR=1M70I70D;CONTIG=TACTTGATGGAGACACCTTTGGTAAAACATCATGGGAATAACTCCTGGGCAGGTTATAAAGGCTGGAGTCACA;CIPOS=0,2;HOMLEN=2;HOMSEQ=TG;   GT:FT:GQ:PL:PR:SR:SB:FS:VF   0/1:PASS:82:132,0,999:48,0:130,12:50,80,11,1:0.000:131,12
```

#### Depth-Based SV Type Classification

In the germline calling model, when SV candidates are discovered from the sample data and have sufficient paired and split read evidence to be reported in the output, the SV caller applies additional depth-based tests to more accurately classify certain SV candidate types. Candidate breakpoints that are consistent with a deletion are tested for the lower read depth that is expected inside the deleted region. Candidate breakpoints consistent with a tandem duplication are tested for the higher read depth expected in the duplicated region. Candidate SV calls that fail the depth-based tests are still reported in the output, but changed to intra-chromosomal breakends. Candidate SV calls that pass continue to be reported in the standard deletion and tandem duplication output formats.

#### SV Breakpoint Insertions

SVs frequently include a small sequence insertion at the breakpoint. Breakpoint insertions are represented differently depending on the SV type. The `INFO/SVINSSEQ` field in the VCF output provides the most general description of breakpoint insertions by describing the insertion sequence itself. The corresponding `INFO/SVINSLEN` field describes the length of the insertion sequence. For example, the following VCF record describes a large (\~8.8 kb) deletion, which includes a single base insertion (C) between the left and right deletion breakends.

```
chr22   17770350        DRAGEN:DEL:101:0:1:0:0:0  C       <DEL>   687     PASS    END=17779108;SVTYPE=DEL;SVLEN=-8758;SVINSLEN=1;SVINSSEQ=C       GT:FT:GQ:PL:PR:SR       0/1:PASS:687:737,0,858:39,20:32,8
```

The `INFO/SVINSSEQ` field is also used to describe breakpoint insertions for tandem duplication and breakend records. The field can also be used to describe the insertion sequence of a large SV insertion.

Breakpoint insertions are represented differently if the variant is classified as a small indel. Any breakpoint insertion that happens in a small deletion is represented in the CIGAR string. See [Small Indel Classification and Representation](#small-indel-classification-and-representation) for information on the conditions this format is used for SVs under.

In the following small indel example, the VCF record describes a 57 base deletion that includes a single base insertion (A) between the left and right deletion breakends

```
chr22   32981929        DRAGEN:DEL:1136:0:0:0:0:0 TGTATACATATATGTGTATATACGTATATATGTATATATGTATGTATACGTATATATG      TA      537     PASS    END=32981986;SVTYPE=DEL;SVLEN=-57;CIGAR=1M1I57D GT:FT:GQ:PL:PR:SR       0/1:PASS:308:587,0,305:8,0:23,15
```

Breakend (BND) records include an additional encoding of breakpoint insertion sequence, as described in the VCF specification for the breakend `ALT` field. The SV caller also provides the information to the `INFO/SVINSSEQ` field for consistency with other SV record types.

The following example shows a breakend connecting a region of chromosomes 1 and 12 in the sample with a breakend insertion sequence of `CA` between the two breakends. The insertion sequence is described in both the `ALT` and `INFO/SVINNSEQ` fields.

```
1       39604587        DRAGEN:BND:31780:1:3:0:0:0:1      T       TCA[12:6472102[ 774     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:31780:1:3:0:0:0:0;SVINSLEN=2;SVINSSEQ=CA;BND_DEPTH=67;MATE_BND_DEPTH=55      GT:FT:GQ:PL:PR:SR       0/1:PASS:774:824,0,999:63,3:36,33
12      6472102 DRAGEN:BND:31780:1:3:0:0:0:0      G       ]1:39604587]CAG 774     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:31780:1:3:0:0:0:1;SVINSLEN=2;SVINSSEQ=CA;BND_DEPTH=55;MATE_BND_DEPTH=67      GT:FT:GQ:PL:PR:SR       0/1:PASS:774:824,0,999:63,3:36,33
```

**SV Breakpoint Insertion Orientation**

The breakpoint insertion sequence is always provided with respect to the strand of the current SV record. Some breakend records have inverted orientation. For inverted orientations, the pair of breakend records contains an insertion sequence that is reverse complemented compared to the mated record.

The following breakend pair example demonstrates an inverted orientation.

```
1       210891730       DRAGEN:BND:43882:0:2:0:2:0:1      A       AATG]19:45732595]       999     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:43882:0:2:0:2:0:0;SVINSLEN=3;SVINSSEQ=ATG;BND_DEPTH=76;MATE_BND_DEPTH=106    GT:FT:GQ:PL:PR:SR       0/1:PASS:999:999,0,999:69,16:43,55
19      45732595        DRAGEN:BND:43882:0:2:0:2:0:0      G       GCAT]1:210891730]       999     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:43882:0:2:0:2:0:1;SVINSLEN=3;SVINSSEQ=CAT;BND_DEPTH=106;MATE_BND_DEPTH=76    GT:FT:GQ:PL:PR:SR       0/1:PASS:999:999,0,999:69,16:43,55
```

#### SV Breakpoint Homology

Each VCF record output by the SV caller is shifted to the left-most position of the exact homology range of the breakpoint. The exact homology range of the breakpoint is the continuous range of positions over which the SV could be represented while still describing the same SV haplotype. The exact homology range is described in the VCF output with the `INFO/HOMSEQ` field, which describes the sequence of the exact homology range and the corresponding `INFO/HOMLEN` field, which describes the length of the range.

The following example shows a 62 base deletion with an 11 base breakend homology region. Without left-shifting, the SV has an equivalent representation anywhere from position 39497639 to 39497650.

```
chr22   39497639        DRAGEN:DEL:34:85:85:1:0:0 GGGGGGTGGGGGCGGGTTGGAGGAGGTTGGCGGGGGGCGGGGGCGGGTTGGAGGAGGTTGGCA G       187     PASS    END=39497701;SVTYPE=DEL;SVLEN=-62;CIGAR=1M62D;CIPOS=0,11;HOMLEN=11;HOMSEQ=GGGGGTGGGGG   GT:FT:GQ:PL:PR:SR     0/1:PASS:12:237,0,8:4,0:2,8
```

The following examples illustrate simplified exact breakend homology. The example displays one three base deletion and another three base insertion. In both the insertion and deletion, the variant is left-shifted, so that the corresponding VCF record position is 2.

Deletion

Reference: GT**C**AGCGA

Variant: GT---**C**GA

Insertion

Reference: GT---**C**AG

Variant: GT**C**GGCAA

In both the insertion and deletion, there is a single base of exact breakend homology `C`, so that the same variant can be represented one base to the right.

#### VCF INFO Fields

| ID                      | Description                                                                                                                                               |
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| IMPRECISE               | Flag indicating that the structural variation is imprecise, ie, the exact breakpoint location is not found                                                |
| SVTYPE                  | Type of structural variant                                                                                                                                |
| SVLEN                   | Difference in length between REF and ALT alleles                                                                                                          |
| END                     | End position of the variant described in this record                                                                                                      |
| CIPOS                   | Confidence interval around POS                                                                                                                            |
| CIEND                   | Confidence interval around END                                                                                                                            |
| CIGAR                   | CIGAR alignment for each alternate indel allele                                                                                                           |
| MATEID                  | ID of mate breakend                                                                                                                                       |
| EVENT                   | ID of event associated to breakend                                                                                                                        |
| HOMLEN                  | Length of base pair identical homology at event breakpoints                                                                                               |
| HOMSEQ                  | Sequence of base pair identical homology at event breakpoints                                                                                             |
| SVINSLEN                | Length of insertion                                                                                                                                       |
| SVINSSEQ                | Sequence of insertion                                                                                                                                     |
| INCOMPLETEINS           | Variant corresponds to an incompletely assembled insertion sequence                                                                                       |
| LEFT\_SVINSSEQ          | Known left side of insertion for an insertion of unknown length                                                                                           |
| RIGHT\_SVINSSEQ         | Known right side of insertion for an insertion of unknown length                                                                                          |
| PAIR\_COUNT             | Read pairs supporting this variant where both reads are confidently mapped                                                                                |
| BND\_PAIR\_COUNT        | Confidently mapped reads supporting this variant at this breakend (mapping may not be confident at remote breakend)                                       |
| UPSTREAM\_PAIR\_COUNT   | Confidently mapped reads supporting this variant at the upstream breakend (mapping may not be confident at downstream breakend)                           |
| DOWNSTREAM\_PAIR\_COUNT | Confidently mapped reads supporting this variant at this downstream breakend (mapping may not be confident at upstream breakend)                          |
| BND\_DEPTH              | Read depth at local translocation breakend                                                                                                                |
| MATE\_BND\_DEPTH        | Read depth at remote translocation mate breakend                                                                                                          |
| JUNCTION\_QUAL          | If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only             |
| SOMATIC                 | Flag indicating a somatic variant                                                                                                                         |
| SOMATICSCORE            | Somatic variant quality score                                                                                                                             |
| SOMATIC\_EVENT          | If the probability of the SV being a germline variant is greater than the probability of the SV being a somatic variant, this is 0. Otherwise, this is 1. |
| JUNCTION\_SOMATICSCORE  | If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only     |
| CONTIG                  | Assembled contig sequence, if the variant is not imprecise (with `--sv-output-contigs`)                                                                   |
| DUPSVLEN                | Length of duplicated reference sequence                                                                                                                   |
| DUPHOMLEN               | Length of base pair identical homology at event breakpoints excluding duplicated reference sequence                                                       |
| DUPHOMSEQ               | Sequence of base pair identical homology at event breakpoints excluding duplicated reference sequence                                                     |
| DUPSVINSLEN             | Length of inserted sequence after duplicated reference sequence                                                                                           |
| DUPSVINSSEQ             | Inserted sequence after duplicated reference sequence                                                                                                     |
| LCF                     | Flag indicating that the large contig filter processed this record                                                                                        |
| INTEGRATION\_TYPE       | Type of integration event ("VIRAL" for oncovirus integration, "MEI" for mobile element insertion)                                                         |
| INTEGRATION\_RNAME      | The reporting name of the sequence the incomplete insertion matched to                                                                                    |
| INTEGRATION\_ALT        | The mated breakpoint ALT notation for an incomplete insertion matching an external sequence                                                               |
| INTEGRATION\_CIGAR      | The CIGAR of the alignment between the inserted sequence and the external sequence                                                                        |

The meaning of the IMPRECISE, SVTYPE, SVLEN, END, CIPOS, CIEND, MATEID, EVENT, HOMLEN, HOMSEQ, fields match their [VCF v4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) definitions.

#### VCF FORMAT Fields

| ID   | Description                                                                                 |
| ---- | ------------------------------------------------------------------------------------------- |
| GT   | Genotype                                                                                    |
| FT   | Sample filter, 'PASS' indicates that all filters have passed for this sample                |
| GQ   | Genotype Quality                                                                            |
| PL   | Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification      |
| PR   | Number of spanning read pairs which strongly support the REF or ALT alleles                 |
| SR   | Number of split-reads which strongly support the REF or ALT alleles                         |
| VF   | Number of fragments which strongly support the REF or ALT alleles at any position           |
| VF1  | Number of fragments which strongly support the REF or ALT allele at the first breakend      |
| VF2  | Number of fragments which strongly support the REF or ALT allele at the second breakend     |
| VAF1 | Variant allele fraction for the first breakend, calculated as VF1\_ALT/(VF1\_ALT+VF1\_REF)  |
| VAF2 | Variant allele fraction for the second breakend, calculated as VF2\_ALT/(VF2\_ALT+VF2\_REF) |
| PSL  | Phase set list                                                                              |

The meaning of the GT and GQ fields match their [VCF v4.2](https://samtools.github.io/hts-specs/VCFv4.2.pdf) definitions. The meaning of the PSL field matches its [VCF v4.4](https://samtools.github.io/hts-specs/VCFv4.4.pdf) definitions.

**SV Variant Allele Fraction (VAF)**

Some of the evidential sequence fragments (or read pairs) could potentially provide both PR and SR support. To avoid double counting, we further defined VF as an additional field to represent the number of evidential sequence fragments (or read pairs) that strongly support the REF or ALT alleles in the listed order. In this context, "strongly support" means that a sequence fragment can be easily distinguished and assigned to one of the alleles. For example, read alignments that fit both REF and ALT allele sequences in a repeat region will be discarded (similar to the definition of [UninformativeReads](https://help.dragen.illumina.com/product-guides/dragen-v4.5/small-variant-calling#read-filtering-and-reporting-of-vcf-dp-fields) as in small variant caller).

Unlike SNV callers where the Variant Allele Fraction (VAF) can be calculated as VF\_ALT/(VF\_ALT+VF\_REF), multiple VAFs can be calculated for SVs. VF, VF1, and VF2 refer to the number of strongly supporting fragments for the entire allele, the first breakend, and the second breakend respectively.

For `<INS>` variants, ALT VF support is all reads strongly supporting anywhere in the insertion. ALT VF1 is all fragments strongly supporting the start of the insertion (that is, the read and/or read pair support the ALT in a position overlaps the start of the insertion), and ALT VF2 is the support at end of the insertion.

For `<DEL>` variants, ALT VF, VF1, and VF2 will be the same, but the REF VF1 and VF2 refer to the reference support and the start and end of the deletion respectively. This are not necessarily the same. For example, if there are two overlapping simple heterozygous deletions, then the outer SV VAFs will be 0.5, but the inner SV VAFs for these deletions will be 1.

In general, SVs with non-zero `SVINSSEQ` will have different ALT VF1 and VF2, and SVs of non-zero reference size will have different REF VF1 and VF2.

For `<INS>`, `<DEL>`, and `<DUP>` variants, VF1/VAF1 refer to the VAF at the start of the SV, and VF2/VAF2 refers to the VAF at the end of the SV. For `BND` variants in VCF breakend notation, VAF1 refers to the local breakend VAF, and VAF2 the remote breakend VAF. For single breakend variants, only VF1 and VAF1 are defined.

**Physical Phasing**

Many somatic samples contain cis phased structural variants in extreme proximity to one another. Such variants are able to be phased if the distance between them is less than or comparable to the library fragment size (or read length for single-ended sequencing). The SV caller performs physical phasing of structural variants by identifing reads/read pairs that unambiguously support the ALT alleles of two nearby structural variants. Physically phased SVs are aggregated into phase sets based the transitive closure of the pair-wise cis phased SVs. Since the phase set aggregation can result in phase switch errors when a trio of 0|1, 1|1 and 0|1 variants are phased, physical phasing is limited to somatic variants larger than 1kbp.

The value of the PSL field for each phase set is the ID of the first record belonging to the phase set in the VCF. Note that since VCF PS field does not support local copy number changes or inter-chromosomal phasing, the PSL field introduced in [VCF v4.4](https://samtools.github.io/hts-specs/VCFv4.4.pdf) is used.

**Heteroplasmy Calculation for Mitochondrial Variants**

For mitochondrial variants, the `VAF1` and `VAF2` field can be used to calculate the heteroplasmy level of a mitochondrial variant: Heteroplasmy = (`VAF1` + `VAF2`) / 2.

An example of a mitochondrial variant with `VAF` values is shown below:

```
chrM	6455	DRAGEN:DEL:12682:33:39:0:0:0	CGTCTGATCCGTCCTAATCACAGCAGTCCTACTTCTCCTATCTCTCCCAGTCCTAGCTGCTGGCATCACTATACTACTAACAGACCGCAACCTCAACACCACCTTCTTCGACCCCGCCGGAGGAGGAGACCCCATTCTATACCAACACCTATTCTGATTTTTCGGTCACCCTGAAGTTTATATTCTTATCCTACCAGGCTTCGGAATAATCTCCCATATTGTAACTTACTACTCCGGAAAAAAAGAACCATTTGGATACATAGGTATGGTCTGAGCTATGATATCAATTGGCTTCCTAGGGTTTATCGTGTGAGCACACCATATATTTACAGTAGGAATAGACGTAGACACACGAGCATATTTCACCTCCGCTACCATAATCATCGCTATCCCCACCGGCGTCAAAGTATTTAGCTGACTCGCCACACTCCACGGAAGCAATATGAAATGATCTGCTGCAGTGCTCTGAGCCCTAGGATTCATCTTTCTTTTCACCGTAGGTGGCCTGACTGGCATTGTATTAGCAAACTCATCACTAGACATCGTACTACACGACACGTACTACGTTGTAGCCCACTTCCACTATGTCCTATCAATAGGAGCTGTATTTGCCATCATAGGAGGCTTCATTCACTGATTTCCCCTATTCTCAGGCTACACCCTAGACCAAACCTACGCCAAAATCCATTTCACTATCATATTCATCGGCGTAAATCTAACTTTCTTCCCACAACACTTTCTCGGCCTATCCGGAATGCCCCGACGTTACTCGGACTACCCCGATGCATACACCACATGAAACATCCTATCATCTGTAGGCTCATTCATTTCTCTAACAGCAGTAATATTAATAATTTTCATGATTTGAGAAGCCTTCGCTTCGAAGCGAAAAGTCCTAATAGTAGAAGAACCCTCCATAAACCTGGAGTGACTATATGGATGCCCCCCACCCTACCACACATTCGAAGAACCCGTATACATAAAATCTAGACAAAAAAGGAAGGAATCGAACCCCCCAAAGCTGGTTTCAAGCCAACCCCATGGCCTCCATGACTTTTTCAAAAAGGTATTAGAAAAACCATTTCATAACTTTGTCAAAGTTAAATTATAGGCTAAATCCTATATATCTTAATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCATAGAAGAGCTTATCACCTTTCATGATCACGCCCTCATAATCATTTTCCTTATCTGCTTCCTAGTCCTGTATGCCCTTTTCCTAACACTCACAACAAAACTAACTAATACTAACATCTCAGACGCTCAGGAAATAGAAACCGTCTGAACTATCCTGCCCGCCATCATCCTAGTCCTCATCGCCCTCCCATCCCTACGCATCCTTTACATAACAGACGAGGTCAACGATCCCTCCCTTACCATCAAATCAATTGGCCACCAATGGTACTGAACCTACGAGTACACCGACTACGGCGGACTAATCTTCAACTCCTACATACTTCCCCCATTATTCCTAGAACCAGGCGACCTGCGACTCCTTGACGTTGACAATCGAGTAGTACTCCCGATTGAAGCCCCCATTCGTATAATAATTACATCACAAGACGTCTTGCACTCATGAGCTGTCCCCACATTAGGCTTAAAAACAGATGCAATTCCCGGACGTCTAAACCAAACCACTTTCACCGCTACACGACCGGGGGTATACTACGGTCAATGCTCTGAAATCTGTGGAGCAAACCACAGTTTCATGCCCATCGTCCTAGAATTAATTCCCCTAAAAATCTTTGAAATAGGGCCCGTATTTACCCTATAGCACCCCCTCTACCCCCTCTAGAGCCCACTGTAAAGCTAACTTAGCATTAACCTTTTAAGTTAAAGATTAAGAGAACCAACACCTCTTTACAGTGAAATGCCCCAACTAAATACTACCGTATGGCCCACCATAATTACCCCCATACTCCTTACACTATTCCTCATCACCCAACTAAAAATATTAAACACAAACTACCACCTACCTCCCTCACCAAAGCCCATAAAAATAAAAAATTATAACAAACCCTGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCAGTACTGATCATTCTATTTCCCCCTCTATTGATCCCCACCTCCAAATATCTCATCAACAACCGACTAATCACCACCCAACAATGACTAATCAAACTAACCTCAAAACAAATGATAACCATACACAACACTAAAGGACGAACCTGATCTCTTATACTAGTATCCTTAATCATTTTTATTGCCACAACTAACCTCCTCGGACTCCTGCCTCACTCATTTACACCAACCACCCAACTATCTATAAACCTAGCCATGGCCATCCCCTTATGAGCGGGCACAGTGATTATAGGCTTTCGCTCTAAGATTAAAAATGCCCTAGCCCACTTCTTACCACAAGGCACACCTACACCCCTTATCCCCATACTAGTTATTATCGAAACCATCAGCCTACTCATTCAACCAATAGCCCTGGCCGTACGCCTAACCGCTAACATTACTGCAGGCCACCTACTCATGCACCTAATTGGAAGCGCCACCCTAGCAATATCAACCATTAACCTTCCCTCTACACTTATCATCTTCACAATTCTAATTCTACTGACTATCCTAGAAATCGCTGTCGCCTTAATCCAAGCCTACGTTTTCACACTTCTAGTAAGCCTCTACCTGCACGACAACACATAATGACCCACCAATCACATGCCTATCATATAGTAAAACCCAGCCCATGACCCCTAACAGGGGCCCTCTCAGCCCTCCTAATGACCTCCGGCCTAGCCATGTGATTTCACTTCCACTCCATAACGCTCCTCATACTAGGCCTACTAACCAACACACTAACCATATACCAATGATGGCGCGATGTAACACGAGAAAGCACATACCAAGGCCACCACACACCACCTGTCCAAAAAGGCCTTCGATACGGGATAATCCTATTTATTACCTCAGAAGTTTTTTTCTTCGCAGGATTTTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAATTAGGAGGGCACTGGCCCCCAACAGGCATCACCCCGCTAAATCCCCTAGAAGTCCCACTCCTAAACACATCCGTATTACTCGCATCAGGAGTATCAATCACCTGAGCTCACCATAGTCTAATAGAAAACAACCGAAACCAAATAATTCAAGCACTGCTTATTACAATTTTACTGGGTCTCTATTTTACCCTCCTACAAGCCTCAGAGTACTTCGAGTCTCCCTTCACCATTTCCGACGGCATCTACGGCTCAACATTTTTTGTAGCCACAGGCTTCCACGGACTTCACGTCATTATTGGCTCAACTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCACTTTGGCTTCGAAGCCGCCGCCTGATACTGGCATTTTGTAGATGTGGTTTGACTATTTCTGTATGTCTCCATCTATTGATGAGGGTCTTACTCTTTTAGTATAAATAGTACCGTTAACTTCCAATTAACTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATCAACACCCTCCTAGCCTTACTACTAATAATTATTACATTTTGACTACCACAACTCAACGGCTACATAGAAAAATCCACCCCTTACGAGTGCGGCTTCGACCCTATATCCCCCGCCCGCGTCCCTTTCTCCATAAAATTCTTCTTAGTAGCTATTACCTTCTTATTATTTGATCTAGAAATTGCCCTCCTTTTACCCCTACCATGAGCCCTACAAACAACTAACCTGCCACTAATAGTTATGTCATCCCTCTTATTAATCATCATCCTAGCCCTAAGTCTGGCCTATGAGTGACTACAAAAAGGATTAGACTGAACCGAATTGGTATATAGTTTAAACAAAACGAATGATTTCGACTCATTAAATTATGATAATCATATTTACCAAATGCCCCTCATTTACATAAATATTATACTAGCATTTACCATCTCACTTCTAGGAATACTAGTATATCGCTCACACCTCATATCCTCCCTACTATGCCTAGAAGGAATAATACTATCGCTGTTCATTATAGCTACTCTCATAACCCTCAACACCCACTCCCTCTTAGCCAATATTGTGCCTATTGCCATACTAGTCTTTGCCGCCTGCGAAGCAGCGGTGGGCCTAGCCCTACTAGTCTCAATCTCCAACACATATGGCCTAGACTACGTACATAACCTAAACCTACTCCAATGCTAAAACTAATCGTCCCAACAATTATATTACTACCACTGACATGACTTTCCAAAAAACACATAATTTGAATCAACACAACCACCCACAGCCTAATTATTAGCATCATCCCTCTACTATTTTTTAACCAAATCAACAACAACCTATTTAGCTGTTCCCCAACCTTTTCCTCCGACCCCCTAACAACCCCCCTCCTAATACTAACTACCTGACTCCTACCCCTCACAATCATGGCAAGCCAACGCCACTTATCCAGTGAACCACTATCACGAAAAAAACTCTACCTCTCTATACTAATCTCCCTACAAATCTCCTTAATTATAACATTCACAGCCACAGAACTAATCATATTTTATATCTTCTTCGAAACCACACTTATCCCCACCTTGGCTATCATCACCCGATGAGGCAACCAGCCAGAACGCCTGAACGCAGGCACATACTTCCTATTCTACACCCTAGTAGGCTCCCTTCCCCTACTCATCGCACTAATTTACACTCACAACACCCTAGGCTCACTAAACATTCTACTACTCACTCTCACTGCCCAAGAACTATCAAACTCCTGAGCCAACAACTTAATATGACTAGCTTACACAATAGCTTTTATAGTAAAGATACCTCTTTACGGACTCCACTTA	C	999	PASS	END=11401;SVTYPE=DEL;SVLEN=-4946;CIGAR=1M4946D;CONTIG=GGCAGGTTTGAAGCTGCTTCTTCGAATTTGCAATTCAATATGAAAATCACCTCGGAGCTGGTAAAAAGAGGCCTAACCCCTGTCTTTAGATTTACAGTCCAATGCTTCACTCAGCCATTTTACCTCACCCCCACTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTATTATTCGGCGCATGAGCTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCCAGCCAGGCAACCTTCTAGGTAACGACCACATCTACAACGTTATCGTCACAGCCCATGCATTTGTAATAATCTTCTTCATAGTAATACCCATCATAATCGGAGGCTTTGGCAACTGACTAGTTCCCCTAATAATCGGTGCCCCCGATATGGCGTTTCCCCGCATAAACAACATAAGCTTCTGACTCTTACCTCCCTCTCTCCTACTCCTGCTCGCATCTGCTATAGTGGAGGCCGGAGCAGGAACAGGTTGAACAGTCTACCCTCCCTTAGCAGGGAACTACTCCCACCCTGGAGCCTCCGTAGACCTAACCATCTTCTCCTTACACCTAGCAGGTGTCTCCTCTATCTTAGGGGCCATCAATTTCATCACAACAATTATCAATATAAAACCCCCTGCCATAACCCAATACCAAACGCCCCTCTTCTGACTCCCTAAAGCCCATGTCGAAGCCCCCATCGCTGGGTCAATAGTACTTGCCGCAGTACTCTTAAAACTAGGCGGCTATGGTATAATACGCCTCACACTCATTCTCAACCCCCTGACAAAACACATAGCCTACCCCTTCCTTGTACTATCCCTATGAGGCATAATTATAACAAGCTCCATCTGCCTACGACAAACAGACCTAAAATCGCTCATTGCATACTCTTCAATCAGCCACATAGCCCTCGTAGTAACAGCCATTCTCATCCAAACCCCCTGAAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTTACATCCTCATTACTATTCTGCCTAGCAAACTCAAACTACGAACGCACTCACAGTCGCATCATAATCCTCTCTCAAGGACTTCAAACTCTACTCCCACTAATAGCTTTTTGATGACTTCTAGCAAGCCTCGCTAACCTCGCCTTACCCCCCACTATTAACCTACTGGGAGAACTCTCTGTGCTAGTAACCACGTTCTCCTGATCAAATATCACTCTCCTACTCACAGGACTCAACATACTAGTCACAGCCCTATACTCCCTCTACATATTTACCACAACACAATGGGGCTCACTCACCCACCACATTAACAACATAAAACCC	GT:GQ:PL:PR:SR:SB:FS:VF:VF1:VAF1:VF2:VAF2:MLQS	0/1:999:999,0,999:4718,1010:6486,1373:3634,2852,713,660:22.239:7947,1920:3957,1920:0.326697:3990,1920:0.324873:.
```

Then the estimated heteroplasmy level of this variant can be calculated as `(0.326697 + 0.324873) / 2 = 0.325785`.

#### VCF FILTER Fields

The following table lists the VCF FILTER fields applied to all VCF output.

| ID        | Level  | Description                                               |
| --------- | ------ | --------------------------------------------------------- |
| Duplicate | Record | Variant is present in the VCF using a different notation. |

**Germline**

The following table lists the VCF FILTER fields applied to all germline VCF output.

| ID                | Level  | Description                                                                                                                                                      |
| ----------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| MinQUAL           | Record | QUAL score is less than a threshold.                                                                                                                             |
| Ploidy            | Record | For DEL and DUP variants, the genotypes of overlapping variants with similar size are inconsistent with diploid expectation.                                     |
| MaxDepth          | Record | Depth is greater than 3x the median chromosome depth near one or both variant breakends. Split read (`SR`) support for MaxDepth filtered variants is subsampled. |
| MaxMQ0Frac        | Record | For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend that exceeds 0.4.                                      |
| LowSupport        | Record | For variants significantly larger than the paired read fragment size, low paired reads support the alternate allele in any sample.                               |
| LargeContigFilter | Record | Assembled contig at SV locus does not successfully realign to the regions +- 200bp of both breakends with mapq >= 40 and identity >= 90%.                        |

**Germline Multi-sample**

The following table lists the VCF FILTER fields unique to germline multi-sample VCF output.

| ID       | Level  | Description                                    |
| -------- | ------ | ---------------------------------------------- |
| SampleFT | Record | No sample passes all the sample-level filters. |

**Tumor-Normal Somatic**

The following table lists the VCF FILTER fields applied to tumor-normal somatic VCF output.

| ID                | Level  | Description                                                                                                                                                                         |
| ----------------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| MinSomaticScore   | Record | SOMATICSCORE is less than a threshold.                                                                                                                                              |
| MaxDepth          | Record | Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. Split read (`SR`) support for MaxDepth filtered variants is subsampled. |
| MaxMQ0Frac        | Record | For a small variant (< 1000 bases) in the normal sample, the fraction of reads with MAPQ0 around either breakend exceeds 0.4.                                                       |
| SystematicNoise   | Record | Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation.                                                                        |
| LargeContigFilter | Record | Assembled contig at SV locus does not successfully realign to the regions +- 200bp of both breakends with mapq >= 40 and identity >= 90%.                                           |

**Tumor-Only**

The following table lists the VCF FILTER fields applied to tumor-only VCF output.

| ID                | Level  | Description                                                                                                                                                                        |
| ----------------- | ------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| MinSomaticScore   | Record | SOMATICSCORE is less than a threshold.                                                                                                                                             |
| SystematicNoise   | Record | Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation.                                                                       |
| MaxDepth          | Record | Tumor sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. Split read (`SR`) support for MaxDepth filtered variants is subsampled. |
| MaxMQ0Frac        | Record | For a small variant (<1000 bases), the fraction of reads with MAPQ0 around either breakend exceeds 0.4.                                                                            |
| LargeContigFilter | Record | Assembled contig at SV locus does not successfully realign to the regions +- 200bp of both breakends with mapq >= 40 and identity >= 90%.                                          |

Note that while the MaxDepth VCF FILTER header is always present, tumor-only MaxDepth filtering is not enabled unless `--sv-apply-somatic-max-depth true` is provided on the command-line.

#### Interpretation of VCF Filters

There are two levels of VCF filters: record level (`FILTER`) and sample level (`FORMAT/FT`). Most record-level filters are independent of those at the sample-level. However, in a germline analysis, if none of the samples pass all sample-level filters, the `SampleFT` record-level filter is applied.

#### Interpretation of INFO/EVENT Field

Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the sample. The `INFO/EVENT` field indicates that two or more such junctions are hypothesized to occur together as part of a single variant event. All individual variant records belonging to the same event share the same `INFO/EVENT` string. Note that although such an inference could be applied after SV calling by analyzing the relative distance and orientation of the called variant breakpoints, the SV caller incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern consistent with a multijunction event (such as a reciprocal translocation pair).

Although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to two. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.

#### VCF ID Field

The VCF `ID`, or identifier, field can be used for annotation, or in the case of `BND` (breakend) records for translocations, the `ID` value is used to link breakend mates or partners. The following is an example of a VCF `ID` field from the SV caller

```
DRAGEN:INS:1577:0:0:0:3:0
```

The value provided in the `ID` field reflects the SV association graph edge(s) from which the SV or indel was discovered. The value is guaranteed to be unique within any single VCF output file produced by the SV caller. These `ID` values are therefore used to link associated breakend records using the standard VCF `MATEID` key.

It's always recommended only to use the entire `ID` value as a unique key, since parsing the key could lead to incompatibility with different DRAGEN versions. The integer values within the `ID` value are internal indices of objects within SV pipeline stages, of which the exact structure may change and is for debugging purpose only. Therefore it is recommend to only associate BNDs based on `INFO/MATEID` (or `INFO/EVENT` for multi-junction events).

See the DRAGEN Software Support Site for information on the latest version of DRAGEN.

#### Convert SV VCF to BEDPE Format

It can sometimes be convenient to express structural variants in BEDPE format. For such applications, DRAGEN recommends the script vcfToBedpe available on GitHub. The repository is forked from @hall-lab with modifications to support VCF 4.2 SV format.

BEDPE format greatly reduces structural variant information compared to the SV caller VCF output. In particular, breakend orientation, breakend homology, and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason, Illumina only recommends BEDPE as a temporary output for applications that require it.

### Output Filtering Options

In addition to filtering labels, the SV caller provides options to control filtering of output variants. The following table lists the options that control filtering behavior.

| Option                                        | Description                                                                                                                                   |
| --------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| --sv-enable-high-precision-filters            | Enable high-precision germline filters based on split and spanning read counts (Default: false).                                              |
| --sv-enable-somatic-high-precision-filters    | Enable high-precision somatic (tumor-only or tumor-normal) filters based on split and spanning read counts (Default: false).                  |
| --sv-min-required-unique-read-count           | Minimum total unique reads supporting a variant when high-precision filters are enabled ([default values](#default-values)).                  |
| --sv-min-required-spanning-read-count         | Minimum spanning reads supporting a variant when high-precision filters are enabled ([default values](#default-values)).                      |
| --sv-min-required-split-read-count            | Minimum split reads supporting a variant when high-precision filters are enabled ([default values](#default-values)).                         |
| --sv-hotspot-min-required-unique-read-count   | Minimum total unique reads in hotspot regions when high-precision filters are enabled ([default values](#default-values)).                    |
| --sv-hotspot-min-required-spanning-read-count | Minimum spanning reads in hotspot regions when high-precision filters are enabled ([default values](#default-values)).                        |
| --sv-hotspot-min-required-split-read-count    | Minimum split reads in hotspot regions when high-precision filters are enabled ([default values](#default-values)).                           |
| --sv-min-diploid-variant-score                | Minimum VCF 'QUAL' score for a variant to be included in the diploid vcf ([default values](#default-values)).                                 |
| --sv-min-somatic-score                        | Minimal somatic quality score for a variant to be included in the somatic vcf ([default values](#default-values)).                            |
| --sv-min-pass-diploid-variant-score           | VCF 'QUAL' score below which a variant is marked as filtered in the diploid vcf ([default values](#default-values)).                          |
| --sv-min-pass-somatic-score                   | Minimal somatic quality score below which a variant is marked as filtered in the somatic vcf ([default values](#default-values)).             |
| --sv-min-scored-variant-size                  | Minimum size of variant to be scored and included in the VCF output ([default values](#default-values)).                                      |
| --sv-hotspot-min-scored-variant-size          | Minimum size of variant in hotspot regions to be scored and included in the VCF output ([default values](#default-values)).                   |
| --sv-diploid-max-mq0-frac                     | Control filtration based on MQ0 fraction ([default values](#default-values)).                                                                 |
| --sv-min-pass-diploid-gt-score                | "Minimum genotype quality score below which single samples are filtered for a variant in the diploid vcf ([default values](#default-values)). |

## Default Values

| Item                                              | Value |
| ------------------------------------------------- | ----- |
| SV\_MIN\_SCORED\_VARIANT\_SIZE                    | 35    |
| SV\_TIN\_CONTAM\_TOLERANCE                        | 0.15  |
| SV\_MIN\_EDGE\_OBSERVATIONS                       | 3     |
| SV\_MIN\_CANDIDATE\_SPANNING\_COUNT               | 3     |
| SV\_HOTSPOT\_MIN\_SCORED\_VARIANT\_SIZE           | 25    |
| SV\_MIN\_GQ                                       | 5     |
| SV\_MIN\_REQUIRED\_UNIQUE\_READ\_COUNT            | 3     |
| SV\_MIN\_REQUIRED\_SPANNING\_READ\_COUNT          | 1     |
| SV\_MIN\_REQUIRED\_SPLIT\_READ\_COUNT             | 1     |
| SV\_HOTSPOT\_MIN\_REQUIRED\_UNIQUE\_READ\_COUNT   | 3     |
| SV\_HOTSPOT\_MIN\_REQUIRED\_SPANNING\_READ\_COUNT | 0     |
| SV\_HOTSPOT\_MIN\_REQUIRED\_SPLIT\_READ\_COUNT    | 1     |
| SV\_ML\_MIN\_PASS\_DEL\_PROB                      | 0.5   |
| SV\_ML\_MIN\_PASS\_INS\_PROB                      | 0.5   |
| SV\_MIN\_DIPLOID\_VARIANT\_SCORE                  | 10    |
| SV\_MIN\_SOMATIC\_SCORE                           | 10    |
| SV\_MIN\_PASS\_DIPLOID\_VARIANT\_SCORE            | 20    |
| SV\_MIN\_PASS\_SOMATIC\_SCORE                     | 20    |
| SV\_DIPLOID\_MAX\_MQ0\_FRAC                       | 0.4   |
| SV\_MIN\_PASS\_DIPLOID\_GT\_SCORE                 | 15    |

## Benchmarking DRAGEN SV VCF against NIST T2T Q100 truthset

DRAGEN's SV calling on HG002 can be evaulated using a recent dragen SV truth set from NIST based on T2T Q100 assemblies (e.g. [v0.019](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.019-20241113/)). This section shows an example of benchmarking using truvari (<https://github.com/ACEnglish/truvari>).

Running truvari is a two-step process — truvari bench and then truvari refine. (Details in [README](https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.019-20241113/NIST_HG002_DraftBenchmark_defrabbV0.019-20241113_README.md) from NIST)

It was recommended by NIST to post-process the truth VCF to remove records with "\*" in the ALT field. This can be done by `bcftools view -e 'ALT="."' -Oz -o benchmark_noAltAst.vcf.gz benchmark.vcf.gz`.

To run `truvari refine`, records with "\<DUP:TANDEM>" in ALT fields need to be removed from DRAGEN SV VCF output. DRAGEN reports tandem duplications with size larger than 1000 with a symbolic "\<DUP:TANDEM>" instead of the actual sequence in the final VCF (see [Large Variant Representation](#large-variant-representation)). Such records can be removed by `bcftools view -e "ALT=<DUP:TANDEM>" -Oz -o dragen.sv.tandup_removed.vcf.gz dragen.sv.vcf.gz`

Taken together, to benchmark a DRAGEN SV VCF against NIST T2T Q100 truthset, run the following script:

```sh
base_vcf="/path/to/GRCh38_HG2-T2TQ100-V1.1_stvar.vcf.gz"
truth_bed="/path/to/GRCh38_HG2-T2TQ100-V1.1_stvar.benchmark.bed"

query_vcf="/path/to/query.sv.vcf.gz"
outdir="/path/to/output"
ref="/path/to/hg38.fa"


filtered_base="/path/to/GRCh38_HG2-T2TQ100-V1.1_stvar.filt.vcf.gz"  
filtered_query="/path/to/query.sv.filt.vcf.gz" 

bcftools view -e 'ALT="."' -Oz -o ${filtered_base} ${base_vcf}
bcftools view -e "ALT=<DUP:TANDEM>" -Oz -o ${filtered_query} ${query_vcf}

truvari bench \
 --base=${filtered_base} \
 --comp=${filtered_query} \
 --output=${outdir} \
 --includebed=${truth_bed} \
 --refdist=2000 \
 --pctseq=0.7 \
 --pctsize=0.7 \
 --pctovl=0.0 \
 --passonly \
 --minhaplen=50 \
 --sizemin=50 \
 --sizefilt=30 \
 --sizemax=50000 \
 --pick=ac \
 --extend=0 \
 --chunksize=5000

truvari refine \
  --use-original-vcfs \
  --use-region-coords \
  --recount \
  --align=mafft \
  --threads=4 \
  --regions=${outdir}/candidate.refine.bed \
  --reference=${ref} \
  ${outdir} 
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/sv-calling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
