Structural Variant Calling
The DRAGEN Structural Variant (SV) Caller integrates and extends Manta structural variant calling methods to provide SV and indel calls larger than or equal to SV_MIN_SCORED_VARIANT_SIZE (default values) of bases. SVs and indels are called from mapped paired-end sequencing reads. The SV caller is optimized for analysis of diploid germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.
The SV caller performs the following actions:
Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions within a single efficient workflow.
Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise.
Provides scoring models for 1) germline variants in small sets of diploid samples, 2) somatic variants in matched tumor-normal sample pairs, and 3) somatic and germline variants in tumor-only samples.
All SV and indel inferences are output in VCF 4.2 format.
DRAGEN SV Caller Overview
The DRAGEN SV caller divides the SV and indel discovery process into the following steps.
Reads input files to estimate alignment statistics, including fragment size distribution and chromosome level depth. For more information on the SV caller input options, see Command Line Options.
Scans the genome or a subset of the genome (specified by the call regions) to build various genome-wide data structures, including a breakend association graph of all SV associated regions from single reads. DRAGEN SV then merges and filters these regions in the graph to reduce noise that can improve precision and runtime. The graph contains edges that connect all regions of the genome that have a possible breakend association. Edges can connect two different regions of the genome to represent evidence of a long-range association, or an edge can connect to a region to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis and multiple breakend candidates might be found on one edge. Typically only one or two candidates are found per edge. Instead of passing an inclusion region BED file, an exclusion region BED file can be passed to DRAGEN so that any SV breakend that overlaps with these regions gets removed from downstream analyses. The excluded regions are removed from the graph building process, but active regions can get extended and present in the excluded regions in the refinement step. This can happen for the active regions that are close to the boundaries of the excluded regions. Hence, the final SV calls may still get extended to these regions.
Analyzes the breakend association graph to discover candidate SVs, then scores discovered candidate SVs. Analysis and scoring are performed as follows.
Infers SV candidates that are associated with the given graph edge.
Assembles the SV breakends.
Scores/genotypes and filters each SV candidate under various biological models (currently germline, tumor-normal, and tumor-only).
Outputs scored SVs to VCF.
DRAGEN SV Caller Capabilities
The DRAGEN SV caller can discover all identifiable structural variant types in the absence of copy number analysis and large-scale de novo assembly. For more information on detectable types, see Detected Variant Classes.
For each structural variant and indel, the SV caller attempts to assemble the breakends by gathering nearby evidential reads, and to call SV events to base pair resolution by aligning assemblies against the reference genome. Then SV caller reports the left-shifted breakend coordinate (per the VCF 4.2 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. As a result, SV events' reported coordinates may not be directly reflected by read alignments' IGV view. Often the assembly will fail to provide a confident explanation of the data, especially in repeat regions. As a result, the SV caller will skip providing single-base resolution breakpoints or the associated split read support. In such cases, the SV caller will approximate the event breakpoints and score the events under the unified likelihood model as in other regular cases but report the variant as IMPRECISE instead.
The sequencing reads provided as input to the SV caller are expected to be from a paired-end sequencing assay that results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.
The SV caller is primarily tested for whole-genome and whole-exome (or other targeted enrichment) sequencing assays on DNA. For these assays the following applications are supported:
Joint analysis of 5 or fewer diploid individuals
Subtractive analysis of a matched tumor-normal sample pair
Analysis of an individual tumor sample
For joint analysis, there is no specific restriction against larger cohorts, but there might be stability or call quality issues.
When performing somatic calling, the matched normal sample might be contaminated with tumor cells. The contamination can substantially reduce somatic variant recall. To account for Tumor-in-Normal (TiN) contamination, you can enable liquid tumor mode. For more information, see Liquid Tumor Calling.
Tumor samples can be analyzed without a matched normal sample. In this case, both germline and somatic variants are scored and reported in the output.
Detected Variant Classes
The SV caller can discover all variation classes that can be explained as novel DNA adjacencies in the genome. Novel DNA adjacencies are classified into the following categories based on the breakend pattern:
Deletions
Insertions
Insertions in the result can be divided into the following two subclasses depending on if the inserted sequence can be fully assembled. 1) Fully-assembled insertions; 2) Partially-assembled/incomplete (inferred) insertions. See VCF record example.
Partially assembled insertion sequences can be reported in the single-breakend format (described in VCF v4.2 Section 5.4.9) with
--sv-report-incomplete-ins-as-bnd=true.As a heuristic, DRAGEN will attempt to pair partially-assembled insertions into a single INS SV call if they are proximal and have consistent orientations e.g. two partially-assembled insertions which derive from a single large insertion. If such a pairing cannot be found and for a partially-assembled insertion sequence and it does not match any MEI sequence (via the workflow described in Mobile Element Insertions Detection), no SV record will be created.
Tandem Duplications
Inversions
Unclassified breakpoints corresponding to intra and inter-chromosomal translocations, or complex structural variants. These are reported as a matching pair of VCF
BNDrecords as per VCF v4.2 Section 5.4.
Mobile Element Insertions Detection
The general purpose SV routine can detect Mobile Element Insertions (MEIs) with assembled inserted sequences like other regular insertions. If missed by the general purpose SV routine, MEIs will be rescued by the MEI specific routine based on similarity between assembled contigs and known sequences in the MEI catalog.
The MEI catalog based rescuing functionality is enabled by default via
--sv-enable-mobile-element-sequences=true.The MEI catalog is described in the file
<INSTALL_PATH>/config/sv_mobile_element_sequences.faand accepted via--sv-mobile-element-sequences-fileby default. The catalog contains sequences of common mobile elements from the Dfam database, includingAlu,LINE1, andSVAsubfamilies. The catalog can be customized by the user with additional sequences or with a different set of sequences.The rescued records will be placed in same VCF as the general purpose SV routine and presented as regular INS events.
An example of such rescued record:
Known Limitations
The SV caller cannot directly discover the following variant types:
Dispersed duplications.
Dispersed duplications may be indirectly called as insertions or unclassified breakends.
Most expansion/contraction variants of a reference tandem repeat.
Breakends corresponding to small inversions.
The limiting size is not tested, but in theory, detection falls off below ~200 bases. Micro-inversions might be detected indirectly as combined insertion/deletion variants.
Fully-assembled large insertions.
The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but power to fully assemble the insertion should fall off to impractical levels before this size.
The SV caller does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.
Large germline deletions and duplications
The SV caller does not report germline deletions and duplications larger than 1 Mb because it relies on split-read and read-pair evidence at breakpoint loci, which is insufficient to report the variants spanning such large regions with high confidence. The size limits can be adjusted using the options,
--sv-max-del-scored-variant-size(deletions) and--sv-max-dup-scored-variant-size(duplications).
Fold-back inversions
Fold-back inversion in which the start and end positions are less than 1kbp apart are not reliably called.
Increased runtime due to poor quality reads.
High levels of discordant alignments can burden the SV caller by generating an excessive number of structural variant candidates, leading to increased runtime. The DRAGEN aligner provides three key metrics to gauge potential sample quality issues: soft-clipped bases, supplementary alignments (indicating chimeric reads) and improperly paired reads (indicating discordant reads). For well-behaved samples, these metrics typically remain below 10% of the total aligned bases or reads. If these metrics are excessively high, it is recommended to investigate potential sources of error upstream. This may include library preparation protocols, input material quantity, and sequencing run quality. We also recommend loading the sample into IGV and comparing to more well-behaved samples.
More general repeat-based limitations exist for all variant types:
Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.
While the SV caller classifies certain novel DNA-adjacencies into variant classes, it has a limited ability to infer high-level events resulting from complex rearrangements, so certain calls summarized as deletions, duplications, and insertions might be better described by looking at the full system of breakends and copy number changes associated with a given event. For example, a set of overlapping deletions, duplications and inversion-like breakpoints could form a chromothriptic rearrangement, or the two sides of an insertion of unknown length may not actually be connected and could instead form a balanced interchromosomal translocation. Care should be taken when interpreting somatic SVs as complex rearrangements are common in cancer and the SV caller classifications are only valid for simple isolated SVs.
Systematic Noise Filtering
When DRAGEN-SV is used in the somatic mode (tumor-only or tumor-normal), a BEDPE file with a set of paired-end regions in the BEDPE file format can be specified to filter out sequencing / systematic noise and also recurrent germline calls. Any variant that overlaps with one of the systematic noise paired regions (with a population count of at least 2) and has the same orientation will be marked as SystematicNoise in the final VCF file. This BEDPE file can be passed via the command line option --sv-systematic-noise.
The systematic noise BEDPE file is built using VCFs that were generated by the DRAGEN-SV tumor-only pipeline when run on normal samples that do not necessarily match to the subject the tumor sample was taken from. The file might contain several dozen samples.
Generating systematic noise BEDPE file
You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To generate a BEDPE file, do as follows.
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
You can also build systematic noise BEDPE files in the cloud using the DRAGEN Baseline Builder App on BaseSpace or the DRAGEN Systematic Noise File Builder Pipeline on ICA.
Pre-built SV systematic noise BEDPE files
The following prebuilt systematic noise files for WGS are available for download on the DRAGEN Software Support Site page. It is recommended to select the one that best matches your library prep and application and ideally to generate it from your own set of samples. Each systematic noise file contains a version string that DRAGEN uses to check the compatibility by default and exits early if a wrong systematic noise file is provided. More details are provided in the README within the downloadable package.
SV systematic noise BEDPE file format
The systematic noise BEDPE is formatted as follows:
contig1
chromosome of the first region (string)
start1
start position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
end1
end position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
contig2
chromosome of the second region (string)
start2
start position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
end2
end position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
event_id
The paired region unique ID (string)
score
The number of occurrences in the cohort
orientation1
direction of breakpoint1 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
orientation2
direction of breakpoint2 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
assembly-status
If all variants used to generate the noise candidate have end-to-end local assemblies, noise candidate is "precise", otherwise it is "imprecise" (string, "precise", or "imprecise")
SV Scoring
The SV caller applies a diploid scoring model for one or more diploid samples(treated as unrelated), as well as a somatic scoring model when a tumor and matched normal sample pair are given.
Germline scoring model
The germline scoring model produces diploid genotype probabilities for each candidate structural variant. Most candidates are approximated as independent for scoring purposes and modeled under a bi-allelic and diploid genotype likelihood setting. DRAGEN solves for the posterior probability over possible genotypes given the sequencing data by approximating it proportionally to the product between the prior probability of a genotype and the conditional probability of observing a set of read fragments(post-filtering) given the underlying genotype. DRAGEN treats each read fragment independently and represents the conditional probability of the set of read fragments as the product over all the individual read fragments'. For each individual read fragment's conditional probability, DRAGEN combines both paired-read and split-read evidence components, and approximates their contributions as independent by representing it as a product of these components, with the condition that the paired-read component is weighted by a linear ramp from one to zero depending on the candidate event type and size as tiny event will not affect pair-read mapping status significantly.
The paired-read component is modeled as a function measuring the deviation of the inferred fragment length from the overall distribution.
The split-read component is modeled as a function measuring the correctness of a read alignment to the a breakend by multiplying across all the non-gap bases' probability of observing a certain base call given the corresponding base of the evaluated allele.
Each read fragment may contribute only paired-read support, only split-read support, or both. Where a fragment contributes split-read support, this support may come from either or both reads in the read pair.
Somatic scoring model
The somatic scoring model is a Bayesian probabilistic model using a tumor-normal joint genotyping approach. It aims to call somatic structural variants in tumors while avoiding germline variants and noisy variants. In this model, the tumor and normal allele frequencies are treated as nonindependent random variables. DRAGEN calculates posterior probabilities for a range of genotype hypotheses, under the assumption that the normal sample conforms to the diploid germline genotype considering homozygous reference, heterozygous, and homozygous states. The tumor sample is a mixture of the germline genotype and, if present, the somatic allele. For the somatic genotype, we consider only two states referring to the absence and presence of the somatic variant in the tumor sample. In cases where the somatic variant is not present, we account for unsystematic independent noise, while assuming an error-free scenario when the somatic variant is considered. To calculate the genotype likelihoods, the model integrates allele frequency likelihoods over the joint tumor and normal allele frequencies and applies modifications to address liquid tumors with Tumor-in-Normal (TiN) contamination. The integration is approximated with a discrete summation. In these calculations, the likelihood for each read to support a given allele is shared with the germline scoring model. The tumor-only somatic scoring model is seen as a special case of the somatic scoring model in the absence of normal data (zero coverage). The posterior probability is converted into a Phred quality score and reported in the VCF output INFO/SOMATICSCORE field. In somatic mode, a genotype state (SAMPLE/GT) and genotype posterior probabilities (SAMPLE/GQ) are not reported out, as the diploid assumption may not be valid under a tumor analysis.
Large Contig Filter
The large contig filter improves SV calling precision by filtering SVs whose assembled contigs fail to corroborate the underlying breakends of the reported variant. It is enabled by default in both germline and somatic mode, but can be disabled with --sv-enable-large-contig-filter=false. For germline variant calling, this filter is only applied to inter-chromosomal breakends (translocations).
Filter Methodology
When an SV is called and an assembled contig is available, DRAGEN realigns the contig back to the reference genome. For each breakend of a true SV, there should be high quality alignment of part of the SV contig to the region near the breakend with the alignment orientation consistent with the breakend. If either of the underlying breakends of an SV does not have such an alignment, the LargeContigFilter filter is applied. Regardless of whether or not the record passes the filter, the LCF tag is appended to the INFO field of the SV to indicate that it was processed by the large contig filter.
Filtering Criteria
The large contig filter will process all variants that meet both of the following criteria:
The SV is an inter-chromosomal BND, or the SV is for a somatic sample and the underlying breakends of the SV are at least 1kbp apart.
The SV's assembled contig length is at least 100bp
For variants meeting these criteria, the filter evaluates whether the assembled contig successfully realigns to the regions ±200bp of each breakend with:
Mapping quality (MAPQ) ≥ 40
Alignment identity ≥ 90%
Input Requirements
When running the SV caller, the input sequencing reads must be from a standard Illumina paired-end sequencing assay with an FR read pair orientation, where for each sequence fragment, a read proceeds from each end of the fragment inwards. For more information, see DRAGEN SV caller Capabilities.
The SV caller is optimized for paired-end libraries where the fragment size is typically larger than the size of both reads. Overlapping read pairs can be used to discover SVs, but might not always be handled optimally. For libraries where the typical fragment size is less than the read length, the SV caller attempts to differentiate reads sequencing into adapter sequence from the variant signal. In such cases, the SV caller's input quality checks may fail and cause SV analysis to be skipped.
If using the standalone mode, your BAM/CRAM inputs must first be mapped. If you have not mapped and aligned your data yet, you can generate an alignment file.
Alignment Contig Checks
If running from a mapped and aligned BAM, then the contigs specified in the header must strictly match those in the DRAGEN hashtable specified in the current analysis. Missing or extra contigs will lead to a "Reference genome mismatch" configuration error and the analysis will not proceed. If such an error is observed, it is recommended to regenerate the alignment file with the intended DRAGEN hashtable, or to run with the DRAGEN map/align module enabled.
Input Quality Checks
The SV caller runs quality checks on the input sequencing reads for each sample to make sure that the input corresponds to a paired read assay with the expected FR orientation, before estimating the fragment size distribution. To check consensus read pair orientation, a subset of high-quality read pairs is sampled. At least 90% of these must have the expected FR orientation for SV analysis to continue, otherwise, the SV caller issues a warning, skips any further analysis, and the resulting output files display empty results.
The SV caller can tolerate non-paired reads in the input, if sufficient paired-end reads exist to estimate the fragment size distribution. To estimate the fragment size distribution, the SV caller requires at least 100 read pairs that meet the quality requirements of the estimation routine. Both reads of the pair must have a non-zero mapping quality to the same chromosome, are not filtered or part of a split read mapping, and do not contain indels or soft-clipping. If a sample does not contain a sufficient number of such read pairs, the SV caller issues a warning, skips any further analysis, and writes empty results to its output files.
Read Groups
The SV caller disregards any read group labels applied to the input sequences. Each input sample is treated as a separate library with a single fragment size distribution.
File Format
In standalone mode, input sequencing reads must be mapped and provided as input in either BAM or CRAM format. Each input file must be coordinate sorted and indexed to produce an htslib-style index in a file named to match the input BAM or CRAM file with an additional .bai, .crai, or .csi file name extension. For more information on standalone mode, see Modes of Operation.
At least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file is treated as a separate sample as part of a joint diploid sample analysis.
In standalone mode, input BAM or CRAM files contain the following limitations:
Alignments cannot have an unknown read sequence (SEQ="*")
Alignments cannot contain the "=" character in the SEQ field.
Alignments cannot use the sequence match/mismatch ("="/"X"). CIGAR notation RG (read group) tags in the alignment records are ignored. Each alignment file is treated as representing one sample.
Alignments with base call quality values greater than 70 are rejected. These are not supported on the assumption that this indicates an offset error.
Generate an Alignment File
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
Exome/Targeted Calling
The SV caller can be configured for targeted sequencing inputs, which disables high-depth filters. Exome mode can be directly set to true or false with the command line option --sv-exome. If not directly set, exome mode defaults to false.
Targeted Somatic Panel Calling
To enable targeted calling for somatic panels, set the options --sv-enable-liquid true or --sv-enable-solid true for liquid and solid biopsies, respectively. Additionally, use the option --sv-call-regions-bed ${BEDFILE} to specify the target regions in a BED file. For details on command line options, see DRAGEN recipes for somatic pipelines.
Note: The
sv-enable-liquidoption only applies to targeted panels of liquid biopsies (e.g. ctDNA), and is different fromsv-enable-liquid-tumor-modewhich applies a specialized scoring model for liquid tumors (e.g. leukemia). See Liquid Tumor Calling for details on liquid tumor mode.
Internal Tandem Duplications Calling
You can use the --sv-somatic-ins-tandup-hotspot-regions-bed ${BEDFILE} option to specify ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed. The file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps). To disable this feature, enter --sv-enable-somatic-ins-tandup-hotspot-regions false.
Liquid Tumor Calling and Tumor-in-normal Contamination
Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. In tumor-normal analysis, DRAGEN accounts for Tumor-in-Normal (TiN) contamination by running liquid tumor mode. TiN contamination is accounted for by allowing a non-zero variant allele frequency for the matched normal when calculating the posterior probability of the somatic state. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Note: liquid tumors are not equivalent to liquid biopsies. For targetted panels used for liquid biopsies such as ctDNA assays, refer to section Targeted Somatic Panel Calling.
Use the following two options to control liquid tumor mode behavior.
--sv-enable-liquid-tumor-mode---Enable liquid tumor mode. Liquid tumor mode is disabled by default.--sv-tin-contam-tolerance---Set the TiN contamination tolerance level. DRAGEN calls variants in the presence of TiN contamination up to a specified maximum tolerance level. You can enter any value between 0–1. The default maximum TiN contamination tolerance is SV_TIN_CONTAM_TOLERANCE (default values). If using the default value, somatic variants are expected to be observed in the normal sample with allele frequencies up to (SV_TIN_CONTAM_TOLERANCE * 100)% of the corresponding allele in the tumor sample.
Command Line Options
The following command line options are supported for the Structural Variant Caller.
Input and Output Options
The following are the top-level options that are shared with the DRAGEN Host Software to control the SV pipeline. You can use BAM and CRAM files as input. Alternatively, if using read mapping with the SV calling in a single run, you can use all of the DRAGEN input options, including FASTQ, BAM, and CRAM files.
--cram-input---The CRAM file to be processed.--tumor-cram-input---If performing tumor-normal or tumor-only analysis, the tumor CRAM file to be processed.--fastq-file1,--fastq-file2,--fastq-list---Input FASTQ files or a list of files to be processed.--tumor-fastq1,--tumor-fastq2,--tumor-fastq-list---Input tumor FASTQ file or list of files to be processed.--enable-map-align---Enables DRAGEN map/align. The default is true, so all input reads are remapped and aligned unless the option is set to false.--output-directory---Output directory where all results are stored.--output-file-prefix---Output file prefix that will be prepended to all result file names.--ref-dir---The DRAGEN reference genome hashtable directory. For more information about the reference genome hashtable, see Prepare a Reference Genome.--bam-input---The BAM file to be processed.--tumor-bam-input--If performing tumor-normal or tumor-only analysis, the tumor BAM file to be processed.
Structural Variant Caller Pipeline Options
--enable-sv---Enable or disable the structural variant caller. The default is false.--sv-target-bed---Specifies a BED file containing the set of regions to call. Optionally, you can compress the file in gzip or bgzip format. SVs with both breakends within the specified regions will be called.--sv-locus-node-target-file--- Specifies a BED file containing a set of target regions for locus nodes. Each locus node roughly corresponds to one SV breakend. SVs with at least one breakend within the specified regions will be called. This option makes the SV caller more sensitive than usingsv-target-bed.--sv-exclusion-bed--- Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.--sv-region--- Limit the analysis to a specified region of the genome for debugging purposes. This option can be specified multiple times to build a list of regions. The value must be in the format "chr:startPos-endPos".--sv-exome--- Set to true to configure the variant caller for targeted sequencing inputs, which includes disabling high depth filters. The default is false.--sv-output-contigs--- Set to true to have assembled contig sequences output in a VCF file. The default is true.--sv-discovery--- Enable SV discovery. The default is true.--sv-report-small-dup-as-ins--- Set to true to convert small duplications (<1000 bps) as insertions. The default is true.--sv-use-overlap-pair-evidence--- Allow overlapping read pairs to be considered as evidence. The default is false.--sv-somatic-ins-tandup-hotspot-regions-bed--- Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from<INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed.--sv-enable-somatic-ins-tandup-hotspot-regions--- Enable or disable the ITD hotspot region input. The default is true in somatic variant analysis.--sv-enable-solid--- Enable SV mode for solid panels. See Targeted Somatic Panel Calling.--sv-enable-liquid--- Enable SV mode for liquid panels. See Targeted Somatic Panel Calling. This option applies only for liquid biopsies, and is different fromsv-enable-liquid-tumor-modewhich applies to hematological cancer that accounts for tumor-in-normal contamination.--sv-enable-liquid-tumor-mode--- Enable liquid tumor mode. See Liquid Tumor Calling.--sv-tin-contam-tolerance--- Set the Tumor-in-Normal (TiN) contamination tolerance level. See Liquid Tumor Calling for more information.--sv-systematic-noise--- Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information see Systematic Noise Filtering.--sv-detect-systematic-noise--- Set to true to generate VCF output per normal sample. For more information see Systematic Noise Filtering--sv-build-systematic-noise-vcfs-list--- List of input VCFs from previous step. Enter one VCF per line. For more information see Systematic Noise Filtering--sv-min-edge-observations--- Remove all edges from the graph with less than this many observations. The default value is set to SV_MIN_EDGE_OBSERVATIONS (default values).--sv-min-candidate-spanning-count--- Run SV caller and report all large SVs with at least this many spanning support observations. The default value is set to SV_MIN_CANDIDATE_SPANNING_COUNT (default values).--sv-min-scored-variant-size--- After candidate identification, only score and report SVs/indels at or above size of SV_MIN_SCORED_VARIANT_SIZE (default values). This parameter doesn't affect the somatic hotspot region.--sv-hotspot-min-scored-variant-size--- After candidate identification, only score and report SVs/indels at or above this size inside the SV hotspot region, which includes FLT3, ARHGEF7, and KMT2A genes by default. The default value is set to SV_HOTSPOT_MIN_SCORED_VARIANT_SIZE (default values).--sv-skip-parsing-ga-tag--- By default SV caller will make use of the graph alignment tag (ga:Z) to improve SV calling sensitivity whenever a ga tag is present in the alignment record. This option provides a way to disable ga tag related functionalities in SV calling. The default value is set to false.--sv-enable-methylation--- Enable methylation-aware SV calling mode (Default=false).--sv-ml-model--- SV ML trained model location.--sv-ml-metafile--- Meta file for SV ML with versioned information.--sv-enable-ml--- If true, use SV ML filtering. (Default=true).--sv-ml-enable-logging--- If true, enable SV ML debugging mode (Default=false).--sv-ml-enable-feature-extraction--- Enable feature extraction for training the SV ML model (Default=false).--sv-ml-min-pass-del-prob--- Minimum pass probability in SV ML for deletions. The default is (default values).--sv-ml-min-pass-ins-prob--- Minimum pass probability in SV ML for insertions. The default is (default values).--sv-ml-max-del-svlen--- Maximum deletion size that SV ML model can be applied to (Default=DoubleMax).--sv-ml-key--- Key used for ML decryption.--sv-skip-artifact-early-exit--- Override early exit upon the detection of excessive artefact sequences to continue SV calling (Default=false).--sv-enable-large-contig-filter--- Enable the large contig filter (Default=true).
Modes of Operation
Structural Variant calling can run in the following modes:
Standalone --- Uses mapped BAM/CRAM input files. If you have not mapped and aligned your data yet, see Input Requirements. This mode requires the following options:
--enable-map-align false--enable-sv true
Integrated -- Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires the following options:
--enable-map-align true--enable-sv true--enable-map-align-output true--output-format bam
You can also enable Structural Variant calling with any other caller.
The following is an example command line for Integrated mode:
The following is an example command line for joint diploid calling in standalone mode:
Structural Variant VCF Output
The structural variants VCF output file is available in the output directory. The file is named <output-file-prefix>.sv.vcf.gz. The contents of the file depend on the type of analysis.
For each major analysis category (germline, tumor-normal, and tumor-only), the appropriate VCF output file is output, reflecting variant calls made under the variant calling mode corresponding to the given analysis type.
VCF Output
VCF output follows the VCF 4.2 specification for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The following sections provide information on the variant representation details and the primary VCF field values.
VCF Sample Names
Sample names output in the VCF output are extracted from each input alignment file from the first read group (@RG) record found in the header. If no sample name is found, a default (SAMPLE1, SAMPLE2, etc.) label is used instead.
Small Indel Classification and Representation
A variant is classified as a small indel if all of the following criteria are met:
The variant can be entirely expressed as a combination of inserted and deleted sequences.
The deletion or insertion length is not 1000 or greater.
The variant breakends and/or the inserted sequence are not imprecise.
The variant has not been converted from a deletion to intra-chromosomal breakends by the depth-based SV classification routine.
All small indels are reported using full sequences in the VCF REF and ALT allele fields. Additionally, their VCF records include the CIGAR INFO tag describing the combined insertion and deletion event.
Large Variant Representation
In somatic mode, variants that do not meet the "small indel" criteria described above are reported using symbolic alleles.
An example of a 1505-base deletion reported using symbolic allele:
In germline mode, all variants are reported in the VCF using full variant sequences in the REF and ALT field unless their sizes exceed thresholds defined below:
The variant is a insertion/deletion and its length is larger than or equal to 1000000 (1 million) bases.
The variant is a duplication and its length is larger than or equal to 1000 bases.
Insertions with Incomplete Insert Sequence Assembly
Large insertions are reported in some cases even when the insert sequence cannot be fully assembled. These can be identified by the INCOMPLETEINS INFO field. In addition, the incomplete insertion records will have INFO fields LEFT_SVINSSEQ and/or RIGHT_SVINSSEQ that describe the assembled left and right ends of the insert sequence. If the record was rescued due to a match to an MEI sequence, INTEGRATION_TYPE=MEI will be added to the INFO field as well.
In germline mode, the inserted sequence of an incomplete insertion is represented as a concatenation of LEFT_SVINSSEQ, 5 "N"s, and RIGHT_SVINSSEQ in ALT field. In the CONTIG field, the left and right sequences are concatenated with 100 "N"s in between. The following is an example of such a record from HG002 mapped to hg38:
In somatic mode, incomplete insertions are reported with <INS> in ALT field, as shown in the example below:
Single Breakend Format for Incomplete Insertions
In addition to the <INS> record, DRAGEN can also output single-breakend formatted records (described in VCF v4.2 Section 5.4.9) corresponding to the partially-assembled sides of the incomplete insertion record. This behavior is enabled by default if viral integration detection is enabled (i.e. --enable-sv=true and --enable-oncovirus-detection=true). Otherwise, it can be enabled with --sv-report-incomplete-ins-as-bnd=true. These single-breakend records will have the Duplicate filter applied unless they represent a viral integration site.
If the incomplete insertion sequence aligns to an MEI sequence (by default, the sequences in <INSTALL_PATH>/config/sv_mobile_element_sequences.fa), details about this match will be added to the INFO field:
INTEGRATION_RNAME
The reporting name of the sequence the incomplete insertion matched to (e.g. AluY).
INTEGRATION_ALT
The mated breakpoint ALT notation for an incomplete insertion matching an external sequence (e.g. [DF0000002.4:100[C).
INTEGRATION_CIGAR
The CIGAR of the alignment between the inserted sequence and the external sequence.
For example, the following set of records correspond to a single MEI event. Note that if --sv-report-incomplete-ins-as-bnd=true is not provided and viral integration detection is not enabled, only the second record (the <INS> record) will be output.
Viral Integration Site Detection
When oncovirus detection is enabled (--enable-oncovirus-detection=true), the SV caller can identify sites where oncoviral sequences have integrated into the human genome. This is done by aligning the partially assembled insertion sequences for incomplete insertion records to the oncoviral reference sequences identified by the oncovirus component of DRAGEN. Partially assembled insertion sequences with high scoring alignments are then reported as integration events in the SV VCF output.
To enable detection of integration sites, the SV caller must be enabled with --enable-sv=true, oncovirus detection enabled with --enable-oncovirus-detection=true, and the database specified with --oncovirus-detection-db.
An example command with viral integration enabled is given below:
Viral integration events are reported as single breakends by default and will have INTEGRATION_TYPE=VIRAL in the INFO field, and like single breakend MEI integration records, details about the viral integration will be added to the INFO field:
INTEGRATION_RNAME
The reporting name of the sequence the incomplete insertion matched to (e.g. HBV_Occult_HK514).
INTEGRATION_ALT
The mated breakpoint ALT notation for an incomplete insertion matching an external sequence (e.g. [KJ410519.4:100[C).
INTEGRATION_CIGAR
The CIGAR of the alignment between the inserted sequence and the external sequence.
It is worth noting that unlike single breakend MEI records, single breakend viral integration records are not Duplicate filtered and do not have a corresponding INS record. For example, below we see three viral integration records output by DRAGEN. The first is a single, isolated site while the latter two are in close proximity with orientations compatible with a single large insertion. In both cases, only the single breakend records will be output.
Normalizing Small Tandem Duplications
The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.
To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions that use the symbolic allele <INS> for the ALT field in somatic mode, and report the full sequence in ALT field in germline mode. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.
An example of an insertion converted from duplication in somatic mode:
Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, INFO/DUPSVINSSEQ provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to INFO/SVINSSEQ. The example above also shows a converted insertion with a breakpoint insertion value (DUPSVINSSEQ=TG).
To prevent small tandem duplications from being reported as insertions, use the option --sv-report-small-dup-as-ins false. For more information about copied INFO fields, see VCF INFO Fields. All INFO fields use the same DUP prefix.
Multi-record Deduplication
In addition to <INS> and <DUP> deduplication, the SV caller will Duplicate filter large structural variants are represented in the VCF in a more precise notation. For example, if A-B-C rearrangment exists in a sample the SV caller will call A-B and B-C as expected, but if the B segment is short, it may also call an IMPRECISE A-C SV or a A-C SV with SVINSSEQ of the B sequence. This deduplication step identifies these A-C SVs and Duplicate filters them.
Candidates for deduplication marking are IMPRECISE SVs, and SV with a non-empty SVINSSEQ. Candidates are considered duplicates when a path traversal through one or more SVs can be found such that:
The length of the traversed sequence is within 10bp of the expected length.
For
IMPRECISEthis is determined byCIPOSandCIENDand for precise variants, the expected length is theSVINSSEQlength.
The start or end of at least one SV is at least 1kbp away from both the candidate start and end.
All traversed segments are at least 50bp in length.
The start and end SVs are within 5bp of the range of acceptable candidate start and end positions respectively.
The traversed sequence is at most 600bp.
The path traverses at most 3 SVs.
When a candidate PASSes filtering, only PASS SVs are considered for deduplication matching. If a candidate has any FILTER applied, all SVs in the VCF are considered.
Inversions
Inversions are usually reported as a set of breakends. For example, given a simple reciprocal inversion, four breakends are reported, sharing the same INFO/EVENT tag. The following is an example breakend records representing a simple reciprocal inversion:
For a micro inversion that is entirely contained within the alignment, and represented by adjacent 'I' and 'D' CIGAR operations of comparable lengths, SV caller can also output a single VCF record with format similar to small INS/DEL for such an event, for example:
Depth-Based SV Type Classification
In the germline calling model, when SV candidates are discovered from the sample data and have sufficient paired and split read evidence to be reported in the output, the SV caller applies additional depth-based tests to more accurately classify certain SV candidate types. Candidate breakpoints that are consistent with a deletion are tested for the lower read depth that is expected inside the deleted region. Candidate breakpoints consistent with a tandem duplication are tested for the higher read depth expected in the duplicated region. Candidate SV calls that fail the depth-based tests are still reported in the output, but changed to intra-chromosomal breakends. Candidate SV calls that pass continue to be reported in the standard deletion and tandem duplication output formats.
SV Breakpoint Insertions
SVs frequently include a small sequence insertion at the breakpoint. Breakpoint insertions are represented differently depending on the SV type. The INFO/SVINSSEQ field in the VCF output provides the most general description of breakpoint insertions by describing the insertion sequence itself. The corresponding INFO/SVINSLEN field describes the length of the insertion sequence. For example, the following VCF record describes a large (~8.8 kb) deletion, which includes a single base insertion (C) between the left and right deletion breakends.
The INFO/SVINSSEQ field is also used to describe breakpoint insertions for tandem duplication and breakend records. The field can also be used to describe the insertion sequence of a large SV insertion.
Breakpoint insertions are represented differently if the variant is classified as a small indel. Any breakpoint insertion that happens in a small deletion is represented in the CIGAR string. See Small Indel Classification and Representation for information on the conditions this format is used for SVs under.
In the following small indel example, the VCF record describes a 57 base deletion that includes a single base insertion (A) between the left and right deletion breakends
Breakend (BND) records include an additional encoding of breakpoint insertion sequence, as described in the VCF specification for the breakend ALT field. The SV caller also provides the information to the INFO/SVINSSEQ field for consistency with other SV record types.
The following example shows a breakend connecting a region of chromosomes 1 and 12 in the sample with a breakend insertion sequence of CA between the two breakends. The insertion sequence is described in both the ALT and INFO/SVINNSEQ fields.
SV Breakpoint Insertion Orientation
The breakpoint insertion sequence is always provided with respect to the strand of the current SV record. Some breakend records have inverted orientation. For inverted orientations, the pair of breakend records contains an insertion sequence that is reverse complemented compared to the mated record.
The following breakend pair example demonstrates an inverted orientation.
SV Breakpoint Homology
Each VCF record output by the SV caller is shifted to the left-most position of the exact homology range of the breakpoint. The exact homology range of the breakpoint is the continuous range of positions over which the SV could be represented while still describing the same SV haplotype. The exact homology range is described in the VCF output with the INFO/HOMSEQ field, which describes the sequence of the exact homology range and the corresponding INFO/HOMLEN field, which describes the length of the range.
The following example shows a 62 base deletion with an 11 base breakend homology region. Without left-shifting, the SV has an equivalent representation anywhere from position 39497639 to 39497650.
The following examples illustrate simplified exact breakend homology. The example displays one three base deletion and another three base insertion. In both the insertion and deletion, the variant is left-shifted, so that the corresponding VCF record position is 2.
Deletion
Reference: GTCAGCGA
Variant: GT---CGA
Insertion
Reference: GT---CAG
Variant: GTCGGCAA
In both the insertion and deletion, there is a single base of exact breakend homology C, so that the same variant can be represented one base to the right.
VCF INFO Fields
IMPRECISE
Flag indicating that the structural variation is imprecise, ie, the exact breakpoint location is not found
SVTYPE
Type of structural variant
SVLEN
Difference in length between REF and ALT alleles
END
End position of the variant described in this record
CIPOS
Confidence interval around POS
CIEND
Confidence interval around END
CIGAR
CIGAR alignment for each alternate indel allele
MATEID
ID of mate breakend
EVENT
ID of event associated to breakend
HOMLEN
Length of base pair identical homology at event breakpoints
HOMSEQ
Sequence of base pair identical homology at event breakpoints
SVINSLEN
Length of insertion
SVINSSEQ
Sequence of insertion
INCOMPLETEINS
Variant corresponds to an incompletely assembled insertion sequence
LEFT_SVINSSEQ
Known left side of insertion for an insertion of unknown length
RIGHT_SVINSSEQ
Known right side of insertion for an insertion of unknown length
PAIR_COUNT
Read pairs supporting this variant where both reads are confidently mapped
BND_PAIR_COUNT
Confidently mapped reads supporting this variant at this breakend (mapping may not be confident at remote breakend)
UPSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at the upstream breakend (mapping may not be confident at downstream breakend)
DOWNSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at this downstream breakend (mapping may not be confident at upstream breakend)
BND_DEPTH
Read depth at local translocation breakend
MATE_BND_DEPTH
Read depth at remote translocation mate breakend
JUNCTION_QUAL
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only
SOMATIC
Flag indicating a somatic variant
SOMATICSCORE
Somatic variant quality score
SOMATIC_EVENT
If the probability of the SV being a germline variant is greater than the probability of the SV being a somatic variant, this is 0. Otherwise, this is 1.
JUNCTION_SOMATICSCORE
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only
CONTIG
Assembled contig sequence, if the variant is not imprecise (with --sv-output-contigs)
DUPSVLEN
Length of duplicated reference sequence
DUPHOMLEN
Length of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPHOMSEQ
Sequence of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPSVINSLEN
Length of inserted sequence after duplicated reference sequence
DUPSVINSSEQ
Inserted sequence after duplicated reference sequence
LCF
Flag indicating that the large contig filter processed this record
INTEGRATION_TYPE
Type of integration event ("VIRAL" for oncovirus integration, "MEI" for mobile element insertion)
INTEGRATION_RNAME
The reporting name of the sequence the incomplete insertion matched to
INTEGRATION_ALT
The mated breakpoint ALT notation for an incomplete insertion matching an external sequence
INTEGRATION_CIGAR
The CIGAR of the alignment between the inserted sequence and the external sequence
The meaning of the IMPRECISE, SVTYPE, SVLEN, END, CIPOS, CIEND, MATEID, EVENT, HOMLEN, HOMSEQ, fields match their VCF v4.2 definitions.
VCF FORMAT Fields
GT
Genotype
FT
Sample filter, 'PASS' indicates that all filters have passed for this sample
GQ
Genotype Quality
PL
Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
PR
Number of spanning read pairs which strongly support the REF or ALT alleles
SR
Number of split-reads which strongly support the REF or ALT alleles
VF
Number of fragments which strongly support the REF or ALT alleles at any position
VF1
Number of fragments which strongly support the REF or ALT allele at the first breakend
VF2
Number of fragments which strongly support the REF or ALT allele at the second breakend
VAF1
Variant allele fraction for the first breakend, calculated as VF1_ALT/(VF1_ALT+VF1_REF)
VAF2
Variant allele fraction for the second breakend, calculated as VF2_ALT/(VF2_ALT+VF2_REF)
PSL
Phase set list
The meaning of the GT and GQ fields match their VCF v4.2 definitions. The meaning of the PSL field matches its VCF v4.4 definitions.
SV Variant Allele Fraction (VAF)
Some of the evidential sequence fragments (or read pairs) could potentially provide both PR and SR support. To avoid double counting, we further defined VF as an additional field to represent the number of evidential sequence fragments (or read pairs) that strongly support the REF or ALT alleles in the listed order. In this context, "strongly support" means that a sequence fragment can be easily distinguished and assigned to one of the alleles. For example, read alignments that fit both REF and ALT allele sequences in a repeat region will be discarded (similar to the definition of UninformativeReads as in small variant caller).
Unlike SNV callers where the Variant Allele Fraction (VAF) can be calculated as VF_ALT/(VF_ALT+VF_REF), multiple VAFs can be calculated for SVs. VF, VF1, and VF2 refer to the number of strongly supporting fragments for the entire allele, the first breakend, and the second breakend respectively.
For <INS> variants, ALT VF support is all reads strongly supporting anywhere in the insertion. ALT VF1 is all fragments strongly supporting the start of the insertion (that is, the read and/or read pair support the ALT in a position overlaps the start of the insertion), and ALT VF2 is the support at end of the insertion.
For <DEL> variants, ALT VF, VF1, and VF2 will be the same, but the REF VF1 and VF2 refer to the reference support and the start and end of the deletion respectively. This are not necessarily the same. For example, if there are two overlapping simple heterozygous deletions, then the outer SV VAFs will be 0.5, but the inner SV VAFs for these deletions will be 1.
In general, SVs with non-zero SVINSSEQ will have different ALT VF1 and VF2, and SVs of non-zero reference size will have different REF VF1 and VF2.
For <INS>, <DEL>, and <DUP> variants, VF1/VAF1 refer to the VAF at the start of the SV, and VF2/VAF2 refers to the VAF at the end of the SV. For BND variants in VCF breakend notation, VAF1 refers to the local breakend VAF, and VAF2 the remote breakend VAF. For single breakend variants, only VF1 and VAF1 are defined.
Physical Phasing
Many somatic samples contain cis phased structural variants in extreme proximity to one another. Such variants are able to be phased if the distance between them is less than or comparable to the library fragment size (or read length for single-ended sequencing). The SV caller performs physical phasing of structural variants by identifing reads/read pairs that unambiguously support the ALT alleles of two nearby structural variants. Physically phased SVs are aggregated into phase sets based the transitive closure of the pair-wise cis phased SVs. Since the phase set aggregation can result in phase switch errors when a trio of 0|1, 1|1 and 0|1 variants are phased, physical phasing is limited to somatic variants larger than 1kbp.
The value of the PSL field for each phase set is the ID of the first record belonging to the phase set in the VCF. Note that since VCF PS field does not support local copy number changes or inter-chromosomal phasing, the PSL field introduced in VCF v4.4 is used.
Heteroplasmy Calculation for Mitochondrial Variants
For mitochondrial variants, the VAF1 and VAF2 field can be used to calculate the heteroplasmy level of a mitochondrial variant: Heteroplasmy = (VAF1 + VAF2) / 2.
An example of a mitochondrial variant with VAF values is shown below:
Then the estimated heteroplasmy level of this variant can be calculated as (0.326697 + 0.324873) / 2 = 0.325785.
VCF FILTER Fields
The following table lists the VCF FILTER fields applied to all VCF output.
Duplicate
Record
Variant is present in the VCF using a different notation.
Germline
The following table lists the VCF FILTER fields applied to all germline VCF output.
MinQUAL
Record
QUAL score is less than a threshold.
Ploidy
Record
For DEL and DUP variants, the genotypes of overlapping variants with similar size are inconsistent with diploid expectation.
MaxDepth
Record
Depth is greater than 3x the median chromosome depth near one or both variant breakends. Split read (SR) support for MaxDepth filtered variants is subsampled.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend that exceeds 0.4.
LowSupport
Record
For variants significantly larger than the paired read fragment size, low paired reads support the alternate allele in any sample.
LargeContigFilter
Record
Assembled contig at SV locus does not successfully realign to the regions +- 200bp of both breakends with mapq >= 40 and identity >= 90%.
Germline Multi-sample
The following table lists the VCF FILTER fields unique to germline multi-sample VCF output.
SampleFT
Record
No sample passes all the sample-level filters.
Tumor-Normal Somatic
The following table lists the VCF FILTER fields applied to tumor-normal somatic VCF output.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. Split read (SR) support for MaxDepth filtered variants is subsampled.
MaxMQ0Frac
Record
For a small variant (< 1000 bases) in the normal sample, the fraction of reads with MAPQ0 around either breakend exceeds 0.4.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation.
LargeContigFilter
Record
Assembled contig at SV locus does not successfully realign to the regions +- 200bp of both breakends with mapq >= 40 and identity >= 90%.
Tumor-Only
The following table lists the VCF FILTER fields applied to tumor-only VCF output.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation.
MaxDepth
Record
Tumor sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. Split read (SR) support for MaxDepth filtered variants is subsampled.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads with MAPQ0 around either breakend exceeds 0.4.
LargeContigFilter
Record
Assembled contig at SV locus does not successfully realign to the regions +- 200bp of both breakends with mapq >= 40 and identity >= 90%.
Note that while the MaxDepth VCF FILTER header is always present, tumor-only MaxDepth filtering is not enabled unless --sv-apply-somatic-max-depth true is provided on the command-line.
Interpretation of VCF Filters
There are two levels of VCF filters: record level (FILTER) and sample level (FORMAT/FT). Most record-level filters are independent of those at the sample-level. However, in a germline analysis, if none of the samples pass all sample-level filters, the SampleFT record-level filter is applied.
Interpretation of INFO/EVENT Field
Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the sample. The INFO/EVENT field indicates that two or more such junctions are hypothesized to occur together as part of a single variant event. All individual variant records belonging to the same event share the same INFO/EVENT string. Note that although such an inference could be applied after SV calling by analyzing the relative distance and orientation of the called variant breakpoints, the SV caller incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern consistent with a multijunction event (such as a reciprocal translocation pair).
Although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to two. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.
VCF ID Field
The VCF ID, or identifier, field can be used for annotation, or in the case of BND (breakend) records for translocations, the ID value is used to link breakend mates or partners. The following is an example of a VCF ID field from the SV caller
The value provided in the ID field reflects the SV association graph edge(s) from which the SV or indel was discovered. The value is guaranteed to be unique within any single VCF output file produced by the SV caller. These ID values are therefore used to link associated breakend records using the standard VCF MATEID key.
It's always recommended only to use the entire ID value as a unique key, since parsing the key could lead to incompatibility with different DRAGEN versions. The integer values within the ID value are internal indices of objects within SV pipeline stages, of which the exact structure may change and is for debugging purpose only. Therefore it is recommend to only associate BNDs based on INFO/MATEID (or INFO/EVENT for multi-junction events).
See the DRAGEN Software Support Site for information on the latest version of DRAGEN.
Convert SV VCF to BEDPE Format
It can sometimes be convenient to express structural variants in BEDPE format. For such applications, DRAGEN recommends the script vcfToBedpe available on GitHub. The repository is forked from @hall-lab with modifications to support VCF 4.2 SV format.
BEDPE format greatly reduces structural variant information compared to the SV caller VCF output. In particular, breakend orientation, breakend homology, and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason, Illumina only recommends BEDPE as a temporary output for applications that require it.
Output Filtering Options
In addition to filtering labels, the SV caller provides options to control filtering of output variants. The following table lists the options that control filtering behavior.
--sv-enable-high-precision-filters
Enable high-precision germline filters based on split and spanning read counts (Default: false).
--sv-enable-somatic-high-precision-filters
Enable high-precision somatic (tumor-only or tumor-normal) filters based on split and spanning read counts (Default: false).
--sv-min-required-unique-read-count
Minimum total unique reads supporting a variant when high-precision filters are enabled (default values).
--sv-min-required-spanning-read-count
Minimum spanning reads supporting a variant when high-precision filters are enabled (default values).
--sv-min-required-split-read-count
Minimum split reads supporting a variant when high-precision filters are enabled (default values).
--sv-hotspot-min-required-unique-read-count
Minimum total unique reads in hotspot regions when high-precision filters are enabled (default values).
--sv-hotspot-min-required-spanning-read-count
Minimum spanning reads in hotspot regions when high-precision filters are enabled (default values).
--sv-hotspot-min-required-split-read-count
Minimum split reads in hotspot regions when high-precision filters are enabled (default values).
--sv-min-diploid-variant-score
Minimum VCF 'QUAL' score for a variant to be included in the diploid vcf (default values).
--sv-min-somatic-score
Minimal somatic quality score for a variant to be included in the somatic vcf (default values).
--sv-min-pass-diploid-variant-score
VCF 'QUAL' score below which a variant is marked as filtered in the diploid vcf (default values).
--sv-min-pass-somatic-score
Minimal somatic quality score below which a variant is marked as filtered in the somatic vcf (default values).
--sv-min-scored-variant-size
Minimum size of variant to be scored and included in the VCF output (default values).
--sv-hotspot-min-scored-variant-size
Minimum size of variant in hotspot regions to be scored and included in the VCF output (default values).
--sv-diploid-max-mq0-frac
Control filtration based on MQ0 fraction (default values).
--sv-min-pass-diploid-gt-score
"Minimum genotype quality score below which single samples are filtered for a variant in the diploid vcf (default values).
Default Values
SV_MIN_SCORED_VARIANT_SIZE
35
SV_TIN_CONTAM_TOLERANCE
0.15
SV_MIN_EDGE_OBSERVATIONS
3
SV_MIN_CANDIDATE_SPANNING_COUNT
3
SV_HOTSPOT_MIN_SCORED_VARIANT_SIZE
25
SV_MIN_GQ
5
SV_MIN_REQUIRED_UNIQUE_READ_COUNT
3
SV_MIN_REQUIRED_SPANNING_READ_COUNT
1
SV_MIN_REQUIRED_SPLIT_READ_COUNT
1
SV_HOTSPOT_MIN_REQUIRED_UNIQUE_READ_COUNT
3
SV_HOTSPOT_MIN_REQUIRED_SPANNING_READ_COUNT
0
SV_HOTSPOT_MIN_REQUIRED_SPLIT_READ_COUNT
1
SV_ML_MIN_PASS_DEL_PROB
0.5
SV_ML_MIN_PASS_INS_PROB
0.5
SV_MIN_DIPLOID_VARIANT_SCORE
10
SV_MIN_SOMATIC_SCORE
10
SV_MIN_PASS_DIPLOID_VARIANT_SCORE
20
SV_MIN_PASS_SOMATIC_SCORE
20
SV_DIPLOID_MAX_MQ0_FRAC
0.4
SV_MIN_PASS_DIPLOID_GT_SCORE
15
Benchmarking DRAGEN SV VCF against NIST T2T Q100 truthset
DRAGEN's SV calling on HG002 can be evaulated using a recent dragen SV truth set from NIST based on T2T Q100 assemblies (e.g. v0.019). This section shows an example of benchmarking using truvari (https://github.com/ACEnglish/truvari).
Running truvari is a two-step process — truvari bench and then truvari refine. (Details in README from NIST)
It was recommended by NIST to post-process the truth VCF to remove records with "*" in the ALT field. This can be done by bcftools view -e 'ALT="."' -Oz -o benchmark_noAltAst.vcf.gz benchmark.vcf.gz.
To run truvari refine, records with "<DUP:TANDEM>" in ALT fields need to be removed from DRAGEN SV VCF output. DRAGEN reports tandem duplications with size larger than 1000 with a symbolic "<DUP:TANDEM>" instead of the actual sequence in the final VCF (see Large Variant Representation). Such records can be removed by bcftools view -e "ALT=<DUP:TANDEM>" -Oz -o dragen.sv.tandup_removed.vcf.gz dragen.sv.vcf.gz
Taken together, to benchmark a DRAGEN SV VCF against NIST T2T Q100 truthset, run the following script:
Last updated
Was this helpful?