Structural Variant Calling
The DRAGEN Structural Variant (SV) Caller integrates and extends Manta structural variant calling methods to provide SV and indel calls 50 bases or larger. SVs and indels are called from mapped paired-end sequencing reads. The SV caller is optimized for analysis of diploid germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.
The SV caller performs the following actions:
Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions within a single efficient workflow.
Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise.
Scores known SV deletions and insertions from an input VCF file against one or more input samples, either as a standalone procedure or together with standard SV discovery.
Provides scoring models for 1) germline variants in small sets of diploid samples, 2) somatic variants in matched tumor-normal sample pairs, and 3) somatic and germline variants in tumor-only samples.
All SV and indel inferences are output in VCF 4.2 format.
DRAGEN SV Caller Overview
The DRAGEN SV Caller divides the SV and indel discovery process into the following steps.
Reads input files to estimate alignment statistics, including fragment size distribution and chromosome level depth. For more information on the SV Caller input options, see Command Line Options.
Scans the genome or a subset of the genome (specified by the call regions) to build various genome-wide data structures, including a breakend association graph of all SV associated regions. The graph contains edges that connect all regions of the genome that have a possible breakend association. Edges can connect two different regions of the genome to represent evidence of a long-range association, or an edge can connect to a region to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis and multiple breakend candidates might be found on one edge. Typically only one or two candidates are found per edge. Instead of passing an inclusion region BED file, an exclusion region BED file can be passed to DRAGEN so that any SV breakend that overlaps with these regions gets removed from downstream analyses. The excluded regions are removed from the graph building process, but active regions can get extended and present in the excluded regions in the refinement step. This can happen for the active regions that are close to the boundaries of the excluded regions. Hence, the final SV calls may still get extended to these regions.
Analyzes the breakend association graph to discover candidate SVs, then scores discovered candidate SVs and any known SVs from the input. Analysis and scoring are performed as follows.
Infers SV candidates that are associated with the given graph edge.
Assembles the SV breakends.
Merges discovered SV candidates with any known SV candidates included in the input data.
Scores/genotypes and filters each SV candidate under various biological models (currently germline, tumor-normal, and tumor-only).
Outputs scored SVs to VCF.
DRAGEN SV Caller Capabilities
The DRAGEN SV Caller can discover all identifiable structural variant types in the absence of copy number analysis and large-scale de novo assembly. For more information on detectable types, see Detected Variant Classes.
For each structural variant and indel, the SV Caller attempts to assemble the breakends by gathering nearby evidential reads, and to call SV events to base pair resolution by aligning assemblies against the reference genome. Then SV caller reports the left-shifted breakend coordinate (per the VCF 4.2 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. As a result, SV events' reported coordinates may not be directly reflected by read alignments' IGV view. Often the assembly will fail to provide a confident explanation of the data, especially in repeat regions. As a result, the SV caller will skip providing single-base resolution breakpoints or the associated split read support. In such cases, the SV caller will approximate the event breakpoints and score the events under the unified likelihood model as in other regular cases but report the variant as IMPRECISE instead.
You can provide known SVs as input for forced genotyping. This known SV input can be scored either standalone or together with the standard SV discovery workflow, in which case the known and discovered SVs are merged.
The sequencing reads provided as input to the SV Caller are expected to be from a paired-end sequencing assay that results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.
The SV Caller is primarily tested for whole-genome and whole-exome (or other targeted enrichment) sequencing assays on DNA. For these assays the following applications are supported:
Joint analysis of 5 or fewer diploid individuals
Subtractive analysis of a matched tumor-normal sample pair
Analysis of an individual tumor sample
For joint analysis, there is no specific restriction against larger cohorts, but there might be stability or call quality issues.
When performing somatic calling on liquid tumor samples, the matched normal sample might be contaminated with tumor cells. The contamination can substantially reduce somatic variant recall. To account for Tumor-in-Normal (TiN) contamination, you can enable liquid tumor mode. For more information, see Liquid Tumor Calling.
Tumor samples can be analyzed without a matched normal sample. In this case, both germline and somatic variants are scored and reported in the output.
Detected Variant Classes
The SV Caller can discover all variation classes that can be explained as novel DNA adjacencies in the genome. Novel DNA adjacencies are classified into the following categories based on the breakend pattern:
Deletions
Insertions
Insertions in the result can be divided into the following two subclasses depending on if the inserted sequence can be fully assembled. 1) Fully-assembled insertions; 2) Partially-assembled (inferred) insertions
Mobile Element Insertions that are not called by the general purpose SV routine will be rescued by the MEI specific routine based on similarity between assembled contigs and known sequences in the MEI catalog described in the file
<INSTALL_PATH>/config/sv_mobile_element_sequences.fa
.
Tandem Duplications
Inversions
Unclassified breakend pairs corresponding to intra- and inter-chromosomal translocations, or complex structural variants.
Known Limitations
The SV Caller cannot directly discover the following variant types:
Dispersed duplications.
Dispersed duplications may be indirectly called as insertions or unclassified breakends.
Most expansion/contraction variants of a reference tandem repeat.
Breakends corresponding to small inversions.
The limiting size is not tested, but in theory, detection falls off below ~200 bases. Micro-inversions might be detected indirectly as combined insertion/deletion variants.
Fully-assembled large insertions.
The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but power to fully assemble the insertion should fall off to impractical levels before this size.
The SV Caller does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.
More general repeat-based limitations exist for all variant types:
Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.
While the SV Caller classifies certain novel DNA-adjacencies into variant classes, it has a limited ability to infer high-level events resulting from complex rearrangements, so certain calls summarized as deletions, duplications, and insertions might be better described by looking at the full system of breakends and copy number changes associated with a given event.
Forced Genotyping Capability
The DRAGEN SV caller is capable of forced genotyping a set of SVs input from a VCF file. Forced genotyping means that the input SVs are scored and emitted in the output of the SV Caller even if the variant is not supported in the sample data. For example, given a germline analysis, the input variants are processed and written to the output VCF, even if the variant quality falls below the threshold normally required for an SV to be emitted.
Forced genotyping typically enables known SVs to be detected at higher recall than standard SV discovery (particularly for SV discovery on a lower-depth sample). Forced genotyping can also be useful to assert against the presence of an SV allele. For example, you can use forced genotyping to distinguish a confident homozygous reference genotype from a lack of sequencing coverage over the SV locus.
Forced genotyping SVs are processed according to the current SV analysis being run. For example, if a germline analysis is configured by providing one or more normal samples as input, then the input SVs are scored under a germline model.
Forced genotyping alleles are always emitted in the output and might have modified scoring and filtering rules applied compared to SVs only discovered from the sample data.
Forced Genotyping Modes
Forced Genotyping can be run in two modes.
Standalone --- Only the SVs described in an input VCF are scored and emitted.
Integrated --- The standard SV discovery analysis is run and the results are merged with SVs scored from the forced genotyping input. The workflow outputs the union of SVs discovered from the sample data and any additional forced genotyping alleles. The workflow is run whenever the
--sv-discovery
option is true.
Forced Genotyping Inputs
You can specify forced genotyping input using the --sv-forcegt-vcf
option. The input must be a VCF of SV alleles. The SV allele types are restricted to insertions, deletions, tandem duplications, and breakends, which are not labeled with the INFO/IMPRECISE
flag. The following are the filtering criteria required for the VCF record to be processed as an input SV allele. If any of the criteria are not met, the VCF record is removed from the set of input SVs for forced genotyping. When a forced genotyping VCF is specified on the command line, the SV caller reports the total number of SV records used as input SVs and the total number of records filtered (if any) due to the following criteria.
Describes an insertion, deletion, tandem duplication, or breakend record.
Cannot contain the
INFO/IMPRECISE
flag.Cannot contain multiple ALT alleles.
Has a
FILTER
value ofPASS
or unknown (.
).All indels are at least the minimum scored variant size (default is 50).
Cannot repeat an SV allele previously described in the same file.
The
REF
field cannot be empty or unknown (.
).
You must describe insertions using the VCF small indel format, including an ALT
entry that describes the complete insertion sequence. Using <INS>
as a symbolic alt allele is not accepted. You can describe deletions using either the VCF small indel format or the <DEL>
symbolic alt allele. For any variant described using a symbolic alt allele, you must also provide a value for INFO/END
. Inversions represented in a single VCF record using the <INV>
alt allele are not accepted, but the inversion can be genotyped if converted to a set of breakend records. Each breakpoint is described by a pair of breakend VCF records. If the forced genotyping input contains just one record of the pair and the input conditions above are met, the input is still accepted for forced genotyping, and the distal breakend is inferred from the local record.
You can describe breakpoint insertions for non-insertion SV alleles using one of the following two methods. Both methods correspond to the format used to describe breakpoint insertions in the SV VCF output.
For SVs described using the symbolic
ALT
format, such as<DEL>
, theINFO/SVINSSEQ
field is parsed to read the breakpoint insertion sequence.For smaller indels described directly in the
REF
andALT
fields, the contents of theALT
field describe the breakend sequence.
Forced Genotyping Output
Forced genotyping SVs are always output to the standard VCF output of the SV Caller, regardless of whether the forced genotyping is standalone or integrated with SV calling. When the same SV allele is independently discovered from the sample data, only the discovered SV appears in the final output. The discovered SV allele is annotated to indicate the match to a forced genotyping input SV, and the scoring and filtration rules are changed to match.
VCF output records influenced by forced genotyping have the following associated fields.
The flag
INFO/NotDiscovered
is set for any VCF record that was not independently discovered from the sample data. When forced genotyping is run standalone, all output records contain the flag. When integrated with SV calling, the flag can distinguish the SV alleles that would not have been discovered in a standard SV analysis.For these variants only, the usual SV caller ID field generated from the SV Locus graph is not available, instead, the ID is taken from the corresponding user input VCF. The suffix
UserInput${InputVCFRecordNumber}
is appended to the ID, separated by an underscore. If your input VCF contains only one of the two VCF records that comprise a breakend variant, then the ID is taken from the mate breakend record and the_Mate
suffix is added.
Any output VCF record that corresponds to a forced genotyping input VCF record has the value
INFO/UserInputId=${ID}
set to reflect the VCF ID value of the input VCF record. The corresponding record might have also been discovered independently from the sample data and might not have theINFO/NotDiscovered
flag set.Any output VCF record that corresponds to a forced genotyping input VCF record containing forced genotyping alleles that match exactly to an input SV has the flag
INFO/KnownSVScoring
. VCF records with this flag are always emitted in the output of the SV Caller. Several filters, such as MaxDepth, are not applied.
Systematic Noise Filtering
When DRAGEN-SV is used in the somatic mode (tumor-only or tumor-normal), a BEDPE file with a set of paired-end regions in the BEDPE file format can be specified to filter out sequencing / systematic noise and also recurrent germline calls. Any variant that overlaps with one of the systematic noise paired regions (with a population count of at least 2) and has the same orientation will be marked as SystematicNoise
in the final VCF file. This BEDPE file can be passed via the command line option --sv-systematic-noise
.
The systematic noise BEDPE file is built using VCFs that were generated by the DRAGEN-SV tumor-only pipeline when run on normal samples that do not necessarily match to the subject the tumor sample was taken from. The file might contain several dozen samples.
Generating systematic noise BEDPE file
You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To generate a BEDPE file, do as follows.
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
You can also build systematic noise BEDPE files in the cloud using the DRAGEN Baseline Builder App on BaseSpace.
Pre-built SV systematic noise BEDPE files
The following prebuilt systematic noise files for WGS are available for download on the DRAGEN Software Support Site page. To generate these noise files, we used 100 unrelated normal samples from the 1000 Genomes Project. Each systematic noise file contains a version string that DRAGEN uses to check the compatibility by default and exits early if a wrong systematic noise file is provided.
WGS_hg19_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HG19 reference
3.0.0
4.3.*
WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HG38 reference
3.0.0
4.3.*
WGS_hs37d5_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HS37D5 reference
3.0.0
4.3.*
The systematic noise BEDPE should follow a particular format
contig1
chromosome of the first region (string)
start1
start position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
end1
end position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
contig2
chromosome of the second region (string)
start2
start position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
end2
end position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
event_id
The paired region unique ID (string)
score
The number of occurrences in the cohort
orientation1
direction of breakpoint1 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
orientation2
direction of breakpoint2 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
assembly-status
If all variants used to generate the noise candidate have end-to-end local assemblies, noise candidate is "precise", otherwise it is "imprecise" (string, "precise", or "imprecise")
SV Scoring
The SV caller applies a diploid scoring model for one or more diploid samples(treated as unrelated), as well as a somatic scoring model when a tumor and matched normal sample pair are given.
Germline scoring model
The germline scoring model produces diploid genotype probabilities for each candidate structural variant. Most candidates are approximated as independent for scoring purposes and modeled under a bi-allelic and diploid genotype likelihood setting. DRAGEN solves for the posterior probability over posible genotypes given the sequencing data by approximating it proportionally to the product between the prior probability of a genotype and the conditional probability of observing a set of read fragments(post-filtering) given the underlying genotype. DRAGEN treats each read fragment independently and represents the conditional probability of the set of read fragments as the product over all the individual read fragments'. For each individual read fragment's conditional probability, DRAGEN combines both paired-read and split-read evidence components, and approximates their contributions as indipendent by representing it as a product of these components, with the condition that the paired-read component is weighted by a linear ramp from one to zero depending on the candidate event type and size as tiny event will not affect pair-read mapping status significantly.
The paired-read component is modeled as a function measuring the deviation of the inferred fragment length from the overall distribution.
The split-read component is modeled as a function measuring the correctness of a read alignment to the a breakend by multiplying across all the non-gap bases' probability of observing a certain base call given the corresponding base of the evaluated allele.
Each read fragment may contribute only paired-read support, only split-read support, or both. Where a fragment contributes split-read support, this support may come from either or both reads in the read pair.
Somatic scoring model
The somatic scoring model is a Bayesian probabilistic model using a tumor-normal joint genotyping approach. It aims to call somatic structural variants in tumors while avoiding germline variants and noisy variants. In this model, the tumor and normal allele frequencies are treated as nonindependent random variables. DRAGEN calculates posterior probabilities for a range of genotype hypotheses, under the assumption that the normal sample conforms to the diploid germline genotype considering homozygous reference, heterozygous, and homozygous states. The tumor sample is a mixture of the germline genotype and, if present, the somatic allele. For the somatic genotype, we consider only two states referring to the absence and presence of the somatic variant in the tumor sample. In cases where the somatic variant is not present, we account for unsystematic independent noise, while assuming an error-free scenario when the somatic variant is considered. To calculate the genotype likelihoods, the model integrates allele frequency likelihoods over the joint tumor and normal allele frequencies and applies modifications to address liquid tumors with Tumor-in-Normal (TiN) contamination. The integration is approximated with a discrete summation. In these calculations, the likelihood for each read to support a given allele is shared with the germline scoring model. The tumor-only somatic scoring model is seen as a special case of the somatic scoring model in the absence of normal data (zero coverage). The posterior probability is converted into a Phred quality score and reported in the VCF output INFO/SOMATICSCORE field.
Input Requirements
When running the SV Caller, the input sequencing reads must be from a standard Illumina paired-end sequencing assay with an FR read pair orientation, where for each sequence fragment, a read proceeds from each end of the fragment inwards. For more information, see DRAGEN SV Caller Capabilities.
The SV Caller is optimized for paired-end libraries where the fragment size is typically larger than the size of both reads. Overlapping read pairs can be used to discover SVs, but might not always be handled optimally. For libraries where the typical fragment size is less than the read length, the SV caller attempts to differentiate reads sequencing into adapter sequence from the variant signal. In such cases, the SV Caller's input quality checks may fail and cause SV analysis to be skipped.
If using the standalone mode, your BAM/CRAM inputs must first be mapped. If you have not mapped and aligned your data yet, you can generate an alignment file.
Alignment Contig Checks
If running from a mapped and aligned BAM, then the contigs specified in the header must strictly match those in the DRAGEN hashtable specified in the current analysis. Missing or extra contigs will lead to a "Reference genome mismatch" configuration error and the analysis will not proceed. If such an error is observed, it is recommended to regenerate the alignment file with the intended DRAGEN hashtable, or to run with the DRAGEN map/align module enabled.
Input Quality Checks
The SV Caller runs quality checks on the input sequencing reads for each sample to make sure that the input corresponds to a paired read assay with the expected FR orientation, before estimating the fragment size distribution. To check consensus read pair orientation, a subset of high-quality read pairs is sampled. At least 90% of these must have the expected FR orientation for SV analysis to continue, otherwise, the SV caller issues a warning, skips any further analysis, and the resulting output files display empty results.
The SV Caller can tolerate nonpaired reads in the input, if sufficient paired-end reads exist to estimate the fragment size distribution. To estimate the fragment size distribution, the SV Caller requires at least 100 read pairs that meet the quality requirements of the estimation routine. Both reads of the pair must have a non-zero mapping quality to the same chromosome, are not filtered or part of a split read mapping, and do not contain indels or soft-clipping. If a sample does not contain a sufficient number of such read pairs, the SV Caller issues a warning, skips any further analysis, and writes empty results to its output files.
Read Groups
The SV Caller disregards any read group labels applied to the input sequences. Each input sample is treated as a separate library with a single fragment size distribution.
File Format
In standalone mode, input sequencing reads must be mapped and provided as input in either BAM or CRAM format. Each input file must be coordinate sorted and indexed to produce an htslib-style index in a file named to match the input BAM or CRAM file with an additional .bai
, .crai
, or .csi
file name extension. For more information on standalone mode, see Modes of Operation.
At least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file is treated as a separate sample as part of a joint diploid sample analysis.
In standalone mode, input BAM or CRAM files contain the following limitations:
Alignments cannot have an unknown read sequence (SEQ="*")
Alignments cannot contain the "=" character in the SEQ field.
Alignments cannot use the sequence match/mismatch ("="/"X"). CIGAR notation RG (read group) tags in the alignment records are ignored. Each alignment file is treated as representing one sample.
Alignments with base call quality values greater than 70 are rejected. These are not supported on the assumption that this indicates an offset error.
Generate an Alignment File
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
Exome/Targeted Calling
The SV caller can be configured for targeted sequencing inputs, which disables high-depth filters. Exome mode can be directly set to true or false with the command line option --sv-exome
. If not directly set, exome mode defaults to false unless you run the SV caller in integrated mode and there is not more than 50 Gb of sequencing input.
Internal Tandem Duplications Calling
You can use the --sv-somatic-ins-tandup-hotspot-regions-bed ${BEDFILE}
option to specify ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed
. The file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps). To disable this feature, enter --sv-enable-somatic-ins-tandup-hotspot-regions false
.
Liquid Tumor Calling
Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. In tumor-normal analysis, DRAGEN accounts for Tumor-in-Normal (TiN) contamination by running liquid tumor mode. You can use liquid tumor mode to account for TiN contamination by allowing a nonzero variant allele frequency for the matched normal when calculating the posterior probability of the somatic state. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Use the following two options to control liquid tumor mode behavior.
--sv-enable-liquid-tumor-mode
---Enable liquid tumor mode. Liquid tumor mode is disabled by default.--sv-tin-contam-tolerance
---Set the TiN contamination tolerance level. DRAGEN calls variants in the presence of TiN contamination up to a specified maximum tolerance level. You can enter any value between 0–1. The default maximum TiN contamination tolerance is 0.15. If using the default value, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample.
Command Line Options
The following command line options are supported for the Structural Variant Caller.
Input and Output Options
The following are the top-level options that are shared with the DRAGEN Host Software to control the SV pipeline. You can use BAM and CRAM files as input. Alternatively, if using read mapping with the SV calling in a single run, you can use all of the DRAGEN input options, including FASTQ, BAM, and CRAM files.
--cram-input
---The CRAM file to be processed.--tumor-cram-input
---If performing tumor-normal or tumor-only analysis, the tumor CRAM file to be processed.--fastq-file1
,--fastq-file2
,--fastq-list
---Input FASTQ files or a list of files to be processed.--tumor-fastq1
,--tumor-fastq2
,--tumor-fastq-list
---Input tumor FASTQ file or list of files to be processed.--enable-map-align
---Enables DRAGEN map/align. The default is true, so all input reads are remapped and aligned unless the option is set to false.--output-directory
---Output directory where all results are stored.--output-file-prefix
---Output file prefix that will be prepended to all result file names.--ref-dir
---The DRAGEN reference genome hashtable directory. For more information about the reference genome hashtable, see Prepare a Reference Genome.--bam-input
---The BAM file to be processed.--tumor-bam-input
--If performing tumor-normal or tumor-only analysis, the tumor BAM file to be processed.
Structural Variant Caller Pipeline Options
--enable-sv
---Enable or disable the structural variant caller. The default is false.--sv-call-regions-bed
---Specifies a BED file containing the set of regions to call. Optionally, you can compress the file in gzip or bgzip format.--sv-exclusion-bed
--- Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.--sv-region
--- Limit the analysis to a specified region of the genome for debugging purposes. This option can be specified multiple times to build a list of regions. The value must be in the format "chr:startPos-endPos".--sv-exome
--- Set to true to configure the variant caller for targeted sequencing inputs, which includes disabling high depth filters. In integrated mode, the default is to autodetect targeted sequencing input, and in standalone mode the default is false.--sv-output-contigs
--- Set to true to have assembled contig sequences output in a VCF file. The default is false.--sv-forcegt-vcf
--- Specify a VCF of structural variants for forced genotyping. The variants are scored and emitted in the output VCF even if not found in the sample data. The variants are merged with any additional variants discovered directly from the sample data.--sv-discovery
--- Enable SV discovery. This flag can be set to false only when--sv-forcegt-vcf
is used. When set to false, SV discovery is disabled and only the forced genotyping input variants are processed. The default is true.--sv-use-overlap-pair-evidence
--- Allow overlapping read pairs to be considered as evidence. The default is false.--sv-somatic-ins-tandup-hotspot-regions-bed
--- Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from<INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed
.--sv-enable-somatic-ins-tandup-hotspot-regions
--- Enable or disable the ITD hotspot region input. The default is true in somatic variant analysis.--sv-enable-liquid-tumor-mode
---Enable liquid tumor mode. See Liquid Tumor Calling.--sv-tin-contam-tolerance
--- Set the Tumor-in-Normal (TiN) contamination tolerance level. See Liquid Tumor Calling for more information.--sv-systematic-noise
--- Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information see Systematic Noise Filtering.--sv-detect-systematic-noise
--- Set to true to generate VCF output per normal sample. For more information see Systematic Noise Filtering--sv-build-systematic-noise-vcfs-list
--- List of input VCFs from previous step. Enter one VCF per line. For more information see Systematic Noise Filtering--sv-min-edge-observations
--- Remove all edges from the graph with less than this many observations. The default value is set to 3.--sv-min-candidate-spanning-count
--- Run SV caller and report all large SVs with at least this many spanning support observations. The default value is set to 3.--sv-min-candidate-variant-size
--- Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.--sv-min-scored-variant-size
--- After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.--sv-hotspot-min-scored-variant-size
--- After candidate identification, only score and report SVs/indels at or above this size inside the SV hotspot region, which includes FLT3, ARHGEF7, and KMT2A genes by default. The default value is set to 25.
Modes of Operation
Structural Variant calling can run in the following modes:
Standalone --- Uses mapped BAM/CRAM input files. If you have not mapped and aligned your data yet, see Input Requirements. This mode requires the following options:
--enable-map-align false
--enable-sv true
Integrated -- Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires the following options:
--enable-map-align true
--enable-sv true
--enable-map-align-output true
--output-format bam
You can also enable Structural Variant calling with any other caller.
The following is an example command line for Integrated mode:
The following is an example command line for joint diploid calling in standalone mode:
Structural Variant VCF Output
The structural variants VCF output file is available in the output directory. The file is named <output-file-prefix>.sv.vcf.gz
. The contents of the file depend on the type of analysis.
For each major analysis category (germline, tumor-normal, and tumor-only), the appropriate VCF output file is output, reflecting variant calls made under the variant calling mode corresponding to the given analysis type.
VCF Output
VCF output follows the VCF 4.2 specification for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The following sections provide information on the variant representation details and the primary VCF field values.
VCF Sample Names
Sample names output in the VCF output are extracted from each input alignment file from the first read group (@RG) record found in the header. If no sample name is found, a default (SAMPLE1, SAMPLE2, etc.) label is used instead.
Small Indel Representation
All variants are reported in the VCF using symbolic alleles unless they are classified as a small indel, in which case full sequences are provided for the VCF REF and ALT allele fields. A variant is classified as a small indel if all of the following criteria are met:
The variant can be entirely expressed as a combination of inserted and deleted sequences.
The deletion or insertion length is not 1000 or greater.
The variant breakends and/or the inserted sequence are not imprecise.
The variant has not been converted from a deletion to intra-chromosomal breakends by the depth-based SV classification routine.
When VCF records are output in the small indel format, they also include the CIGAR INFO tag describing the combined insertion and deletion event.
Insertions with Incomplete Insert Sequence Assembly
Large insertions are reported in some cases even when the insert sequence cannot be fully assembled. In this case, the SV Caller reports the insertion using the <INS>
symbolic allele and includes the special INFO
fields LEFT_SVINSSEQ
and RIGHT_SVINSSEQ
to describe the assembled left and right ends of the insert sequence. The following is an example of such a record from the joint diploid analysis of NA12878, NA12891 and NA12892 mapped to hg19:
Normalizing Small Tandem Duplications
The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.
To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions, using the symbolic allele <INS>
for the ALT
field. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.
Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, INFO/DUPSVINSSEQ
provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to INFO/SVINSSEQ
. The following example shows a converted insertion with a breakpoint insertion value:
For more information about copied INFO
fields, see VCF INFO Fields. All INFO
fields use the same DUP
prefix.
Inversions
Inversions are reported as a set of breakends. For example, given a simple reciprocal inversion, four breakends are reported, sharing the same EVENT INFO
tag. The following is an example breakend records representing a simple reciprocal inversion:
Depth-Based SV Type Classification
In the germline calling model, when SV candidates are discovered from the sample data and have sufficient paired and split read evidence to be reported in the output, the SV caller applies additional depth-based tests to more accurately classify certain SV candidate types. Candidate breakpoints that are consistent with a deletion are tested for the lower read depth that is expected inside the deleted region. Candidate breakpoints consistent with a tandem duplication are tested for the higher read depth expected in the duplicated region. Candidate SV calls that fail the depth-based tests are still reported in the output, but changed to intrachromosomal breakends. Candidate SV calls that pass continue to be reported in the standard deletion and tandem duplication output formats.
SV Breakpoint Insertions
SVs frequently include a small sequence insertion at the breakpoint. Breakpoint insertions are represented differently depending on the SV type. The INFO/SVINSSEQ
field in the VCF output provides the most general description of breakpoint insertions by describing the insertion sequence itself. The corresponding INFO/SVINSLEN
field describes the length of the insertion sequence. For example, the following VCF record describes a large (~8.8 kb) deletion, which includes a single base insertion (C) between the left and right deletion breakends.
The INFO/SVINSSEQ
field is also used to describe breakpoint insertions for tandem duplication and breakend records. The field can also be used to describe the insertion sequence of a large SV insertion.
Breakpoint insertions are represented differently in the VCF small indel format. The SV caller represents small deletions and insertions using the VCF small indel format instead of symbolic ALT alleles. Any breakpoint insertion that occurs in the VCF small indel format is represented as part of the VCF ALT field. See Small Indel Representation for information on the conditions this format is used for SVs under.
In the following small indel format example, the VCF record describes a 57 base deletion that includes a single base insertion (A) between the left and right deletion breakends
Breakend records include an additional encoding of breakpoint insertion sequence, as described in the VCF specification for the breakend ALT
field. The SV caller also provides the information to the INFO/SVINSSEQ
field for consistency with other SV record types.
The following example shows a breakend connecting a region of chromosomes 1 and 12 in the sample with a breakend insertion sequence of CA
between the two breakends. The insertion sequence is described in both the ALT
and INFO/SVINNSEQ
fields.
SV Breakpoint Insertion Orientation
The breakpoint insertion sequence is always provided with respect to the strand of the current SV record. Some breakend records have inverted orientation. For inverted orientations, the pair of breakend records contains an insertion sequence that is reverse complemented compared to the mated record.
The following breakend pair example demonstrates an inverted orientation.
SV Breakpoint Homology
Each VCF record output by the SV caller is shifted to the left-most position of the exact homology range of the breakpoint. The exact homology range of the breakpoint is the continuous range of positions over which the SV could be represented while still describing the same SV haplotype. The exact homology range is described in the VCF output with the INFO/HOMSEQ
field, which describes the sequence of the exact homology range and the corresponding INFO/HOMLEN
field, which describes the length of the range.
The following example shows a 62 base deletion with an 11 base breakend homology region. Without left-shifting, the SV has an equivalent representation anywhere from position 39497639 to 39497650.
The following examples illustrate simplified exact breakend homology. The example displays one three base deletion and another three base insertion. In both the insertion and deletion, the variant is left-shifted, so that the corresponding VCF record position is 2.
Deletion
Reference: GTCAGCGA
Variant: GT---CGA
Insertion
Reference: GT---CAG
Variant: GTCGGCAA
In both the insertion and deletion, there is a single base of exact breakend homology C
, so that the same variant can be represented one base to the right.
VCF INFO Fields
IMPRECISE
Flag indicating that the structural variation is imprecise, ie, the exact breakpoint location is not found
SVTYPE
Type of structural variant
SVLEN
Difference in length between REF and ALT alleles
END
End position of the variant described in this record
CIPOS
Confidence interval around POS
CIEND
Confidence interval around END
CIGAR
CIGAR alignment for each alternate indel allele
MATEID
ID of mate breakend
EVENT
ID of event associated to breakend
HOMLEN
Length of base pair identical homology at event breakpoints
HOMSEQ
Sequence of base pair identical homology at event breakpoints
SVINSLEN
Length of insertion
SVINSSEQ
Sequence of insertion
LEFT_SVINSSEQ
Known left side of insertion for an insertion of unknown length
RIGHT_SVINSSEQ
Known right side of insertion for an insertion of unknown length
PAIR_COUNT
Read pairs supporting this variant where both reads are confidently mapped
BND_PAIR_COUNT
Confidently mapped reads supporting this variant at this breakend (mapping may not be confident at remote breakend)
UPSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at the upstream breakend (mapping may not be confident at downstream breakend)
DOWNSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at this downstream breakend (mapping may not be confident at upstream breakend)
BND_DEPTH
Read depth at local translocation breakend
MATE_BND_DEPTH
Read depth at remote translocation mate breakend
JUNCTION_QUAL
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only
SOMATIC
Flag indicating a somatic variant
SOMATICSCORE
Somatic variant quality score
SOMATIC_EVENT
If the probability of the SV being a germline variant is greater than the probability of the SV being a somatic variant, this is 0. Otherwise, this is 1.
JUNCTION_SOMATICSCORE
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only
CONTIG
Assembled contig sequence, if the variant is not imprecise (with --outputContig
)
DUPSVLEN
Length of duplicated reference sequence
DUPHOMLEN
Length of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPHOMSEQ
Sequence of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPSVINSLEN
Length of inserted sequence after duplicated reference sequence
DUPSVINSSEQ
Inserted sequence after duplicated reference sequence
NotDiscovered
Variant candidate specified by the user and not discovered from input sequencing data
UserInputId
Variant ID from user input VCF
KnownSVScoring
Variant is associated with a user specified input variant, therefore scoring and filtration criteria are relaxed under a stronger prior assumption of truth
VCF FORMAT Fields
GT
Genotype
FT
Sample filter, 'PASS' indicates that all filters have passed for this sample
GQ
Genotype Quality
PL
Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
PR
Number of spanning read pairs which strongly support the REF or ALT alleles
SR
Number of split-reads which strongly support the REF or ALT alleles
VF
Number of fragments which strongly support the REF or ALT alleles
VCF FILTER Fields
Germline
The following table lists the VCF FILTER fields applied to germline VCF output.
MinQUAL
Record
QUAL score is less than a threshold. The filter is not applied to records with KnownSVScoring
flag.
Ploidy
Record
For DEL and DUP variants, the genotypes of overlapping variants with similar size are inconsistent with diploid expectation. The filter is not applied to records with KnownSVScoring
flag.
MaxDepth
Record
Depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend that exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
NoPairSupport
Record
For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample. The filter is not applied to records with KnownSVScoring
flag.
SampleFT
Record
No sample passes all the sample-level filters.
MinGQ
Sample
GQ score is less than 15. The filter is applied at sample level and not applied to records with KnownSVScoring
flag.
HomRef
Sample
Homozygous reference call. The filter is applied at the sample level.
Tumor-Normal Somatic
The following table lists the VCF FILTER fields applied to tumor-normal somatic VCF output.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (< 1000 bases) in the normal sample, the fraction of reads with MAPQ0 around either breakend exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation. The filter is not applied to records with the KnownSVScoring
flag.
Tumor-Only
The following table lists the VCF FILTER fields applied to tumor-only VCF output.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation. The filter is not applied to records with the KnownSVScoring
flag.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads with MAPQ0 around either breakend exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
Interpretation of VCF Filters
There are two levels of VCF filters: record level (FILTER
) and sample level (FORMAT/FT
). Most record-level filters are independent of those at the sample-level. However, in a germline analysis, if none of the samples pass all sample-level filters, the SampleFT
record-level filter is applied.
Interpretation of INFO/EVENT Field
Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the sample. The INFO/EVENT
field indicates that two or more such junctions are hypothesized to occur together as part of a single variant event. All individual variant records belonging to the same event share the same INFO/EVENT
string. Note that although such an inference could be applied after SV calling by analyzing the relative distance and orientation of the called variant breakpoints, the SV Caller incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern consistent with a multijunction event (such as a reciprocal translocation pair).
Although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to two. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.
SV Variant Allele Fraction (VAF) Calculation
Some of the evidential read pairs could provide both PR and SR support, we defined VF as an additional field to represent number of evidence in sequence fragment(or read pairs), which strongly support the REF or ALT alleles in the listed order, to facilitate unbiased calculation of Variant Allele Fraction (VAF), where VAF = VF_ALT/(VF_ALT+VF_REF).
VCF ID Field
The VCF ID
, or identifier, field can be used for annotation, or in the case of BND
(breakend) records for translocations, the ID
value is used to link breakend mates or partners. The following is an example of a VCF ID
field from the SV caller
The value provided in the ID
field reflects the SV association graph edge(s) from which the SV or indel was discovered. The value is guaranteed to be unique within any single VCF output file produced by the SV Caller. These values are therefore used to link associated breakend records using the standard VCF MATEID
key. The exact structure of this identifier may change in the future. You can use the entire value as a unique key, but parsing the key could lead to incompatibility with future DRAGEN versions. See the DRAGEN Software Support Site for information on the latest version of DRAGEN.
Convert SV VCF to BEDPE Format
It can sometimes be convenient to express structural variants in BEDPE format. For such applications, DRAGEN recommends the script vcfToBedpe available on GitHub. The repository is forked from @hall-lab with modifications to support VCF 4.2 SV format.
BEDPE format greatly reduces structural variant information compared to the SV Caller VCF output. In particular, breakend orientation, breakend homology, and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason, Illumina only recommends BEDPE as a temporary output for applications that require it.
Last updated