Small Variant Calling
Last updated
Was this helpful?
Last updated
Was this helpful?
The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).
Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller
option to true. The VCF header is annotated with ##source=DRAGEN_SNV
to indicate the file is generated by the DRAGEN SNV pipeline.
The DRAGEN Small Variant Caller performs the following steps:
Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.
Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.
Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.
Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.
Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.
In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:
Unfiltered depth is the number of reads (fragment-based) covering the position, downstream of any read collapsing, deduplication, downsampling and read disqualification, but upstream of informative read determination. Overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads (fragment-based) actually used to make the calling decision, where badly mated reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.
The following figure summarizes the different filtering steps in more detail.
Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
Duplicate reads.
Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
[Somatic] Reads with MAPQ=0.
[Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
By default, germline runs with machine learning (ML) enabled consider all reads, even those with MAPQ 0, resulting in increased sensitivity. MAPQ read filtering is controlled by --vc-min-read-qual
for germline and tumor/normal (T/N) runs, but it does not affect tumor-only (T/O) runs. In contrast, --vc-min-tumor-read-qual
controls filtering for tumor samples in T/N and T/O runs and has no effect on normal-only samples.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
Unmapped reads.
Secondary reads.
Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out the following reads:
Badly mated reads. A badly mated read is a read where the pair is mapped to two different reference contigs.
Non-informative reads. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.
Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC
INFO flag. As default, MOSAIC
tagged variants with AF
smaller than 20% are filtered with the MosaicLowAF
filter. To further enhance sensitivity, if the median depth of the sample detected by the ploidy estimator exceeds 100x
, a default 10% threshold will be applied. This is likely to have a greater impact on exome data, which typically has higher coverage. Exome users looking to control the number of low allele frequency (AF) mosaic variants can set the option --vc-mosaic-af-filter-threshold
to 0.2 to override the dynamic coverage-based thresholding.
The following options control the variant caller stage of the DRAGEN host software.
--enable-variant-caller
Set --enable-variant-caller
to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:
If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.
--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection. Set to false to disable DRAGEN mosaic detection.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
if the median depth of the sample detected by the ploidy caller is <= 100x
and 0.1
if the detected depth is > 100x
.
You can use the following options for downsampling reads in the small variant calling pipeline.
--vc-target-coverage
Specifies the maximum number of reads covering any given position.
--vc-max-reads-per-active-region
Specifies the maximum number of reads covering a given active region.
--vc-max-reads-per-raw-region
Specifies the maximum number of reads covering a given raw region.
--vc-min-reads-per-start-pos
Specifies the minimum number of reads with a start position overlapping any given position.
--high-coverage-support-mode
Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time.
For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.
--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito
The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.
The following are the default downsampling values for each small variant calling mode.
Germline
--vc-target-coverage
500
Germline
--vc-max-reads-per-active-region
10000
Germline
--vc-max-reads-per-raw-region
30000
Somatic
--vc-target-coverage
1000
Somatic
--vc-max-reads-per-active-region
10000
Somatic
--vc-max-reads-per-raw-region
30000
High Coverage
--vc-target-coverage
100000
High Coverage
--vc-max-reads-per-active-region
200000
High Coverage
--vc-max-reads-per-raw-region
200000
Mitochondrial
--vc-target-coverage-mito
40000
Mitochondrial
--vc-max-reads-per-active-region-mito
200000
Mitochondrial
--vc-max-reads-per-raw-region-mito
200000
The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos
setting.
If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos
, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos
, default value is 10) at any start position.
The next downsampling step is to apply the --vc-max-reads-per-raw-region
and --vc-max-reads-per-active-region
limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.
This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.
When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.
By default, the DRAGEN small variant caller outputs a VCF file (<output-file-prefix>.hard-filtered.vcf.gz
) in VCF 4.2 format containing both filtered and PASSing variant records.
Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned
Additional notes on variant representation in the DRAGEN VCF:
Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:
Two indels are considered as multi-allelic if they share the same reference base preceding the indel. For example:
DRAGEN employs joint detection of overlapping variants, a feature designed to detect overlapping SNP and INDEL variants and output them in a single VCF record represented as a multi-allelic genotype. However, if a SNP overlaps an INDEL but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below.
In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.
QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.
Description
Probability that the site has no variant
Probability that the call is incorrect
Evidence supporting homref call
Qual normalized by depth
Formulation
QUAL = GP(GT=0/0)
GQ =-10*log10(p)
GQ = 10*log10[P(D|homref)/P(D|variant)]
QUAL/DP
Scale
Unsigned Phred
Unsigned Phred
Signed Phred
Unsigned Phred
Numerical example
QUAL=20: 1 % chance that there is no variant at the site. Qual=50: 1 in 1e5 chance that there is no variant at the site.
GQ=3, 50% that the call is incorrect. GQ=20, 1% change that the call is incorrect.
GQ=0: no evidence. GQ>0: evidence favors homref.
The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:
Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3.0103) compared to DRAGEN without ML (10.41 for SNP and 7.83 for Indel).
Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3.0103 with ML enabled.
A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF>
allele. Reads that do not support the reference or any variants are assigned the <NON_REF>
allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ
and in somatic mode the score is FORMAT/SQ
.
The following options are available to enable and control gVCF output.
--vc-emit-ref-confidence
To enable gVCF output, set to GVCF
. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence
to BP_RESOLUTION
.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf
(banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80
for germline and 1 3 10 20 50 80
for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50
.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf
, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30
and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.
Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.
The following are example gVCF records that include a hom-ref block call and a variant call.
In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.
DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same phase set.
The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.
During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, they are phased together. If the variants only occur on different haplotypes, they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.
The DRAGEN small variant caller supports combining multiple nearby phased variant records into a single VCF record. When the functionality is enabled, the caller will output both multi-nucleotide variants (MNVs; multiple phased SNVs combined into a single VCF record) and complex indels (multiple phased insertions, deletions, and/or SNVs combined into a single VCF record) in the VCF.
For example, assuming reference at position chr2 115035
is A
, the following two phased SNVs can be combined into an MNV.
The phased SNVs are combined as follows.
The following two phased indels can be combined into a complex indel.
The phased indels are combined as follows.
Individual variant records existing on the same haplotypes are deemed to be in phase and will be merged if they are within a configurable distance threshold of one another. For each consecutive pair of phased variants in a phase set, the variants will be combined if the difference between their POS fields does not exceed the threshold. For deletions, the number of deleted bases is taken into account and subtracted from the POS difference between the deletion and downstream phased variant when calculating the distance between calls. Please note that variant records without a PS tag may be merged into MNVs and complex indels together with calls having PS tags due to algorithmic differences between variant phasing and variant merging.
In the somatic pipeline, combining phased variants is enabled by default, consistent with HGVS guidelines. In the germline pipeline, the functionality can be enabled using the command line options detailed below.
--vc-combine-phased-variants-distance-snvs-only
Specifies the maximum distance over which phased SNVs will be combined into an MNV. This option applies only to phased variant groups consisting of only SNVs. The default is 2
for somatic and 0
for germline (disabled). For phased variant groups that include both SNVs and indels, the analogous vc-combine-phased-variants-distance
option applies.
--vc-combine-phased-variants-distance
Specifies the maximum distance over which phased insertions, deletions, and SNVs will be combined into a complex indel. This distance threshold will be applied to any group of phased variants that includes at least one indel. The default is 2
for somatic and 0
for germline (disabled).
For both options, a value of 0 disables merging. When either option is enabled with a value [1, 15], all phased variants in the group that are within the provided distance value of one another are merged into an MNV (for vc-combine-phased-variants-distance-snvs-only
) or complex indel (for vc-combine-phased-variants-distance
).
--vc-mnv-emit-component-calls
Specifies whether or not to emit the individual component variant records along with the merged variant records. When set to true
, all component calls making up an MNV or complex indel will be emitted in the VCF along with the merged variant record. The default is true
for somatic and false
(disabled) for germline.
--vc-combine-phased-variants-max-vaf-delta
Specifies the threshold for filtering MNV component variant calls when the events comprising to the MNV have different allele frequencies. The default value is 0.1, which means that any SNV or INDEL with an AF that is more than 0.1 greater than the MNV AF shall be emitted as a PASSing call, while the remaining components shall be emitted with the 'mnv_component' FILTER flag. Only applicable when vc-combine-phased-variants-distance
is greater than 0 and vc-mnv-emit-component-calls
is true. (Default=0.1)
DRAGEN can output all component SNVs and/or INDELs that make up a merged MNV or complex indel along with the merged call itself. Merged calls and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. This behavior is default in somatic mode and can be enabled in germline mode by setting --vc-mnv-emit-component-calls=true
. When vc-mnv-emit-component-calls
is enabled and DRAGEN reports an MNV or complex indel call, the component calls that make up the merged call are filtered with the mnv_component
filter flag unless the difference in VAF between the component call and merged call is greater than the value of vc-combine-phased-variants-max-vaf-delta
(default: 0.1). This avoids component calls being doubly represented in the VCF output. In the case where VAF difference between a given component call and merged call exceeds the threshold value of vc-combine-phased-variants-max-vaf-delta
, that is considered evidence for the component call existing both as a standalone variant and as part of the MNV or complex indel and the component call will be emitted as a PASSing VCF record. For example, in the following MNV group, there are two component SNVs making up the MNV. The MNV call is emitted as a PASSing call while one component SNV with AF equal to that of the MNV is filtered with the mnv_component
FILTER flag and the other component SNV with VAF greater than that of the MNV by more than 0.1 is emitted as a PASSing call.
DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.
0|1
0|1
0/1
Germline and Somatic
Yes in 4.0
0/1
1/1
1/2
Germline
No
0/1
1/2
1/2
Germline
No
1/1
1/1
1/1
Germline
Yes in 4.2
Examples of diploid haplotypes where phasing is supported:
Examples of diploid haplotypes where phasing is not supported:
The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.
If --sample-sex
is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex
is specified on the command line, contigs are processed as follows.
For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.
For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 75% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid
to false.
An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909
:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8
DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.
male
Not relevant
Male
female
Not relevant
Female
none
Not relevant
None
auto (default)
XY
Male
auto (default)
XX
Female
auto (default)
Everything else
None
Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.
When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.
Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.
DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.
--vc-enable-af-filter-mito
Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito
Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito
Set the threshold to mark emitted vcf call as filtered. The default is 0.02.
QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).
The following filters can be applied to mitochondrial variant calls.
--vc-sq-call-threshold
Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold
Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default value is false.
If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.
If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.
The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.
For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.
The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.
We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.
The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.
To enable personalized variant calling, including the personalized machine learning model, set --enable-personalization
to true (default: false). This outputs two files in the output directory: .personal_haplotypes.tsv.gz
and .personal.vcf.gz
.
.personal_haplotypes.tsv.gz
describes the personalized 2-haplotype reference. Each line contains the following fields: #CHROM START END HAPS
. By default each line represents a 4kbp bin of the reference genome (indicated by the CHROM
, START
and END
fields). For each 4kbp bin, the HAPS
field denotes the pair of ancestral haplotypes (from the pangenome reference panel) that are inferred for the sample.
.personal.vcf.gz
describes the variants imputed for the sample. Each variant is annotated along with genotype (GT
), posterior probabilities (PGP
, Personalized Genotype Posterior) and the inferred best haplotype pair (HAPS
). The variant quality score (QUAL
) is computed as -10 * log10(probability that the imputed genotype is incorrect)
and is capped at 999.
The .personal.vcf.gz
is useful when running in split mode and is beneficial to save along with the BAM/CRAM output. To enable personalized variant calling and machine learning in split-run scenarios, simply provide the personal variant VCF (.personal.vcf.gz
) along with the BAM/CRAM input (--enable-personalization true --vc-pg-variants <$OUT_PREFIX.personal.vcf.gz>
).
Note that personalization is only available for the germline small variant caller (WGS and WES) when used with a pangenome reference.
When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:
Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.
Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.
Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection
to false.
DRAGEN has two algorithms that model correlated errors across reads in a given pileup.
Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.
Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.
FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.
Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.
The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.
Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.
See for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 and DRAGEN 4.3 features.
DRAGEN outputs variants in the VCF file following variant normalization conventions described here . The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.
Allele decomposition: By default, phased variants are represented as contiguous individual variant records in the VCF. When phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants; this information indicates which variants have occurred on the same haplotype. DRAGEN offers functionality to merge phased variant records into a single VCF record; please see the section for details.
In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA, as described in .