Analysis Methods
Last updated
Was this helpful?
Last updated
Was this helpful?
The software is a DNA only analysis pipeline based on the . Even though it includes some of the default settings from the , it uses a distinct recipe with different options. A user has the ability to override specific parameters via a .
The software performs germline variant calling on the normal sample, and reports the following variants:
SNV (annotated)
CNV (annotated)
SV (annotated)
Targeted callers (cyp2b6, cyp2d6, cyp21a2, gbna, hba, lpa, rh and smn)
Expansion hunter
VNTR
The software perform somatic variant calling on the tumor sample and reports the following variants:
SNV (annotated)
MNV
CNV (annotated, requires germline SNV and CNV VCF)
SV (annotated, with variant deduplication)
TMB
MSI
HRD
ASCN
LOH
DUX4
HLA
The pipeline supports two reference genomes for the DRAGEN Map/Aligner - hg38 and hs37d5_chr.
The hs37d5_chr genome is the hg19 reference genome with the Chromosome Y PAR masked. It includes the NC_012920 mitocondria genome. The contigs have the chr prefix added, but without the native alternate loci names.
DRAGEN continues to use these final alignments as input for various variant calls such as gene amplification (copy number) calling, small variant calling (SNV, indel, MNV, delin), and DNA library quality control.
DRAGEN supports calling SNVs, indels, MNVs, and delins in tumor-only samples by using mapped and aligned DNA reads from a tumor sample as input. Variants are detected via both column wise pileup analysis and local de novo assembly of haplotypes. The de novo haplotypes allow the detection of much larger insertions and deletions than possible through column wise pileup analysis only. DRAGEN insertions and deletions are validated with lengths of at least 0–25 bp and more than 25 bp can be supported. In addition, DRAGEN also uses the de novo assembly to detect SNVs, insertions, and deletions that are co-phased and part of the same haplotypes. Any such co-phased variants that are within a window of 15 bp can then be reassembled into complex variants (MNVs and delins). The tumor-only pipeline produces a VCF file containing both germline and somatic variants that can be further analyzed to identify tumor mutations. The pipeline makes no ploidy assumptions, enabling detection of low-frequency alleles.
DRAGEN small variant calling includes the following steps:
Detects regions with sufficient read coverage (callable regions).
Detects regions where the reads deviate from the reference and there is a possibility of a germline or somatic call (active regions).
Assembles de novograph haplotypes are assembled from reads (haplotype assembly).
Extracts possible somatic or germline calls (events) from column wise pileup analysis.
Calibrates read base qualities to account for background noise.
Computes read likelihoods for each read/haplotype pair.
Performs mutation calling by summing the genotype probabilities across all reads/haplotype pairs.
Performs additional filtering to improve variant calling accuracy, including using a systematic noise file. The systematic noise file indicates the statistical probability of noise at specific positions in the genome. This noise file is constructed using clean (normal) samples. Regions where noise is common (eg, difficult to map regions) have higher noise values. The small variant caller penalizes those regions to reduce the probability of making false positive calls.
The DRAGEN copy number variant caller performs amplification, reference, and deletion calling for CNV targets within the assay. It counts the coverage of each target interval on the panel, uses a preprocessed panel of normal samples to normalize target counts, corrects for GC coverage bias, and calculates scores of a CNV event from observed coverage and makes copy number calls.
The contamination analysis step detects foreign human DNA contamination using the SNP error file and pileup file that are generated during the small variant calling and the TMB trace file. The software determines whether a sample has foreign DNA using the contamination score. In contaminated samples, the variant allele frequencies in SNPs shift from the expected values of 0%, 50%, or 100%. The algorithm collects all positions that overlap with common SNPs that have variant allele frequencies of < 25% or > 75%. Then, the algorithm computes the likelihood that the positions are an error or a real mutation. The contamination score is the sum of all the log likelihood scores across the predefined SNP positions with minor allele frequency < 25% in the sample and are not likely due to CNV events.
The larger the contamination score, the more likely there is foreign DNA contamination. A sample is considered to be contaminated if the contamination score is above predefined quality threshold. The contamination score was found to be high in samples with highly rearranged genomes or HRD samples. 1% of HRD samples found to be above the threshold with no evidence for actual contamination.
The Illumina Annotation Engine performs annotation of small variants, and CNVs. The inputs are gVCF files and the outputs are annotated JSON files.
The Illumina Annotation Engine processes each variant entry and annotates with available information from databases such as dbSNP, gnomAD genome and exome, 1000 genomes, ClinVar, COSMIC, RefSeq, and Ensembl. The header includes version information and general details. Each annotated variant is included as a nested dictionary structure in separate lines following the header.
DRAGEN is used to compute tumor mutational burden (TMB) in coding regions where there is sufficient coverage.
The following variants are excluded from the TMB calculation:
Non-PASS variants.
Mitochondrial variants.
MNVs.
Variants that do not meet a minimum depth threshold.
Variants that do not meet the minimum variant allele threshold.
Variants that fall outside the eligible regions.
Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. Germline variants are not counted towards TMB. Variants are determined as germline based on a database and a proxy filter.
Variants with a population allele count ≥ 10 that are observed in either the 1000 Genomes or gnomAD databases are marked as germline. MNVs, which do not count towards TMB, may be marked as germline when all their component small variants are marked as germline. The proxy filter scans the variants surrounding a specific variant and identifies those variants with similar variant allele frequencies (VAF). If the majority of surrounding variants of similar VAF are germline, then the variant is also marked as germline.
The formula for TMB calculation is:
Outputs are captured in a .tmb.trace.tsv
file that contains information on variants used in the TMB calculation and a .tmb.metrics.json
file that contains the TMB score calculation and configuration details.
DRAGEN can determine the MSI status of a sample. It uses a normal reference file, which was created from a set of normal samples. During sequencing, normal reference files are generated by tabulating read counts for each microsatellite site. The normal file contains the read count distribution for each microsatellite.
MSI calling for a tumor-only sample is performed by first tabulating tumor counts from the read alignments for each microsatellite site. Then, the Jensen-Shannon distance (JSD) is calculated between each pair of tumor and normal baseline samples. DRAGEN determines unstable sites by performing Chi-square testing of tumor JSD and normal JSD distributions. Unstable sites are called if the mean distance difference of the two JSD distributions is ≥to the distance threshold and Chi-square p-value is ≤ to the p-value threshold. Lastly, DRAGEN produces an MSI status given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites, and the sum of the Jensen-Shannon distance of all the unstable sites.
Genomic instability score (GIS) is a whole genome signature for homologous recombination deficiency. The GIS is composed of the sum of three components: loss of heterozygosity, telomeric allele imbalance, and large-scale state transition. These components are estimated using the GIS algorithm contracted from Myriad Genetics, which uses an input of the b-allele frequency and coverage across a genome-wide single nucleotide panel. A panel of normal samples is used for both bias reduction and normalization prior to GIS estimation. Final GIS results can be found in the *.gis.json
file.
An example command is provided that highlights the input and output used in DragenCaller step of the software, which may be found in the DRAGEN run log file. Any parameter options not displayed on the command line would be using the default value for the DRAGEN variant caller module. The detailed parameters and default arguments for the individual modules within the DragenCaller step may be found in the replay.json output. See for detailed explanations of the parameters.
involves aligning sequencing reads derived from DNA libraries to a reference genome prior to variant calling.
The software currently supports both tumor and normal samples with UMI. Please use the to get details on the options.
Additional information is available at .
The supports both matched tumor-normal pairs and tumor only samples. The germline mode of the small variant caller is used to analyze the normal sample in the matched pair.
Additional information is available at .
Absolute copy numbers are calculated by the CNV ASCN Caller. See .
See more information available at .
The DRAGEN Structural Variant (SV) Caller is described .
The DUX4 rearrangement caller is described .
The Variant Deduplication is described
The database content included with Nirvana database is available at the .
The pipeline currently does not support annotation of gVCF files. Please use the to perform tertiary analysis.
Please see the for details about the TMB biomarker analysis.
Please see the for details about the MSI biomarker analysis.
Please see the for details about the MSI biomarker analysis.
Please see the for details.
Please see
Please see .
Please see