Somatic Mode

The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.

For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.

The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, a variant annotation data directory must be passed in via the commane line; DRAGEN will then tag variants that are common in the gnomAD database as germline so they can be filtered out if desired (see details below in Germline Tagging in the Tumor-Only Pipeline). The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.

Variant Scoring

DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):

##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">

DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.

If tumor SQ > vc-sq-call-threshold (default is 3 for tumor-normal and 0.1 for tumor-only), then FORMAT/GT is hard-coded to 0/1 for the tumor sample and 0/0 for the normal sample (if present), and the tumor-sample FORMAT/AF yields an estimate of the somatic variant allele frequency, which ranges anywhere within [0,1].

If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.
If tumor SQ < vc-sq-call-threshold, the variant is not emitted in the VCF.
If tumor SQ > vc-sq-call-threshold but tumor SQ <vc-sq-filter-threshold, the variant is emitted in the VCF, but FILTER=weak_evidence.
If tumor SQ > vc-sq-call-threshold and tumor SQ >vc-sq-filter-threshold, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).
The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ >vc-sq-call-threshold but tumor SQ < vc-sq-filter-threshold, so the FILTER is marked as weak_evidence.

chr2 593701 . G A . weak_evidence
DP=97;MQ=48.74;SQ=3.86;NLOD=9.83;FractionInformativeReads=1.000
GT:SQ:AF:F1R2:F2R1:DP:SB:MB 0/0:9.83:33,0:0.000:14,0:19,0:33
0/1:3.86:61,3:0.047:29,2:32,1:64:35,26,0,3:39,22,1,2

The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0.

Somatic Mode Options

To run DRAGEN somatic small variant calling, enable the variant caller with --enable-variant-caller=true and pass in tumor, and optionally, matched normal inputs via the command line. FASTQ (both gzipped and Ora-compressed), FASTQ list, BAM and CRAM inputs are all supported input types. For all input types, reads will be aligned by the DRAGEN map/align module and resulting alignments fed into the caller by default. For BAM and CRAM inputs, you can bypass map/align and use existing alignments as variant caller input by setting --enable-map-align=false.

Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:

--tumor-fastq1 and --tumor-fastq2

Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:

dragen -f -r  /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--tumor-fastq1 <TUMOR_FASTQ1> \
--tumor-fastq2 <TUMOR_FASTQ2> \
--RGID-tumor <RG0-tumor> ---RGSM-tumor <SM0-tumor> \
-1 <NORMAL_FASTQ1> \
-2 <NORMAL_FASTQ2> \
--RGID <RG0> --RGSM <SM0> \
--enable-variant-caller true \
--output-directory /staging/examples/ \
--output-file-prefix SRA056922_30x_e10_50M

--tumor-fastq-list

Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:

dragen -f \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
--tumor-fastq-list <TUMOR_FASTQ_LIST> \
--fastq-list <NORMAL_FASTQ_LIST> \
--enable-variant-caller true \
--output-directory /staging/examples/ \
--output-file-prefix SRA056922_30x_e10_50M

--tumor-bam-input and --tumor-cram-input Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode. When the mapper is enabled (default), reads from the input BAM/CRAM files are re-mapped and updated alignments are sent to the caller (supported for both tumor-normal and tumor-only BAM/CRAM input). When the mapper is disabled (--enable-map-align=false), the existing BAM/CRAM alignments will be used in the caller.
--vc-sq-call-threshold and --vc-sq-filter-threshold These options control the thresholds for emitting calls in the VCF and applying the weak_evidence filter tag (see above).
--vc-target-vaf This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.
--vc-somatic-hotspots, --vc-use-somatic-hotspots, and --vc-hotspot-log10-prior-boost DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_* based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.
vc-systematic-noise This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE.
--vc-combine-phased-variants-distance This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).
vc-skip-germline-tagging=true This option disables the germline tagging feature in the tumor-only pipeline (not recommended).
--vc-callability-tumor-thresh Specifies the callability threshold for tumor samples. The somatic callable regions report includes all regions with tumor coverage above the tumor threshold. The default value is 50. For more information on the somatic callable regions report, see Somatic Callable Regions Report.
--vc-callability-normal-thresh Specifies the callability threshold for normal samples, if present. If applicable, the somatic callable regions report includes all regions with normal coverage above the normal threshold. The default value is 5. For more information on the somatic callable regions report, see Somatic Callable Regions Report.
--vc-excluded-regions-bed Optional excluded regions BED file specifying where variants will be hard-filtered. Useful, e.g., to exclude ALU regions that tend to be especially noisy in FFPE samples.
--vc-call-hotspots-in-excluded-regions Do not apply excluded regions filter to hotspot variants (Default=false).

Tumor-in-normal contamination and liquid tumor mode

In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.

Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.

Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).

Mixing tumor and normal samples from different sequencing protocols

If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.

Allele frequency and related settings

There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.

The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:

Coverage

Lowest AF

0-199

0.05

200-399

0.025

400-799

0.0125

...

If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter (see Post Somatic Calling Filtering below) to apply a hard filter on VAF.

Sample-specific NTD Error Bias Estimation

DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.

Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. This is done by collecting counts (sampled from across the genome, and counted per read orientation) of reads supporting each specific nucleotide subsitution (C->T, G->A, etc.). The estimated rate of each substitution is written to a metrics file named "*.allele-transition-noise-metrics.csv". NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.

This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true.

To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed. Alternatively, if --vc-target-bed is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.

DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.

Unique Molecular Identifier (UMI) Support

DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true) or when running from UMI-collapsed bams, you can enable UMI-aware variant calling by setting one of the following options to true:

--vc-enable-umi-solid The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.
--vc-enable-umi-liquid The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.
If your UMI-collapsed reads do not meet the recommended post-collapsed coverage depths for the options listed above, we recommend you run with default settings.

If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.

gVCF Output

You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.

By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod option.

Post Somatic Calling Filtering

DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the <output-file-prefix>.hard-filtered.vcf.gz output file.

Options

The following options are available for post somatic calling filtering:

--vc-sq-call-threshold
Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold is lower than vc-sq-call-threshold, the filter threshold value is used instead of the call threshold value.
--vc-sq-filter-threshold
Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.
--vc-enable-non-primary-allelic-filter
Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.
--vc-enable-af-filter
Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold and vc-af-filter-threshold command-line options. Please use vc-enable-af-filter-mito and corresponding threshold options for mitochondrial allele frequency filtering.
--vc-enable-non-homref-normal-filter
Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.
--vc-enable-vaf-ratio-filter
Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.
--vc-depth-filter-threshold
Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).
vc-homref-depth-filter-threshold
In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.
vc-depth-annotation-threshold
Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).

Filters

Somatic Mode

Filter ID

Description

Tumor-Only & Tumor-Normal

weak_evidence

Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only.

Tumor-Only & Tumor-Normal

multiallelic

Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants.

Tumor-Only & Tumor-Normal

base_quality

Median base quality of ALT reads at this locus is < 20.

Tumor-Only & Tumor-Normal

mapping_quality

Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only).

Tumor-Only & Tumor-Normal

fragment_length

Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000.

Tumor-Only & Tumor-Normal

read_position

Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use --vc-output-variant-read-position=true.

Tumor-Only & Tumor-Normal

low_af

Allele frequency is below the threshold specified with --vc-af-filter-threshold (default is 5%). Enabled only when using --vc-enable-af-filter=true.

Tumor-Only & Tumor-Normal

systematic_noise

If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered.

Tumor-Only & Tumor-Normal

low_frac_info_reads

The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5.

Tumor-Only & Tumor-Normal

filtered_reads

More than 50% of reads have been filtered out.

Tumor-Only & Tumor-Normal

long_indel

Indel length is more than 100bp.

Tumor-Only & Tumor-Normal

low_depth

The site was filtered because the number of reads is too low. The filter is off by default.

Tumor-Only & Tumor-Normal

low_tlen

The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default).

Tumor-Only and Tumor-Normal

no_reliable_supporting_read

No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5.

Tumor-Only & Tumor-Normal

too_few_supporting_reads

Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines.

Tumor-Normal

noisy_normal

More than three alleles are observed in the normal sample at allele frequency above 9.9%.

Tumor-Normal

alt_allele_in_normal

ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See vc-enable-vaf-ratio-filter for optional conditions.

Tumor-Normal

non_homref_normal

Normal sample genotype is not a homozygous reference.

Systematic Noise Filtering

The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter removes noise that consistently appears at specific locations in the reference genome. This noise can arise from:

Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.
PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.

To determine whether a variant should be filtered, the systematic noise filter compares the observed variant's allele frequency (AF) to the noise level at the matching locus in the systematic noise file. Variants are filtered if their AF is not statistically sufficiently higher than the recorded noise.

Note that the systematic noise filter specifically aims to remove noise, not germline variants; however, it may inadvertently filter some germline variants. For this reason, it is not ideal to evaluate the systematic noise file on germline admixture datasets.

Newer versions of the systematic noise filter will include allele-specific information along with two columns for noise frequency: one for the "mean" noise and one for the "max" noise. During a VC run, DRAGEN will automatically detect the input sample type as either WGS or WES/panel and will apply the optimal noise values based on sample type and run context. For WGS data, the "max" noise is used by default; for WES/panel data or whenever UMI is enabled, the "mean" noise is used.

WES and WGS prebuilt systematic noise files are available for download (see below).

Custom panels will require custom noise files. It is recommended to use normal samples sequenced on the same instrument type and using the same library prep. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 30-70 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.

The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding --vc-systematic-noise NOISE_FILE_PATH.

Option

Description

--vc-systematic-noise

Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF.

--vc-systematic-noise-filter-threshold

Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity.

--vc-systematic-noise-filter-threshold-in-hotspot

Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only.

--vc-allele-specific-systematic-noise

Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1.x.x noise files (Default=true))

Prebuilt Systematic Noise BED Files

Prebuilt systematic noise files can be downloaded here: DRAGEN Software Support Site page

Somatic Systematic Noise Baseline Collection v2.0.0 noise files include allele specific information to better preserve sensitivity with systematic noise filtering enabled. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns, with the appropriate noise applied automatically based on auto-detected input type and run context.

The latest noise files (v2.0.0) contain more columns than earlier noise files and are therefore incompatible with versions of DRAGEN prior to v4.3. Older noise files are still supported in the current version of DRAGEN; however, the older noise files lack allele specific information and noise filtering will be applied by position only as was the default in v4.2 and earlier versions of DRAGEN.

The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files. For details please refer to SNV Systematic Noise Files.

Custom Systematic Noise Files

The BaseSpace Sequence Hub DRAGEN Baseline Builder App or the DRAGEN Systematic Noise File Builder Pipeline on ICA can be used to build systematic noise files in the cloud.

For example command lines on how to build a custom noise file, please refer to the respective DRAGEN recipes: DRAGEN Rescipes.

Option

Description

--build-sys-noise-vcfs-list

Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line.

--build-sys-noise-germline-vaf-threshold

Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1)

--build-sys-noise-use-germline-tag

This option will ensure that variants tagged by vc-enable-germline-tagging=true will not be counted as noise. (Default true)

--build-sys-noise-min-sample-cov

Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5)

--build-sys-noise-min-supporting-samples

Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1).

Germline Tagging in the Tumor-Only Pipeline

When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:

--vc-enable-germline-tagging Enable germline tagging. The default is 'false'. In a tumor-only analysis, this option must either be set 'true' (recommended) or germline tagging must be explicitly disabled with --vc-skip-germline-tagging=true (not recommended). Once the vc-enable-germline-tagging option is set to 'true', it will require the user to pass in a variant annotation data directory as follows:
- --variant-annotation-data Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)

Additional options to control how to define germline variants.

--germline-tagging-db-threshold The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).
--germline-tagging-pop-af-threshold The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.

1    11301714        .       A       G       .       PASS    
DP=3626;MQ=249.61;FractionInformativeReads=0.974;AQ=100.00;GermlineStatus=Germline_DB   
GT:SQ:AD:AF:F1R2:F2R1:DP:SB:MB  0/1:64.73:1772,1758:0.498:872,901:900,857:3530:846,926,843,915:894,878,874,884

The total number of variants tagged as germline and somatic in the VCF are written to a metrics file named "*.vc_germline_tagging_metrics.csv".

Mutation Annotation Format (MAF) Conversion in Tumor-Only and Tumor-Normal Pipelines

When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).

When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:

Annotation options:

--enable-variant-annotation=true Enable variant annotation
--variant--annotation-data Nirvana annotation database, please see Nirvana.

MAF conversion options:

--enable-maf-output=true Enable MAF output
--maf-transcript-source Desired transcript source, RefSeq or Ensembl

Additional standalone options (when running without the variant caller):

--maf-input-vcf Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz
--maf-input-json Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz

Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.

Optional options:

--maf-include-non-pass-variants Enabling this option will output all variants, including non-PASS variants, in the MAF output file.

Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.

Example command lines:

MAF output from BAM input and variant caller:

bin/dragen --output-dir=/path/to/output/dir --output-file-prefix=prefix_name --ref-dir=/path/to/ref/dir --enable-map-align=false --enable-sort=false --enable-variant-caller=true -b /path/to/normal/bam --tumor-bam-input /path/to/tumor/bam --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:

bin/dragen --output-dir=/path/to/output/dir/with/vcf --output-file-prefix=prefix_of_vcf --ref-dir=/path/to/ref-dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

MAF output from source VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-variant-annotation=true --variant-annotation-assembly <GRCh37/GRCh38> --variant-annotation-data /path/to/annotation/data --enable-maf-output=true --maf-input-vcf=/path/to/vcf/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir and --output-file-prefix options.

MAF output from source annotated VCF file:

bin/dragen --ref-dir=/path/to/ref/dir --enable-maf-output=true --maf-input-json=/path/to/annotated/json/file --maf-transcript-source <Ensembl/RefSeq> --maf-include-non-pass-variants <true/false>

Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir and --output-file-prefix options.

PreviousB-Allele Frequency Output NextPedigree Analysis

Last updated 3 months ago

Was this helpful?