Contamination Detection

The DRAGEN cross-sample contamination module estimates the fraction of sequencing reads originating from another human sample using a probabilistic mixture model.

DRAGEN provides two contamination detection modes. The appropriate mode depends on sample type, coverage, and expected contamination level.


Quick Decision Guide

What are you running?
Sample characteristics
Setting to use
What DRAGEN does

General germline or somatic (default)

>= 20X coverage; FFPE/CNV/LOH allowed

--qc-detect-contamination=true

Runs GATK-based model; automatically falls back to legacy VerifyBamID-like model if GATK fails (e.g. high contamination)

RNA-seq

Variable expression and coverage

--qc-detect-contamination=true

Runs GATK-based model in experimental mode; results are best-effort and qualitative

Low coverage germline

Low coverage (~10×), no FFPE/CNV/LOH

--qc-cross-cont-vcf

Runs legacy VerifyBamID-like model directly; robust at low coverage


Fallback Mechanism

When --qc-detect-contamination=true is specified, DRAGEN:

  1. First attempts contamination estimation using the GATK-based model

  2. Automatically falls back to the legacy VerifyBamID-like model if the GATK-based model fails to converge, most commonly at high contamination levels

No additional settings are required to enable fallback behavior.


GATK-Based Contamination Detection (Default)

Use for: Germline, tumor-only, and tumor-normal workflows. This is the recommended default.

Enable

Population Marker Resources

(hg19, hg38, hs37d5)

Markers can also be supplied explicitly:

Behavior

  • Accounts for FFPE damage, copy number variation (CNV), and loss of heterozygosity (LOH)

  • Empirically adjusts base qualities to reduce FFPE deamination and oxidation noise

  • Optimized for low-to-moderate contamination levels


RNA-seq Support (Experimental)

--qc-detect-contamination=true can be run on RNA-seq data.

Limitations

  • Less stable than DNA due to expression and coverage variability

  • Results are qualitative indicators only

  • Feature is experimental


Legacy Contamination Detection (VerifyBamID-like)

Use for: Clean germline samples, especially at low coverage (~10×), or when fallback occurs.

Enable

Population Marker Resources

(hg19, hg38, hs37d5)

Behavior

  • Models the sample as a mixture of individuals

  • Performs well on clean germline data

  • Robust at low coverage

  • Can remain informative at high contamination

  • Not robust to FFPE, CNVs, or extended ROH


Output and Interpretation

The contamination estimate is reported as a fraction:

This corresponds to 1.1% contamination.

Interpretation Guidance

  • Contamination should be well below the minimum allele frequency of interest

  • Example: at 1% contamination, variants below ~5% AF may be unreliable

  • The metric saturates near ~30% contamination


Coverage and Validity Requirements

Contamination estimation requires ≥100 valid pileups.

A pileup is valid if:

  • Coverage ≥ 10×

  • 95% of reads are valid

Soft-clipped reads are excluded. Excessive soft clipping is often caused by untrimmed adapters. If contamination is reported as NA, inspect marker loci in IGV and correct adapter issues upstream.


Legacy Model–Specific Settings

Setting
Description

--qc-contam-min-cov

Minimum coverage per pileup (default: 10).

--qc-contam-min-valid-read-ratio

Minimum fraction of valid reads (default: 0.95). Can be lowered to ~0.75, but adapter trimming issues should be fixed instead.


Key Takeaways

  • Use GATK-based contamination detection for most workflows

  • Use the legacy model for low-coverage clean germline samples

  • High contamination triggers automatic fallback when using --qc-detect-contamination=true

  • RNA-seq support is experimental

Last updated

Was this helpful?