DRAGEN
Illumina Connected Software
  • Overview
    • Illumina® DRAGEN™ Secondary Analysis
    • DRAGEN Applications
    • Deployment Options
  • Product Guides
    • DRAGEN v4.4
      • Getting Started
      • DRAGEN Host Software
        • DRAGEN Secondary Analysis
      • Clinical Research Workflows
        • DRAGEN Heme WGS Tumor Only Pipeline
          • Quick Start
          • Sample Sheets
            • Introduction
            • Requirements
            • Templates
          • Run Planning
            • Sample Sheet Creation in BaseSpace
            • Custom Config Support
          • DRAGEN Server App
            • Getting Started
            • Launching Analysis
            • Command Line Options
            • Output
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Illumina Connected Insights
          • ICA Cloud App
            • Getting Started
            • Launching Analysis
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Post Processing
              • Illumina Connected Insights
          • Analysis Output
          • Analysis Methods
          • Troubleshooting
        • DRAGEN Solid WGS Tumor Normal Pipeline
          • Quick Start
          • Sample Sheets
            • Introduction
            • Requirements
            • Templates
          • Run Planning
            • Sample Sheet Creation in BaseSpace
            • Custom Config Support
          • DRAGEN Server App
            • Quick Start
            • Getting Started
            • Launching Analysis
            • Command Line Options
            • Output
            • Advanced Topics
            • Custom Workflow
              • Custom Config Support
            • Illumina Connected Insights
          • ICA Cloud App
            • Getting Started
            • Launching Analysis
            • Output
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Post Processing
              • Illumina Connected Insights
          • Analysis Output
          • Analysis Methods
          • Troubleshooting
      • DRAGEN Recipes
        • DNA Germline Panel UMI
        • DNA Germline Panel
        • DNA Germline WES UMI
        • DNA Germline WES
        • DNA Germline WGS UMI
        • DNA Germline WGS
        • DNA Somatic Tumor-Normal Solid Panel UMI
        • DNA Somatic Tumor-Normal Solid Panel
        • DNA Somatic Tumor-Normal Solid WES UMI
        • DNA Somatic Tumor-Normal Solid WES
        • DNA Somatic Tumor-Normal Solid WGS UMI
        • DNA Somatic Tumor-Normal Solid WGS
        • DNA Somatic Tumor-Only Heme WGS
        • DNA Somatic Tumor-Only Solid Panel UMI
        • DNA Somatic Tumor-Only Solid Panel
        • DNA Somatic Tumor-Only Solid WES UMI
        • DNA Somatic Tumor-Only Solid WES
        • DNA Somatic Tumor-Only Solid WGS UMI
        • DNA Somatic Tumor-Only Solid WGS
        • DNA Somatic Tumor-Only ctDNA Panel UMI
        • Illumina scRNA
        • Other scRNA prep
        • RNA Panel
        • RNA WTS
      • DRAGEN Reference Support
        • Prepare a Reference Genome
      • DRAGEN DNA Pipeline
        • DNA Mapping
        • Read Trimming
        • DRAGEN FASTQC
        • Sorting and Duplicate Marking
        • Small Variant Calling
          • ROH Caller
          • B-Allele Frequency Output
          • Somatic Mode
          • Pedigree Analysis
          • De Novo Small Variant Filtering
          • Autogenerated MD5SUM for VCF Files
          • Force Genotyping
          • Machine Learning for Variant Calling
          • Evidence BAM
          • Mosaic Detection
          • VCF Imputation
          • Multi-Region Joint Detection
        • Copy Number Variant Calling
          • Available pipelines
            • Germline CNV Calling (WGS/WES)
            • Germline CNV Calling ASCN (WGS)
            • Multisample Germline CNV Calling
            • Somatic CNV Calling ASCN (WGS)
            • Somatic CNV Calling WES
            • Somatic CNV Calling ASCN (WES)
          • Additional documentation
            • CNV Input
            • CNV Preprocessing
            • CNV Segmentation
            • CNV Output
            • CNV ASCN module
            • CNV with SV Support
            • Cytogenetics Modality
        • Repeat Expansion Detection
          • De Novo Repeat Expansion Detection
        • Targeted Caller
          • CYPDB6 Caller
          • CYP2D6 Caller
          • CYP21A2 Caller
          • GBA Caller
          • HBA Caller
          • LPA Caller
          • Rh Caller
          • SMN Caller
        • Structural Variant Calling
          • Structural Variant De Novo Quality Scoring
          • Structural Variant IGV Tutorial
        • VNTR Calling
        • Population Genotyping
        • Filter Duplicate Variants
        • Ploidy Calling
          • Ploidy Estimator
          • Ploidy Caller
        • Multi Caller
        • QC Metrics Reporting
        • JSON Metrics Reporting
        • HLA Typing
        • Biomarkers
          • Tumor Mutational Burden
          • Microsatellite Instability
          • Homologous Recombination Deficiency
          • BRCA Large Genomic Rearrangment
          • DRAGEN Fragmentomics
        • Downsampling
          • Fractional (Raw Reads) Downsampling
        • Unique Molecular Identifiers
        • Indel Re-aligner (Beta)
        • Star Allele Caller
        • High Coverage Analysis
        • CheckFingerprint
        • Population Haplotyping (Beta)
        • DUX4 Rearrangement Caller
      • DRAGEN RNA Pipeline
        • RNA Alignment
        • Gene Fusion Detection
        • Gene Expression Quantification
        • RNA Variant Calling
        • Splice Variant Caller
      • DRAGEN Single Cell Pipeline
        • Illumina PIPseq scRNA
        • Other scRNA Prep
        • scATAC
        • Single-Cell Multiomics
      • DRAGEN Methylation Pipeline
      • DRAGEN MRD Pipeline
      • DRAGEN Amplicon Pipeline
      • Explify Analysis Pipeline
        • Kmer Classifier
        • Kmer Classifier Database Builder
      • BCL conversion
      • Illumina Connected Annotations
      • ORA Compression
      • Command Line Options
        • Docker Requirements
      • DRAGEN Reports
      • Tools and Utilities
    • DRAGEN v4.3
      • Getting Started
      • DRAGEN Host Software
        • DRAGEN Secondary Analysis
      • DRAGEN Reference Support
        • Prepare a Reference Genome
      • DRAGEN DNA Pipeline
        • DNA Mapping
        • Read Trimming
        • DRAGEN FASTQC
        • Sorting and Duplicate Marking
        • Small Variant Calling
          • ROH Caller
          • B-Allele Frequency Output
          • Somatic Mode
          • Joint Analysis
          • De Novo Small Variant Filtering
          • Autogenerated MD5SUM for VCF Files
          • Force Genotyping
          • Machine Learning for Variant Calling
          • Evidence BAM
          • Mosaic Detection
          • VCF Imputation
          • Multi-Region Joint Detection
        • Copy Number Variant Calling
          • CNV Output
          • CNV with SV Support
          • Multisample CNV Calling
          • Somatic CNV Calling WGS
          • Somatic CNV Calling WES
          • Allele Specific CNV for Somatic WES CNV
        • Repeat Expansion Detection
          • De Novo Repeat Expansion Detection
        • Targeted Caller
          • CYPDB6 Caller
          • CYP2D6 Caller
          • CYP21A2 Caller
          • GBA Caller
          • HBA Caller
          • LPA Caller
          • Rh Caller
          • SMN Caller
        • Structural Variant Calling
          • Structural Variant De Novo Quality Scoring
        • VNTR Calling
        • Filter Duplicate Variants
        • Ploidy Calling
          • Ploidy Estimator
          • Ploidy Caller
        • Multi Caller
        • QC Metrics Reporting
        • HLA Typing
        • Biomarkers
          • Tumor Mutational Burden
          • Microsatellite Instability
          • Homologous Recombination Deficiency
          • BRCA Large Genomic Rearrangment
          • DRAGEN Fragmentomics
        • Downsampling
          • Fractional (Raw Reads) Downsampling
          • Effective Coverage Downsampling
        • Unique Molecular Identifiers
        • Indel Re-aligner (Beta)
        • Star Allele Caller
        • High Coverage Analysis
        • CheckFingerprint
        • Population Haplotyping (Beta)
        • DUX4 Rearrangement Caller
      • DRAGEN RNA Pipeline
        • RNA Alignment
        • Gene Fusion Detection
        • Gene Expression Quantification
        • RNA Variant Calling
        • Splice Variant Caller
      • DRAGEN Single-Cell Pipeline
        • scRNA
        • scATAC
        • Single-Cell Multiomics
      • DRAGEN Methylation Pipeline
      • DRAGEN Amplicon Pipeline
      • Explify Analysis Pipeline
        • Kmer Classifier
        • Kmer Classifier Database Builder
      • DRAGEN Recipes
        • DNA Germline Panel UMI
        • DNA Germline Panel
        • DNA Germline WES UMI
        • DNA Germline WES
        • DNA Germline WGS UMI
        • DNA Germline WGS
        • DNA Somatic Tumor-Normal Solid Panel UMI
        • DNA Somatic Tumor-Normal Solid Panel
        • DNA Somatic Tumor-Normal Solid WES UMI
        • DNA Somatic Tumor-Normal Solid WES
        • DNA Somatic Tumor-Normal Solid WGS UMI
        • DNA Somatic Tumor-Normal Solid WGS
        • DNA Somatic Tumor-Only Heme WGS
        • DNA Somatic Tumor-Only Solid Panel UMI
        • DNA Somatic Tumor-Only Solid Panel
        • DNA Somatic Tumor-Only Solid WES UMI
        • DNA Somatic Tumor-Only Solid WES
        • DNA Somatic Tumor-Only Solid WGS UMI
        • DNA Somatic Tumor-Only Solid WGS
        • DNA Somatic Tumor-Only ctDNA Panel UMI
        • RNA Panel
        • RNA WTS
      • BCL conversion
      • Illumina Connected Annotations
      • ORA Compression
      • Command Line Options
      • DRAGEN Reports
      • Tools and Utilities
  • Reference
    • DRAGEN Server
    • DRAGEN Multi-Cloud
      • DRAGEN on AWS
      • DRAGEN on AWS Batch
      • DRAGEN on Microsoft Azure
        • Run DRAGEN VM on Azure
      • DRAGEN on Microsoft Azure Batch
        • Azure Batch Run Modes
    • DRAGEN Licensing
      • DRAGEN Server Licensing
      • DRAGEN Cloud Licensing
    • DRAGEN Application Manager
    • Support
    • Resource Files
      • Noise Baselines
    • Supplementary Information
    • Troubleshooting
    • Citing DRAGEN software
    • Release Notes
    • Revision History
Powered by GitBook
On this page
  • Target Counts
  • Whole Genome
  • Whole Exome
  • Target Counts Options
  • Filter Duplicate Alignments
  • Target Counts Dropout Regions
  • Rescue of target counts in Segmental Duplications
  • B-Allele Counts (ASCN callers)
  • Somatic-specific options
  • GC Bias Correction
  • Normalization
  • Self Normalization
  • Panel of Normals
  • Generating Panel of Normals (Combined Counts)
  • Normalization Options
  • Exclude BED Filtering

Was this helpful?

Export as PDF
  1. Product Guides
  2. DRAGEN v4.4
  3. DRAGEN DNA Pipeline
  4. Copy Number Variant Calling
  5. Additional documentation

CNV Preprocessing

PreviousCNV InputNextCNV Segmentation

Last updated 2 days ago

Was this helpful?

Target Counts

The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.

When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list option.

With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed option to determine the intervals for analysis. The target BED file should contain intervals that match those in the panel of normals file. If the intervals in the target BED file and the panel of normals file do not match, DRAGEN will use the target intervals from the panel of normals file.

The target counts stage generates a *.target.counts.gz file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input or --cnv-tumor-input option for the normalization stage. The *.target.counts.gz file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.

Further details are available in the section.

Whole Genome

If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.

The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.

WGS Coverage per Sample
Recommended Resolution* (bp)

5

10000

10

5000

>= 30

1000

Using a cnv-interval-width of less than 250 bp for WGS analysis can drastically increase runtime.

The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y. You can specify a list of contigs to skip by using the --cnv-skip-contig-list option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.

For example, to skip chromosome M, X, and Y, use the following option:

--cnv-skip-contig-list "chrM,chrX,chrY"

Whole Exome

If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width.

To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.

Target Counts Options

The following options control the generation of target counts.

  • --cnv-counts-method --- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.

  • --cnv-min-mapq --- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.

  • --cnv-target-bed --- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.

  • --cnv-interval-width --- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.

  • --cnv-skip-contig-list --- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm.

  • --cnv-filter-duplicate-alignments --- Filter duplicate marked alignments during target counts if option is set to true. The default setting is true unless map/align is enabled and duplicate marking is disabled.

Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.

Filter Duplicate Alignments

PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.

If --enable-map-align=false, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true, then --enable-duplicate-marking=true should be set.

Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.

Input format
enable-map-align
Required option

Fastq

TRUE

--enable-map-align=true, --enable-duplicate-marking=true

BAM

TRUE

--enable-map-align=true, --enable-duplicate-marking=true

BAM

FALSE

--enable-map-align=false

Target Counts Dropout Regions

In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.

Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.

For WGS samples and in absence of a cnv-target-bed file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width option, which defaults to 1000bp. The cnv-interval-width option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE in the *.cnv.excluded_intervals.bed.gz file.

A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.

Rescue of target counts in Segmental Duplications

The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.

This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38) with at least 30x coverage. See below for additional requirements.

Supported duplications

The following pairs of genes defining Segmental Duplications are included:

CYP2A6

CYP2A7

FCGR3A

FCGR3B

RHD

RHCE

STRC

STRCP1

ACSM2A

ACSM2B

ACTR3B

ACTR3C

AQP12A

AQP12B

ASAH2

ASAH2B

CCDC74A

CCDC74B

CD177

CD177p1

CD8B

CD8B2

CFH1

CFHR1

CYP4A11

CYP4A22

DHX40

DHX40P1

EIF5AL1

EIF5AP4

FCGR2A

FCGR2C

FFAR3

GPR42

FOLH1

FOLH1B

FRMPD2

FRMPD2B

GPAT2

GPAT2P1

GSTT2B

GSTT2

DDT

DDTL

HCAR2

HCAR3

HSPA1A

HSPA1B

KRT81

KRT86

LGALS7

LGALS7B

MRPL45

MRPL45P2

MSTO1

MSTO2p

MUC20

MUC20P1

MZT2A

MZT2B

OTOA

OTOAp1

PDPR

PDPR2P

PIEZ02

ENST00000591853.1

ZP3

POMZP3

PRAMEF7

PRAMEF8

PROS1

PROS2P

RMND5A

ANAPC1P2

ROCK1

ROCK1p1

SERPINB3

SERPINB4

SYT3

ZNF473CR

TBC1D26

TBC1D28

TOP3B

TOP3BP1

TUBA3D

TUBA3E

ZNF443

ZNF799

Extension requirements

This extension is enabled by default in the germline CNV workflow (ASCN workflow currently unsupported). However, it requires:

  • Normalization set to self-normalization (--cnv-enable-self-normalization=true).

  • GC bias correction enabled (--cnv-enable-gcbias-correction=true).

  • Counts method set to start (--cnv-counts-method=start).

  • Interval width not greater than 10kb. However, we recommend using the cnv-interval-width default (1kb) for best performance.

  • A supported reference genome builds in input (currently supported based on: hg38).

If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension to false.

Algorithm

  • For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz).

  • Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz).

  • Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.

  • The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz file for inspection and they are automatically injected before the segmentation step.

    • During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j

B-Allele Counts (ASCN callers)

The ASCN callers require a source of heterozygous SNP sites to measure B-allele counts of the input sample. The following are the available modes, of which some are only available in somatic workflows.

Option
Description

cnv-population-b-allele-vcf

Specify a population SNP VCF. This option is available for both the germline and the somatic workflows. In somatic, it can be used when a matched normal sample is not available and analysis must be performed in tumor-only mode.

cnv-normal-b-allele-vcf

(Somatic-specific) Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow.

cnv-use-somatic-vc-baf

(Somatic-specific) Set to true to enable DRAGEN to identify germline variants during a tumor/matched-normal run, rather than requiring a separate run on the normal sample. Use if and only if tumor and matched normal input are available. Also enable the Somatic SNV Caller via enable-variant-caller to use this option.

To specify a population SNP VCF, use --cnv-population-b-allele-vcf option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency> to the INFO section of each record. Additional INFO fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf can be either heterozygous or homozygous in the germline genome from which the tumor genome derives

The following is an example valid population SNP record (note: it needs to be tab-delimited):

chr1  51479  .  T  A  1000  PASS  AF=0.3253

DRAGEN considers the following requirements when parsing records from the b-allele VCF:

  • Only simple SNV sites.

  • Records must be marked PASS in the FILTER field.

  • If there are records with the same CHROM and POS values in the VCF, then DRAGEN uses the first record that occurs.

Somatic-specific options

To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.

If a tumor sample and matched normal input are available, use --cnv-use-somatic-vc-baf true. You must enable the Somatic SNV Caller. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.

If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.

To enable --cnv-use-somatic-vc-baf, enter the following command line options.

  • --tumor-bam-input <TUMOR_BAM>—Specify the tumor input

  • --bam-input <NORMAL_BAM>—Specify the matched normal input

  • --enable-variant-caller true—Enable the somatic SNV variant caller

  • --cnv-use-somatic-vc-baf true—Enable somatic VC BAF

GC Bias Correction

GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.

Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.

The following options control the GC bias correction module.

  • --cnv-enable-gcbias-correction --- Enable or disable GC bias correction when generating target counts. The default is true.

  • --cnv-enable-gcbias-smoothing --- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.

  • --cnv-num-gc-bins --- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.

Normalization

The DRAGEN CNV pipeline supports two normalization algorithms:

  • Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.

  • Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.

Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.

Self-Normalization

  • Whole genome sequencing

  • Single sample analysis

  • Additional matched samples are not readily available

  • Simpler workflow via a single invocation

  • Only references with chr1, chr2, chr3, ..., chrX, chrY or 1, 2, 3, ..., X, Y naming conventions are supported.

Panel of Normals

  • Whole genome sequencing (excluding Germline WGS ASCN)

  • Whole exome sequencing

  • Targeted panels, including somatic panels

  • Additional matched samples are available

  • Nonhuman samples

The table below shows supported normalization methods for CNV workflow:

Germline
Germline
Somatic
Somatic
Somatic

non ASCN

ASCN

non ASCN

ASCN T/N

ASCN T/O

WGS

Self/PoN

Self

Not available

Self/PoN

Self PoN

WES

PoN

Not available

PoN

PoN

PoN

Not available indicates the workflow is not supported.

Self Normalization

The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.

Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.

The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true

If you are running from a FASTQ sample, then the default mode of operation is self-normalization.

When operating in self-normalization mode, the --cnv-interval-width option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.

Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references or similar mammalian references (chr1, chr2, chr3, ..., chrX, chrY).

If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.

Panel of Normals

The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. Proper sample selection and preparation are critical for constructing an accurate and reliable CNV PON. High-quality germline samples—meeting stringent sequencing quality criteria such as a high percentage of bases over Q30, sufficient total read depth (yield), appropriate GC content, and minimal adapter contamination—must be used. Additionally, all samples should originate from the same sample type (e.g., FFPE, fresh-frozen) and be processed under identical experimental conditions, including the same library preparation kit, sequencing platform, and capture panel version. Even minor variations in hybridization efficiency or read depth distribution can introduce systematic artifacts, leading to inaccurate CNV calls.

Below are the key recommendations for preparing a high-quality PON:

  • Sample Selection: Normal samples should be sourced from individuals without known chromosomal abnormalities to establish a clean and representative reference baseline. Additionally, normal samples should not be drawn from a cohort that is likely to be enriched for particular CNVs, or enriched for individuals affected by a particular disease or syndrome with a genetic component. Normal samples should ideally be unrelated to each other and to the case samples to be processed.

  • Balanced sample sex: The normal sample set should include both male and female samples in similar numbers to ensure a well-represented reference baseline.

  • Exclude Low-Quality Samples: Samples with unusually uneven target coverage, low sequencing depth, or high technical noise should be removed to minimize variability and ensure consistency in the PON.

  • Standardized Library Preparation: All samples must be processed using the same library preparation protocol. Any deviations such as differences in hybridization efficiency, incubation time, or temperature can lead to inconsistent coverage patterns, increasing the likelihood of false positive CNV calls.

  • Adequate Number of Reference Samples: A sufficient number (a minimum of 50 samples is recommended, though not mandatory) of high-quality reference samples is essential for reliable coverage estimation and robust CNV detection.

By following these guidelines, the PON can effectively minimize technical biases, improving the accuracy and reliability of CNV detection.

In PON mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample (case and normals), to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.

Target Counts Stage

Target counts should be generated for all normal samples used as a panel of normals. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings including reference version, target bed, counting methods, duplicate marking/filtering, filtering method/cutoff, etc. The target counts stage also performs GC Bias correction, if enabled. GC Bias correction is enabled by default, but can be disabled if desired.

The following examples are for WES processing, where a panel of normals is required.

The following is an example command for processing a BAM file.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

The following is an example command for processing a CRAM file.

dragen \
-r <HASHTABLE> \
--cram-input <CRAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-target-bed <BED> \

The following example is for WGS processing, where a panel of normals is optional.

dragen \
-r <HASHTABLE> \
--bam-input <BAM> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-enable-self-normalization true \

Generating Panel of Normals (Combined Counts)

When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file (one per file) or --cnv-normals-list (single text file with paths to each sample).

The following is an example command line using a normals list:

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--cnv-normals-list <NORMALS_LIST> \
--enable-cnv true \
--cnv-generate-combined-counts true \

Normalization and Call Detection Stage

The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.

Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.

The presence of CNVs in the panel can result in artifactual calls in the test sample at locations where at least some of the panel samples have copy number changes. This leads to two considerations regarding construction of a panel.

Firstly, while it is not generally possible to select samples with no CNVs, panel samples should not be be clearly aneuploid or contain large-scale somatic CNVs; further, if there is a region of particular interest, samples should be selected to be normal in that region.

Secondly, for optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels increase the likelihood of artifactual calls. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.

The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.

/data/output/sample1.target.counts.gc-corrected.gz
/data/output/sample2.target.counts.gc-corrected.gz
/data/output/sample4.target.counts.gc-corrected.gz
/data/output/sample5.target.counts.gc-corrected.gz
/data/output/sample7.target.counts.gc-corrected.gz
/data/output/sample8.target.counts.gc-corrected.gz
...

DRAGEN accepts 3 different file formats for a Panel of Normals (PON).

Option
Description

--cnv-normals-file

Individual normal file. This option uses a single file name and can be specified multiple times.

--cnv-normals-list

List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz or *.target.counts.gc-corrected.gz file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.

--cnv-combined-counts

PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.

The CNV caller can also be started from the *.target.counts.gz (raw counts) or *.target.counts.gc-corrected.gz (GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input or --cnv-tumor-input option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction should be set to false to disable the GC-correction stage; GC-corrected inputs are not supported for somatic WGS analysis.

For example, the following command normalizes the case sample against the panel of normals.

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-map-align false \
--enable-cnv true \
--cnv-input <CASE_COUNTS> \
--cnv-normals-list <NORMALS> \
--cnv-enable-gcbias-correction false

Normalization Options

These options control the preconditioning of the panel of normals and the normalization of the case sample.

  • --cnv-enable-self-normalization --- Enable/disable self normalization mode, which does not require a panel of normals.

  • --cnv-extreme-percentile --- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.

  • --cnv-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals, for germline analysis (see --cnv-tumor-input for somatic analysis).

  • --cnv-normals-file --- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.

  • --cnv-normals-list --- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.

  • --cnv-max-percent-zero-samples --- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.

  • --cnv-max-percent-zero-targets --- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.

  • --cnv-target-factor-threshold --- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.

  • --cnv-tumor-input --- Specifies a target counts file for the case sample under analysis when using a panel of normals, for somatic analysis (see --cnv-input for germline analysis).

  • --cnv-truncate-threshold --- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.

  • --cnv-enable-gender-matched-pon --- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.

  • --cnv-enable-cross-gender-adjustments-chrX --- Enable normalization on chrX by adjusting coverage of PON samples according to the expected number of copies of chrX in male and female samples. If the case sample is male, coverage of female PON samples is scaled down by a factor of 2 on chrX. If the case sample is female, coverage of male PON samples is scaled up by a factor of 2 on chrX. If no male PON samples are available, chrY intervals will be filtered. This feature is only supported for germline enrichment runs. The default value is false; if set to true, then --cnv-enable-gender-matched-pon must also be true.

Exclude BED Filtering

You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.

The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz file excludes the intervals removed during normalization.

Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section for more details.

See for a description of the extension output files.

A suitable population B-allele VCF is provided for selected references at .

The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See for further details on GC-corrected target counts files.

See for a description of the target counts files.

An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See for further details.

this page
Segmental Duplication Extension
Output Files
Output Files
Output Files
Output Files
Output Files