Germline

Overview

DRAGEN provides germline copy number variant (CNV) calling workflows that detect copy number aberrations and regions with absence of heterozygosity (AOH) in whole genome sequencing (WGS) and whole exome sequencing (WES) data. The CNV workflows leverage both depth of coverage and B-allele frequencies (BAFs) to provide comprehensive detection of:

  • Copy number gains (duplications) and losses (deletions)

  • Copy-neutral loss of heterozygosity (CNLOH)

  • Whole-arm and whole-chromosome aneuploidies (via Cytogenetics modality)

  • Mosaic alterations (WGS only, enabled by default)

  • Minor allele copy number estimation

For applications that do not require allele-specific information, mosaic alterations and whole-arm/-chromosome aneuploidies, our legacy depth-only workflow is also available. See Depth-Only Workflowarrow-up-right for details.

Workflow

The germline CNV workflow follows this processing pipeline:

The pipeline consists of the following modules:

  1. Target Counts — Binning of read counts and other signals from alignments

  2. B-Allele Counts — Extraction of allelic read counts

  3. Bias Correction — Correction of GC bias and other systematic biases

  4. Normalization — Detection of normal ploidy levels and normalization

  5. Segmentation — Breakpoint detection via segmentation of normalized depth and BAF signals

  6. ASCN Calling — Integration of depth and BAF segments to determine copy number states and allele-specific information

Example Command Lines

WGS

Note: add --cnv-stop-after-intrinsic-corrections=true if interested only in target counts generation + bias correction.

WES

Alternatively, you can use a pre-combined panel of normals file:

Required Options

Option
Description

--enable-cnv

Enable CNV processing (set to true)

Input

Option
Description

--fastq-file1, --fastq-file2

FASTQ input files (requires --enable-map-align true)

--bam-input

BAM input file

--cram-input

CRAM input file

--ref-dir

DRAGEN reference genome hashtable directory

--enable-map-align

Enable mapper and aligner module

--cnv-population-b-allele-vcf

Population SNP catalog for BAF estimation

--cnv-target-bed

BED file defining exome capture regions (only for WES)

--sample-sex

Sample sex (e.g., male, female). If not specified, sex is estimated from data

You can download a suitable population SNP catalog (Resource file "CNV Population SNP VCF") for your associated reference at this pagearrow-up-right

Segmentation

The default segmentation mode depends on the sample type. Germline WGS samples use SLM by default. Germline WES samples use HSLM by default. The segmentation mode can be set explicitly with --cnv-segmentation-mode (SLM|HSLM). See Segmentation in the CNV reference section for a description of SLM and its HSLM variant.

Option
Description

--cnv-slm-eta

Probability that the segmenter changes to any other state than the current state going from the current target to the next target. This could also be expressed as the probability that the true depth for adjacent targets is different for reasons that simple counting noise does not adequately explain. Likewise, the stay-in-state probability is (1.0 - eta). The default value is 4e-5, the range is (0.0, 1.0) excluding endpoints. Decreasing this value results in longer segments and reduced fragmentation; increasing produces shorter segments with more fragmentation.

--cnv-slm-omega

Scaling parameter modulating the relative weight between experimental and biological variance. The default is 0.3, the range is (0.0, 1.0) excluding endpoints. In general, decreasing this value produces longer segments with less fragmentation; increasing produces shorter segments with more fragmentation.

The following options apply to the HSLM segmentation method, which is only used in germline WES.

Option
Description

--cnv-slm-stepeta

Distance normalization parameter. The default is 10000. Modifies the effective eta based on the genomic distance between consecutive target intervals. This can progressively relax stay-in-state, or "stickiness", of the segmenter as adjacent targets become farther apart, making the method adaptive to unequal spacing. Decreasing produces shorter segments with more fragmentation; increasing produces longer segments with less fragmentation.

--cnv-slm-fw

Minimum number of depth bins or targets required for a segment to be retained. This is an internal hard filter at the segmentation stage. The default is 0, which disables it. This is largely vestigial; use of this option is not recommended.

The following options are documented here in proximity to segmentation options because of their direct relevance to each other. Once provisional calls for copy number (CN) and minor copy number (MCN) have been made on the resulting segments from the segmentation stage, adjacent segments with the same CN and MCN are joined together to form one single segment. This is continued until no two adjacent segments satisfy the merging criteria. Segment merging is a critical step which compensates for over-segmentation or over-fragmentation happening at the segmentation stage. However, segment merging cannot split segments apart, so it cannot compensate in the other direction. Thus, segmentation can afford to produce a degree of over-segmentation, but there is no compensatory mechanism for under-segmentation. The following options control segment merging in germline analyses and do not depend on segmentation method or the segmentation options in use.

Option
Description

--cnv-merge-distance

Maximum gap in base pairs between two adjacent segments that still allows them to be merged. The default is 1000 for germline WGS. For WES the default is effectively unlimited, since target intervals are inherently non-contiguous.

--cnv-merge-threshold

Maximum difference in segment mean (linear copy ratio) between two adjacent segments that still allows them to be merged. The default is 0.2 for germline WGS and 0.4 for germline WES.

Setting --cnv-merge-threshold to zero disables segment merging entirely. This is not recommended.

Normalization

The following options are mutually exclusive:

Option
Description

--cnv-normals-list

Text file containing paths to reference target counts files (one per line)

--cnv-normals-file

Individual normal counts file (use multiple times for multiple files)

--cnv-combined-counts

Combined panel of normals file (.combined.counts.txt.gz)

--cnv-enable-self-normalization

Use self-normalization for sample normalization (only available for WGS)

Output

Option
Description

--output-directory

Output directory for all results

--output-file-prefix

Prefix prepended to all output file names

Workflow Configuration

Option
Description
Default
WGS
WES

--cnv-enable-mosaic-calling

Enable detection of mosaic alterations

true

--cnv-enable-cyto-output

Enable cytogenetics-compatible output VCF

true

--cnv-enable-legacy-vcf-format

Use VCF v4.2 format instead of VCF v4.4

false

--cnv-stop-after-intrinsic-corrections

Stop processing after generating target counts and GC-corrected counts

false

Note: Mosaic calling is available for WES but not recommended (disabled by default) due to lack of extensive validation.

Output Filtering

Option
Description
Default

--cnv-enable-ref-calls

Emit copy-neutral (REF) calls in output VCF

true for WGS

--cnv-filter-length

Minimum event length (bp) for PASS calls

10000

--cnv-exclude-bed

BED file specifying intervals to exclude from analysis

Not set

--cnv-exclude-bed-min-overlap

Minimum overlap fraction for exclusion

0.5

--cnv-post-vcf-target-bed

BED file used to only emit calls overlapping BED intervals

Not set

Output Files

The germline CNV workflow generates the following output files:

File
Description
Format

.target.counts.gz

Raw target counts before bias correction

gzipped TSV

.target.counts.gc-corrected.gz

GC-bias corrected target counts

gzipped TSV

.tn.tsv.gz

Tangent-normalized coverage signal

gzipped TSV

.ballele.counts.gz

B-allele counts at population SNP sites

gzipped TSV

.baf.bedgraph.gz

B-allele frequency in bedgraph format

gzipped bedGraph

.seg

Segmentation results (depth and BAF)

TSV

.cnv.vcf.gz

Primary CNV calls (VCF v4.4 by default)

gzipped VCF

.cyto.vcf.gz

Cytogenetics-compatible calls (if enabled)

gzipped VCF

.cnv_metrics.csv

Summary metrics including predicted sex

CSV

.cnv.gff3

Variant calls in GFF format

GFF

.tn.bw

Tangent-normalized signal track

BigWig

Target Counts Output

<prefix>.target.counts.gz

Compressed tab-delimited file containing the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

Columns:

  1. Contig identifier

  2. Start position

  3. End position

  4. Target interval name

  5. Count of alignments in this interval

  6. Count of improperly paired alignments in this interval

Header lines starting with # contain the DRAGEN version, command line, and other meta information.

Example:

<prefix>.target.counts.gc-corrected.gz

Contains GC-corrected read counts per target interval. The format is equivalent to the *.target.counts.gz file:

  1. Contig identifier

  2. Start position

  3. End position

  4. Target interval name

  5. GC-corrected read counts in this interval

  6. Count of improperly paired alignments in this interval

Example:

For more information, see Target Counts File and GC Bias Correction.

Normalized Coverage Output

<prefix>.tn.tsv.gz

Contains the normalized signal of the case sample per target interval, i.e., the log2-transformed copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *.target.counts.gz file:

  1. Contig identifier

  2. Start position

  3. End position

  4. Target interval name

  5. Log2-transformed copy ratio in this interval

  6. Count of improperly paired alignments in this interval

Header lines are also included that start with #. In some cases, the normalization counts could be patched internally with intervals from other processes, such as the SegDups extension. In such cases, patches are indicated (sorted in order of application) with header lines starting with #patch:

and the original (unpatched) *.tn.tsv.gz is renamed as *.tn.unpatched.tsv.gz. Note: this file is reported in output for inspection, but most use cases will use the (patched) *.tn.tsv.gz file downstream of normalization.

An example of a *.tn.tsv.gz file is shown below.

For more information, see Normalization.

B-Allele Counts

In germline ASCN runs, B-allele counts are calculated at bi-allelic sites taken from a collection of high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the sample supporting each of these alleles is counted.

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

<prefix>.ballele.counts.gz

Columns:

  1. Contig identifier

  2. Start, BED-style (zero-based inclusive) start position of the reference allele

  3. Stop, BED-style (one-based inclusive) stop position of the reference allele

  4. Base sequence for the reference allele

  5. Base sequence for the first allele being counted

  6. Base sequence for the second allele being counted

  7. The number of qualified reads containing a sequence matching the first allele

  8. The number of qualified reads containing a sequence matching the second allele

  9. Population frequency for the first allele

  10. Population frequency for the second allele

Example:

<prefix>.baf.bedgraph.gz

B-allele frequency in bedgraph format. Allele count ratios are calculated by sorting alleles according to base priority {A, T, G, C} (descending), producing frequencies deterministically distributed above and below 0.5. This provides easy visualization in IGV of significant BAF changes between neighboring segments.

Example:

Segmentation Results

<prefix>.seg

Contains the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.

The file has the following columns:

  1. Sample name

  2. Contig identified

  3. Start position

  4. End position

  5. Number of intervals in the segment

  6. Linear copy-ratio of the segment

An example of a *.seg file is shown below.

<prefix>.baf.seg

In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. First, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Second, there is an additional column:

  1. BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction

An example of BAF segmentation output file is shown below:

VCF Output

<prefix>.cnv.vcf.gz

The CNV VCF file follows the standard VCF format v4.4arrow-up-right. The VCF header is annotated with ##source=<DRAGEN_SOURCE>, where <DRAGEN_SOURCE> identifies the caller which produced the VCF, e.g.:

Due to the nature of how CNV events are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. To include copy neutral (REF) calls, set --cnv-enable-ref-calls to true. AOH/LOH events are not available in the legacy depth-only caller.

Example Records

The following is an example of some of the header lines that are specific to CNV:

The following header lines are specific to the germline WGS ASCN caller:

ID
Description

ModelSource

The primary basis on which the final model was chosen. Value: DEPTH+BAF.

DiploidCoverage

Expected read count for a target bin in a diploid region.

OverallPloidy

Length-weighted average of copy number for PASS events.

OutlierBafFraction

A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. Range: [0, 1].

HomozygosityIndex

Autosomal AOH/LOH percentage, considering only PASS AOH/LOH ≥ 2 Mb (default). Used as a proxy for consanguinity. A custom minimum size can be set through --cnv-min-length-homozygosity-index. The Cyto VCF (*.cyto.vcf.gz) also provides resolution-specific homozygosity indexes.

Records

All coordinates in the VCF are 1-based.

ID
Description

CHROM

The chromosome (or contig) on which the copy number variant occurs.

POS

Start position of the variant. If any of the ALT alleles is a symbolic allele (e.g., <DEL>), POS denotes the coordinate of the base preceding the polymorphism.

ID

Encodes the event type and coordinates of the event (1-based, inclusive). Event types include GAIN, LOSS, REF, CNLOH, and GAINLOH.

REF

Contains N for all CNV events.

ALT

Specifies the type of CNV event: <DEL>, <DUP>, or <LOH>. REF calls have ALT .. With --cnv-enable-legacy-vcf-format (VCF v4.2), the ALT field contains <DEL>,<DUP> in place of <LOH> for AOH/LOH events.

QUAL

Estimated quality score used in hard filtering. Note: different workflows provide different QUAL score distributions - it is recommended to compare QUAL scores only within results from the same workflow (e.g., it is incorrect to compare QUAL scores between the CNV caller and the legacy (depth-only) CNV callerarrow-up-right).

FILTER

The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.

ID
Description

binCount

CNV events with a bin count lower than a threshold.

chromArmBinCount

A whole-arm alteration call is based on a minimal portion (default 500 intervals) of the entire arm (e.g., in acrocentric chromosomes, where the short arm is mainly consisting of poor mappability regions, that are ignored during copy-number calling).

cnvLength

The length of the CNV is lower than a threshold.

cnvMosaicLength

A MOSAIC call below a certain length has been filtered as candidate FP.

cnvQual

The QUAL of the CNV is lower than a threshold.

mosaicFraction

The mosaic fraction of a CNV is below a defined threshold (--cnv-filter-mosaic-fraction). This filter is applied only to small CNVs with lengths shorter than the specified size threshold (--cnv-filter-mosaic-fraction-max-length, default: 200000).

INFO

The INFO column contains information representing the event.

ID
Description

REFLEN

Length of the event.

SVLEN

Length of the event. Only present for non-REF records. Note: in VCF v4.2 format (enabled with --cnv-enable-legacy-vcf-format), SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion).

SVTYPE

Always CNV. Only present for non-REF records.

END

End position of the event (1-based, inclusive).

LOHTYPE

Type of loss of heterozygosity. Possible values: AOH (Absence of Heterozygosity).

MOSAIC

Tag identifying mosaic calls (if mosaic calling is enabled).

CIPOS

Confidence interval around the nominal POS.

CIEND

Confidence interval around the nominal END.

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

When matching CNV with SV output, additional INFO annotations are added.

FORMAT

The common FORMAT fields are described in the header:

ID
Description

GT

Genotype

SM

Linear copy ratio of the segment mean

CN

Estimated total copy number of sample

BC

Number of read count bins

PE

Number of improperly paired end reads at start and stop breakpoints

AS

Number of allelic read count sites

CNF

Floating point estimate of copy number

CNQ

Exact total copy number Q-score

MAF

Estimate for the minor allele frequency

MCN

Estimated minor-haplotype copy number

MCNF

Floating point estimate of minor-haplotype copy number

MCNQ

Minor copy number Q-score

MF

Mosaic fraction estimate (for MOSAIC calls)

OBF

Per-segment Outlier BAF Fraction. Percentage of BAF counts which are considered "outlier" with respect to the chosen segment call. Higher values might indicate segments where BAF counts are problematic.

SD

Best estimate of segment's bias-corrected read count

For more information, see CNV VCF.

Cytogenetics Output

<prefix>.cyto.vcf.gz

The Cytogenetics modality output has a similar format to the standard CNV VCF (*.cnv.vcf.gz). A list of differences is indicated below:

  • Records can have the INFO/RES field. In such case, such field indicate the resolution(s) associated with the record.

  • Records can have the INFO/SEGID field. In such case, such field can either indicate custom predefined segments indicated in input by the user (similar to the standard CNV VCF), or Cytogenetics-specific predefined segments which are typically whole-arm/-chromosome segments automatically injected during the caller execution. In the latter case, the annotation field indicates the ID or name for the arm or chromosome.

  • The VCF header is annotated with ##source=DRAGEN_CYTO to indicate the file is generated by the Cytogenetics modality.

Note: The Cyto VCF also provides resolution-specific homozygosity indexes (i.e., computed on each specific resolution's callset). The default minimum size considered is the same as the main HomozygosityIndex, and for each resolution in output, there will be an additional header line on the Cyto VCF indicating the resulting metric, e.g., ##HomozygosityIndex(25k)=0.001015.

CNV Metrics Output

<prefix>.cnv_metrics.csv

DRAGEN CNV outputs metrics in CSV format. The following metrics are reported:

Sex Genotyper

Metric
Description

Estimated sex

Estimated sex of the case sample (and panel of normals samples if applicable).

Confidence score

Range: [0.0, 1.0]. If the sample sex is specified via --sample-sex, this value is 0.0.

DRAGEN Sex Genotyper requires a minimum of 300 target intervals to confidently determine sex genotype; if the panel covers fewer intervals on the sex chromosomes, genotyping will fail and an undetermined genotype is returned. Users may lower this requirement by setting --cnv-sex-genotyper-num-interval-requirement to a smaller value, at the risk of increased false genotype calls.

CNV Summary

  • Bases in reference genome in use

  • Average alignment coverage over genome - The average alignment coverage over the genome is calculated by dividing the total number of bases from processed alignment records (excluding those filtered by the Target Counts stage in DRAGEN CNV) by the genome length. Alignment records are filtered taking into consideration duplicate marking status (if available), MAPQ, and mapping status.

  • Number of alignment records processed

    • Number of filtered records (total)

    • Number of filtered records (due to duplicates)

    • Number of filtered records (due to MAPQ)

    • Number of filtered records (due to being unmapped)

  • PMAD - Pairwise Median Absolute Deviation measures the variation in read coverage between adjacent bins. It measures variability due to various factors, such as DNA degradation, extraction, amplification or library preparation. Higher values indicate noisier sample data. PMAD is calculated as following:

    • Define a vector v[i] as normalized counts of i-th interval in log scale, and d[i] as pairwise differences of consecutive normalized counts between i and i+1 intervals, i.e. d[i] = (v[i] - v[i+1])

    • PMAD is median absolute deviation of d, i.e. PMAD = Median(|d[i]-Median(d)|)

  • Coverage MAD - Median absolute deviation of normalized case counts. Higher values indicate noisier sample data.

  • Median Bin Count - Median of raw counts normalized by interval size.

  • Number of target intervals

  • Number of normal samples

  • Number of segments

  • Number of amplifications - Note: GAINLOH events (ALT=LOH and CN > 2) are also included here

  • Number of deletions

  • Number of CNLOHs (Copy-Neutral LOHs)

  • Number of PASS amplifications - Note: GAINLOH events (ALT=LOH and CN > 2) are also included here

  • Number of PASS deletions

  • Number of PASS CNLOHs (Copy-Neutral LOHs)

  • Post-Normalization Bin Count Sigma - Standard deviation of post-PoN-normalization median-normalized coverage values.

Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Post-Normalization Bin Count Sigma is only printed when PoN normalization has been applied.

Example (not all metrics are shown):

For more information, see CNV Metrics.

Track Files (IGV)

To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.

The following IGV tracks are automatically populated in the output IGV session file:

Track File
Description
Recommended View

*.target.counts.bw

BigWig representation of target counts bins. Values are GC-corrected if GC correction was performed.

Barchart or points

*.improper_pairs.bw

BigWig representation of improper pairs counts.

Barchart

*.tn.bw

BigWig representation of the tangent normalized signal.

Points

*.seg.bw

BigWig representation of the segments.

Points

*.baf.seg.bw

BigWig representation of BAF segments (if available).

Points

*.baf.bedgraph.gz

BED graph representation of B-allele frequency (if available).

Points

*.cnv.gff3

GFF3 representation of CNV events: DEL=blue, DUP=red, filtered=light gray, REF=green (if enabled), AOH/LOH=magenta. An example is shown below (different workflows may output different attributes on the 9th column).

Example GFF3 output:

IGV Session

File extension: *.igv_session.xml

The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.

Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

When the Cytogenetics Modality is enabled, DRAGEN CNV produces an additional IGV session xml *.cyto.igv_session.xml shown below.

Advanced Topics

Cytogenetics Modality

Conventional cytogenetics methodologies typically focus on larger alterations than the ones provided by NGS analyses. The Cytogenetics modality for the CNV caller allows the user to visualize CNAs at different resolutions, aiming at providing a more flexible workspace for different use cases.

It is enabled with --cnv-enable-cyto-output (default true for germline workflows). Not available for somatic WES workflows.

From the same sample, and during the same run, the Cytogenetics modality starts from the high resolution results (before smoothing) provided in the standard output CNV VCF. The output callset then undergoes multiple rounds of smoothing, going progressively from finer resolution to coarser resolution calls (larger alterations). Each round of smoothing produces a smoothed callset which is set aside and becomes the starting point for callsets with higher degree of smoothing.

At the end of the smoothing procedure, the Cytogenetics modality produces several outputs, e.g.:

  • Multiple GFF3 files, one for each round of smoothing (extension *cyto.<resolution_ID>.gff3).

  • A single VCF file, with extension *.cyto.vcf.gz. This file contains all callsets identified through the smoothing iterations, where the iteration identifier is stored on the INFO/RES field. Identical alterations across resolutions are deduplicated. In such case, the INFO/RES field will contain a comma-separated list of resolution identifiers.

    • Some resolutions will be based on depth of coverage only (no BAF). Their INFO/RES value will reflect the original callset used as a starting point, with added suffix _depth. E.g., for depth-only calls derived from resolution 1M, the new callset will have resolution ID 1M_depth. Note: calls made at different resolutions or with different information (depth+BAF versus depth-only) may occasionally conflict. For instance, in a region that is AOH that also has a mosaic DEL, the region may be reported as AOH for the depth+BAF calling but may be reported as (mosaic) DEL for the depth-only track. The event type with the strongest evidence will be output for each resolution.

    • An additional callset which does not conform to the ones above (no INFO/RES field) is the one containing whole-arm/-chromosome aneuploidies. For this callset, all reported records have the chromosome name or arm name in the INFO/SEGID field. Entries for this callset will not be present on any GFF3 file. For more details see the section on whole-chromosome aneuploidies below.

  • A single IGV session file, with extension *.cyto.igv_session.xml, which provides a convenient way to load the multiple GFF3 files and other typical tracks found on the standard *.cnv.igv_session.xml. Below an example screenshot of one of such IGV sessions:

    • The first 5 tracks provide the DRAGEN CNV calls (Blue/DEL, Green/REF, Magenta/AOH, Red/DUP) at decreasing degree of resolution (from high to low, top to bottom).

    • The remaining tracks are similar to the standard *cnv.igv_session.xml run, e.g.: poor mappability regions, target counts coverage, improper pairs, B-allele frequency, etc.

Below, an example set of calls from the *.cyto.vcf.gz output file (note additional INFO/RES annotation with respect to *.cnv.vcf.gz output file):

Selection of appropriate resolution

Since the most-informative resolution may vary depending on circumstances (event sizes, distance between calls, presence of smaller calls causing fragmentation, etc), no one-size-fits-all recommendation can work for all cases. However, some practical recommendations to consider are the following:

  • Each resolution INFO/RES ID identifies the minimum size for alterations to be considered PASS.

  • If only minimal call smoothing is necessary, resolution 25k can provide a good balance and provide calls in size ranges compatible with Chromosomal Microarray (CMA).

  • When comparing against technologies such as karyotyping, resolution 1M may be the more appropriate to reduce call fragmentation.

Note: if the use case under consideration is not impacted by call fragmentation, it is typically recommended to use the *.cnv.vcf.gz or *.cnv_sv.vcf.gz output results (instead of the ones in *.cyto.vcf.gz), to take full advantage of the superior detail of NGS.

Additional options

Option
Description

--cnv-cyto-keep-resolutions=<resolution_list>

Comma-separated list of resolutions to output (currently supported: 25k,50k,500k,1M,1M_depth)

Whole-chromosome Aneuploidy Detection

For some use cases, it is sometimes necessary to inspect a sample at arm or whole-chromosome level. Typically this would require the use of an additional caller, together with the standard CNV caller with automated segment detection. On the same run, the Cytogenetics modality provides such set of calls within the same VCF file (with extension *.cyto.vcf.gz).

In the example above, two calls derived from such callset. The segment ID annotation (INFO/SEGID) provides the name for the segment call under consideration (i.e., for this example, q-arm of chromosome 21 and the entire chromosome X). REF calls are not displayed by default unless required explicitly by the user (i.e., with --cnv-enable-ref-calls true. Note: this will enable REF calls for both CNV and CYTO VCF files).

Note: acrocentric chromosomes (13, 14, 15, 21, and 22) have short arms characterized by repetitive regions. These regions create mappability issues and they are typically excluded from analysis. Thus, calling short arm alterations for these chromosomes is challenging, being based on a small percentage of total arm's length. To avoid false positive calls (in this case, indicating an alteration on the full short arm with evidence only coming from a minimal portion of it), the algorithm has a hard threshold (default 500 intervals) on the minimum number of intervals required when calling whole-arm alterations. When the chromosome arm call does not satisfy this threshold, the call is filtered with FILTER chromArmBinCount. The default can be changed with option cnv-filter-chrom-arm-bin-count.

MOSAIC fraction estimation

For MOSAIC alterations, DRAGEN attempts inference of the mosaic fraction (MF), that is, the percentage of cells showing the alteration.

After copy number calling, the call's mosaic fraction is preliminarily estimated from the total and minor-allele copy-number (CN, MCN) and floating point estimates (CNF, MCNF). For example: in the case of CN=4, CNF=4.48, the CN of the population without the alteration is considered $CN'=5$, and then the mosaic fraction preliminary estimate is $MF=1-0.48=0.52$.

The call observed MAF is then cross-checked with the expected MAF:

MAF=M1n1(1q)+M2n2qn1(1q)+n2qMAF=\frac{M_1n_1(1-q)+M_2n_2q}{n_1(1-q)+n_2q}

* Note: this algorithm assumes only 2 cells populations (population 2: with the MOSAIC alteration called with CN and MCN, population 1: all remaining cells).

where:

  • $MAF$ is the expected MAF of the mixture

  • $n_1$ and $n_2$ represent the expected CN of the 2 cell populations

  • $M_1$ and $M_2$ represent the expected MAF of the 2 cell populations

  • $q$ denotes the mosaic fraction (aka MF, fraction of the 2nd cell population)

If the observed MAF is consistent with the expected MAF (considering a 5% tolerance on the MF value), the MF value is returned. Otherwise, the algorithm investigates alternative ($n_1$, $q$) configurations that are compatible with $n_2$ and the copy-number floating point estimate (CNF). If at least one alternative passes the expected MAF compatibility check, the updated $q$ is returned in the MF field. In all other cases, MF=..

Low-pass WGS support

The germline WGS caller supports reliable detection of CNVs from low-pass WGS data. Low-pass WGS is a highly cost-effective approach for CNV detection, providing genome-wide resolution at substantially lower cost than standard WGS or WES:

  • Cost-effective CNV detection at low sequencing depth (1× to 10×)

  • Comparable performance to WGS for cytogenetic-scale events (>1 Mb)

  • Detects CNVs down to a few hundred kilobases

  • Supports whole-chromosome aneuploidy and mosaic events

CNV Detection Capabilities

  • Variant types:

    • Deletions

    • Duplications

  • Resolution tiers:

    • Cytogenetic (coarse): ≥ 1 Mb

    • CNV (fine): 200 kb – 1 Mb

  • Minimum event size:

    • 200 kb hard filter

  • B-allele frequency (BAF):

    • Not estimated in low-pass mode

Output Files

Output File
Resolution
Size Range

cyto.vcf.gz

Coarse (cytogenetic)

≥ 1 Mb

cnv.vcf.gz

Fine

200 kb – 1 Mb

Command-Line Usage

Enable low-pass CNV calling using the --cnv-enable-lowpass=true option:

Example records

CNV

Cytogenetics

Mosaic events

Hard Filter Options

Low-pass CNV calling applies filters based on CNV length and bin count to reduce noise associated with low sequencing coverage.

Option
Default
Description

--cnv-filter-length

200 kb

Minimum CNV length for a PASS call.

--cnv-filter-bin-count

4

Minimum bin count for a PASS call.

CNV with SV Support

The DRAGEN CNV caller leverages depth/BAF as its primary signal for calling copy number variants. CNV alone poses challenges for calling events that are less than 10kbp. The sensitivity of CNVs at lengths less than 10kbp can be improved by leveraging junction signals from the DRAGEN structural variant caller.

When both the DRAGEN CNV and SV caller are executed in a single invocation, then an additional integration step is done at the end of a DRAGEN run to improve the CNV calls. This feature is enabled automatically when DRAGEN detects a germline WGS analysis.

The SV/CNV Integration module takes in DEL and DUP calls from the output data structures of the germline CNV and SV callers, identifies putative matches, updates annotations, filters, scores, and outputs the refined records in CNV VCF. By leveraging junction signals from the SV caller and depth/BAF signals from the CNV caller, this approach allows for sensitive CNV detection down to 1kbp while also improving recall and precision across length scales. This is achieved by rescuing previously low quality calls if evidence is found from both callers, and also by adjusting CNV breakends to the more accurate SV breakends. The matching algorithm takes into account the proximity of the events as well as the transition states at the breakends, among other things.

Example command lines

The following is an example command line for running a germline WGS analysis for both CNV and SV.

Other optional CNV or SV parameters can also be added.

Note: There is a high sensitivity mode that can be enabled with --sv-cnv-enable-high-sensitivity-mode=true. This option is experimental and will disable many filters in the processing chain to allow for more SV+CNV calls to pass. It is recommended that users apply their own training and downstream filters when using this option.

VCF Output

CNV calls with SV support are output in the CNV VCF (*.cnv.vcf.gz). The VCF header includes all header information from the individual CNV and SV callers, with some header lines deduplicated and additional header lines added from SV/CNV integration. For details on the individual caller header lines, please refer to the CNV and SV sections of the user guide. In cases where users want to obtain a separate CNV/SV VCF file while keeping the original CNV and SV VCF outputs, they can specify --sv-cnv-output-as-cnv-vcf=false. CNV calls with SV support are then output in a separate CNV/SV VCF with the *.cnv_sv.vcf.gz extension. In this case, the original CNV and SV VCF files prior to integration are also available in the DRAGEN output directory, as described elsewhere.

Newly added header lines from SV/CNV integration are described in the following table.

Header Field
Number
Type
Description

END_LEFT_BND_OF

1

String

ID of CNV whose left end is matched to the end of SV

END_RIGHT_BND_OF

1

String

ID of CNV whose right end is matched to the end of SV

LEFT_BND

1

String

ID of SV that matches the left end of CNV record

LEFT_BND_OF

1

String

ID of CNV whose left end is matched to SV

MatchSv

1

Integer

ID of original SV that was merged with CNV record

OrigCnvEnd

1

Integer

Coordinate of original CNV end

OrigCnvPos

1

Integer

Coordinate of original CNV pos

RIGHT_BND

1

String

ID of SV that matches the right end of CNV record

RIGHT_BND_OF

1

String

ID of CNV whose right end is matched to SV

SVCLAIM

A

String

Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively

Records that can be matched or rescued will have annotations indicating the breakpoint linkage between a CNV and SV record. If a complete match is found, then the MatchSv annotation will be present in the record, indicating the SV record's ID field for this CNV record. In this case, BND notations refer to the merged record ID itself rather than the SV before merging. Furthermore, the use of the SVCLAIM field will indicate if the record has evidence arising from depth/BAF signal D, or junction signals J, or both DJ.

Because of the mixing of standalone SV records and CNV records, the FORMAT field may have different annotations. For details on the CNV or SV specific annotations, please refer to the individual CNV and SV user guide sections.

Records that can be matched or rescued will have FILTER set to PASS. The original FILTERs are retained for records that were not matched or rescued. For example, the cnvLength FILTER will still be applied to standalone CNV records (those with SVCLAIM=D).

Example records are shown below.

Coverage Uniformity

The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.

A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. It is recommended to use this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.

Call Smoothing

The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.

After initial calling, segments shorter than the specified value of --cnv-filter-length are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. The caller combines two successive segments that are within --cnv-merge-distance of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged.

QUAL Model

QUAL estimation is based on a model associated with the most likely diploid coverage estimated from depth of coverage and B-allele frequency.

Given such diploid coverage, for each segment, the algorithm calls the most likely copy number state (complete with total copy number CN, and minor allele copy number MCN).

The probability of the REF state is used in input to the scoring algorithm which outputs the QUAL value (a PHRED score capped at 1000). The QUAL value is the PHRED score where the probability of error is the probability of REF when an alteration is called, or the probability of having a non-REF call when the segment should be called REF.

Note: this is different from how QUAL is computed in the legacy depth-only callerarrow-up-right.

Comparison with ROH caller

Both the ROH caller and the germline CNV caller can detect runs-of-homozygosity (ROH) regions.

The two algorithms underlying the two different approaches might occasionally disagree. The differences are due to the following:

  • The ROH caller requires minor-allele frequency to be ~0. In contrast, the germline CNV caller will assign to each segment its most likely copy-number state. This can include MOSAIC alterations, not available in the ROH caller.

  • The ROH caller is dependent on the small variant caller, and only uses the SNPs that it calls. In contrast, the germline CNV caller works with a catalog of SNPs from population variation studies, such as 1000 Genomes.

  • The ROH caller uses a blacklist bed file to filter certain sites and reduce call fragmentation. In contrast, the germline CNV caller does not need to filter any site but provides an alternative smoothing algorithm to reduce call fragmentation, which is agnostic on the sample under consideration.

  • The ROH caller identifies ROH regions but does not provide the total copy number of the region under consideration. In contrast, the germline CNV caller also reports the copy number for the region (which could be different from reference ploidy).

Limitations

The following features (available in the depth-only workflowarrow-up-right) are not yet supported:

  • Multisample/Pedigree mode

Multisample Germline CNV Calling

Multisample Germline CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.

Multisample Germline CNV analysis is supported for legacy (depth-only) WGS and WES workflowsarrow-up-right.

Example command lines

The following is an example command line for running a trio analysis:

De Novo CNV Calling Options

Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.

The following options are used in DeNovo CNV calling:

Option
Description

--cnv-input

Input tangent-normalized signal files (*.tn.tsv.gz) from single sample runs. Can be specified multiple times, once per sample.

--cnv-filter-de-novo-qual

Phred-scaled threshold for calling an event as de novo in the proband. Default: 0.125.

--pedigree-file

Pedigree file specifying the relationship between input samples.

Joint Segmentation

First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:

Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the multisample VCF as a single entry. The quality score (QS in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL column of the multisample VCF is always missing (ie, "."). The FILTER column of the multisample VCF is SampleFT if none of the sample's FT fields are PASS, and PASS if any of the sample's FT fields are PASS.

Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):

The previous can be visualized as:

De Novo Calling Stage

A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.

For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.

The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:

Parent Copy Number Genotype
Possible Copy Number Alleles
Assumed Possible Copy Number Alleles

2

0/2, 1/1

1/1

3

0/3, 1/2

1/2

4

0/4, 1/3, 2/2

1/3, 2/2

N

x/(N-x) for x <= N/2

x/(N-x) for 1 <= x <= N/2

The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:

Mother Copy Number
Father Copy Number
Proband Copy Number
Mendelian Consistent?

2

2

2

Yes

2

2

1

No

3

2

4

No

3

2

2

Yes

2

0

2

No

If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:

DQ=10log(1Cp(CNmdata)p(CNfdata)p(CNpdata)p(CNm,CNf,CNp)Gp(CNmdata)p(CNfdata)p(CNpdata)p(CNm,CNf,CNp))DQ = -10log \left( \frac{1-\sum_C{p(CN_m|data) \cdot p(CN_f|data) \cdot p(CN_p|data) \cdot p(CN_m,CN_f,CN_p)}}{\sum_G{p(CN_m|data) \cdot p(CN_f|data) \cdot p(CN_p|data) \cdot p(CN_m,CN_f,CN_p)}} \right)

Where:

  • GG is the set of all genotypes

  • CC is the set of conflicting genotypes

  • CNmCN_m is the Mother copy number

  • CNfCN_f is the Father copy number

  • CNpCN_p is the Proband copy number

  • p(CNm,CNf,CNp)p(CN_m,CN_f,CN_p) is the prior for the trio genotype

The DN field in the VCF is used to indicate the de novo status for each segment. Possible values are:

  • Inherited - the called trio genotype is consistent with Mendelian inheritance

  • LowDQ - the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)

  • DeNovo - the called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold (default 0.125)

Multisample CNV VCF Output

The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:

The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.

The QUAL column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE columns with the QS tag.

The FILTER column indicates PASS if any of the individual SAMPLE columns PASS. Otherwise, it indicates SampleFT.

The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT annotation.

Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.

While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN and DQ annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.

Chromosome X and Y behavior

The sample sex from the single sample analysis (either estimated or overriden using the --sample-sex option) can be overriden by specifying the sex in the input pedigree file (i.e. 1 for male and 2 for female). To use the sample sex from the single sample analysis an unknown sex can be specified in the pedigree file using the value 0 (rather than 1 for male or 2 for female).

Note that when all samples in the pedigree are female, then no calls on chrY will be emitted for any sample. When the pedigree includes at least one male sample, only the male samples will have genotype info reported in the VCF for chrY and any VCF entries on chrY will have a "missing" Genotype column (i.e. ".") for all corresponding female samples in the pedigree.

Last updated

Was this helpful?