CNV Output

DRAGEN emits the calls in the standard VCF format. The VCF file includes only copy number gain and loss events. To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls to true. AOH/LOH events are available in workflows where allele-specific copy number is available.

CNV VCF File

File extension: *.cnv.vcf.gz

The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV to indicate the file is generated by the DRAGEN CNV pipeline.

VCF format differences between different callers

In the DRAGEN CNV component, two versions of the VCF specification are used for the *.cnv.vcf.gz file:

For non-ASCN workflows, the format used is VCF v4.2
For ASCN workflows, the format used is VCF v4.4

The differences between the two formats in output from DRAGEN are the following:

General

Field

VCF v4.2

VCF v4.4

INFO/SVLEN

Positive or Negative

Always Positive

Absence/Loss of Heterozygosity (AOH/LOH)

Field

VCF v4.2

VCF v4.4

ALT

<DEL>,<DUP>

<LOH>

FORMAT/GT

1/2

1/1

Header

The following is an example of some of the header lines that are specific to CNV:

##fileformat=VCFv4.2
##CoverageUniformity=0.402517
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
...
##reference=file:///reference_genomes/Hsapiens/hs37d5/DRAGEN
##ALT=<ID=CNV,Description="Copy number variant region">
##ALT=<ID=DEL,Description="Deletion relative to the reference">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference">
##INFO=<ID=REFLEN,Number=1,Type=Integer,Description="Number of REF positions included in this record">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END">
##FILTER=<ID=cnvQual,Description="CNV with quality below <WORKFLOW-SPECIFIC DEFAULT VALUES>">
##FILTER=<ID=cnvCopyRatio,Description="CNV with copy ratio within +/- 0.2 of 1.0">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=SM,Number=1,Type=Float,Description="Linear copy ratio of the segment mean">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Estimated copy number">
##FORMAT=<ID=BC,Number=1,Type=Integer,Description="Number of bins in the region">
##FORMAT=<ID=PE,Number=2,Type=Integer,Description="Number of improperly paired end reads at start and stop breakpoints">

The following header lines are specific to the somatic ASCN callers (WGS/WES) and the germline WGS ASCN caller:

ModelSource The primary basis on which the final model was chosen. The following values can be included:
- DEPTH+BAF: Depth+BAF signal is used to determine model.
DiploidCoverage Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy Length weighted average of copy number for PASS events (for the tumor fraction in somatic runs). The numeric value is unlimited.
OutlierBafFraction A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
HomozygosityIndex Autosomal AOH/LOH percentage, considering only PASS AOH/LOH greater or equal than a certain threshold. This metric can be used as a proxy for consanguinity in the germline WGS (ASCN) CNV caller. The default minimum size for PASS AOH/LOH to be considered is 2Mb, since it is often found that shorter ROHs "do not arise from inbreeding in recent generations and are common in all of the populations represented in the HGDP" (Kirin et al., 2010). However, a custom minimum size can be set through the option cnv-min-length-homozygosity-index. Note: The Cyto VCF (*.cyto.vcf.gz) also provides resolution-specific homozygosity indexes (i.e., computed on each specific resolution's callset). The default minimum size considered is the same as the main HomozygosityIndex, and for each resolution in output, there will be an additional header line on the Cyto VCF indicating the resulting metric, e.g., ##HomozygosityIndex(25k)=0.001015.

The following header lines are specific to the somatic ASCN callers (WGS/WES):

ModelSource can also have the following values:
- DEPTH+BAF_DOUBLED: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
- DEPTH+BAF_DEDUPLICATED: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
- DEPTH+BAF_WEAK: Depth+BAF signal is used to determine lower-confidence tumor model.
- VAF: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
- SAMPLE_MEDIAN: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF in somatic. Diploid coverage set to sample median.
- DEGENERATE_DIPLOID: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
EstimatedTumorPurity Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA if a confident model could not be determined.
AlternativeModelDedup An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.

Records

All coordinates in the VCF are 1-based.

CHROM

The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.

POS

The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.

ID

The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN, LOSS and REF events, in Somatic ASCN (WGS/WES) and Germline ASCN (WGS) CNV, the ID could include the Copy Neutral Loss/Absence of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.

REF

The REF column contains an N for all CNV events.

ALT

The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL>, <DUP> or <LOH> entries are used. If REF calls are emitted, their ALT will always be .. In workflows where allele-specific copy number (ASCN) is available, if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format, the ALT field will contain two alleles, <DEL>,<DUP>, in place of <LOH>, for AOH/LOH events.

QUAL

The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header. Note: different workflows (e.g., germline WGS vs germline WGS ASCN) do not share the same underlying model and provide different QUAL score distributions. It is recommended to compare QUAL scores only within results from the same workflow. More details are available on germline CNV calling and ASCN callers.

FILTER

The FILTER column contains PASS if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.

FILTER

Germline WGS

Germline WGS (ASCN)

Germline WES

Somatic WGS (ASCN)

Somatic WES

Somatic WES (ASCN)

binCount

✓

cnvBinSupportRatio

✓

cnvCopyRatio

✓

cnvHetLength

✓

cnvLength

✓

cnvLikelihoodRatio

✓

cnvMosaicLength

✓

cnvQual

✓

dinucQual

✓

highCN

✓

lengthDegenerate

✓

segmentMean

✓

SqQual

✓

FILTER description

Available FILTERs:

binCount - Filters CNV events with a bin count lower than a threshold.
cnvBinSupportRatio which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
cnvCopyRatio which indicates that the segment mean of the CNV is not far enough from copy neutral.
cnvHetLength which indicates that a HET call below a certain length has been filtered as candidate FP.
cnvLength which indicates that the length of the CNV is lower than a threshold.
cnvLikelihoodRatio indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
cnvMosaicLength which indicates that a MOSAIC call below a certain length has been filtered as candidate FP.
cnvQual which indicates that the QUAL of the CNV is lower than a threshold.
chromArmBinCount which indicates that a whole-arm alteration call is based on a minimal portion (default 500 intervals) of the entire arm (e.g., in acrocentric chromosomes, where the short arm is mainly consisting of poor mappability regions, that are ignored during copy-number calling).
dinucQual is applied based on the percentage of bases in a segment that belong to a two-base set (GC, CT, or AC), determined by individual occurrences. A CNV call is filtered out if any of these percentages fall outside typical ranges, indicating a likely false positive.
highCN which indicates a CNV call with implausible copy number (>6).
lengthDegenerate - Marks records as non-PASSing based on each record's length (REFLEN) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean - Marks records as non-PASSing based on each record's segment mean (SM) when the caller returns the default model. Segments having insufficient SM in DELs or DUPs are assigned this filter when returning the default model.
SqQual - Marks records as non-PASSing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.

INFO

The INFO column contains information representing the event.

REFLEN indicates the length of the event.
SVLEN indicates the length of the event and it is only present for non-REF records. Note: if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format, SVLEN is a signed representation of REFLEN (e.g., a negative value indicates a deletion).
SVTYPE is always CNV and only present for non-REF records.
END indicates the end position of the event (1-based, inclusive).

Germline CNV includes the following INFO fields:

Description

GCP

Percentage of bases that are G or C

CTP

Percentage of bases that are C or T

ACP

Percentage of bases that are A or C

If using a segment BED file, then the segment identifier is carried over from the input to SEGID field.

In Germline WGS (ASCN) CNV the MOSAIC tag identifies mosaic calls. In Somatic CNV the HET tag identifies subclonal calls. See Subclonal/Mosaic-Calling Mode for more details.

When matching CNV with SV output, additional INFO annotations are added. See CNV With SV Support.

FORMAT

The common FORMAT fields are described in the header:

Description

Genotype

Linear copy ratio of the segment mean

Estimated copy number

Number of bins in the region

Number of improperly paired end reads at start and stop breakpoints

Germline WES CNV includes the following FORMAT fields:

Description

Log10 likelihood ratio of ALT to REF

Allele-Specific CN callers (e.g., Germline WGS ASCN and Somatic WGS/WES ASCN) include the following FORMAT fields:

Description

Number of allelic read count sites

Number of read count bins

Estimated total copy number of sample (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

CNF

Floating point estimate of copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

CNQ

Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.

MAF

Estimate for the minor allele frequency

MCN

Estimated minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

MCNF

Floating point estimate of minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.

MCNQ

Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.

OBF

Per-segment Outlier BAF Fraction. Percentage of BAF counts which are considered "outlier" with respect to the chosen segment call. Higher values might indicate segments where BAF counts are problematic.

Best estimate of segment's bias-corrected read count

Somatic ASCN (WGS/WES) CNV also includes the following FORMAT fields:

Description

NCN

Normal-sample copy number. The field is only present in germline-aware mode.

SCND

Difference between CN and NCN. The field is only present in germline-aware mode.

Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.

Note on genotype annotation in germline copy number calling (non-ASCN)

Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:

Diploid or Haploid?

ALT

FORMAT:CN

FORMAT:GT

Diploid

./.

Diploid

<DUP>

./1

Diploid

<DEL>

0/1

Diploid

<DEL>

1/1

Haploid

<DUP>

Haploid

<DEL>

Coverage Uniformity

The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.

A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.

Cytogenetics modality output

File extension: *.cyto.vcf.gz

The Cytogenetics modality output has a similar format to the standard CNV VCF (*cnv.vcf.gz). A list of differences is indicated below:

Records can have the INFO/RES field. In such case, such field indicate the resolution(s) associated with the record.
Records can have the INFO/SEGID field. In such case, such field can either indicate custom predefined segments indicated in input by the user (similar to the standard CNV VCF), or Cytogenetics-specific predefined segments which are typically whole-arm/-chromosome segment automatically injected during the caller execution. In the latter case, the annotation field indicates the ID or name for the arm or chromosome.
The VCF header is annotated with ##source=DRAGEN_CYTO to indicate the file is generated by the Cytogenetics modality.

See Cytogenetics Modality for more details.

CNV Metrics File

DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv file extension. The following list summarizes the metrics that are output from a CNV run.

Sex Genotyper:

Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.

CNV Summary:

Bases in reference genome in use
Average alignment coverage over genome - The average alignment coverage over the genome is calculated by dividing the total number of bases from processed alignment records (excluding those filtered by the Target Counts stage in DRAGEN CNV) by the genome length. Alignment records are filtered taking into consideration duplicate marking status (if available), MAPQ, and mapping status.
Number of alignment records processed
- Number of filtered records (total)
- Number of filtered records (due to duplicates)
- Number of filtered records (due to MAPQ)
- Number of filtered records (due to being unmapped)
PMAD - Pairwise Median Absolute Deviation measures the variation in read coverage between adjacent bins. It measures variability due to various factors, such as DNA degradation, extraction, amplification or library preparation. Higher values indicate noiser sample data. PMAD is calculated as following:
- Define a vector v[i] as normalized counts of i-th interval in log scale, and d[i] as pairwise differences of consecutive normalized counts between i and i+1 intervals, i.e. d[i] = (v[i] - v[i+1])
- PMAD is median absolute deviation of d, i.e. PMAD = Median(|d[i]-Median(d)|)
Coverage MAD - Median absolute deviation of normalized case counts. Higher values indicate noiser sample data.
Median Bin Count - Median of raw counts normalized by interval size.
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions
Post-Normalization Bin Count Sigma - Standard deviation of post-PoN-normalization median-normalized coverage values.

Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Post-Normalization Bin Count Sigma is only printed when PoN normalization has been applied.

Intermediate and Visualization Files

Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.

All files have a structure similar to a BED file with optional header line(s).

Target Counts

The file *.target.counts.gz is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid file, which is normalized to the normal ploidy level of 2 instead of raw counts.

It has the following columns:

Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gz file is shown below.

#TARGET COUNTS FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start  stop   name                <SampleName> improper_pairs
1       565480 565959 target-wgs-1-565480 7          6
1       566837 567182 target-wgs-1-566837 9          0
1       713984 714455 target-wgs-1-713984 34         4
1       721116 721593 target-wgs-1-721116 47         1
1       724219 724547 target-wgs-1-724219 24         21
1       725166 725544 target-wgs-1-725166 43         12
1       726381 726817 target-wgs-1-726381 47         14
1       753243 753655 target-wgs-1-753243 31         2
1       754322 754594 target-wgs-1-754322 27         0
1       754594 755052 target-wgs-1-754594 41         0

B-Allele counts

In germline ASCN runs, B-allele counts are calculated at bi-allelic sites taken from a collection of high-frequency SNVs in the population. In somatic ASCN runs, B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, sites are selected from a population collection (similar to germline ASCN runs). Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the sample supporting each of these alleles is counted.

B-allele counts are written both to gzipped tsv file *.ballele.counts.gz and gzipped bedgraph file *.baf.bedgraph.gz.

B-allele tsv

The tsv file format is the following:

Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (one-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele

Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:

Population frequency for the first allele
Population frequency for the second allele

An example of B-allele counts file is provided below:

contig  start   stop    refAllele       allele1 allele2 allele1Count    allele2Count
chr1    11021   11022   G       G       A       4       2
chr1    14463   14464   A       A       T       111     36
chr1    16494   16495   G       G       C       122     262
chr1    38741   38742   C       C       T       9       9
chr1    39014   39015   A       A       C       38      48
chr1    39260   39261   T       T       C       199     143
chr1    48447   48448   C       C       T       8       15
chr1    48517   48518   A       A       G       13      15
chr1    91485   91486   G       G       C       1       4
chr1    91489   91490   A       A       G       1       3
chr1    98944   98945   C       C       T       46      114

B-allele bedgraph

The bedgraph file format is similar to the BED format and it has the following columns:

Contig identifier
Start
Stop
Ratio of allele counts

The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.

When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:

allele1Count / (allele1Count + allele2Count)

When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:

allele2Count / (allele1Count + allele2Count)

By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.

An example of the bedgraph file is shown below:

chr1    11021   11022   0.333333
chr1    14463   14464   0.755102
chr1    16494   16495   0.317708
chr1    38741   38742   0.5
chr1    39014   39015   0.44186
chr1    39260   39261   0.581871
chr1    48447   48448   0.652174
chr1    48517   48518   0.464286

Bias correction

The file *.target.counts.gc-corrected.gz contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.

An example of a *.target.counts.gc-corrected.gz file is shown below.

#GC CORRECTED FILE
##DRAGENVersion=<VERSION_INFO>
##DRAGENCommandLine=<CommandLineOptions>
...
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   1071.353133     6
chr1    819840  821337  target-wgs-chr1-819840:821337   1051.014997     19
chr1    821337  822485  target-wgs-chr1-821337:822485   1098.6502       10
chr1    822485  824431  target-wgs-chr1-822485:824431   1117.28308      7
chr1    830446  832304  target-wgs-chr1-830446:832304   1102.211816     1
chr1    832304  834311  target-wgs-chr1-832304:834311   1004.822683     5
chr1    836677  838659  target-wgs-chr1-836677:838659   1015.973037     7
chr1    841054  843056  target-wgs-chr1-841054:843056   1014.921403     3

Combined counts

The file *.combined.counts.txt.gz is a column-wise concatenation of individual *.target.counts.gz and *.target.counts.gc-corrected.gz used to form the panel of normals.

Normalization

The file *.tn.tsv.gz contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz file:

Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval

Header lines are also included that start with #. In some cases, the normalization counts could be patched internally with intervals from other processes, such as the SegDups extension. In such cases, patches are indicated (sorted in order of application) with header lines starting with #patch:

#patch 1 = <normalized_counts_patch_1_filename>
#patch 2 = <normalized_counts_patch_2_filename>
...

and the original (unpatched) *.tn.tsv.gz is renamed as *.tn.unpatched.tsv.gz. Note: this file is reported in output for inspection, but most use cases will use the (patched) *.tn.tsv.gz file downstream of normalization.

An example of a *.tn.tsv.gz file is shown below.

#title = Normalized coverage profile
#sex = UNDETERMINED
contig  start   stop    name    <SampleName> improper_pairs
chr1    818022  819840  target-wgs-chr1-818022:819840   -0.18479358083014644    6
chr1    819840  821337  target-wgs-chr1-819840:821337   -0.21244441644669046    19
chr1    821337  822485  target-wgs-chr1-821337:822485   -0.14849555308041734    10
chr1    822485  824431  target-wgs-chr1-822485:824431   -0.12423291178926463    7
chr1    830446  832304  target-wgs-chr1-830446:832304   -0.1438261733656668     1
chr1    832304  834311  target-wgs-chr1-832304:834311   -0.27728673450293895    5
chr1    836677  838659  target-wgs-chr1-836677:838659   -0.26136555699676262    7

Segmentation

File extension: *.seg, *.seg.called, *.seg.called.merged

Files containing the segments produced by the segmentation algorithm. The Segment_Mean value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.

The *.seg file has the following columns:

Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment

An example of a *.seg file is shown below.

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean
<SampleName> chr1    818022  1117426 224     0.82500341336435279
<SampleName> chr1    1117426 4063702 2438    0.91726081432236528
<SampleName> chr1    4063702 4067591 3       0.38861386123247205
<SampleName> chr1    4067591 7705829 3302    0.93021316913709917
<SampleName> chr1    7705829 9357003 1405    0.98147825043799442
<SampleName> chr1    9357003 9377365 19      0.50269670724395654
<SampleName> chr1    9377365 12859821        2905    1.0684818476332989

Germline-Specific (Depth-Only) Segmentation Output Files

The *.seg.called file is identical to the *.seg file, with an additional column indicating the initial call for whether the segment is a duplication + or a deletion -.

The *.seg.called.merged file is identical to the *.seg.called file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:

QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count

B-allele segmentation (ASCN callers)

In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg and it has the same format of the *.seg file with two modifications. Firstly, the Segment_Mean value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:

BAF_SLM_STATE: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or . when the BAF data are too variable to estimate a minor-allele fraction

An example of segmentation output file is shown below:

Sample  Chromosome      Start   End     Num_Probes      Segment_Mean    BAF_SLM_STATE
<SampleName> chr1    820348  1104646 194     0.29301737166888697     6
<SampleName> chr1    1105091 1533754 444     0.26185904799069076     5
<SampleName> chr1    1533810 1534166 9       0.41958837071702065     8
<SampleName> chr1    1534217 9356793 6689    0.26034515815016335     5
<SampleName> chr1    9358304 9376529 27      0.46450553586280602     10
<SampleName> chr1    9378480 12859495        1651    0.24172965924359388     5

Model identification (ASCN callers)

In somatic ASCN callers the file *.cnv.purity.coverage.models.tsv describes the different tested models and their log-likelihood. It has columns:

Model purity (Cellularity)
Model diploid coverage
Model log-likelihood

An example is shown below:

#Purity Coverage        logL
1       384     -23441740.5209
0.99    566     -22926572.4287
0.99    726     -23281869.1423
0.99    1206    -24075475.1481
0.99    1836    -24334376.579
0.99    2256    -24380290.0335
0.99    2696    -24380616.8655
0.98    449     -23988016.7101

In the germline WGS ASCN caller the file *.cnv.coverage.models.tsv serves the same purpose. However, since germline analysis has no concept for tumor purity, the first column is set to the default value of 1.

Visualization

To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml can be loaded directly into IGV for analysis.

The following IGV tracks are automatically populated in the output IGV session file:

*.target.counts.bw --- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw --- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw --- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw --- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.seg.bw --- BigWig representation of the BAF segments (if available). Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz --- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3 --- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. When the caller can call AOH/LOH events, they will show up as magenta. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):

##gff-version 3
chr1    DRAGEN  LOSS    12779193        12859821        30      .       .       Alt=DEL;LinearCopyRatio=0.576;CopyNumber=1;Genotype=0/1;Qual=30;Filter=PASS;Start=12779192;Stop=12859821;Length=80629;BinCount=24;ImproperPairsCount=16,7;color=#0000FF;
chr1    DRAGEN  REF     13106280        13122338        19      .       .       Alt=REF;LinearCopyRatio=1.05981;CopyNumber=2;Genotype=./.;Qual=19;Filter=PASS;Start=13106279;Stop=13122338;Length=16059;BinCount=8;ImproperPairsCount=3,1;color=#00FF00;
chr1    DRAGEN  GAIN    13225213        13247040        66      .       .       Alt=DUP;LinearCopyRatio=2.016;CopyNumber=4;Genotype=./1;Qual=66;Filter=PASS;Start=13225212;Stop=13247040;Length=21828;BinCount=9;ImproperPairsCount=7,5;color=#FF0000;

For somatic WGS analyses, the following additional files are included in the IGV session xml:

*.tumor.baf.bedgraph.gz --- Bedgraph representation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.

IGV Session

File extension: *.igv_session.xml

The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome attribute in the Session element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.

<?xml version="1.0" encoding="utf-8"?>
<Session genome="b37" hasGeneTrack="false" hasSequenceTrack="true" version="8">
    <Resources>
        <Resource path="example.cnv.gff3"/>
        <Resource path="example.cnv.excluded_intervals.bed.gz"/>
        <Resource path="example.target.counts.bw"/>
        <Resource path="example.improper.pairs.bw"/>
        <Resource path="example.tn.bw"/>
        <Resource path="example.seg.bw"/>
    </Resources>
    <Panel height="500" width="1200" name="DataPanel">
        ...
    </Panel>
</Session>

Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome field in the XML file directly. For example, IGV has traditionally packaged a b37 reference genome, but may also include a 1kg_v37 or a 1kg_b37+decoy, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.

You can determine what the correct encoding of a reference genome by going to File > Save Session... and then inspecting the generated igv_session.xml file.

When the Cytogenetics Modality is enabled, DRAGEN CNV produces an additional IGV session xml *.cyto.igv_session.xml shown below. Please see related section for a description of the different tracks on this file.

Creating CNV coverage and BAF plots with third-party tools

DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:

*.target.counts.gz or *.target.counts.gc-corrected.gz, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.

In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.

Using R, a good starting point is the karyoploteR package. The main workflow involves reading the *.target.counts.gz file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.

Using Python, the workflow is similar to R's but using Python's libraries such as pandas, to convert DRAGEN output files to dataframe, and matplotlib, to plot coverage and BAF profiles across the genome.

A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3 output file. Some examples of DRAGEN output GFF3 are shown below:

Germline WGS

chr1    DRAGEN  LOSS    12779193        12859821        30      .       .       Alt=DEL;LinearCopyRatio=0.576;CopyNumber=1;Genotype=0/1;Qual=30;Filter=PASS;Start=12779192;Stop=12859821;Length=80629;BinCount=24;ImproperPairsCount=16,7;color=#0000FF;
chr1    DRAGEN  REF     13106280        13122338        19      .       .       Alt=REF;LinearCopyRatio=1.05981;CopyNumber=2;Genotype=./.;Qual=19;Filter=PASS;Start=13106279;Stop=13122338;Length=16059;BinCount=8;ImproperPairsCount=3,1;color=#00FF00;
chr1    DRAGEN  GAIN    13225213        13247040        66      .       .       Alt=DUP;LinearCopyRatio=2.016;CopyNumber=4;Genotype=./1;Qual=66;Filter=PASS;Start=13225212;Stop=13247040;Length=21828;BinCount=9;ImproperPairsCount=7,5;color=#FF0000;

Somatic WGS

chr1    DRAGEN  GAIN    16605768        16949283        237     .       .       Start=16605769;Stop=16949283;Length=343515;Alt=<DUP>;Qual=237;Filter=PASS;Genotype=1/1;CopyNumber=4;MinorCopyNumber=2;CopyNumberQual=1;MinorCopyNumberQual=1;CopyNumberFloat=4.371887;MinorCopyNumberFloat=2.000000;BiasCorrectedReadCount=1182.6;MinorAlleleFrequency=0.5;BinCount=74;ImproperPairsCount=15,17;NumAllelicSites=223;color=#FF0000;
chr1    DRAGEN  CNLOH   16949283        23272950        1000    .       .       Start=16949284;Stop=23272950;Length=6323667;Alt=<LOH>;Qual=1000;Filter=PASS;Genotype=1/1;CopyNumber=2;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=2.090573;MinorCopyNumberFloat=0.000000;BiasCorrectedReadCount=565.5;MinorAlleleFrequency=0;BinCount=5572;ImproperPairsCount=17,84;NumAllelicSites=2517;color=#FF00FF;
chr1    DRAGEN  LOSS    23272950        25394644        1000    .       .       Start=23272951;Stop=25394644;Length=2121694;Alt=<DEL>;Qual=1000;Filter=PASS;Genotype=0/1;CopyNumber=1;MinorCopyNumber=0;CopyNumberQual=1000;MinorCopyNumberQual=1000;CopyNumberFloat=1.069501;MinorCopyNumberFloat=0.000000;BiasCorrectedReadCount=289.3;MinorAlleleFrequency=0;BinCount=1718;ImproperPairsCount=84,5;NumAllelicSites=872;color=#0000FF;

From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).

Excluded Intervals File

To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.cnv.excluded_intervals.bed.gz file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.

Exclusion Reason

Description

Related DRAGEN Option

NON_KMER_UNIQUE

Non-unique Kmer bases are larger than 50% of interval.

Not applicable. This reason only applies to self-normalization mode.

EXCLUDE_BED

Interval overlaps with exclude BED larger than threshold.

--cnv-exclude-bed-min-overlap

PON_MAX_PERCENT_ZERO_SAMPLES

Number of PON samples with 0 coverage is larger than threshold.

--cnv-max-percent-zero-samples

PON_TARGET_FACTOR_THRESHOLD

Median coverage of interval is lower than threshold of overall median coverage.

--cnv-target-factor-threshold

PON_MISSING_INTERVAL

Target interval not found in PON.

Not applicable

An example of a *.cnv.excluded_intervals.bed.gz file is shown below:

chr1    0       818022  NON_KMER_UNIQUE
chr1    824431  830446  NON_KMER_UNIQUE
chr1    834311  836677  NON_KMER_UNIQUE
chr1    838659  841054  NON_KMER_UNIQUE
chr1    850451  853257  NON_KMER_UNIQUE
chr1    855442  860261  NON_KMER_UNIQUE
chr1    866189  868833  NON_KMER_UNIQUE
chr1    881779  884116  NON_KMER_UNIQUE
chr1    1016667 1018959 NON_KMER_UNIQUE
chr1    1075880 1079718 NON_KMER_UNIQUE
chr1    1137942 1140725 NON_KMER_UNIQUE

Excluded Samples File

To improve accuracy, the DRAGEN CNV Pipeline excludes panel of normals samples if one or more of the samples failed at least one quality requirement. The excluded samples are reported to *.cnv.excluded_samples.txt.gz file. The file has a tsv (tab separated) format, identifies the excluded panel of normals samples and describes the reason. The following are the possible reasons for exclusion.

Exclusion Reason

Description

Related DRAGEN Option

PON_SAMPLE_NAME_EQUAL_TO_CASE

PON sample name is equal to case sample name

PON_SAMPLE_CORRELATION_EQUAL_TO_CASE

PON sample counts are equal to case sample counts

PON_MAX_PERCENT_NAN_SAMPLES

number of nan values in sample is higher than threshold

--cnv-max-percent-nan-samples(default=50)

MAX_PERCENT_ZERO_TARGETS

number of 0 target counts in sample is higher than threshold

--cnv-max-percent-zero-targets(default=5)

EXTREME_PERCENTILE:UPPER

median coverage of sample is higher than threshold

--cnv-extreme-percentile(default=2.5)

EXTREME_PERCENTILE:LOWER

median coverage of sample is lower than threshold

--cnv-extreme-percentile(default=2.5)

An example of a *.cnv.excluded_samples.txt.gz file is shown below:

#name        reason                                 value        threshold
Sample1      MAX_PERCENT_ZERO_TARGETS               4776         418
Sample2      EXTREME_PERCENTILE:LOWER               0.000812534  0.20065
Sample3      EXTREME_PERCENTILE:UPPER               1.0003       1.00025
Sample4      PON_SAMPLE_NAME_EQUAL_TO_CASE          NA           NA
Sample5      PON_SAMPLE_CORRELATION_EQUAL_TO_CASE   NA           NA

The excluded samples output file may not exist if there are no excluded samples.

Panel of Normals Files

PON Metrics File

The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz) if a Panel of Normals is provided and --cnv-generate-pon-metric-file is set to true. If PON size is less than 2, then an empty file will be generated.

The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:

Column index

Column contents

Description

contig

chromosome name

start

genomic locus of interval start

stop

genomic locus of interval stop

name

interval name

mean

average coverage depth

std

standard deviation

normalizedStd

normalized standard deviation (std/mean)

min

minimum

25%

25 percentile

50%

median

75%

75 percentile

max

maximum

intervalSize

interval size (stop-start)

gcContents

percent GC

Example:

contig  start   stop    name    mean    std     normalizedStd min     25%     50%     75%     max     intervalSize    gcContents
1       12098   12178   target-wes-1-12098:12178/1      3.6259044560802365      0.46661435469856077      0.1286890927079175     2.7961783439490446      3.2573018790849675      3.7105263157894739      4.0162683823529415      4.3298969072164946      80      0.49382716049382713
1       12178   12258   target-wes-1-12178:12258/2      5.0685579775753595      0.70638315915955963      0.13936570564740217     3.9044585987261144      4.5225944682508761      5.067708333333333       5.5778115844038769      6.3277777777777775      80      0.46913580246913578
1       12553   12637   target-wes-1-12553:12637/1      4.6990858287992054      0.62537786269786677      0.13308500535681309     3.7417218543046356      4.0305632538350444      5.0382165605095546      5.2151580459770113      5.5773195876288657      84      0.6705882352941176

PON Correlation File

The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.

Example:

Correlation of case sample CASE_SAMPLE_NAME
  PON1: 0.9786
  PON2: 0.9868
  PON3: 0.9912
  ...

SegDups Extension Files

The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).

The final output has extension .cnv.segdups.rescued_intervals.tsv.gz, and contains the rescued target intervals which can then be injected before segmentation. It has columns:

Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)

Intermediate files

The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz with columns:

Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)

The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz with columns:

Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site

PreviousCNV Segmentation NextCNV ASCN module

Last updated 5 months ago

Was this helpful?