CNV Output
Last updated
Was this helpful?
Last updated
Was this helpful?
DRAGEN emits the calls in the standard VCF format. The VCF file includes only copy number gain and loss events. To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls
to true. AOH/LOH events are available in workflows where allele-specific copy number is available.
File extension: *.cnv.vcf.gz
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV
to indicate the file is generated by the DRAGEN CNV pipeline.
In the DRAGEN CNV component, two versions of the VCF specification are used for the *.cnv.vcf.gz
file:
For non-ASCN workflows, the format used is
For ASCN workflows, the format used is
The differences between the two formats in output from DRAGEN are the following:
General
INFO/SVLEN
Positive or Negative
Always Positive
INFO/SVTYPE
CNV
Removed
Absence/Loss of Heterozygosity (AOH/LOH)
ALT
<DEL>,<DUP>
<LOH>
FORMAT/GT
1/2
1/1
The following is an example of some of the header lines that are specific to CNV:
The following header lines are specific to the somatic ASCN callers (WGS/WES) and the germline WGS ASCN caller:
ModelSource
The primary basis on which the final model was chosen. The following values can be included:
DEPTH+BAF
: Depth+BAF signal is used to determine model.
DiploidCoverage
Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy
Length weighted average of copy number for PASS events (for the tumor fraction in somatic runs). The numeric value is unlimited.
OutlierBafFraction
A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
HomozygosityIndex
Autosomal AOH/LOH percentage, considering only PASS AOH/LOH greater or equal than a certain threshold. This metric can be used as a proxy for consanguinity in the germline WGS (ASCN) CNV caller. The default minimum size for PASS AOH/LOH to be considered is 2Mb, since it is often found that shorter ROHs "do not arise from inbreeding in recent generations and are common in all of the populations represented in the HGDP" (Kirin et al., 2010). However, a custom minimum size can be set through the option cnv-min-length-homozygosity-index
.
The following header lines are specific to the somatic ASCN callers (WGS/WES):
ModelSource
can also have the following values:
DEPTH+BAF_DOUBLED
: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
DEPTH+BAF_DEDUPLICATED
: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
DEPTH+BAF_WEAK
: Depth+BAF signal is used to determine lower-confidence tumor model.
VAF
: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
SAMPLE_MEDIAN
: Sample is treated as high-purity diploid in absence of adequate signal
from depth+BAF and VAF in somatic. Diploid coverage set to sample median.
DEGENERATE_DIPLOID
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
EstimatedTumorPurity
Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA
if a confident model could not be determined.
AlternativeModelDedup
An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup
An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
All coordinates in the VCF are 1-based.
The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>
, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN
, LOSS
and REF
events, in Somatic ASCN (WGS/WES) and Germline ASCN (WGS) CNV, the ID could include the Copy Neutral Loss/Absence of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.
The REF column contains an N for all CNV events.
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL>
, <DUP>
or <LOH>
entries are used. If REF calls are emitted, their ALT will always be .
. In workflows where allele-specific copy number (ASCN) is available, if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format
, the ALT
field will contain two alleles, <DEL>,<DUP>
, in place of <LOH>
, for AOH/LOH events.
The FILTER column contains PASS
if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.
binCount
âś“
âś“
âś“
cnvBinSupportRatio
âś“
âś“
âś“
cnvCopyRatio
âś“
âś“
âś“
cnvHetLength
âś“
âś“
cnvLength
âś“
âś“
âś“
âś“
âś“
âś“
cnvLikelihoodRatio
âś“
âś“
cnvMosaicLength
âś“
cnvQual
âś“
âś“
âś“
âś“
âś“
âś“
dinucQual
âś“
âś“
highCN
âś“
lengthDegenerate
âś“
âś“
segmentMean
âś“
âś“
SqQual
âś“
FILTER description
Available FILTERs:
binCount
- Filters CNV events with a bin count lower than a threshold.
cnvBinSupportRatio
which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
cnvCopyRatio
which indicates that the segment mean of the CNV is not far enough from copy neutral.
cnvHetLength
which indicates that a HET call below a certain length has been filtered as candidate FP.
cnvLength
which indicates that the length of the CNV is lower than a threshold.
cnvLikelihoodRatio
indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
cnvMosaicLength
which indicates that a MOSAIC call below a certain length has been filtered as candidate FP.
cnvQual
which indicates that the QUAL of the CNV is lower than a threshold.
dinucQual
is applied based on the percentage of bases in a segment that belong to a two-base set (GC, CT, or AC), determined by individual occurrences. A CNV call is filtered out if any of these percentages fall outside typical ranges, indicating a likely false positive.
highCN
which indicates a CNV call with implausible copy number (>6).
lengthDegenerate
- Marks records as non-PASS
ing based on each record's length (REFLEN
) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean
- Marks records as non-PASS
ing based on each record's segment mean (SM
) when the caller returns the default model. Segments having insufficient SM
in DEL
s or DUP
s are assigned this filter when returning the default model.
SqQual
- Marks records as non-PASS
ing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.
The INFO column contains information representing the event.
REFLEN
indicates the length of the event.
SVLEN
indicates the length of the event and it is only present for non-REF records. Note: if the legacy DRAGEN VCF format (VCF v4.2) has been enabled with --cnv-enable-legacy-vcf-format
, SVLEN
is a signed representation of REFLEN
(e.g., a negative value indicates a deletion).
SVTYPE
is always CNV and only present for non-REF records.
END
indicates the end position of the event (1-based, inclusive).
Germline CNV includes the following INFO fields:
GCP
Percentage of bases that are G or C
CTP
Percentage of bases that are C or T
ACP
Percentage of bases that are A or C
If using a segment BED file, then the segment identifier is carried over from the input to SEGID
field.
The common FORMAT fields are described in the header:
GT
Genotype
SM
Linear copy ratio of the segment mean
CN
Estimated copy number
BC
Number of bins in the region
PE
Number of improperly paired end reads at start and stop breakpoints
Germline WES CNV includes the following FORMAT fields:
LR
Log10 likelihood ratio of ALT to REF
Allele-Specific CN callers (e.g., Germline WGS ASCN and Somatic WGS/WES ASCN) include the following FORMAT fields:
AS
Number of allelic read count sites
BC
Number of read count bins
CN
Estimated total copy number of sample (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
CNF
Floating point estimate of copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
CNQ
Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
MAF
Estimate for the minor allele frequency
MCN
Estimated minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
MCNF
Floating point estimate of minor-haplotype copy number (for the tumor fraction in Somatic callers). This field is not present if the model cannot be estimated with high confidence.
MCNQ
Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
OBF
Per-segment Outlier BAF Fraction. Percentage of BAF counts which are considered "outlier" with respect to the chosen segment call. Higher values might indicate segments where BAF counts are problematic.
SD
Best estimate of segment's bias-corrected read count
Somatic ASCN (WGS/WES) CNV also includes the following FORMAT fields:
NCN
Normal-sample copy number. The field is only present in germline-aware mode.
SCND
Difference between CN and GCN. The field is only present in germline-aware mode.
Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN
entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
Diploid
.
2
./.
Diploid
<DUP>
>2
./1
Diploid
<DEL>
1
0/1
Diploid
<DEL>
0
1/1
Haploid
.
1
0
Haploid
<DUP>
>1
1
Haploid
<DEL>
0
1
The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity
metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity
metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.
A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width
setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.
File extension: *.cyto.vcf.gz
The Cytogenetics modality output has a similar format to the standard CNV VCF (*cnv.vcf.gz
). A list of differences is indicated below:
Records can have the INFO/RES
field. In such case, such field indicate the resolution(s) associated with the record.
Records can have the INFO/SEGID
field. In such case, such field can either indicate custom predefined segments indicated in input by the user (similar to the standard CNV VCF), or Cytogenetics-specific predefined segments which are typically whole-arm/-chromosome segment automatically injected during the caller execution. In the latter case, the annotation field indicates the ID or name for the arm or chromosome.
The VCF header is annotated with ##source=DRAGEN_CYTO
to indicate the file is generated by the Cytogenetics modality.
DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv
file extension. The following list summarizes the metrics that are output from a CNV run.
Sex Genotyper:
Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.
CNV Summary:
Bases in reference genome in use
Average alignment coverage over genome - The average alignment coverage over the genome is calculated by dividing the total number of bases from processed alignment records (excluding those filtered by the Target Counts stage in DRAGEN CNV) by the genome length. Alignment records are filtered taking into consideration duplicate marking status (if available), MAPQ, and mapping status.
Number of alignment records processed
Number of filtered records (total)
Number of filtered records (due to duplicates)
Number of filtered records (due to MAPQ)
Number of filtered records (due to being unmapped)
PMAD - Pairwise Median Absolute Deviation measures the variation in read coverage between adjacent bins. It measures variability due to various factors, such as DNA degradation, extraction, amplification or library preparation. Higher values indicate noiser sample data. PMAD is calculated as following:
Define a vector v[i] as normalized counts of i-th interval in log scale, and d[i] as pairwise differences of consecutive normalized counts between i and i+1 intervals, i.e. d[i] = (v[i] - v[i+1])
PMAD is median absolute deviation of d, i.e. PMAD = Median(|d[i]-Median(d)|)
Coverage MAD - Median absolute deviation of normalized case counts. Higher values indicate noiser sample data.
Median Bin Count - Median of raw counts normalized by interval size.
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions
Post-Normalization Bin Count Sigma - Standard deviation of post-PoN-normalization median-normalized coverage values.
Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Post-Normalization Bin Count Sigma is only printed when PoN normalization has been applied.
Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.
All files have a structure similar to a BED file with optional header line(s).
The file *.target.counts.gz
is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid
file, which is normalized to the normal ploidy level of 2 instead of raw counts.
It has the following columns:
Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gz
file is shown below.
In germline ASCN runs, B-allele counts are calculated at bi-allelic sites taken from a collection of high-frequency SNVs in the population. In somatic ASCN runs, B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, sites are selected from a population collection (similar to germline ASCN runs). Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the sample supporting each of these alleles is counted.
B-allele counts are written both to gzipped tsv file *.ballele.counts.gz
and gzipped bedgraph file *.baf.bedgraph.gz
.
The tsv file format is the following:
Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (one-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele
Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:
Population frequency for the first allele
Population frequency for the second allele
An example of B-allele counts file is provided below:
The bedgraph file format is similar to the BED format and it has the following columns:
Contig identifier
Start
Stop
Ratio of allele counts
The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.
When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:
When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:
By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.
An example of the bedgraph file is shown below:
The file *.target.counts.gc-corrected.gz
contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gc-corrected.gz
file is shown below.
The file *.combined.counts.txt.gz
is a column-wise concatenation of individual *.target.counts.gz
and *.target.counts.gc-corrected.gz
used to form the panel of normals.
The file *.tn.tsv.gz
contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. In some cases, the normalization counts could be patched internally with intervals from other processes, such as the SegDups extension. In such cases, patches are indicated (sorted in order of application) with header lines starting with #patch
:
and the original (unpatched) *.tn.tsv.gz
is renamed as *.tn.unpatched.tsv.gz
. Note: this file is reported in output for inspection, but most use cases will use the (patched) *.tn.tsv.gz
file downstream of normalization.
An example of a *.tn.tsv.gz
file is shown below.
File extension: *.seg
, *.seg.called
, *.seg.called.merged
Files containing the segments produced by the segmentation algorithm. The Segment_Mean
value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.
The *.seg
file has the following columns:
Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment
An example of a *.seg
file is shown below.
The *.seg.called
file is identical to the *.seg
file, with an additional column indicating the initial call for whether the segment is a duplication +
or a deletion -
.
The *.seg.called.merged
file is identical to the *.seg.called
file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:
QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count
In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg
and it has the same format of the *.seg
file with two modifications. Firstly, the Segment_Mean
value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:
BAF_SLM_STATE
: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or .
when the BAF data are too variable to estimate a minor-allele fraction
An example of segmentation output file is shown below:
In somatic ASCN callers the file *.cnv.purity.coverage.models.tsv
describes the different tested models and their log-likelihood. It has columns:
Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
An example is shown below:
In the germline WGS ASCN caller the file *.cnv.coverage.models.tsv
serves the same purpose. However, since germline analysis has no concept for tumor purity, the first column is set to the default value of 1.
To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks
option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml
can be loaded directly into IGV for analysis.
The following IGV tracks are automatically populated in the output IGV session file:
*.target.counts.bw
--- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw
--- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw
--- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw
--- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.seg.bw
--- BigWig representation of the BAF segments (if available). Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz
--- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3
--- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. When the caller can call AOH/LOH events, they will show up as magenta. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):
For somatic WGS analyses, the following additional files are included in the IGV session xml:
*.tumor.baf.bedgraph.gz
--- Bedgraph representation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.
File extension: *.igv_session.xml
The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir
specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome
attribute in the Session
element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.
Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome
field in the XML file directly. For example, IGV has traditionally packaged a b37
reference genome, but may also include a 1kg_v37
or a 1kg_b37+decoy
, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.
You can determine what the correct encoding of a reference genome by going to File > Save Session...
and then inspecting the generated igv_session.xml file.
DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:
*.target.counts.gz
or *.target.counts.gc-corrected.gz
, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz
, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz
, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.
In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.
A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3
output file. Some examples of DRAGEN output GFF3 are shown below:
Germline WGS
Somatic WGS
From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber
annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).
To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.cnv.excluded_intervals.bed.gz
file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.
NON_KMER_UNIQUE
Non-unique Kmer bases are larger than 50% of interval.
Not applicable. This reason only applies to self-normalization mode.
EXCLUDE_BED
Interval overlaps with exclude BED larger than threshold.
--cnv-exclude-bed-min-overlap
PON_MAX_PERCENT_ZERO_SAMPLES
Number of PON samples with 0 coverage is larger than threshold.
--cnv-max-percent-zero-samples
PON_TARGET_FACTOR_THRESHOLD
Median coverage of interval is lower than threshold of overall median coverage.
--cnv-target-factor-threshold
PON_MISSING_INTERVAL
Target interval not found in PON.
Not applicable
An example of a *.cnv.excluded_intervals.bed.gz
file is shown below:
To improve accuracy, the DRAGEN CNV Pipeline excludes panel of normals samples if one or more of the samples failed at least one quality requirement. The excluded samples are reported to *.cnv.excluded_samples.txt.gz
file. The file has a tsv (tab separated) format, identifies the excluded panel of normals samples and describes the reason. The following are the possible reasons for exclusion.
PON_SAMPLE_NAME_EQUAL_TO_CASE
PON sample name is equal to case sample name
NA
PON_SAMPLE_CORRELATION_EQUAL_TO_CASE
PON sample counts are equal to case sample counts
NA
PON_MAX_PERCENT_NAN_SAMPLES
number of nan values in sample is higher than threshold
--cnv-max-percent-nan-samples
(default=50)
MAX_PERCENT_ZERO_TARGETS
number of 0 target counts in sample is higher than threshold
--cnv-max-percent-zero-targets
(default=5)
EXTREME_PERCENTILE:UPPER
median coverage of sample is higher than threshold
--cnv-extreme-percentile
(default=2.5)
EXTREME_PERCENTILE:LOWER
median coverage of sample is lower than threshold
--cnv-extreme-percentile
(default=2.5)
An example of a *.cnv.excluded_samples.txt.gz
file is shown below:
The excluded samples output file may not exist if there are no excluded samples.
The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz
) if a Panel of Normals is provided and --cnv-generate-pon-metric-file
is set to true
. If PON size is less than 2, then an empty file will be generated.
The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:
1
contig
chromosome name
2
start
genomic locus of interval start
3
stop
genomic locus of interval stop
4
name
interval name
5
mean
average coverage depth
6
std
standard deviation
7
normalizedStd
normalized standard deviation (std/mean)
8
min
minimum
9
25%
25 percentile
10
50%
median
11
75%
75 percentile
12
max
maximum
13
intervalSize
interval size (stop-start)
14
gcContents
percent GC
Example:
The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz
) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.
Example:
The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).
The final output has extension .cnv.segdups.rescued_intervals.tsv.gz
, and contains the rescued target intervals which can then be injected before segmentation. It has columns:
Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)
The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz
with columns:
Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)
The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz
with columns:
Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header. Note: different workflows (e.g., germline WGS vs germline WGS ASCN) do not share the same underlying model and provide different QUAL score distributions. It is recommended to compare QUAL scores only within results from the same workflow. More details are available on and .
In Germline WGS (ASCN) CNV the MOSAIC
tag identifies mosaic calls. In Somatic CNV the HET
tag identifies subclonal calls. See for more details.
When matching CNV with SV output, additional INFO annotations are added. See .
See for more details.
When the Cytogenetics Modality is enabled, DRAGEN CNV produces an additional IGV session xml *.cyto.igv_session.xml
shown below. Please see for a description of the different tracks on this file.
Using R, a good starting point is the package. The main workflow involves reading the *.target.counts.gz
file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR
package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.
Using Python, the workflow is similar to R's but using Python's libraries such as , to convert DRAGEN output files to dataframe, and , to plot coverage and BAF profiles across the genome.