CNV ASCN module
Selecting a diploid coverage level is a key component of an allele-specific copy number (ASCN) caller. In the somatic case, the caller also needs to identify the most likely tumor purity. DRAGEN CNV ASCN callers use a grid-search approach that evaluates many candidate models to attempt to fit the observed read and b-allele counts across all segments in the input sample. A log likelihood score is emitted for each candidate, and all scores are output (in *.cnv.coverage.models.tsv
or *.cnv.purity.coverage.models.tsv
, respectively for germline or somatic workflows). The caller chooses the model with the highest log likelihood and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.
Note: if BAF data is not sufficient it might be discarded during model estimation, leading to a model based on coverage depth only. In such case, the model will not be able to detect alterations that cannot be easily identified without BAF (e.g., whole-genome trisomy).
Somatic-specific extensions
Default purity/ploidy model
If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA
. The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM
) or, in the case of copy-neutral LOHs, by their minor allele frequency value (MAF
). The threshold values for SM
used by the caller are estimated automatically considering the variance of the sample, with larger SM
thresholds for DUPs when the variance is higher. For MAF
values, PASSing copy-neutral LOHs are called when the MAF
is below a certain threshold. The user can use alternative threshold values through the --cnv-filter-del-mean
, --cnv-filter-dup-mean
and --cnv-filter-cnloh-maf
parameters.
Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN
, CNF
, CNQ
, MCN
, MCNF
, MCNQ
) are omitted from the final VCF output.
Grid search optimization informed by essential regions
In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.
The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit
(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH>
parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.
If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.
¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041
Rejection of models calling large portions of chromosome as CN0 (homozygous deletion)
Large chromosomal events are likely to negatively impact genome stability and cell viability. The option --cnv-somatic-homdel-max-fraction
is the maximum allowed fraction for any chromosome that can be called as CN0 (default value: 0.7). If the number of bases on a chromosome are more than this fraction (over the total number of called bases), the weighted average coverage across all HOMDEL segments is taken as the coverage that needs to be at least CN1 for a model to be considered. Model fitting then restarts from the beginning with new constraints (and thus a reduced set of alternative models). This feature can be disabled by setting the parameter to --cnv-somatic-homdel-max-fraction=1
, effectively allowing the total number of called bases on each chromosome to be CN0 without rejecting the model.
Subclonal/Mosaic Calling Mode
DRAGEN uses a subclonal/mosaic calling mode for segments with a copy number that is estimated to be heterogeneous among different cells in the sample. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.
When a segment is considered as heterogeneous, the output for the segment is changed as follows.
The MOSAIC (germline) or HET (somatic) tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.
To turn on this feature, specify either one of these options:
--cnv-enable-mosaic-calling=true
(for the germline workflow)--cnv-somatic-enable-het-calling=true
(for the somatic workflow)
Allele Specific Copy Number Examples
In addition to assigning total copy number based on depth, ASCN Callers make use of BAFs to call allele specific copy numbers. The following table provides examples for a DUP in a reference-diploid region:
4
2
2+2
4
1
3+1
*4
0
4+0
*The entry represents a Absence or Loss of Heterozygosity (AOH/LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH
to distinguish the value from Copy Neutral AOH/LOH (CNLOH
), which would be annotated as 2+0.
WGS CNV Smoothing
The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.
After initial calling, segments shorter than the specified value of --cnv-filter-length
are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the ASCN Caller combines two successive segments that are within --cnv-merge-distance
(default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. Note: in somatic workflows, when the germline CN information is available, and two segments have different germline CN, they will not be merged.
QUAL Model
The ASCN caller uses a model based on diploid coverage (and purity in somatic workflows) from depth of coverage and B-allele frequency.
Given the most likely diploid coverage (and purity in somatic workflows), for each segment, the algorithm calls the most likely copy number state (complete with total copy number CN, and minor allele copy number MCN).
The probability of the REF state is used in input to the scoring algorithm which outputs the QUAL value (a PHRED score capped at 1000). The QUAL value is the PHRED score where the probability of error is the probability of REF when an alteration is called, or the probability of having a non-REF call when the segment should be called REF.
Comparison with ROH caller
The two algorithms underlying the two different approaches might occasionally disagree. The differences are due to the following:
The ROH caller requires minor-allele frequency to be ~0. In contrast, the Germline WGS ASCN caller will assign to each segment its most likely copy-number state. This includes MOSAIC alterations, not available in the ROH caller.
The ROH caller is dependent on the small variant caller, and only uses the SNPs that it calls. In contrast, the Germline WGS ASCN caller works with a catalog of SNPs from population variation studies, such as 1000 Genomes.
The ROH caller uses a blacklist bed file to filter certain sites and reduce call fragmentation. In contrast, the Germline ASCN caller does not need to filter any site but provides an alternative smoothing algorithm to reduce call fragmentation, which is agnostic on the sample under consideration.
The ROH caller identifies ROH regions but does not provide the total copy number of the region under consideration. In contrast, the Germline ASCN CNV caller also reports the copy number for the region (which could be different from reference ploidy).
Last updated
Was this helpful?