CNV ASCN module

Selecting a diploid coverage level is a key component of an allele-specific copy number (ASCN) caller. In the somatic case, the caller also needs to identify the most likely tumor purity. DRAGEN CNV ASCN callers use a grid-search approach that evaluates many candidate models to attempt to fit the observed read and b-allele counts across all segments in the input sample. A log likelihood score is emitted for each candidate, and all scores are output (in *.cnv.coverage.models.tsv or *.cnv.purity.coverage.models.tsv, respectively for germline or somatic workflows). The caller chooses the model with the highest log likelihood and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.

Note: if BAF data is not sufficient it might be discarded during model estimation, leading to a model based on coverage depth only. In such case, the model will not be able to detect alterations that cannot be easily identified without BAF (e.g., whole-genome trisomy).

Somatic-specific extensions

Default purity/ploidy model

If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA. This can be identified on the output VCF header lines:

##fileformat=VCFv4.4
##ModelSource=DEGENERATE_DIPLOID
##EstimatedTumorPurity=NA
##DiploidCoverage=337.000000
##OverallPloidy=2.008310

The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM) or, in the case of copy-neutral LOHs, by their minor allele frequency value (MAF). The threshold values for SM used by the caller are estimated automatically considering the variance of the sample, with larger SM thresholds for DUPs when the variance is higher. For MAF values, PASSing copy-neutral LOHs are called when the MAF is below a certain threshold. The user can use alternative threshold values through the --cnv-filter-del-mean, --cnv-filter-dup-mean and --cnv-filter-cnloh-maf parameters.

Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN, CNF, CNQ, MCN, MCNF, MCNQ) are omitted from the final VCF output. The following is a set of example calls from the final VCF output:

# LOSS - not PASSing since it is a short alteration (< 1Mb)
chr11   125205167       DRAGEN:LOSS:chr11:125205168-125215362   N       <DEL>   322     lengthDegenerate
  END=125215362;REFLEN=10195;SVLEN=10195;SVTYPE=CNV
  GT:SM:SD:MAF:BC:AS:PE:OBF
  0/1:0.506231:170.6:0.337:10:1:143,143:0

# REF - not PASSing (only GAIN/LOSS/LOH are output by the default model)
chr12   11415425        DRAGEN:REF:chr12:11415425-33369045      N       .       1000    segmentMean
  END=33369045;REFLEN=21953621
  GT:SM:SD:MAF:BC:AS:PE:OBF
  0/0:1.005638:338.9:0.5:19315:16563:13,86:0.00398647

# GAIN - PASSing
chr12   33369045        DRAGEN:GAIN:chr12:33369046-34655426     N       <DUP>   1000    PASS
  END=34655426;REFLEN=1286381;SVLEN=1286381;SVTYPE=CNV
  GT:SM:SD:MAF:BC:AS:PE:OBF
  0/1:1.487834:501.4:0.355:1045:723:86,89:0.00414938

# LOH - PASSing
chr13   18785987        DRAGEN:CNLOH:chr13:18785988-114351403   N       <LOH>   1000    PASS
  END=114351403;REFLEN=95565416;SVLEN=95565416;LOHTYPE=CNLOH;SVTYPE=CNV
  GT:SM:SD:MAF:BC:AS:PE:OBF
  1/1:0.996893:866.3:0.139:85604:167539:28,12:0.00247715

Grid search optimization informed by essential regions

In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.

The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH> parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.

If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.

¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041

The following is an example taken from the output log where this feature is triggered, leading to an additional iteration of model fitting:

==================================================================
Checking for model plausibility
==================================================================
Call overlap found: chr18 65121873 80261473
- Call segment counts: 272.4
- Call length: 15139600
Model considered not plausible

==================================================================
Performing model fitting
==================================================================
Constraint on lowest non plausible HOMDEL coverage: 272.4
Performing grid search on 23867 models

and, an example where this feature is not triggered:

==================================================================
Checking for model plausibility
==================================================================
Plausible model identified

Rejection of models calling large portions of chromosome as CN0 (homozygous deletion)

Large chromosomal events are likely to negatively impact genome stability and cell viability. The option --cnv-somatic-homdel-max-fraction is the maximum allowed fraction for any chromosome that can be called as CN0 (default value: 0.7). If the number of bases on a chromosome are more than this fraction (over the total number of called bases), the weighted average coverage across all HOMDEL segments is taken as the coverage that needs to be at least CN1 for a model to be considered. Model fitting then restarts from the beginning with new constraints (and thus a reduced set of alternative models). This feature can be disabled by setting the parameter to --cnv-somatic-homdel-max-fraction=1, effectively allowing the total number of called bases on each chromosome to be CN0 without rejecting the model.

The following is an example taken from the output log where this feature is triggered:

==================================================================
Checking for model plausibility
==================================================================
Model considered not plausible

Subclonal/Mosaic Calling Mode

DRAGEN uses a subclonal/mosaic calling mode for segments with a copy number that is estimated to be heterogeneous among different cells in the sample. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.

Note, in somatic this setting will only be honored when DRAGEN is able to identify a confident model. When a confident model cannot be identified, the caller will return a default model and this feature will always be disabled (see the Default purity/ploidy model section for more details and nuances of this approach).

When a segment is considered as heterogeneous, the output for the segment is changed as follows.

The MOSAIC (germline) or HET (somatic) tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.

To turn on this feature, specify either one of these options:

--cnv-enable-mosaic-calling=true (for the germline ASCN workflow, default true)
--cnv-somatic-enable-het-calling=true (for the somatic workflow, default false)

Note: this calling mode can be disabled for alterations smaller than N bases. It is recommended not to change default thresholds. If necessary, however, they can be changed with:

--cnv-filter-mosaic-length=N (for the germline ASCN workflow, default 100000)
--cnv-somatic-filter-het-length=N (for the somatic workflow, default 0)

The following is an example of MOSAIC GAIN call from the germline ASCN model (from *.cnv.vcf.gz output file):

chr2    89673674        DRAGEN:GAIN:chr2:89673675-89851643      N       <DUP>   1000    PASS
  END=89851643;REFLEN=177969;MOSAIC;SVLEN=177969;SVTYPE=CNV
  GT:CN:MCN:CNQ:MCNQ:CNF:MCNF:SM:SD:MAF:BC:AS:PE
  ./1:4:.:0:.:4.32403:.:2.162016:278.9:.:70:0:13,2

The assigned total CN is 4. However, inspecting the CNF annotation (CNF ~ 4.32), we can see that the segment above has a larger deviation from the diploid state with respect to the assigned integer CN state. This can support various hypotheses on the fraction of cells bearing the different CN states, for example:

32% of cells having CN5, 68% of cells having CN4

Allele Specific Copy Number Examples

In addition to assigning total copy number based on depth, ASCN Callers make use of BAFs to call allele specific copy numbers. The following table provides examples for a DUP in a reference-diploid region:

Total Copy Number (CN)

Minor Copy Number (MCN)

ASCN Scenario

2+2

3+1

4+0

*The entry represents a Absence or Loss of Heterozygosity (AOH/LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH to distinguish the value from Copy Neutral AOH/LOH (CNLOH), which would be annotated as 2+0.

WGS CNV Smoothing

The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.

After initial calling, segments shorter than the specified value of --cnv-filter-length are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the ASCN Caller combines two successive segments that are within --cnv-merge-distance (default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. Note: in somatic workflows, when the germline CN information is available, and two segments have different germline CN, they will not be merged.

QUAL Model

The ASCN caller uses a model based on diploid coverage (and purity in somatic workflows) from depth of coverage and B-allele frequency.

Given the most likely diploid coverage (and purity in somatic workflows), for each segment, the algorithm calls the most likely copy number state (complete with total copy number CN, and minor allele copy number MCN).

The probability of the REF state is used in input to the scoring algorithm which outputs the QUAL value (a PHRED score capped at 1000). The QUAL value is the PHRED score where the probability of error is the probability of REF when an alteration is called, or the probability of having a non-REF call when the segment should be called REF.

Note: this is different from how QUAL is computed in depth-only callers, e.g., the germline CNV caller.

Comparison with ROH caller

Both the ROH caller and the ASCN module, as used in the Germline WGS ASCN caller, can detect runs-of-homozygosity (ROH) regions.

The two algorithms underlying the two different approaches might occasionally disagree. The differences are due to the following:

The ROH caller requires minor-allele frequency to be ~0. In contrast, the Germline WGS ASCN caller will assign to each segment its most likely copy-number state. This includes MOSAIC alterations, not available in the ROH caller.
The ROH caller is dependent on the small variant caller, and only uses the SNPs that it calls. In contrast, the Germline WGS ASCN caller works with a catalog of SNPs from population variation studies, such as 1000 Genomes.
The ROH caller uses a blacklist bed file to filter certain sites and reduce call fragmentation. In contrast, the Germline ASCN caller does not need to filter any site but provides an alternative smoothing algorithm to reduce call fragmentation, which is agnostic on the sample under consideration.
The ROH caller identifies ROH regions but does not provide the total copy number of the region under consideration. In contrast, the Germline ASCN CNV caller also reports the copy number for the region (which could be different from reference ploidy).

PreviousCNV Output NextCNV with SV Support

Last updated 25 days ago

Was this helpful?