Multisample CNV Calling

Multisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.

Multisample CNV analysis is supported for WGS and WES workflows.

The following is an example command line for running a trio analysis:

dragen \
-r <HASHTABLE> \
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \
--enable-cnv true \
--cnv-input <FATHER_TN_TSV> \
--cnv-input <MOTHER_TN_TSV> \
--cnv-input <PROBAND_TN_TSV> \
--pedigree-file <PEDIGREE_FILE>

De Novo CNV Calling Options

Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.

The following options are used in DeNovo CNV calling:

--cnv-input For DeNovo CNV calling, this specifies the input tangent-normalized signal files (*.tn.tsv.gz) from the single sample runs. This option can be specified multiple times, once for each input sample.

--cnv-filter-de-novo-qual Phred-scaled threshold at which a putative event in the proband sample if marked as DeNovo. Default value is 0.125.

--pedigree-file Pedigree file specifying the relationship between the input samples.

Joint Segmentation

First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:

Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the mutlisample VCF as a single entry. The quality score (QS in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL column of the multisample VCF is always missing (ie, "."). The FILTER column of the mutlisample VCF is SampleFT if none of the sample's FT fields are PASS, and PASS if any of the sample's FT fields are PASS.

Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):

DRAGEN:REF:chr22:21917617-22385563	GT:SM:CN:BC:PE:QS:FT	./.:1.01773:2:867:0,0:62:PASS	./.:1.00693:2:379:0,0:61:PASS
DRAGEN:LOSS:chr22:22385564-22549952	GT:SM:CN:BC:PE:QS:FT:GC:CT:AC	./.:1.01773:2:867:0,0:62:PASS	0/1:0.695867:1:135:0,0:7:cnvQual:0.427961:0.493883:0.506859
DRAGEN:LOSS:chr22:22549953-23041393	GT:SM:CN:BC:PE:QS:FT:GC:CT:AC	./.:1.01773:2:867:0,0:62:PASS	0/1:0.614398:1:341:0,0:40:PASS:0.457178:0.499478:0.500493
DRAGEN:LOSS:chr22:23041394-23055519	GT:SM:CN:BC:PE:QS:FT:GC:CT:AC	./.:1.01773:2:867:0,0:62:PASS	0/1:0.31226:1:141:0,0:52:PASS:0.452074:0.492297:0.513473
DRAGEN:LOSS:chr22:23055520-23198595	GT:SM:CN:BC:GC:CT:AC:PE:QS:FT	0/1:0.57652:1:168:0.452278:0.489735:0.514792:0,0:41:PASS	0/1:0.31226:1:141:0.452074:0.492297:0.513473:0,0:52:PASS
DRAGEN:LOSS:chr22:23198596-23241095	GT:SM:CN:BC:GC:CT:AC:PE:QS:FT	0/1:0.57652:1:168:0.452278:0.489735:0.514792:0,0:41:PASS	1/1:0.128:0:39:0.466259:0.483365:0.516541:0,0:42:PASS

The previous can be visualized as:

De Novo Calling Stage

A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.

For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.

The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:

Parent Copy Number Genotype

Possible Copy Number Alleles

Assumed Possible Copy Number Alleles

0/2, 1/1

1/1

0/3, 1/2

1/2

0/4, 1/3, 2/2

1/3, 2/2

x/(N-x) for x <= N/2

x/(N-x) for 1 <= x <= N/2

The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:

Mother Copy Number

Father Copy Number

Proband Copy Number

Mendelian Consistent?

Yes

If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:

$DQ = -10log \left( \frac{1-\sum_C{p(CN_m|data) \cdot p(CN_f|data) \cdot p(CN_p|data) \cdot p(CN_m,CN_f,CN_p)}}{\sum_G{p(CN_m|data) \cdot p(CN_f|data) \cdot p(CN_p|data) \cdot p(CN_m,CN_f,CN_p)}} \right)$

Where

$G$ is the set of all genotypes
$C$ is the set of conflicting genotypes
$CN_m$ is the Mother copy number
$CN_f$ is the Father copy number
$CN_p$ is the Proband copy number
$p(CN_m,CN_f,CN_p)$ is the the prior for the trio genotype

The DN field in the VCF is used to indicate the de novo status for each segment. Possible values are:

Inherited - the called trio genotype is consistent with Mendelian inheritance
LowDQ - the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)
DeNovo - the called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold (default 0.125)

Multisample CNV VCF Output

The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:

The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.

The QUAL column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE columns with the QS tag.

The FILTER column indicates PASS if any of the individual SAMPLE columns PASS. Otherwise, it indicates SampleFT.

The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT annotation.

Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.

##FORMAT=<ID=DQ,Number=1,Type=Float,Description="De novo quality">
##FORMAT=<ID=DN,Number=1,Type=String,Description="Possible values are `Inherited', 'DeNovo' or 'LowDQ'. Threshold for a passing de novo call is DQ > 0.125000">

While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN and DQ annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.

PreviousCNV with SV Support NextSomatic CNV Calling WGS

Last updated 1 year ago

Was this helpful?