Copy Number Variant Calling
Last updated
Last updated
The DRAGEN Copy Number Variant (CNV) Pipeline can call CNV events using next-generation sequencing (NGS) data. This pipeline supports multiple applications in a single interface via the DRAGEN Host Software, including processing of whole-genome sequencing (WGS) data and whole-exome sequencing (WES) data for germline analysis.
The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes apply different normalization techniques to handle biases that differ based on the application, for example, WGS versus WES. While the default option settings attempt to provide the best trade-off in terms of speed and accuracy, a specific workflow may require more finely tuned option settings.
The DRAGEN CNV pipeline follows the workflow shown in the following figure.
DRAGEN CNV Pipeline Workflow
The DRAGEN CNV Pipeline uses many aspects of the DRAGEN secondary analysis available in other pipelines, such as hardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN Host Software, set the --enable-cnv
command line option to true.
The CNV pipeline has the following processing modules:
Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.
The normalization module can optionally take in a panel of normals (PoN), which is used when a cohort or population samples are readily available. Note that PoN normalization is not available for somatic WGS analysis. All other modules are shared between the different CNV algorithms.
The following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as the signal traverses through the various stages. These figures are examples and are not identical to the plots that are generated from the DRAGEN CNV Pipeline.
The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.
Read Count Signal
Improper Pairs Signal
Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.
Normalization
The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.
Segments
Called Events
The events are then scored and emitted in the output VCF.
The following are the top-level options that are shared with the DRAGEN Host Software to control the CNV pipeline. You can input a BAM or CRAM file into the CNV pipeline. If you are using the DRAGEN mapper and aligner, you can use FASTQ files.
--bam-input
--- The BAM file to be processed.
--cram-input
--- The CRAM file to be processed.
--enable-cnv
--- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align
--- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1
, --fastq-file2
--- FASTQ file or files to be processed.
--output-directory
--- Output directory where all results are stored.
--output-file-prefix
--- Output file prefix that will be prepended to all result file names.
--ref-dir
--- The DRAGEN reference genome hashtable directory.
The output and filtering options control the CNV output files.
--cnv-exclude-bed
--- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap
, the target interval is suppressed.
--cnv-exclude-bed-min-overlap
--- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls
--- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks
--- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff
file for the output variant calls is generated, as well as \*.bw
files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio
--- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len
. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio
as a filter.
--cnv-filter-bin-support-ratio-min-len
--- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio
. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio
--- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio
as a filter.
--cnv-filter-length
--- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength
as a filter.
--cnv-filter-qual
--- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual
as a filter.
--cnv-min-qual
--- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual
--- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale
--- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale
--- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.
The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see Streaming Alignments for instructions on streaming alignment records directly from the DRAGEN map/align stage.
DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see Generate an Alignment File.
For the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option
set to true, in addition to any other options required by other pipelines. When --enable-cnv
is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.
The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see Prepare a Reference Genome.
The following example command generates a hashtable.
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM
option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM
option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.
To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.
For information on running CNV concurrently with the Haplotype Variant Caller, see Concurrent CNV and Small Variant Calling.
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list
option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed
option to determine the intervals for analysis.
The target counts stage generates a .target.counts.gz
file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input
option for the normalization stage. The .target.counts.gz
file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
Further details are available in the Output Files section.
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width
option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
Using a cnv-interval-width
of less than 250 bp for WGS analysis can drastically increase runtime.
The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
. You can specify a list of contigs to skip by using the --cnv-skip-contig-list
option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED
option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width
.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
The following options control the generation of target counts.
--cnv-counts-method
--- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq
--- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed
--- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width
--- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list
--- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm
.
--cnv-filter-duplicate-alignments
--- Filter duplicate marked alignments during target counts if option is set to true
. The deafult setting is false
.
Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.
PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments
when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.
If --enable-map-align=false
, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true
, then --enable-duplicate-marking=true
should be set.
Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.
Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.
For WGS samples and in absence of a cnv-target-bed
file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width
option, which defaults to 1000bp. The cnv-interval-width
option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width
, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE
in the *.cnv.excluded_intervals.bed.gz
file.
A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.
Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section Segmental Duplication Extension for more details.
GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.
The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz
file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz
extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See Output Files for further details on GC-corrected target counts files.
Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.
The following options control the GC bias correction module.
--cnv-enable-gcbias-correction
--- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing
--- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins
--- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.
The DRAGEN CNV pipeline supports two normalization algorithms:
Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.
Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.
Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
naming conventions are supported.
Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples
The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization
to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.
Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.
The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.
If you are running from a FASTQ sample, then the default mode of operation is self-normalization.
When operating in self-normalization mode, the --cnv-interval-width
option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.
Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.
If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true
. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.
The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. This allows the algorithm to subtract system level biases that are not sample specific. The generation of the target counts for these normal samples should also have identical command line options with the case sample under analysis.
In this mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample, case, and normals, to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.
Target counts should be generated for all samples, whether the samples are to be used as references or are the case samples under analysis. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings. The target counts stage also performs GC Bias correction. GC Bias correction is enabled by default.
The following examples are for WES processing, which is the case in where a panel of normals is required.
The following is an example command for processing a BAM file.
The following is an example command for processing a CRAM file.
When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true
option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file
(one per file) or --cnv-normals-list
(single text file with paths to each sample).
The following is an example command line using a normals list:
The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.
Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.
For optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels can result in artifactual calls in the test sample where at least some of the panel samples have copy number changes. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.
The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.
DRAGEN accepts 3 different file formats for a Panel of Normals (PON).
The CNV caller can also be started from the *.target.counts.gz
(raw counts) or *.target.counts.gc-corrected.gz
(GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input
option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction
should be set to false to disable the GC-correction stage.
For example, the following command normalizes the case sample against the panel of normals.
See Output Files for a description of the target counts files.
These options control the preconditioning of the panel of normals and the normalization of the case sample.
--cnv-enable-self-normalization
--- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile
--- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input
--- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file
--- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list
--- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples
--- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets
--- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold
--- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold
--- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon
--- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.
After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:
Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)
The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.
By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing
For the targeted sequencing workflows, you can also run with a --cnv-segmentation-bed
. The option pre-defines the segments to estimate copy numbers for and skips the segmentation step of the workflow. See Targeted Segmentation (Segment BED) for more information.
--cnv-segmentation-mode
--- Specifies the segmentation algorithm to perform. The following values are available.
bed
cbs
slm
--- The default for germline WGS analysis.
aslm
--- The default for somatic WGS analysis.
hslm
--- The default for targeted/WES analysis.
--cnv-merge-distance
--- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold
--- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.
Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.
--cnv-cbs-alpha
--- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta
--- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax
--- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width
--- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin
--- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm
--- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim
--- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.
¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646
The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².
--cnv-slm-eta
--- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw
--- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega
--- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta
--- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.
Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.
²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5
In applications for targeted panels, you can limit the segmentation and calling performed on intervals by specifying a --cnv-segmentation-bed
. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. This segmentation mode is only supported with the panel of normals and requires an accompanying --cnv-target-bed
. Also specify the --cnv-segmentation-bed
during the panel of normals generation step, so that all interval boundaries during analysis are matched. For more information on panel of normals generation, see Panel of Normals
The recommended format for the BED file includes four columns and a header. The four columns are contig
, start
, stop
, and name
. The name column represents the name of the gene and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID
field. The following example file is in the recommended format:
If using a three-column BED file, then do not include a header or the name field values. Three-column BED files should only include the contig
, start
, and stop
values. In this case, the segment identifier is autogenerated from the coordinate fields.
Quality scores are computed using a probabilistic model that uses a mixture of heavy tailed probability distributions (one per integer copy number) with a weighting for event length. Noise variance is estimated. The output VCF contains a Phred-scaled metric that measures confidence in called amplification (CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus) events.
The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.
You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed
. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.
The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz
file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz
file excludes the intervals removed during normalization.
An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See Output Files for further details.
DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.
Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.
The following examples show different commands.
When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.
A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.
The results are printed to the screen when running the pipeline. For example:
The predicted sexes for samples in use are also printed to the *.cnv_metrics.csv output file. For a panel of normals, the predicted sexes are used to determine which panel samples are leveraged for normalization on sex chromosomes. If the estimated sex of the sample is UNDETERMINED, the sex of the sample is set to FEMALE.
You can override the predicted sex of the sample with the --sample-sex
option.
The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.
This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38
) with at least 30x coverage. See below for additional requirements.
The following pairs of genes defining Segmental Duplications are included:
This extension is enabled by default in the germline CNV workflow. However, it requires:
Normalization set to self-normalization (--cnv-enable-self-normalization=true
).
GC bias correction enabled (--cnv-enable-gcbias-correction=true
).
Counts method set to start
(--cnv-counts-method=start
).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width
default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38
).
If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension
to false.
For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz
).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz
).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz
file for inspection and they are automatically injected before the segmentation step.
During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j
See Output Files for a description of the extension output files.
WGS Coverage per Sample | Recommended Resolution* (bp) |
---|---|
Input format | enable-map-align | Required option |
---|---|---|
Option | Description |
---|---|
5
10000
10
5000
>= 30
1000
Fastq
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
FALSE
--enable-map-align=false
--cnv-normals-file
Individual normal file. This option uses a single file name and can be specified multiple times.
--cnv-normals-list
List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz
or *.target.counts.gc-corrected.gz
file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.
--cnv-combined-counts
PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz
file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.
CYP2A6
CYP2A7
FCGR3A
FCGR3B
RHD
RHCE
STRC
STRCP1
ACSM2A
ACSM2B
ACTR3B
ACTR3C
AQP12A
AQP12B
ASAH2
ASAH2B
CCDC74A
CCDC74B
CD177
CD177p1
CD8B
CD8B2
CFH1
CFHR1
CYP4A11
CYP4A22
DHX40
DHX40P1
EIF5AL1
EIF5AP4
FCGR2A
FCGR2C
FFAR3
GPR42
FOLH1
FOLH1B
FRMPD2
FRMPD2B
GPAT2
GPAT2P1
GSTT2B
GSTT2
DDT
DDTL
HCAR2
HCAR3
HSPA1A
HSPA1B
KRT81
KRT86
LGALS7
LGALS7B
MRPL45
MRPL45P2
MSTO1
MSTO2p
MUC20
MUC20P1
MZT2A
MZT2B
OTOA
OTOAp1
PDPR
PDPR2P
PIEZ02
ENST00000591853.1
ZP3
POMZP3
PRAMEF7
PRAMEF8
PROS1
PROS2P
RMND5A
ANAPC1P2
ROCK1
ROCK1p1
SERPINB3
SERPINB4
SYT3
ZNF473CR
TBC1D26
TBC1D28
TOP3B
TOP3BP1
TUBA3D
TUBA3E
ZNF443
ZNF799