CNV Preprocessing
Last updated
Was this helpful?
Last updated
Was this helpful?
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list
option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed
option to determine the intervals for analysis. The target BED file should contain intervals that match those in the panel of normals file. If the intervals in the target BED file and the panel of normals file do not match, DRAGEN will use the target intervals from the panel of normals file.
The target counts stage generates a *.target.counts.gz
file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input
or --cnv-tumor-input
option for the normalization stage. The *.target.counts.gz
file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
Further details are available in the section.
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width
option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
5
10000
10
5000
>= 30
1000
Using a cnv-interval-width
of less than 250 bp for WGS analysis can drastically increase runtime.
The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
. You can specify a list of contigs to skip by using the --cnv-skip-contig-list
option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED
option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width
.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
The following options control the generation of target counts.
--cnv-counts-method
--- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq
--- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed
--- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width
--- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list
--- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm
.
--cnv-filter-duplicate-alignments
--- Filter duplicate marked alignments during target counts if option is set to true
. The default setting is true
unless map/align is enabled and duplicate marking is disabled.
Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.
PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments
when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.
If --enable-map-align=false
, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true
, then --enable-duplicate-marking=true
should be set.
Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.
Fastq
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
FALSE
--enable-map-align=false
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.
Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.
For WGS samples and in absence of a cnv-target-bed
file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width
option, which defaults to 1000bp. The cnv-interval-width
option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width
, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE
in the *.cnv.excluded_intervals.bed.gz
file.
A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.
The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.
This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38
) with at least 30x coverage. See below for additional requirements.
The following pairs of genes defining Segmental Duplications are included:
CYP2A6
CYP2A7
FCGR3A
FCGR3B
RHD
RHCE
STRC
STRCP1
ACSM2A
ACSM2B
ACTR3B
ACTR3C
AQP12A
AQP12B
ASAH2
ASAH2B
CCDC74A
CCDC74B
CD177
CD177p1
CD8B
CD8B2
CFH1
CFHR1
CYP4A11
CYP4A22
DHX40
DHX40P1
EIF5AL1
EIF5AP4
FCGR2A
FCGR2C
FFAR3
GPR42
FOLH1
FOLH1B
FRMPD2
FRMPD2B
GPAT2
GPAT2P1
GSTT2B
GSTT2
DDT
DDTL
HCAR2
HCAR3
HSPA1A
HSPA1B
KRT81
KRT86
LGALS7
LGALS7B
MRPL45
MRPL45P2
MSTO1
MSTO2p
MUC20
MUC20P1
MZT2A
MZT2B
OTOA
OTOAp1
PDPR
PDPR2P
PIEZ02
ENST00000591853.1
ZP3
POMZP3
PRAMEF7
PRAMEF8
PROS1
PROS2P
RMND5A
ANAPC1P2
ROCK1
ROCK1p1
SERPINB3
SERPINB4
SYT3
ZNF473CR
TBC1D26
TBC1D28
TOP3B
TOP3BP1
TUBA3D
TUBA3E
ZNF443
ZNF799
This extension is enabled by default in the germline CNV workflow (ASCN workflow currently unsupported). However, it requires:
Normalization set to self-normalization (--cnv-enable-self-normalization=true
).
GC bias correction enabled (--cnv-enable-gcbias-correction=true
).
Counts method set to start
(--cnv-counts-method=start
).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width
default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38
).
If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension
to false.
For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz
).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz
).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz
file for inspection and they are automatically injected before the segmentation step.
During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j
The ASCN callers require a source of heterozygous SNP sites to measure B-allele counts of the input sample. The following are the available modes, of which some are only available in somatic workflows.
cnv-population-b-allele-vcf
Specify a population SNP VCF. This option is available for both the germline and the somatic workflows. In somatic, it can be used when a matched normal sample is not available and analysis must be performed in tumor-only mode.
cnv-normal-b-allele-vcf
(Somatic-specific) Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow.
cnv-use-somatic-vc-baf
(Somatic-specific) Set to true
to enable DRAGEN to identify germline variants during a tumor/matched-normal run, rather than requiring a separate run on the normal sample. Use if and only if tumor and matched normal input are available. Also enable the Somatic SNV Caller via enable-variant-caller
to use this option.
To specify a population SNP VCF, use --cnv-population-b-allele-vcf
option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is
"1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency>
to the INFO
section of each record. Additional INFO
fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf
can be either heterozygous or homozygous in the germline genome from which the tumor genome derives
The following is an example valid population SNP record (note: it needs to be tab-delimited):
DRAGEN considers the following requirements when parsing records from the b-allele VCF:
Only simple SNV sites.
Records must be marked PASS
in the FILTER
field.
If there are records with the same CHROM
and POS
values in the VCF
, then DRAGEN uses the first record that occurs.
To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf
option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz
extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz
), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.
If a tumor sample and matched normal input are available, use --cnv-use-somatic-vc-baf true
. You must enable the Somatic SNV Caller. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true
. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
To enable --cnv-use-somatic-vc-baf
, enter the following command line options.
--tumor-bam-input <TUMOR_BAM>
—Specify the tumor input
--bam-input <NORMAL_BAM>
—Specify the matched normal input
--enable-variant-caller true
—Enable the somatic SNV variant caller
--cnv-use-somatic-vc-baf true
—Enable somatic VC BAF
GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.
Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.
The following options control the GC bias correction module.
--cnv-enable-gcbias-correction
--- Enable or disable GC bias correction
when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing
--- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins
--- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.
The DRAGEN CNV pipeline supports two normalization algorithms:
Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.
Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.
Self-Normalization
Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
naming conventions are supported.
Panel of Normals
Whole genome sequencing (excluding Germline WGS ASCN)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples
The table below shows supported normalization methods for CNV workflow:
non ASCN
ASCN
non ASCN
ASCN T/N
ASCN T/O
WGS
Self/PoN
Self
Not available
Self/PoN
Self PoN
WES
PoN
Not available
PoN
PoN
PoN
Not available
indicates the workflow is not supported.
The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization
to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.
Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.
The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.
If you are running from a FASTQ sample, then the default mode of operation is self-normalization.
When operating in self-normalization mode, the --cnv-interval-width
option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.
Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references or similar mammalian references (chr1, chr2, chr3, ..., chrX, chrY).
If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true
. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.
The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. Proper sample selection and preparation are critical for constructing an accurate and reliable CNV PON. High-quality germline samples—meeting stringent sequencing quality criteria such as a high percentage of bases over Q30, sufficient total read depth (yield), appropriate GC content, and minimal adapter contamination—must be used. Additionally, all samples should originate from the same sample type (e.g., FFPE, fresh-frozen) and be processed under identical experimental conditions, including the same library preparation kit, sequencing platform, and capture panel version. Even minor variations in hybridization efficiency or read depth distribution can introduce systematic artifacts, leading to inaccurate CNV calls.
Below are the key recommendations for preparing a high-quality PON:
Sample Selection: Normal samples should be sourced from individuals without known chromosomal abnormalities to establish a clean and representative reference baseline. Additionally, normal samples should not be drawn from a cohort that is likely to be enriched for particular CNVs, or enriched for individuals affected by a particular disease or syndrome with a genetic component. Normal samples should ideally be unrelated to each other and to the case samples to be processed.
Balanced sample sex: The normal sample set should include both male and female samples in similar numbers to ensure a well-represented reference baseline.
Exclude Low-Quality Samples: Samples with unusually uneven target coverage, low sequencing depth, or high technical noise should be removed to minimize variability and ensure consistency in the PON.
Standardized Library Preparation: All samples must be processed using the same library preparation protocol. Any deviations such as differences in hybridization efficiency, incubation time, or temperature can lead to inconsistent coverage patterns, increasing the likelihood of false positive CNV calls.
Adequate Number of Reference Samples: A sufficient number (a minimum of 50 samples is recommended, though not mandatory) of high-quality reference samples is essential for reliable coverage estimation and robust CNV detection.
By following these guidelines, the PON can effectively minimize technical biases, improving the accuracy and reliability of CNV detection.
In PON mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample (case and normals), to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.
Target counts should be generated for all normal samples used as a panel of normals. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings including reference version, target bed, counting methods, duplicate marking/filtering, filtering method/cutoff, etc. The target counts stage also performs GC Bias correction, if enabled. GC Bias correction is enabled by default, but can be disabled if desired.
The following examples are for WES processing, where a panel of normals is required.
The following is an example command for processing a BAM file.
The following is an example command for processing a CRAM file.
The following example is for WGS processing, where a panel of normals is optional.
When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true
option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file
(one per file) or --cnv-normals-list
(single text file with paths to each sample).
The following is an example command line using a normals list:
The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.
Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.
The presence of CNVs in the panel can result in artifactual calls in the test sample at locations where at least some of the panel samples have copy number changes. This leads to two considerations regarding construction of a panel.
Firstly, while it is not generally possible to select samples with no CNVs, panel samples should not be be clearly aneuploid or contain large-scale somatic CNVs; further, if there is a region of particular interest, samples should be selected to be normal in that region.
Secondly, for optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels increase the likelihood of artifactual calls. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.
The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.
DRAGEN accepts 3 different file formats for a Panel of Normals (PON).
--cnv-normals-file
Individual normal file. This option uses a single file name and can be specified multiple times.
--cnv-normals-list
List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz
or *.target.counts.gc-corrected.gz
file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.
--cnv-combined-counts
PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz
file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.
The CNV caller can also be started from the *.target.counts.gz
(raw counts) or *.target.counts.gc-corrected.gz
(GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input
or --cnv-tumor-input
option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction
should be set to false to disable the GC-correction stage; GC-corrected inputs are not supported for somatic WGS analysis.
For example, the following command normalizes the case sample against the panel of normals.
These options control the preconditioning of the panel of normals and the normalization of the case sample.
--cnv-enable-self-normalization
--- Enable/disable self normalization
mode, which does not require a panel of normals.
--cnv-extreme-percentile
--- Specifies the extreme median percentile
value at which to filter out samples. The default is 2.5.
--cnv-input
--- Specifies a target counts file for the case sample under analysis when using a panel of normals, for germline analysis (see --cnv-tumor-input for somatic analysis).
--cnv-normals-file
--- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list
--- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples
--- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets
--- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold
--- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-tumor-input
--- Specifies a target counts file for the case sample under analysis when using a panel of normals, for somatic analysis (see --cnv-input for germline analysis).
--cnv-truncate-threshold
--- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon
--- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.
--cnv-enable-cross-gender-adjustments-chrX
--- Enable normalization on chrX by adjusting coverage of PON samples according to the expected number of copies of chrX in male and female samples. If the case sample is male, coverage of female PON samples is scaled down by a factor of 2 on chrX. If the case sample is female, coverage of male PON samples is scaled up by a factor of 2 on chrX. If no male PON samples are available, chrY intervals will be filtered. This feature is only supported for germline enrichment runs. The default value is false; if set to true, then --cnv-enable-gender-matched-pon
must also be true.
You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed
. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.
The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap
, the target counts interval are excluded from analysis. The *.target.counts.gz
file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz
file excludes the intervals removed during normalization.
Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section for more details.
See for a description of the extension output files.
A suitable population B-allele VCF is provided for selected references at .
The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz
file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz
extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See for further details on GC-corrected target counts files.
See for a description of the target counts files.
An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz
file. See for further details.