Population Genotyping

Analyzing a cohort of unrelated samples using the iterative gVCF Genotyper

DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.

The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See Iterative gVCF Genotyper analysis for information on the available command line options.

A population-based analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis. See additional description in Pedigree analysis section.

To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true option to write a multisample gVCF.

This tool requires a new paid license, 'GvcfGenotyper'. For those using DRAGEN OnPrem Servers - once purchased it must be installed with a specific command, please see the OnPrem Licensing Installation Reference Section for more information.

Note: All DRAGEN OnPrem Servers have been assigned a complimentary 1Tb GvcfGenotyper License which expires on November 15th 2025, or 6 months after the date your server was manufactured - whichever is furthest out. If your DRAGEN OnPrem Server is connected to the internet, you can install this license yourself using the section referenced above. If your DRAGEN OnPrem Server is not connected to the internet, please contact us to retrieve this license.

Combined phased variants in the gVCF input

The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance to combine phased variants in the gVCF output.

The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.

Force-genotyped and targeted variants in the gVCF input

If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.

Similarly, targeted variant calls (option --targeted-merge-vc) in any gVCF file that are not also called by the variant caller will be ignored as well.

Iterative gVCF Genotyper analysis

gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.

Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.

When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.

Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.

Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.

To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard below.

There is a special treatment of alternative or unaligned contigs when the --shard option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.

If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.

Commandline arguments common to all steps:

--enable-gvcf-genotyper-iterative: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference: The file containing the reference sequence in FASTA format (always required).
--output-directory: The output directory (always required).
--output-file-prefix: The prefix used to label all output files (optional, default value dragen).
--shard: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N (e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-allow-dupe-samples: if true, any duplicate sample names found in the input will be automatically renamed by appending ":{N+1}". Otherwise, the execution of the analysis will be aborted. Default is false.
--gg-sample-rename-mapfile [Optional] Path to a file providing a tab-separated mapping of sample names that should be renamed on the merged output. Format: Original Target[ Filename] OR /regex pattern/ substitution. If a filename is provided, both the sample name and filename need to match for the rename to be applicable. Target sample names must be unique, unless --gg-allow-dupe-samples is true, in which case an automatic suffix will be applied on top of any duplicates. For regex substitutions, any capturing groups have to be referred to using a $1, $2, etc. syntax.
--gg-discard-ac-zero If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT. The exception are calls with TargetedConflict filter status which are not affected by this setting.
--gg-info-to-import Optional list of gVCF INFO fields to import. See gVCF metric import customization below.
--gg-format-to-import Set to override or extend the default set of genotype fields imported from the input gVCF files. See gVCF metric import customization below.
--gg-hard-filter Specifies a filtering expression to be applied to the output msVCF records. See msVCF hard filtering below. The default is to apply no filters.
--gg-skip-filtered-sites Omits msVCF records that fail the given hard filter. The default is false.
--gg-msvcf-format-fields Can be used to override the default set of sample genotype fields in the output msVCF. See msVCF metric customization below.
--gg-msvcf-info-fields Can be used to override the default set of site-wise INFO fields in the output msVCF. See msVCF metric customization below.
--gg-squeeze-msvcf Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.
--gg-enable-indexing: set to true (the default) to generate a tabix index for an output msVCF (single batch or merged).
--gg-output-type Set to spvcf to write the output in spVCF format rather then the default msVCF. See File size optimizations below for details.
--gg-diploidify In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g. 1 becomes 1/1. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include the DF FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see msVCF metric customization below.

Commandline arguments for Step1 (step-by-step mode)

--gvcfs-to-cohort-census: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for Step2 (step-by-step mode)

--aggregate-censuses: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.

Commandline arguments for Step3 (step-by-step mode)

--generate-msvcf: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file: the path to the per batch cohort file (from Step1).
--input-census-file: the path to the per batch census file (from Step1).
--input-global-census-file: the path to the global census file (from Step2).

Commandline arguments for running Step1 + Step2 + Step3 (end-to-end mode for a single batch)

--gvcfs-to-msvcf : set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for merging per-batch msVCF files

--merge-batches: set to true to merge msVCF files for a set of batches.
--input-batch-list: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-merge-subset: set to override the restriction that all batches must be included in the merge.

Enabling use of mimalloc library for enhanced performance

Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.

For this purpose, the convenience script mi_dragen.sh is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.

The recommended version of mimalloc for gVCF Genotyper is v2.0.7.

gVCF metric import customization

The set of INFO and FORMAT metrics imported into gVCF Genotyper from the input gVCF files can be customized via the --gg-info-to-import and --gg-format-to-import options respectively. This allows metrics beyond the default set to be ingested and subsequently output in the final msVCF. The cost of this is in an increased size of the intermediate cohort file, which can be significant. Conversely, reducing the metrics imported to the minimum desired set could lead to space savings on intermediate file storage.

The sample genotype (GT) is unconditionally imported, as are the values in the QUAL and FILTER columns (FORMAT/FT for msVCF input). The values of the AD, GQ, PL and, on chrM only, AF gVCF FORMAT fields are imported by default. These defaults can be overwritten in the argument to --gg-format-to-import, as well as the following additional fields: DP, F1R2, F2R1, GP, MB, PRI, SB, SQ (chrM only), e.g. --gg-format-to-import=AD,GQ,PL,DP,GP,SB.

No gVCF INFO fields are ingested by default; the following are available for optional import via a comma-separated list to --gg-info-to-import: MQ, MQRankSum, ReadPosRankSum, FractionInformativeReads, LOD, R2_5P_bias, MOSAIC. Imported values are then available for output as per-sample FORMAT values when generating the final msVCF.

For details on how to write any of these metrics to the output msVCF file, see msVCF metric customization below. It should be noted that any metric not selected for import will show as missing if requested in the output msVCF and in particular that no Allelic Balance calculations will be available if AD has not been imported at the outset.

The multi-sample VCF output of the iterative gVCF Genotyper

The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.

The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.

We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.

This approach is also referred to as local alleles and is also used by open source software such as bcftools and Hail.

Iterative gVCF Genotyper with Mitochondrial Variant Calls

When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:

Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF

QUAL column in msVCF

The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.

Measures of Hardy-Weinberg Equilibrium in the msVCF output

The Hardy-Weinberg Equilibrium (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. Adherance to HWE can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.

Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values¹, an allele-wise excess heterozygosity (ExcHet) P-value¹ and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G.

Metric

Description

Scope

Number of values

HWE

Hardy-Weinberg Equilibrium P-value

Allele-wise

One for each alt allele

ExcHet

Excess Heterozygosity P-value

Allele-wise

One for each alt allele

HWEc2

Hardy-Weinberg Equilibrium P-value

Site-wise

Inbreeding Coefficient

Site-wise

Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.

Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.

Allele-wise vs site-wise Hardy-Weinberg metrics

Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method Am J Hum Genet. 2005 May; 76(5): 887–893 the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in Am J Hum Genet. 2005 May; 76(5): 887–893. Care should be taken when deciding which to use.

Allele-Wise Hardy-Weinberg Equilibrium and Excess Heterozygosity.

Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in Am J Hum Genet. 2005 May; 76(5): 887–893. The implementation does not use a mid P-value correction.

For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.

For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).

Site-Wise Hardy-Weinberg Equilibrium.

Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as

𝜒² = ∑_gt (E_gt - O_gt)² / E_gt

where the summation is over gt is over all genotypes possible at the site given the alleles present, and E_gt and O_gt are the expected and observed counts for genotype gt, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is

dof = n(n-1)/2

where n is the number of alleles.

The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.

A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.

The Inbreeding Coefficient

Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:

IC = 1 - O(het) / E(het),

where O(het) and E(het) are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].

Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.

Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.

Allelic Balance

Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD).

For homozygotes this is taken as

AB = AD_i / ∑_j AD_j

where i is the index of the called allele and j runs over all alleles.

For heterozygotes this is taken as

AB_i = AD_i / (AD_a + AD_b)

where a and b are the indices of the called alleles and i can have values a or b.

For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.

DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values¹ for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, or if AD is not selected for import, the values are not calculated and the fields will be omitted from the output msVCF file.

It is also possible to filter based on the maximum ABHetP value, see msVCF hard filtering.

msVCF hard filtering

Sites in the output msVCF can be filtered on the following global metrics:

QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒² Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)

The syntax of a filtering expression is the same as that used by the small variant caller (see Germline Small Variant Hard Filtering). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.

msVCF metric customization

The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL. The following metrics form the default set: GT, GQ, LAD, FT (original gVCF FILTER), LPL, LAA, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify option is enabled). The following additional localized metrics are available (if imported): LF1R2, LF2R1, LGP, LPRI, LSQ (MT contig only), LGT and LA. Also potentially available for output are AD (non-localized), QL (original gVCF QUAL), DP, MB, SB, MQ, MQR (gVCF MQRankSum), RPRS (gVCF ReadPosRankSum), FIR (gVCF FractionInformativeReads), LOD, R25PB (gVCF R2_5P_bias), MSC (gVCF MOSAIC flag). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None.

The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF, ABHom, ABHet, ABHetP. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None, in which case the INFO column will contain the missing symbol, .. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G. INFO fields that have a missing value, ., at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.

File size optimizations

For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.

The following options have applicability to this:

The small variant caller's --vc-compact-gvcf, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
- Outputting local allele values, as described above.
- Use of the msVCF metric customization options to output only those metrics required for the downstream analysis.
- Omitting records that fail filters (--gg-skip-filtered-sites option).
- Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf option). This behaviour is explicitly permitted by the VCF specification.

The option that can have the biggest impact on the final output file size is that to write it directly in spVCF format. This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf) can be directly merged via the --merge-batches subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for download. The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.

¹ P-values in DRAGEN's iterative gVCF Genotyper are calculated as 32-bit floating-point variables so are subject to their limitations. As P-values can be very small, they sometimes fall below the numerical minimum of 32-bit floating-point variables where they may not be correctly represented.

PreviousVNTR Calling NextFilter Duplicate Variants

Last updated 1 month ago

Was this helpful?