VCF Imputation
The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:
with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)
The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.
Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the DRAGEN Software Support site page.
For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.
Notes:
The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.
The following is an example of commands to impute vcf on a single chromosome:
The following is an example of commands to impute vcf on chromosome X:
Inputs
Sample Input
The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.
The sample(s) to be imputed must have the following format:
VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information
To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the DRAGEN Software Support Site page. When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).
Recommendation for imputing INDELs
To impute INDELs and get the best accuracy on INDELs, it is recommended:
to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the DRAGEN Software Support Site page.
and to set the command
--imputation-phase-impute-reference-only-variants
to true.
Reference Panel
A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the DRAGEN Software Support Site page. IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.
Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported
A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename
. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.
Genetic Map
A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files. The genetic map should follow the format:
<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input
JSON config file
This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the DRAGEN Software Support Site page. It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.
In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.
Example of JSON config file
For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):
Instructions to make a custom JSON configuration file:
The JSON config file is made of two fields as defined in the table below
regions
Required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define contig name and subregion name of mixed ploidy chromosome
Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]
ploidy
“default” is a required name
contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define:
ploidy behavior when different from “default”
default ploidy behavior
Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input
Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.
Sample type file
The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.
Outputs
The VCF imputation tool generates several outputs:
The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name
<prefix>.impute.vcf.gz
The intermediate files:
chunk regions to be passed along to the internal Phase step with name
<prefix>.impute.chunk.out.txt
imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name
<prefix>_chr_start-end.impute.phase.vcf.gz
text file with path to all the
<prefix>_chr_start-end.impute.phase.vcf.gz
generated with name<prefix>.impute.phase.out.txt
Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps
Command Line Options
--enable-imputation
NA
Yes
Set to true
to enable vcf imputation pipeline
--imputation-ref-panel-dir
STRING
Yes
Directory containing per-chromosome reference panel VCF and optionally the JSON config file
--imputation-ref-panel-prefix
STRING
Yes
Prefix for reference panel files and the JSON config file
--imputation-genome-map-dir
STRING
Yes
Directory containing per-chromosome genome map files
--imputation-chunk-input-region
STRING
Yes for single region
Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).
--imputation-chunk-input-region-list
STRING
Yes for list of regions
Text file listing chromosomes or regions to be processed, one chromosome/region per line.
--imputation-phase-input
STRING
Yes for single VCF file
Sample input file with VCF/BCF format. Single VCF or multi-sample VCF
--imputation-phase-input-list
STRING
Yes for multiple VCF files
Text file listing sample input in VCF/BCF format, one input file per line
--imputation-phase-sample-type
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file
Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file
--imputation-phase-sample-type-list
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files
Path to the Sample Type file
--output-directory
STRING
Yes
Output directory
--output-file-prefix
STRING
Yes
Output files prefix
--imputation-phase-threads
INT
No
Specify the number of threads to use. Default is the number of system threads
--imputation-phase-filter-input-sample-in-ref
NA
No
Default is true
: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false
if all samples from the reference panel should be kept regardless of their presence in the sample input.
--imputation-phase-impute-reference-only-variants
STRING
No
Default is false
. If set to true
, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used.
When the input sample variant calling was performed using --vc-forcegt-vcf
with SNPs-only sites.vcf file, it is recommended to set this option to true to also impute INDELs positions from the reference panel.
--imputation-phase-input-independently
STRING
No
Default is false
. If set to true
, allows to treat each sample input independently without using them in the reference panel calculation
Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.
Last updated