VCF Imputation

The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:

  • with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes

  • with accceleration supported with Advanced Vector Extensions (AVX)

The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.

Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the DRAGEN Software Support site page.

For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.

Notes:

  • The output is in biallelic format, one line per ALT.

  • The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.

The following is an example of commands to impute vcf on a single chromosome:

dragen 
--enable-imputation true 
--imputation-ref-panel-dir <REF_PANEL_DIR>
--imputation-ref-panel-prefix <IRPv2.0> 
--imputation-chunk-input-region <chr22> 
--imputation-phase-input-list <VCF_to_be_imputed.txt> 
--imputation-genome-map-dir <MAP_DIR> 
--output-directory <OUT_DIR>
--output-file-prefix <OUT_PREFIX>

The following is an example of commands to impute vcf on chromosome X:

dragen 
--enable-imputation true 
--imputation-ref-panel-dir <REF_PANEL_DIR>
--imputation-ref-panel-prefix <IRPv2.0> 
--imputation-chunk-input-region <chrX> 
--imputation-phase-input-list <VCF_to_be_imputed.txt> 
--imputation-genome-map-dir <MAP_DIR> 
--imputation-phase-sample-type-list <path to sample type file>
--output-directory <OUT_DIR>
--output-file-prefix <OUT_PREFIX>

Inputs

Sample Input

The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.

The sample(s) to be imputed must have the following format:

  • VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported

  • Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information

To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the DRAGEN Software Support Site page. When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).

Recommendation for imputing INDELs

To impute INDELs and get the best accuracy on INDELs, it is recommended:

  • to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the DRAGEN Software Support Site page.

  • and to set the command --imputation-phase-impute-reference-only-variants to true.

Reference Panel

A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the DRAGEN Software Support Site page. IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.

Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported

A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.

Genetic Map

A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files. The genetic map should follow the format:

  • <chromosome name>.gmap.gz

  • 3 columns: position, chromosome number, distance (cM)

  • compliant with the reference genome used to generate the sample input

JSON config file

This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the DRAGEN Software Support Site page. It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.

In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.

Example of JSON config file

For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):

{
  "regions": { 
    "chrX" : [ "chrX_par1", "chrX_nonpar", "chrX_par2" ] 
  },
  "ploidy" : {
     "chrX_nonpar" : { "M": 1, "F": 2},
     "default"     : { "M": 2, "F": 2}
  }
 }

Instructions to make a custom JSON configuration file:

The JSON config file is made of two fields as defined in the table below

FieldsRequired/OptionalPurposeType

regions

Required only when a chromosome of mixed ploidy is present in the Reference Panel folder

Define contig name and subregion name of mixed ploidy chromosome

Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]

ploidy

  • “default” is a required name

  • contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder

Define:

  • ploidy behavior when different from “default”

  • default ploidy behavior

Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input

Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.

Sample type file

The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.

The sample type file is a txt file with the following format

  • 2 columns, tabs or space delimited

  • First column: list of all sample names present in the input sample

  • Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.

Outputs

The VCF imputation tool generates several outputs:

  • The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz

  • The intermediate files:

    • chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt

    • imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz

    • text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz generated with name <prefix>.impute.phase.out.txt

Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps

Command Line Options

OptionTypeRequiredDescription

--enable-imputation

NA

Yes

Set to true to enable vcf imputation pipeline

--imputation-ref-panel-dir

STRING

Yes

Directory containing per-chromosome reference panel VCF and optionally the JSON config file

--imputation-ref-panel-prefix

STRING

Yes

Prefix for reference panel files and the JSON config file

--imputation-genome-map-dir

STRING

Yes

Directory containing per-chromosome genome map files

--imputation-chunk-input-region

STRING

Yes for single region

Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).

--imputation-chunk-input-region-list

STRING

Yes for list of regions

Text file listing chromosomes or regions to be processed, one chromosome/region per line.

--imputation-phase-input

STRING

Yes for single VCF file

Sample input file with VCF/BCF format. Single VCF or multi-sample VCF

--imputation-phase-input-list

STRING

Yes for multiple VCF files

Text file listing sample input in VCF/BCF format, one input file per line

--imputation-phase-sample-type

STRING

Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file

Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file

--imputation-phase-sample-type-list

STRING

Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files

Path to the Sample Type file

--output-directory

STRING

Yes

Output directory

--output-file-prefix

STRING

Yes

Output files prefix

--imputation-phase-threads

INT

No

Specify the number of threads to use. Default is the number of system threads

--imputation-phase-filter-input-sample-in-ref

NA

No

Default is true: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false if all samples from the reference panel should be kept regardless of their presence in the sample input.

--imputation-phase-impute-reference-only-variants

STRING

No

Default is false. If set to true, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used. When the input sample variant calling was performed using --vc-forcegt-vcf with SNPs-only sites.vcf file, it is recommended to set this option to true to also impute INDELs positions from the reference panel.

--imputation-phase-input-independently

STRING

No

Default is false. If set to true, allows to treat each sample input independently without using them in the reference panel calculation

Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.

Last updated