DRAGEN
Illumina Connected Software
  • Overview
    • Illumina® DRAGEN™ Secondary Analysis
    • DRAGEN Applications
    • Deployment Options
  • Product Guides
    • DRAGEN v4.4
      • Getting Started
      • DRAGEN Host Software
        • DRAGEN Secondary Analysis
      • Clinical Research Workflows
        • DRAGEN Heme WGS Tumor Only Pipeline
          • Quick Start
          • Sample Sheets
            • Introduction
            • Requirements
            • Templates
          • Run Planning
            • Sample Sheet Creation in BaseSpace
            • Custom Config Support
          • DRAGEN Server App
            • Getting Started
            • Launching Analysis
            • Command Line Options
            • Output
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Illumina Connected Insights
          • ICA Cloud App
            • Getting Started
            • Launching Analysis
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Post Processing
              • Illumina Connected Insights
          • Analysis Output
          • Analysis Methods
          • Troubleshooting
        • DRAGEN Solid WGS Tumor Normal Pipeline
          • Quick Start
          • Sample Sheets
            • Introduction
            • Requirements
            • Templates
          • Run Planning
            • Sample Sheet Creation in BaseSpace
            • Custom Config Support
          • DRAGEN Server App
            • Quick Start
            • Getting Started
            • Launching Analysis
            • Command Line Options
            • Output
            • Advanced Topics
            • Custom Workflow
              • Custom Config Support
            • Illumina Connected Insights
          • ICA Cloud App
            • Getting Started
            • Launching Analysis
            • Output
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Post Processing
              • Illumina Connected Insights
          • Analysis Output
          • Analysis Methods
          • Troubleshooting
      • DRAGEN Recipes
        • DNA Germline Panel UMI
        • DNA Germline Panel
        • DNA Germline WES UMI
        • DNA Germline WES
        • DNA Germline WGS UMI
        • DNA Germline WGS
        • DNA Somatic Tumor-Normal Solid Panel UMI
        • DNA Somatic Tumor-Normal Solid Panel
        • DNA Somatic Tumor-Normal Solid WES UMI
        • DNA Somatic Tumor-Normal Solid WES
        • DNA Somatic Tumor-Normal Solid WGS UMI
        • DNA Somatic Tumor-Normal Solid WGS
        • DNA Somatic Tumor-Only Heme WGS
        • DNA Somatic Tumor-Only Solid Panel UMI
        • DNA Somatic Tumor-Only Solid Panel
        • DNA Somatic Tumor-Only Solid WES UMI
        • DNA Somatic Tumor-Only Solid WES
        • DNA Somatic Tumor-Only Solid WGS UMI
        • DNA Somatic Tumor-Only Solid WGS
        • DNA Somatic Tumor-Only ctDNA Panel UMI
        • Illumina scRNA
        • Other scRNA prep
        • RNA Panel
        • RNA WTS
      • DRAGEN Reference Support
        • Prepare a Reference Genome
      • DRAGEN DNA Pipeline
        • DNA Mapping
        • Read Trimming
        • DRAGEN FASTQC
        • Sorting and Duplicate Marking
        • Small Variant Calling
          • ROH Caller
          • B-Allele Frequency Output
          • Somatic Mode
          • Pedigree Analysis
          • De Novo Small Variant Filtering
          • Autogenerated MD5SUM for VCF Files
          • Force Genotyping
          • Machine Learning for Variant Calling
          • Evidence BAM
          • Mosaic Detection
          • VCF Imputation
          • Multi-Region Joint Detection
        • Copy Number Variant Calling
          • Available pipelines
            • Germline CNV Calling (WGS/WES)
            • Germline CNV Calling ASCN (WGS)
            • Multisample Germline CNV Calling
            • Somatic CNV Calling ASCN (WGS)
            • Somatic CNV Calling WES
            • Somatic CNV Calling ASCN (WES)
          • Additional documentation
            • CNV Input
            • CNV Preprocessing
            • CNV Segmentation
            • CNV Output
            • CNV ASCN module
            • CNV with SV Support
            • Cytogenetics Modality
        • Repeat Expansion Detection
          • De Novo Repeat Expansion Detection
        • Targeted Caller
          • CYPDB6 Caller
          • CYP2D6 Caller
          • CYP21A2 Caller
          • GBA Caller
          • HBA Caller
          • LPA Caller
          • Rh Caller
          • SMN Caller
        • Structural Variant Calling
          • Structural Variant De Novo Quality Scoring
          • Structural Variant IGV Tutorial
        • VNTR Calling
        • Population Genotyping
        • Filter Duplicate Variants
        • Ploidy Calling
          • Ploidy Estimator
          • Ploidy Caller
        • Multi Caller
        • QC Metrics Reporting
        • JSON Metrics Reporting
        • HLA Typing
        • Biomarkers
          • Tumor Mutational Burden
          • Microsatellite Instability
          • Homologous Recombination Deficiency
          • BRCA Large Genomic Rearrangment
          • DRAGEN Fragmentomics
        • Downsampling
          • Fractional (Raw Reads) Downsampling
        • Unique Molecular Identifiers
        • Indel Re-aligner (Beta)
        • Star Allele Caller
        • High Coverage Analysis
        • CheckFingerprint
        • Population Haplotyping (Beta)
        • DUX4 Rearrangement Caller
      • DRAGEN RNA Pipeline
        • RNA Alignment
        • Gene Fusion Detection
        • Gene Expression Quantification
        • RNA Variant Calling
        • Splice Variant Caller
      • DRAGEN Single Cell Pipeline
        • Illumina PIPseq scRNA
        • Other scRNA Prep
        • scATAC
        • Single-Cell Multiomics
      • DRAGEN Methylation Pipeline
      • DRAGEN MRD Pipeline
      • DRAGEN Amplicon Pipeline
      • Explify Analysis Pipeline
        • Kmer Classifier
        • Kmer Classifier Database Builder
      • BCL conversion
      • Illumina Connected Annotations
      • ORA Compression
      • Command Line Options
        • Docker Requirements
      • DRAGEN Reports
      • Tools and Utilities
    • DRAGEN v4.3
      • Getting Started
      • DRAGEN Host Software
        • DRAGEN Secondary Analysis
      • DRAGEN Reference Support
        • Prepare a Reference Genome
      • DRAGEN DNA Pipeline
        • DNA Mapping
        • Read Trimming
        • DRAGEN FASTQC
        • Sorting and Duplicate Marking
        • Small Variant Calling
          • ROH Caller
          • B-Allele Frequency Output
          • Somatic Mode
          • Joint Analysis
          • De Novo Small Variant Filtering
          • Autogenerated MD5SUM for VCF Files
          • Force Genotyping
          • Machine Learning for Variant Calling
          • Evidence BAM
          • Mosaic Detection
          • VCF Imputation
          • Multi-Region Joint Detection
        • Copy Number Variant Calling
          • CNV Output
          • CNV with SV Support
          • Multisample CNV Calling
          • Somatic CNV Calling WGS
          • Somatic CNV Calling WES
          • Allele Specific CNV for Somatic WES CNV
        • Repeat Expansion Detection
          • De Novo Repeat Expansion Detection
        • Targeted Caller
          • CYPDB6 Caller
          • CYP2D6 Caller
          • CYP21A2 Caller
          • GBA Caller
          • HBA Caller
          • LPA Caller
          • Rh Caller
          • SMN Caller
        • Structural Variant Calling
          • Structural Variant De Novo Quality Scoring
        • VNTR Calling
        • Filter Duplicate Variants
        • Ploidy Calling
          • Ploidy Estimator
          • Ploidy Caller
        • Multi Caller
        • QC Metrics Reporting
        • HLA Typing
        • Biomarkers
          • Tumor Mutational Burden
          • Microsatellite Instability
          • Homologous Recombination Deficiency
          • BRCA Large Genomic Rearrangment
          • DRAGEN Fragmentomics
        • Downsampling
          • Fractional (Raw Reads) Downsampling
          • Effective Coverage Downsampling
        • Unique Molecular Identifiers
        • Indel Re-aligner (Beta)
        • Star Allele Caller
        • High Coverage Analysis
        • CheckFingerprint
        • Population Haplotyping (Beta)
        • DUX4 Rearrangement Caller
      • DRAGEN RNA Pipeline
        • RNA Alignment
        • Gene Fusion Detection
        • Gene Expression Quantification
        • RNA Variant Calling
        • Splice Variant Caller
      • DRAGEN Single-Cell Pipeline
        • scRNA
        • scATAC
        • Single-Cell Multiomics
      • DRAGEN Methylation Pipeline
      • DRAGEN Amplicon Pipeline
      • Explify Analysis Pipeline
        • Kmer Classifier
        • Kmer Classifier Database Builder
      • DRAGEN Recipes
        • DNA Germline Panel UMI
        • DNA Germline Panel
        • DNA Germline WES UMI
        • DNA Germline WES
        • DNA Germline WGS UMI
        • DNA Germline WGS
        • DNA Somatic Tumor-Normal Solid Panel UMI
        • DNA Somatic Tumor-Normal Solid Panel
        • DNA Somatic Tumor-Normal Solid WES UMI
        • DNA Somatic Tumor-Normal Solid WES
        • DNA Somatic Tumor-Normal Solid WGS UMI
        • DNA Somatic Tumor-Normal Solid WGS
        • DNA Somatic Tumor-Only Heme WGS
        • DNA Somatic Tumor-Only Solid Panel UMI
        • DNA Somatic Tumor-Only Solid Panel
        • DNA Somatic Tumor-Only Solid WES UMI
        • DNA Somatic Tumor-Only Solid WES
        • DNA Somatic Tumor-Only Solid WGS UMI
        • DNA Somatic Tumor-Only Solid WGS
        • DNA Somatic Tumor-Only ctDNA Panel UMI
        • RNA Panel
        • RNA WTS
      • BCL conversion
      • Illumina Connected Annotations
      • ORA Compression
      • Command Line Options
      • DRAGEN Reports
      • Tools and Utilities
  • Reference
    • DRAGEN Server
    • DRAGEN Multi-Cloud
      • DRAGEN on AWS
      • DRAGEN on AWS Batch
      • DRAGEN on Microsoft Azure
        • Run DRAGEN VM on Azure
      • DRAGEN on Microsoft Azure Batch
        • Azure Batch Run Modes
    • DRAGEN Licensing
      • DRAGEN Server Licensing
      • DRAGEN Cloud Licensing
    • DRAGEN Application Manager
    • Support
    • Resource Files
      • Noise Baselines
    • Supplementary Information
    • Troubleshooting
    • Citing DRAGEN software
    • Release Notes
    • Revision History
Powered by GitBook
On this page
  • Inputs
  • Sample Input
  • Reference Panel
  • Genetic Map
  • JSON config file
  • Sample type file
  • Outputs
  • Command Line Options

Was this helpful?

Export as PDF
  1. Product Guides
  2. DRAGEN v4.4
  3. DRAGEN DNA Pipeline
  4. Small Variant Calling

VCF Imputation

PreviousMosaic DetectionNextMulti-Region Joint Detection

Last updated 2 days ago

Was this helpful?

The VCF imputation software can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:

  • with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes

  • with accceleration supported with Advanced Vector Extensions (AVX)

The DRAGEN VCF imputation software infers variants on autosomes and chromosome X of haploid and diploid species.

Upon completion, the software generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the .

For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping software.

Notes:

  • The output is in biallelic format, one line per ALT.

  • The VCF imputation software only supports input sample data generated with the DRAGEN secondary analysis software.

The following is an example of commands to impute SNPs on a single autosome chromosome:

dragen 
--enable-imputation true 
--imputation-ref-panel-dir <REF_PANEL_DIR>
--imputation-ref-panel-prefix <IRPv2.1> 
--imputation-chunk-input-region <chr22> 
--imputation-phase-input-list <VCF_to_be_imputed.txt> 
--imputation-genome-map-dir <MAP_DIR> 
--output-directory <OUT_DIR>
--output-file-prefix <OUT_PREFIX>

The following is an example of commands to impute SNPs and INDELs on all human chromosomes (autosomes and chromosome X):

dragen 
--enable-imputation true 
--imputation-ref-panel-dir <REF_PANEL_DIR>
--imputation-ref-panel-prefix <IRPv2.1> 
--imputation-chunk-input-region-list <chr_list.txt> 
--imputation-phase-input-list <VCF_to_be_imputed.txt> 
--imputation-genome-map-dir <MAP_DIR> 
--imputation-phase-sample-type-list <path to sample type file>
--imputation-phase-impute-reference-only-variants true
--output-directory <OUT_DIR>
--output-file-prefix <OUT_PREFIX>

Inputs

Sample Input

The imputation software infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the software leverages the information from all provided samples.

The sample(s) to be imputed must have the following format:

  • VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported

  • Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information

Recommendation when imputing SNPs and INDELs

To impute SNPs and INDELs and get the best accuracy on INDELs, it is recommended:

  • and to set the command --imputation-phase-impute-reference-only-variants to true.

Reference Panel

Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX (chrY and chrM are not supported)

A custom reference panel can be built with the DRAGEN Population Haplotyping software. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.

Genetic Map

  • <chromosome name>.gmap.gz

  • 3 columns: position, chromosome number, distance (cM)

  • compliant with the reference genome used to generate the sample input

JSON config file

In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.

Example of JSON config file

For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):

{
  "regions": { 
    "chrX" : [ "chrX_par1", "chrX_nonpar", "chrX_par2" ] 
  },
  "ploidy" : {
     "chrX_nonpar" : { "M": 1, "F": 2},
     "default"     : { "M": 2, "F": 2}
  }
 }

Instructions to make a custom JSON configuration file:

The JSON config file is made of two fields as defined in the table below

Fields
Required/Optional
Purpose
Type

regions

Required only when a chromosome of mixed ploidy is present in the Reference Panel folder

Define contig name and subregion name of mixed ploidy chromosome

Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]

ploidy

  • “default” is a required name

  • contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder

Define:

  • ploidy behavior when different from “default”

  • default ploidy behavior

Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input

Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.

Sample type file

The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.

The sample type file is a txt file with the following format

  • 2 columns, tabs or space delimited

  • First column: list of all sample names present in the input sample

  • Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.

Outputs

The VCF imputation software generates several outputs:

  • The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz

  • The intermediate files:

    • chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt

    • imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz

    • text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz generated with name <prefix>.impute.phase.out.txt

Note: while the imputation software can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools software can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps

Command Line Options

Option
Type
Required
Description

--enable-imputation

NA

Yes

Set to true to enable vcf imputation pipeline

--imputation-ref-panel-dir

STRING

Yes

Directory containing per-chromosome reference panel VCF/BCF format and optionally the JSON config file

--imputation-ref-panel-prefix

STRING

Yes

Prefix for reference panel files and the JSON config file

--imputation-genome-map-dir

STRING

Yes

Directory containing per-chromosome genome map files

--imputation-chunk-input-region

STRING

Yes for single region

Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).

--imputation-chunk-input-region-list

STRING

Yes for list of regions

Text file listing chromosomes or regions to be processed, one chromosome/region per line.

--imputation-phase-input

STRING

Yes for single VCF file

Sample input file with VCF/BCF format. Single VCF or multi-sample VCF

--imputation-phase-input-list

STRING

Yes for multiple VCF files

Text file listing sample input in VCF/BCF format, one input file per line

--imputation-phase-sample-type

STRING

Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file

Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file

--imputation-phase-sample-type-list

STRING

Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files

Path to the Sample Type file

--output-directory

STRING

Yes

Output directory

--output-file-prefix

STRING

Yes

Output files prefix

--imputation-phase-threads

INT

No

Specify the number of threads to use. Default is the number of system threads

--imputation-phase-filter-input-sample-in-ref

NA

No

Default is true: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false if all samples from the reference panel should be kept regardless of their presence in the sample input.

--imputation-phase-impute-reference-only-variants

STRING

No

Default is false. If set to true, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used. When imputing INDELS and SNPs positons, it is recommended to input samples that have been variant called using --vc-forcegt-vcf with SNPs-only sites.vcf file AND to turn this command to true.

--imputation-phase-input-independently

STRING

No

Default is false. If set to true, allows to treat each sample input independently without using them in the reference panel calculation

Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.

To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the . When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning software (--vc-ml-enable-recalibration=false).

to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the .

A per-chromosome reference panel in BCF or VCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the . IRPv2.1 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.

A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the . DRAGEN does not generate custom genetic map files. The genetic map should follow the format:

This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the . It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the software assumes that the imputation is done on all diploid chromosomes.

DRAGEN Software Support site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page
DRAGEN Software Support Site page