Population Haplotyping (Beta)

DRAGEN implements a beta version of the Population Haplotyping tool. This tool supports the estimation of haplotypes from a population scale dataset via the packaging of the SHAPEIT5 Software (2022, Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O). It is designed to phase common variants as well as rare variants in a step-by-step mode. The following step-by-step workflow must be reproduced to phase each chromosome of the studied genome.

  • Step 1: Phase Common step to estimate the haplotypes of common variants (variants with allele frequency above a given allele frequency threshold) on defined regions.

  • Step 2: Common Ligate step to ligate the phased common variants from step 1 into a single chromosome.

  • Step 3: Phase Rare step to add the haplotypes of rare variants (variants with allele frequency below a given allele frequency threshold) on defined regions to the common variant scaffold obtained in step 2.

  • Step 4: Concat All step to concatenate the haplotype regions obtained in step 3 into a single chromosome.

This tool provides best accuracy on population scale dataset with thousands of samples. It is recommended to be run on multiple nodes to parallelize processes. A common use case of the Population Haplotyping tool is the generation of a custom reference panel to be used for the VCF Imputation pipeline.

The tool supports autosomes and mixed ploidy chromosomes for diploid species only. It does not use the FPGA accelerated capability and it can run on generic software only compute node.

Note: the Population Haplotyping tool only supports input msVCF produced with the DRAGEN gVCF Genotyper tool.

Command-Line Examples

The following is an example of required command to generate haplotypes on common and rare variants (with default allele frequency threshold) on population scale dataset:

Step 1: Phase Common

dragen \
  --enable-population-haplotyping true \
  --enable-phase-common true \
  --ph-phase-common-input-list <path_to_txt_file> \
  --ph-phase-common-input-region <string> \
  --ph-phase-common-map <path_genetic_map> \
  --ph-phase-common-config <path_config_txt_file> \
  --ph-phase-common-sample-type <path_sample_type_txt_file> \
  --output-directory <DIR> \
  [options]

Step 2: Ligate Common

dragen \
  --enable-population-haplotyping true \
  --enable-ligate-common true \
  --ph-ligate-common-input-list <path_to_txt_file> \
  --output-directory <DIR> \
  [options]

Step 3: Phase Rare

dragen \
  --enable-population-haplotyping true \
  --enable-phase-rare true \
  --ph-phase-rare-input <path_to_preprocessed_file_output_of_step_1> \
  --ph-phase-rare-input-region <string> \
  --ph-phase-rare-scaffold <path_to_scaffold_file_output_of_step_1> \
  --ph-phase-rare-scaffold-region <string> \
  --ph-phase-rare-map <path_genetic_map> \
  --ph-phase-rare-config <path_config_txt_file> \
  --ph-phase-rare-sample-type <path_sample_type_txt_file> \
  --output-directory <DIR> \
  [options]

Step 4: Concat All

To generate per chromosome haplotypes:

dragen \
  --enable-population-haplotyping true \
  --enable-concat-all true \
  --ph-concat-all-input-list <path_to_txt_file> \
  --output-directory <DIR> \
  [options]

To generate per genome haplotyped sites

dragen \
  --enable-population-haplotyping true \
  --enable-concat-all true \
  --ph-concat-all-input-list-sites-only <path_to_txt_file> \
  --output-directory <DIR> \
  [options]

Input Files

msVCF Input (step 1 and step 3)

msVCF input list for the Phase Common step (step 1)

For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:

  • per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition

  • generated from the same reference build

  • compressed and indexed

  • with unphased GT calls

  • with no duplicates

  • with header ##contig "ID" and "length" fields for all contigs present in the studied genome

Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.

msVCF input for the Phase Rare step (step 3)

The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).

To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size. Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.

Genetic map (step 1 and step 3)

A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the geentic map corresponding to the human hg38 reference genome available to download from the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files.

The genetic map should follow the format:

  • 3 columns: position, chromosome number, distance (cM), in this order and tab separated

  • Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1, PAR2 chrX_par2 and non PAR chrX_nonpar regions)

  • Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY)

The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.

Config file (step 1 and step 3)

This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar).

The user can provide its own or use the one available to download from DRAGEN Software Support Site page.

Example of Config file

##version=1.0
##ref_build=hg38
#filename    region    male_ploidy    female_ploidy
chr1.gmap.gz    chr1:1-248956422    2    2
chr2.gmap.gz    chr2:1-242193529    2    2
chr3.gmap.gz    chr3:1-198295559    2    2
chr4.gmap.gz    chr4:1-190214555    2    2
chr5.gmap.gz    chr5:1-181538259    2    2
chr6.gmap.gz    chr6:1-170805979    2    2
chr7.gmap.gz    chr7:1-159345973    2    2
chr8.gmap.gz    chr8:1-145138636    2    2
chr9.gmap.gz    chr9:1-138394717    2    2
chr10.gmap.gz    chr10:1-133797422    2    2
chr11.gmap.gz    chr11:1-135086622    2    2
chr12.gmap.gz    chr12:1-133275309    2    2
chr13.gmap.gz    chr13:1-114364328    2    2
chr14.gmap.gz    chr14:1-107043718    2    2
chr15.gmap.gz    chr15:1-101991189    2    2
chr16.gmap.gz    chr16:1-90338345    2    2
chr17.gmap.gz    chr17:1-83257441    2    2
chr18.gmap.gz    chr18:1-80373285    2    2
chr19.gmap.gz    chr19:1-58617616    2    2
chr20.gmap.gz    chr20:1-64444167    2    2
chr21.gmap.gz    chr21:1-46709983    2    2
chr22.gmap.gz    chr22:1-50818468    2    2
chrX_par1.gmap.gz    chrX:1-2781479    2    2
chrX_nonpar.gmap.gz    chrX:2781480-155701382    1    2
chrX_par2.gmap.gz    chrX:155701383-156040895    2    2

Instructions to make a custom configuration file:

The config file is a text file with the headers:

  • ##version

  • ##ref_build indicating the reference build used for the study.

The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.

Note: for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1, chrX_nonpar, and chrX_par2.

Sample type file (step 1 and step 3)

The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.

The sample type file is a txt file with the following format

  • 2 columns, tabs or space delimited

  • First column: list of all sample names present in the input sample

  • Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.

Output Files

Phase Common step

The Phase common step (step 1) is run on a defined region, and outputs:

  • a single scaffold msVCF and related msVCF index with phased common variants for that region. The default name is dragen.ph_phase_common.vcf.gz.

  • a single formatted msVCF called <prefix>.preprocess.vcf.gz and related index. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).

Ligate Common step

The Ligate Common step (step 2) ligates the regions phased in step 1 and outputs a single scaffold msVCF and related msVCF index with phased common variants for a single chromosome. The default name is “dragen.ph_ligate_common.vcf.gz”.

Phase Rare step

The Phase Rare step (step 3) is run on a defined region on a chromosome with preprocessed unphased msVCF from step 1 and phased scaffold msVCF from step 2, and outputs:

  • a single phased msVCF and related msVCF index with phased common and rare variants for that region. The default name is “dragen.ph_rare_common.vcf.gz”.

  • a single 8-column VCF and related index listing all sites that have been phased for that region. The default name is “dragen.ph_rare_common.sites.vcf.gz”.This output is used at the Concat-All step to generate a VCF file with all phased sites accross the genome.

Concat All step

The Concat All processing is used to generate 2 types of output

  1. Phased common and rare variants for a chromosome

The Concat All step (step 4) concatenates the regions phased in step 3 and outputs an msVCF and related index with phased common and rare variants for a single chromosome. The default name is “dragen.ph_concat_all.vcf.gz”.

  1. List of phased sites

This output is useful for input of the ForceGT tool. The Concat All step lists all sites in a 8-column VCF format that have been phased and output a VCF and related index with list of phased sites. This output can be generated either from a list of phased site VCFs across the genome from step3, or, in a second step once the list of per chromosome sites have been generated. The default name is “dragen.ph_concat_all.sites.vcf.gz”.

Command-Line Options for step 1: Phase Common

Command-Line Options for step 2: Ligate Common

Command-Line Options for step 3: Phase Rare

Command-Line Options for step 4: Concat All

Population Haplotyping Accuracy

An additional module of the Population Haplotyping tool checks for the quality of the haplotypes produced based on a phased truth set provided as input.

Command-Line example

dragen \
  --enable-population-haplotyping true \
  --enable-phase-qc true  \
  --ph-phase-qc-validation <path_to_phased_truth_set> \
  --ph-phase-qc-estimation <path_to_phased_msVCF> \
  --ph-phase-qc-input-region <string> \
  --output-directory <DIR> \
  [options]

Command-Line Options

Last updated