Population Haplotyping (Beta)

DRAGEN implements a beta version of the Population Haplotyping tool. This tool supports the estimation of haplotypes from a population scale dataset via the packaging of the SHAPEIT5 Software (2022, Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O). It is designed to phase common variants as well as rare variants in a step-by-step mode. The following step-by-step workflow must be reproduced to phase each chromosome of the studied genome.

  • Step 1: Phase Common step to estimate the haplotypes of common variants (variants with allele frequency above a given allele frequency threshold) on defined regions.

  • Step 2: Common Ligate step to ligate the phased common variants from step 1 into a single chromosome.

  • Step 3: Phase Rare step to add the haplotypes of rare variants (variants with allele frequency below a given allele frequency threshold) on defined regions to the common variant scaffold obtained in step 2.

  • Step 4: Concat All step to concatenate the haplotype regions obtained in step 3 into a single chromosome.

This tool provides best accuracy on population scale dataset with thousands of samples. It is recommended to be run on multiple nodes to parallelize processes. A common use case of the Population Haplotyping tool is the generation of a custom reference panel to be used for the VCF Imputation pipeline.

The tool supports autosomes and mixed ploidy chromosomes for diploid species only. It does not use the FPGA accelerated capability and it can run on generic software only compute node.

Note: the Population Haplotyping tool only supports input msVCF produced with the DRAGEN gVCF Genotyper tool.

Command-Line Examples

The following is an example of required command to generate haplotypes on common and rare variants (with default allele frequency threshold) on population scale dataset:

Step 1: Phase Common

dragen \
  --enable-population-haplotyping true \
  --enable-phase-common true \
  --ph-phase-common-input-list <path_to_txt_file> \
  --ph-phase-common-input-region <string> \
  --ph-phase-common-map <path_genetic_map> \
  --ph-phase-common-config <path_config_txt_file> \
  --ph-phase-common-sample-type <path_sample_type_txt_file> \
  --output-directory <DIR> \
  [options]

Step 2: Ligate Common

dragen \
  --enable-population-haplotyping true \
  --enable-ligate-common true \
  --ph-ligate-common-input-list <path_to_txt_file> \
  --output-directory <DIR> \
  [options]

Step 3: Phase Rare

dragen \
  --enable-population-haplotyping true \
  --enable-phase-rare true \
  --ph-phase-rare-input <path_to_preprocessed_file_output_of_step_1> \
  --ph-phase-rare-input-region <string> \
  --ph-phase-rare-scaffold <path_to_scaffold_file_output_of_step_1> \
  --ph-phase-rare-scaffold-region <string> \
  --ph-phase-rare-map <path_genetic_map> \
  --ph-phase-rare-config <path_config_txt_file> \
  --ph-phase-rare-sample-type <path_sample_type_txt_file> \
  --output-directory <DIR> \
  [options]

Step 4: Concat All

To generate per chromosome haplotypes:

dragen \
  --enable-population-haplotyping true \
  --enable-concat-all true \
  --ph-concat-all-input-list <path_to_txt_file> \
  --output-directory <DIR> \
  [options]

To generate per genome haplotyped sites

dragen \
  --enable-population-haplotyping true \
  --enable-concat-all true \
  --ph-concat-all-input-list-sites-only <path_to_txt_file> \
  --output-directory <DIR> \
  [options]

Input Files

msVCF Input (step 1 and step 3)

msVCF input list for the Phase Common step (step 1)

For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:

  • per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition

  • generated from the same reference build

  • compressed and indexed

  • with unphased GT calls

  • with no duplicates

  • with header ##contig "ID" and "length" fields for all contigs present in the studied genome

Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.

msVCF input for the Phase Rare step (step 3)

The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).

To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size. Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.

Genetic map (step 1 and step 3)

A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the geentic map corresponding to the human hg38 reference genome available to download from the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files.

The genetic map should follow the format:

  • 3 columns: position, chromosome number, distance (cM), in this order and tab separated

  • Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1, PAR2 chrX_par2 and non PAR chrX_nonpar regions)

  • Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY)

The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.

Config file (step 1 and step 3)

This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar).

The user can provide its own or use the one available to download from DRAGEN Software Support Site page.

Example of Config file

##version=1.0
##ref_build=hg38
#filename    region    male_ploidy    female_ploidy
chr1.gmap.gz    chr1:1-248956422    2    2
chr2.gmap.gz    chr2:1-242193529    2    2
chr3.gmap.gz    chr3:1-198295559    2    2
chr4.gmap.gz    chr4:1-190214555    2    2
chr5.gmap.gz    chr5:1-181538259    2    2
chr6.gmap.gz    chr6:1-170805979    2    2
chr7.gmap.gz    chr7:1-159345973    2    2
chr8.gmap.gz    chr8:1-145138636    2    2
chr9.gmap.gz    chr9:1-138394717    2    2
chr10.gmap.gz    chr10:1-133797422    2    2
chr11.gmap.gz    chr11:1-135086622    2    2
chr12.gmap.gz    chr12:1-133275309    2    2
chr13.gmap.gz    chr13:1-114364328    2    2
chr14.gmap.gz    chr14:1-107043718    2    2
chr15.gmap.gz    chr15:1-101991189    2    2
chr16.gmap.gz    chr16:1-90338345    2    2
chr17.gmap.gz    chr17:1-83257441    2    2
chr18.gmap.gz    chr18:1-80373285    2    2
chr19.gmap.gz    chr19:1-58617616    2    2
chr20.gmap.gz    chr20:1-64444167    2    2
chr21.gmap.gz    chr21:1-46709983    2    2
chr22.gmap.gz    chr22:1-50818468    2    2
chrX_par1.gmap.gz    chrX:1-2781479    2    2
chrX_nonpar.gmap.gz    chrX:2781480-155701382    1    2
chrX_par2.gmap.gz    chrX:155701383-156040895    2    2

Instructions to make a custom configuration file:

The config file is a text file with the headers:

  • ##version

  • ##ref_build indicating the reference build used for the study.

The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.

Column information
Description

First column: filename

Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames.

Second column: region

Specifies the start and end positions of the chromosome or sub-chromosome region with format <contig_name>:<start_position>-<end_position>. For chromosomes without mixed ploidy regions, the start position is 1, end position is the length of the chromosome (1-based, inclusive). For chromosomes with mixed ploidy regions, for each region, the start and end positions are those of the region (1-based, inclusive).

Third column: mixed ploidy subject

Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region

Fourth column: diploid subject

Specifies 2 for all chromosomes

Note: for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1, chrX_nonpar, and chrX_par2.

Sample type file (step 1 and step 3)

The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.

The sample type file is a txt file with the following format

  • 2 columns, tabs or space delimited

  • First column: list of all sample names present in the input sample

  • Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.

Output Files

Phase Common step

The Phase common step (step 1) is run on a defined region, and outputs:

  • a single scaffold msVCF and related msVCF index with phased common variants for that region. The default name is dragen.ph_phase_common.vcf.gz.

  • a single formatted msVCF called <prefix>.preprocess.vcf.gz and related index. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).

Ligate Common step

The Ligate Common step (step 2) ligates the regions phased in step 1 and outputs a single scaffold msVCF and related msVCF index with phased common variants for a single chromosome. The default name is “dragen.ph_ligate_common.vcf.gz”.

Phase Rare step

The Phase Rare step (step 3) is run on a defined region on a chromosome with preprocessed unphased msVCF from step 1 and phased scaffold msVCF from step 2, and outputs:

  • a single phased msVCF and related msVCF index with phased common and rare variants for that region. The default name is “dragen.ph_rare_common.vcf.gz”.

  • a single 8-column VCF and related index listing all sites that have been phased for that region. The default name is “dragen.ph_rare_common.sites.vcf.gz”.This output is used at the Concat-All step to generate a VCF file with all phased sites accross the genome.

Concat All step

The Concat All processing is used to generate 2 types of output

  1. Phased common and rare variants for a chromosome

The Concat All step (step 4) concatenates the regions phased in step 3 and outputs an msVCF and related index with phased common and rare variants for a single chromosome. The default name is “dragen.ph_concat_all.vcf.gz”.

  1. List of phased sites

This output is useful for input of the ForceGT tool. The Concat All step lists all sites in a 8-column VCF format that have been phased and output a VCF and related index with list of phased sites. This output can be generated either from a list of phased site VCFs across the genome from step3, or, in a second step once the list of per chromosome sites have been generated. The default name is “dragen.ph_concat_all.sites.vcf.gz”.

Command-Line Options for step 1: Phase Common

Option
Required
Description

--enable-population-haplotyping

Yes

Set to true to enable population haplotyping tool.

--enable-phase-common

Yes

Set to true to enable the Phase Common step.

--ph-phase-common-input-list

Yes

Provides a .txt file listing the sample input pertaining to one chromosome, with path to a single msVCF or a list of msVCF, one line per path. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.

--ph-phase-common-input-region

Yes

Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must overlap between them for the downstream ligate common step. Examples of input region length for human data: 10 mbp Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1, chrX_nonpar and chrX_par2, instead of one run with region chrX).

--ph-phase-common-map

Yes

Provides path to the chromosome genetic map. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.

--ph-phase-common-config

Yes

Provides path to the txt config file.

--ph-phase-common-reference

No

Provides the path to a reference panel of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.

--ph-phase-common-scaffold

No

Provides the path to a scaffold of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.

--ph-phase-common-sample-type

Yes

Provides the path to the Sample type file.

--ph-phase-common-filter-maf

No

Default 0.001. Set the Minimum Allele Frequency threshold. All variants with allele frequency equal or above this MAF are phased during this Phase Common step.

--ph-phase-common-max-miss-gt-rate

No

Default 0.1. Set the threshold for variants to be skipped if the rate of missing GT is higher than this value.

--output-directory

Yes

Specifies the output directory.

--output-file-prefix

No

Outputs filename with the defined prefix for the file generated by the pipeline.

Command-Line Options for step 2: Ligate Common

Option
Required
Description

--enable-population-haplotyping

Yes

Set to true to enable population haplotyping tool.

--enable-ligate-common

Yes

Set to true to enable the Ligate Common step.

--ph-ligate-common-input-list

Yes

Provide a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Common step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome

--output-directory

Yes

Specifies the output directory.

--output-file-prefix

No

Outputs filename with the defined prefix for the file generated by the pipeline.

Command-Line Options for step 3: Phase Rare

Option
Required
Description

--enable-population-haplotyping

Yes

Set to true to enable population haplotyping tool.

--enable-phase-rare

Yes

Set to true to enable the Phase Rare step.

--ph-phase-rare-input

Yes

Provides the path to the preprocessed unphased msVCF generated from Phase Common step covering the phase rare region.

--ph-phase-rare-input-region

Yes

Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must not overlap or have gaps between them. Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1, chrX_nonpar and chrX_par2, instead of one run with region chrX).

--ph-phase-rare-map

Yes

Provides the path to the genetic map of the chromosome. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.

--ph-phase-rare-config

Yes

Provides the path to the txt config file.

--ph-phase-rare-scaffold

Yes

Provides the path to the scaffold of haplotypes in msVCF format generated from Ligate Common step.

--ph-phase-rare-scaffold-region

Yes

Specifies the scaffold region to be phased. String in the format contigname: startposition-endposition. This scaffold region needs to cover the Input region and to allow buffer between regions. The buffer length impacts the accuracy and speed of the process: longer length is slower but improves accuracy.

--ph-phase-rare-sample-type

Yes

Provides the path to the Sample type file.

--ph-phase-rare-filter-maf

No

Default 0.001. Set the Maximum Allele Frequency threshold. All variants with allele frequency below this MAF are phased during this Phase Rare step. This value must be the same as the one provided at –ph-phase-common-filter-maf. If values differ not all variants will be phased.

--output-directory

Yes

Specifies the output directory.

--output-file-prefix

No

Outputs filename with the defined prefix for the file, generated by the pipeline.

Command-Line Options for step 4: Concat All

Option
Required
Description

--enable-population-haplotyping

Yes

Set to true to enable population haplotyping tool.

--enable-concat-all

Yes

Set to true to enable the Concat All step.

--ph-concat-all-input-list

Yes when --ph-concat-all-input-list is not provided

Provides a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Rare step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.

--ph-concat-all-input-list-sites-only

Yes when --ph-concat-all-input-list is not provided

Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end.

--output-directory

Yes

Specifies the output directory.

--output-file-prefix

No

Outputs filename with the defined prefix for the file generated by the pipeline.

Population Haplotyping Accuracy

An additional module of the Population Haplotyping tool checks for the quality of the haplotypes produced based on a phased truth set provided as input.

Command-Line example

dragen \
  --enable-population-haplotyping true \
  --enable-phase-qc true  \
  --ph-phase-qc-validation <path_to_phased_truth_set> \
  --ph-phase-qc-estimation <path_to_phased_msVCF> \
  --ph-phase-qc-input-region <string> \
  --output-directory <DIR> \
  [options]

Command-Line Options

Option
Required
Description

--enable-population-haplotyping

Yes

Set to true to enable population haplotyping tool.

--enable-phase-qc

Yes

Set to true to enable the quality control module.

--ph-phase-qc-validation

Yes

Provides the path to the phased truth set msVCF. Note: the validation msVCF must have the same samples as in the estimation msVCF for which the phasing accuracy is to be estimated.

--ph-phase-qc-estimation

Yes

Provides the path to the phased msVCF, output of Concat All to be validated.

--ph-phase-qc-input-region

Yes

Specifies the target region to be phased. String in the format contigname: startposition-endposition (startposition-endposition is optional). Regions must not overlap or have gaps between them.

--output-directory

Yes

Specifies the output directory.

--output-file-prefix

No

Outputs filename with the defined prefix for the file generated by the pipeline.

Last updated