Pedigree Analysis
Last updated
Was this helpful?
Last updated
Was this helpful?
DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other. You can find more information about the population-based analysis in the section.
Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF
option. Since is not supported for joint analysis, set --enable-personalization to false when generating gVCF files.
The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
There are two available joint analysis output files:
Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.
The multisample gVCF output is only available in the pedigree-based analysis.
The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".
In hom-ref blocks, the following FORMAT fields are calculated uniquely.
FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.
In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.
SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.
In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.
In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.
Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.
The following parameters are available.
--enable-joint-genotyping
To run the Joint Genotyper, set to true.
--output-directory
The output directory. --output-directory is required.
--output-file-prefix
The prefix used to label all output files. --output-file-prefix is
required.
-r
The directory where the hash table resides.
--variant
Specifies the path to a single gVCF file. You can specify multiple gVCF
files using multiple --variant
options. The joint genotyper output depends on the order of the input gVCF files passed by the --variant command line parameter.
It is recommended to use the same input order when re-analyzing gVCF files to ensure the output is consistent with previous runs.
--pedigree-file
Specify the path to a pedigree file that describes the relationship between samples.
It is possible to run JointGenotyper without a pedigree file on unrelated samples on versions prior to DRAGEN v3.10. It is not recommended for gVCF variant calls on DRAGEN v3.10 or later.
To invoke pedigree mode, set the --enable-joint-genotyping
option to true. Use the --pedigree-file
option to specify the path to a pedigree file that describes the relationship between panels.
The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.
Family_ID
The pedigree identifier.
Individual_ID
The ID of the individual.
Paternal_ID
The ID of the individual's father. If the founder, the value is 0.
Maternal_ID
The ID of the individual's mother. If the founder, the value is 0.
Sex
The sex of the sample. If male, the value is 1. If female, the value is 2.
Phenotype
The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2.
The following is an example of an input pedigree file.
The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.
Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.
Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
gVCF files for the Small Variant Caller.
*.tn.tsv files for the Copy Number Caller.
BAM files for the Structural Variant Caller.
The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.
Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.
At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.
The DQ formula is DQ = -10log10(1 - Pdenovo).
Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.
In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.
The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.
The following are the possible values:
Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.
The following is an example VCF line for a trio:
1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.
The following command line options are available for de novo small variant calling.
--enable-joint-genotyping
--Run the joint genotyping caller.
--pedigree-file
--Specify the path to a pedigree file that describes the relationship between samples. It is possible
to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant
or --variant-list
--Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold
--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold
--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory
--The output directory. This is required.
--output-file-prefix
--The prefix used to label all output files. This is required.
-r
The directory where the hash table resides.
The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant
or --variant-list
. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.
Run Pedigree Mode for Small Variant Caller. For more information, see .
Run Pedigree Mode for Copy Number Caller. For more information, see .
Run Pedigree Mode for Structural Variant Caller. For more information, see .
Run DeNovo Variant Small Variant Filtering. For more information, see .