Multi-Region Joint Detection

DRAGEN Multi-region Joint Detection (MRJD) is a de novo germline small variant caller for paralogous regions. In DRAGEN v4.3, MRJD covers regions that include six clinically relevant genes: NEB, TTN, SMN1/2, PMS2, STRC, and IKBKG. MRJD is compatible with hg38, hg19 and GRCh37 reference genome. The table below includes hg38 region coordinates covered by MRJD.

Chromosome
Start
End
Description

chr2

151578759

151588523

NEB exon 98-105

chr2

151589318

151599076

NEB exon 90-97

chr2

151599871

151609628

NEB exon 82-89

chr2

178653238

178654995

TTN exon 172-180

chr2

178657498

178659255

TTN exon 181-189

chr2

178661759

178663516

TTN exon 190-198

chr5

70049522

70077596

SMN2

chr5

70924940

70953013

SMN1

chr7

5970924

5980896

PMS2 exon 13-15

chr7

5980968

5987689

PMS2 exon 11-12

chr7

6737007

6743712

PMS2CL exon 2-3

chr7

6743880

6753867

PMS2CL exon 4-6

chr15

43599563

43602630

STRC exon 24-29

chr15

43602982

43611000

STRC exon 14-23

chr15

43611040

43618800

STRC exon 1-13

chr15

43699379

43702452

STRCP1 exon 23-28

chr15

43702488

43710472

STRCP1 exon 13-22

chr15

43710502

43718262

STRCP1 exon 1-12

chrX

154555884

154565047

IKBKG exon 3-10

chrX

154639390

154648553

IKBKGP1

MRJD method

MRJD is a variant calling method that is designed to detect de novo germline small variants in paralogous regions of the genome. A conventional variant caller relies on the read aligner to determine which reads likely originated from a given location. This method works well when the region of interest does not resemble any other region of the genome over the span of a single read (or a pair of reads for paired-end sequencing). However, a significant fraction of the human genome does not meet this criterion. At least 5% of the human genome consists of segmental duplications. Many regions of the genome have near-identical copies elsewhere, and as a result, the true source location of a read might be subject to considerable uncertainty. If a group of reads is mapped with low confidence, a conventional variant caller might ignore the reads, even though they contain useful information. If a read is mismapped (i.e., the primary alignment is not the true source of the read), it can result in variant detection errors.

MRJD is designed in attempt to tackle the complexities raised by segmental duplication regions. Basically, instead of considering each region in isolation, MRJD considers all locations from which a group of reads may have originated and attempts to detect the underlying sequences jointly across all paralogous regions in the sample of interest.

Below is a diagram showing the general workflow of MRJD in PMS2 and PMS2CL regions. MRJD takes primary alignments in all paralogous regions, regardless of mapping quality, builds and places haplotypes based on reads and prior knowledge, and computes joint genotypes to call small variants.

Figure 1. MRJD Caller workflow.

Two modes of the MRJD Caller

As shown in the diagram above, there are two modes of the DRAGEN MRJD Caller, default mode and high sensitivity mode. Here are details on the differences between the two modes.

Default mode

With --enable-mrjd=true, the MRJD Caller will report the following two types of variants:

  1. Uniquely placed variants, which means the variant is found and placed in one of the paralogous regions without ambiguity. See variants labeled with “type 1” in Figure 2.

  2. Region-ambiguous variants. In this case, the aggregated genotype contains a variant allele with high confidence, but MRJD Caller is unable to place the variant allele in one of the paralogous regions with high confidence. The MRJD Caller will report the variant allele in all paralogous regions. See variants labeled with “type 2” in Figure 2.

High Sensitivity mode

With both --enable-mrjd=true and --mrjd-enable-high-sensitivity-mode=true, the MRJD Caller reports the same variants as from the default mode, plus two other types of variants.

  1. Positions where the reference alleles in all paralogous regions are not the same. It is well established that gene conversion, including reciprocal crossover, is a common event between paralogous regions (such as PMS2 and PMS2CL). When reciprocal crossover event occurs, the prior model, without nearby information on phasing, might end up placing the converted haplotype in the source region instead of the destination region, resulting in no variant. The high sensitivity mode compensates for this event by reporting the variant in corresponding positions in all paralogous regions. See variants labeled with “type 3” in Figure 2.

  2. Variants that have been placed uniquely in one of the paralogous regions and no variant in the corresponding position in the other region. The high sensitivity mode reports the variant in the rest of the paralogous regions. This is to compensate the fact that sometimes the prior knowledge that is used to help place the variant is not sufficient or is estimated incorrectly. In those cases, the variant allele still exists but is placed in the wrong paralog region. Therefore, reporting the variant in the other paralogous regions can help maximize sensitivity even with the lack of prior. See variants labeled with “type 4” in Figure 2.

Figure 2. Different variant types reported by MRJD Caller default mode and high sensitivity mode.

Running DRAGEN MRJD

The MRJD Caller is disabled by default and requires WGS data aligned to a human reference genome build 38, 19, or GRCh37.

Here is the list of options related to MRJD.

  • --enable-mrjd If set to true, MRJD is enabled for the DRAGEN pipeline. Note that MRJD cannot run together with SNV caller in the current version of DRAGEN (default = ‘false’).

  • --mrjd-enable-high-sensitivity-mode If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See previous section on what variant types are reported in MRJD default mode and high sensitivity mode (default = ‘false’).

The following command-line example uses FASTQ input and runs MRJD Caller with high sensitivity mode:

dragen \
  -r <REF> \
  -1 <FQ1> \
  -2 <FQ2> \
  --RGID <RG> --RGSM <SM> \
  --output-dir <OUTPUT> \
  --output-file-prefix <PREFIX> \
  --enable-map-align=true \
  --enable-map-align-output=true \
  --enable-sort=true \
  --enable-duplicate-marking=true \
  --enable-mrjd true \
  --mrjd-enable-high-sensitivity-mode true

The following command-line example uses BAM input that has already been aligned and runs MRJD Caller with high sensitivity mode:

dragen \
  -r <REF> \
  -b <BAM> \
  --output-dir <OUTPUT> \
  --output-file-prefix <PREFIX> \
  --enable-map-align=false \
  --enable-mrjd true \
  --mrjd-enable-high-sensitivity-mode true

It is important to note that MRJD cannot run together with the DRAGEN Small Variant Caller in this DRAGEN version. We recommend users to run DNA Mapping and Small Variant Calling workflow first, and then run MRJD using the aligned BAM file generated from DNA Mapping workflow as input. Using this workflow, two VCF files will be created (.hard-filtered.vcf.gz by DRAGEN Small Variant Caller and .mrjd.hard-filtered.vcf.gz by DRAGEN MRJD). To help user get a single VCF file for downstream anlaysis, we prepared a utility tool that replaces the DRAGEN Small Variant Caller output in the homology region of the six medically relevant and challenging genes with MRJD caller output. The tool also annotates the calls made by MRJD (with "MRJD" tag in the INFO column). Please refer to the DRAGEN Software Support Site page to download the utility tool.

Here are the example command lines to first run DNA Mapping and Small Variant Calling workflow using FASTQ files as input, and then run MRJD using BAM file generated by the DNA Mapping workflow as input.

# run DNA Mapping and Small Variant Calling workflow
dragen \
  -r <HASH_TABLE> \
  -1 <FQ1> \
  -2 <FQ2> \
  --RGID <RG> --RGSM <SM> \
  --output-dir <OUTPUT_DIRECTORY> \
  --output-file-prefix <PREFIX> \
  --enable-map-align true \
  --enable-map-align-output true \
  --enable-sort true \
  --enable-duplicate-marking true \
  --enable-variant-caller true

# run MRJD
dragen \
  -r <HASH_TABLE> \
  -b <BAM> \
  --output-dir <OUTPUT_DIRECTORY> \
  --output-file-prefix <PREFIX> \
  --enable-map-align false \
  --enable-mrjd true \
  --mrjd-enable-high-sensitivity-mode true

Output format

The MRJD Caller generates a .mrjd.hard-filtered.vcf.gz file in the output directory. The output file is a compressed VCFv4.2 formatted file that contains the VCF representation of the small variants from the identified genotype.

Uniquely placed call

The following are example output format for uniquely placed variant. The DRAGENHardQual filter is applied to the records if the variant has a QUAL < 3.00.

Figure 3. VCF output format example for uniquely placed call.

Non-uniquely-placed call

For variant that are not uniquely placed (type 2-4 variant in Figure 2), the MRJD Caller will also report variants under diploid genotype format, which can be interpreted the same way as uniquely placed variant (the genotype is region-specific instead of an aggregate across all regions). Under this format, The QUAL presents phred-scaled quality score for the assertion made in ALT (i.e. −10log10 prob(GT==0/0)). Note that the QUAL score will be equal to or less than 3 (if the QUAL > 3, then the call should be uniquely placed).

The QUAL, GT, GQ and PL will be reported similar to the DRAGEN germline VC. To avoid losing information about the aggregated genotype across paralogous regions, the MRJD Caller reports genotype, phred-scaled quality score, and the phred-scaled genotype likelihoods for aggregated genotype using JGT, JQL, and JPL in the FORMAT column.

Figure 4. VCF output format example for non-uniquely-placed call.

Last updated