# Illumina TruPath Genome Prep

DRAGEN’s Germline pipeline integrates proximity mapped reads from the Illumina TruPath Genome prep to enhance genomic analysis using long-range information encoded on the flowcell. This proximity-aware workflow supports highly accurate read mapping, phasing, and variant detection, including structural variants, paralog‑resolved small variants, short tandem repeat (STR) genotyping, and colocation analysis. By modeling and applying read‑to‑read linkage probabilities, the pipeline enables more confident interpretation of complex and low‑mappability genomic regions using standard short‑read data.

## Summary

* **Integrated TruPath proximity mapping**: Enabling `--enable-proximity=true` activates proximity-aware modeling and analysis across the DRAGEN Germline pipeline, allowing reads that are spatially close on the flowcell to be probabilistically linked as originating from the same DNA template.
* **Proximity model-driven mapping and alignment**: DRAGEN performs a preliminary mapping pass to collect high‑confidence alignments and fits a non‑linear proximity linking model that relates flowcell spatial distance and genomic distance to read‑to‑read linkage probability. The resulting Phred‑scaled linkage probability lookup table is applied during map/align to resolve ambiguous mappings and improve read placement accuracy in repetitive and complex genomic regions.
* **Enhanced phasing support**: Proximity information strengthens read phasing by associating reads from the same original template molecule, enabling longer and more reliable phasing blocks that propagate into variant calling and assembly‑based analyses.
* **Structural variant calling**: The Germline SV caller leverages proximity‑derived phasing to support phased assemblies, haplotype‑aware machine‑learning features, and haplotype‑resolved genotyping for single‑sample TruPath whole‑genome analyses.
* **Haplotype‑resolved small variant detection in paralogs**: For clinically relevant paralogous regions, Multi‑Region Joint Detection (MRJD) estimates total copy number from read depth, reconstructs individual paralog copies using read sequences and proximity information, assigns each copy to a genomic region or haplotype, and calls small variants from the reconstructed copies.
* **STR genotyping with IRR recovery**: Proximity linking enables recovery and placement of in‑repeat reads (IRRs) that would otherwise be unmapped, improving detection and sizing of large STR expansions and supporting phasing‑aware genotyping.
* **Colocation analysis and filtering**: Colocation maps summarize long-range genomic interactions using proximity‑linked reads and are used to visualize structural features and filter SV breakends lacking proximity support.
* **Specialized outputs and reporting**: The pipeline generates proximity‑aware BAM/CRAM files, VCFs, JSON summaries, cooler files, and TruPath‑specific DRAGEN Reports with dedicated QC metrics and visualizations.

## Overview

Short‑read DNA sequencing typically captures genomic variation at high accuracy but lacks long-range context needed to confidently resolve complex regions such as repeats, paralogs, and structural variants. The **Illumina TruPath Genome Prep** encodes long-range molecular information directly on the flowcell by preserving spatial proximity between reads derived from the same original DNA molecule. When combined with DRAGEN’s proximity‑aware algorithms, this information enables long-range analysis that extends the power of standard short‑read data.

The **DRAGEN Germline pipeline for Illumina TruPath Genome** leverages this flowcell‑encoded proximity information through a probabilistic proximity linking model that assigns read‑to‑read linkage probabilities based on spatial and genomic distance. When proximity mode is enabled, DRAGEN automatically fits this model, generates Phred‑scaled proximity link probability distributions, and applies them across mapping, phasing, and variant calling workflows. These proximity linkage probabilities serve as a foundational signal reused throughout the pipeline—informing alignment scoring, phasing blocks, candidate assemblies, machine‑learning features, and variant filtering—to improve accuracy and confidence in repetitive and structurally complex genomic regions while remaining compatible with standard short‑read sequencing workflows and formats.

![Illumina TruPath Genome Analysis Workflow](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-57ffca5a19e2d2bb1fa3cdc2c02cbabb5361ab5c%2Fdragen_user_guide_workflow.png?alt=media)

## Proximity Mode Analysis in DRAGEN

When proximity mode is enabled, DRAGEN automatically performs additional modeling and downstream analyses that integrate proximity information throughout the Germline pipeline. TruPath‑specific proximity analysis is activated by enabling proximity during a DRAGEN Germline run setting `--enable-proximity=true`. This proximity‑aware processing supports the following workflow and features:

* High‑accuracy read mapping using linkage‑informed alignment scoring
* Enhanced phasing via read‑to‑template association
* Structural variant calling using phased assemblies and haplotype‑aware algorithms
* Paralog‑resolved small variant detection with Multi‑Region Joint Detection (MRJD)
* Improved STR genotyping through in‑repeat read (IRR) recovery
* Long-range genomic interaction analysis and SV filtering using colocation maps

![DRAGEN Analysis Workflow](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-36a5a2c50e234e696ba553b2917ecf76d9f36f41%2Foutput_files_graphic.png?alt=media)

## Key Benefits of TruPath Genome vs Standard Illumina SBS

When DRAGEN Germline with proximity mode is enabled for TruPath Genome data, improvements are observed relative to standard Illumina SBS inputs across multiple performance dimensions.

These include improved small variant calling accuracy, longer phasing blocks, a higher proportion of fully phased genes, and improved structural variant recall. The table below summarizes key performance metrics across TruPath and standard Illumina SBS datasets.

| Benefit                                             | TruPath, high molecular weight input DNA | TruPath, standard molecular weight input DNA | Standard Illumina SBS on DRAGEN 4.4 |
| --------------------------------------------------- | ---------------------------------------- | -------------------------------------------- | ----------------------------------- |
| **Best-in-class small variant calling performance** | 36,717 FP+FN                             | 40,267 FP+FN                                 | 61,288                              |
| **Multi-megabase phasing blocks**                   | 8.1 Mbp                                  | 649 kbp                                      | NA                                  |
| **Fully phased genes**                              | 98.4%                                    | 87.6%                                        | 0%                                  |
| **Improved SV recall**                              | 94.0%                                    | 93.7%                                        | 80.7%                               |

### Phased, High-Quality Small Variant Calls in Clinically Relevant Gene Families

TruPath proximity-aware analysis enables haplotype‑resolved, copy‑number‑aware small variant calling in ten clinically relevant paralogous gene families using [Multi-Region Joint Detection (MRJD)](#multi-region-joint-detection), as shown in the table below. With TruPath data, MRJD produces phased variant calls across these supported paralogous regions without reliance on population haplotypes.

**Supported Genes**

| Paralogous Gene             | Disease Relevance                       |
| --------------------------- | --------------------------------------- |
| **PMS2**                    | Lynch Syndrome                          |
| **SMN1–SMN2**               | Spinal Muscular Atrophy                 |
| **NCF1**                    | Chronic Granulomatous Disease           |
| **CYP21A2**                 | Congenital Adrenal Hyperplasia          |
| **TNXB**                    | Ehlers–Danlos syndrome                  |
| **STRC**                    | Recessive Nonsyndromic Hearing Loss     |
| **CYP2D6**                  | Pharmacogenetics                        |
| **CYP11B1–CYP11B2**         | Glucocorticoid-remediable Aldosteronism |
| **CFHR1–CFHR2–CFHR3–CFHR4** | Atypical Hemolytic Uremic Syndrome      |
| **USP18**                   | Type I Interferonopathy                 |

The figure below illustrates haplotype‑resolved variant calls generated by MRJD for *PMS2* and *PMS2CL*, reported as separate copies for each locus, with long‑read data shown for comparison.

<div align="center"><img src="https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-fc5e4dda6b362a04402da330984028cb2012ff75%2FTruPath_PMS2.png?alt=media" alt="Haplotype-Resolved Variant Calls in *PMS2* and *PMS2CL*" width="600"></div>

### Improved STR Expansion Length and Classification Accuracy

TruPath analysis improves short tandem repeat (STR) expansion length estimation by recovering fragments composed entirely of STR sequence and by applying sequencing efficiency correction to account for locus‑specific coverage bias.

These improvements result in STR length estimates that more closely track expected repeat sizes and support more accurate expansion classification. The figure below compares STR expansion length estimates generated using standard Illumina sequencing and TruPath analysis across multiple loci.

![STR Length Estimation for Standard Illumina Sequencing and TruPath](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-4790283f59163b99023fe3d3ae7dfff058a700c2%2Fstandard_trupath_wgs_str.png?alt=media)

### Improved BND Filtering

TruPath proximity information enables more selective filtering of large (>200 kbp) inter‑ and intra‑chromosomal breakend (BND) calls produced by DRAGEN Structural-Variant (SV) Calling.

Incorporating colocation evidence reduces the number of reported large BND events while maintaining recall. This effect is observed for both intra‑chromosomal and inter‑chromosomal BNDs across evaluated samples.

Summarized below is BND recall and reduction in reported intra‑ and inter‑chromosomal BND calls for TruPath Coriell samples (n=45), with and without colocation filtering.

<div align="center"><img src="https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-9b900570a175d12b9fc66c84dac85e01a3d2e459%2Fcolocation_bnd_filter.png?alt=media" alt="DRAGEN-SV BND call reduction with TruPath" width="600"></div>

## Proximity Linking Model

In Illumina TruPath Genome data, read pairs that are proximal on the flowcell have an increased likelihood of originating from the same original template molecule. To quantify this likelihood, DRAGEN uses a probabilistic proximity linking model that relates genomic distance and flowcell proximity to calculate the probability that two reads originate from the same input DNA molecule.

When DRAGEN is run with `--enable-proximity=true`, the mapper estimates the parameters of this proximity linking model and generates a link probability distribution for each TruPath FASTQ input. This process consists of three stages: sample collection, proximity analysis, and model fitting, followed by generation of a link probability lookup table.

### Sample Collection

To fit the proximity linking model, DRAGEN first collects a representative subset of preliminary alignments from the input data. During an initial mapping pass, alignments are generated in flowcell‑tile-sized batches and reads meeting suitability requirements are retained for proximity analysis.

Eligible preliminary alignments must satisfy the following criteria:

* Mapped with MAPQ ≥ 60
* Primary alignments
* Non‑duplicate reads
* For paired‑end data, first‑in‑pair with a mapped mate and proper pairing

DRAGEN continues sampling until one million qualifying preliminary alignments have been collected or until the entire FASTQ input has been processed. If fewer than one million alignments are collected, processing continues with a warning indicating a potentially insufficient sample. If no suitable alignments are found, DRAGEN exits with an error.

### Proximity Analysis

Once a sufficient set of preliminary alignments has been collected, DRAGEN analyzes read pairs that are both spatially proximal on the flowcell and genomically proximal on the reference genome. Read pairs meeting both criteria have a high likelihood of originating from the same template molecule.

Each alignment is associated with a mapped genomic position and a flowcell coordinate (X, Y). For candidate read pairs, DRAGEN computes:

* Spatial displacement on the flowcell, represented as (`XD`, `YD`) in nanometers
* Genomic displacement, represented as `GDIST` in base pairs and rounded to the nearest 1,000 bp

Read pairs whose spatial and genomic displacements fall within configured proximity thresholds are considered likely linked. For these pairs, counts are aggregated across combinations of `XD`, `YD`, and `GDIST`. These aggregated counts form the empirical input to the model fitting stage.

A second set of counts is also collected using read pairs that are spatially proximal but genomically distant. These pairs are assumed to represent chance colocation and are used to model background noise.

Before proceeding, DRAGEN evaluates both sets of counts to ensure the observed trends are consistent with TruPath data. If the data fails validation, DRAGEN exits with an error.

### Model Fitting

The proximity linking model is non‑linear and includes approximately 20 parameters that predict the expected number of linked read pairs (`N`) as a function of `XD`, `YD`, and `GDIST`. The aggregated counts from proximity analysis are submitted to a non‑linear least‑squares solver to estimate these parameters.

If the solver fails to converge, DRAGEN exits with an error. When fitting succeeds, the model enables calculation of the expected number of linked read pairs, $$\mu(\text{XD}, \text{YD}, \text{GDIST})$$, which provides a smoothed estimate relative to the empirical counts.

A separate background model estimates the expected number of proximal read pairs due to chance, $$\mu\_\text{chance}(\text{XD}, \text{YD}, \text{GDIST})$$. The link probability is then computed as:

$$1 - \frac{\mu\_\text{chance}(\text{XD}, \text{YD}, \text{GDIST})}{\mu(\text{XD}, \text{YD}, \text{GDIST})}$$

This probability is typically expressed on a Phred scale as:

$$-10 \log\_{10} \left(\frac{\mu\_\text{chance}(\text{XD}, \text{YD}, \text{GDIST})}{\mu(\text{XD}, \text{YD}, \text{GDIST})}\right)$$

Higher values indicate a stronger likelihood that two reads originated from the same template molecule.

### Link Probability Distribution Generation

After successful model fitting, DRAGEN evaluates the fitted model across the practical range of spatial and genomic displacements and stores the resulting link probabilities in a lookup table. The table is generated continuously until link probabilities fall below a minimum threshold.

This lookup table represents the primary output of the TruPath proximity linking model and is used downstream by the DRAGEN Germline pipeline to incorporate proximity information during mapping, template tagging, and variant calling.

In rare cases where the fitted model fails to produce meaningful link probabilities above the minimum threshold, an empty lookup table is generated and DRAGEN exits with an error.

## Map/Align

The proximity linking model is used during mapping to improve read alignment accuracy for TruPath samples. In regions of high sequence homology, standard Illumina sequencing reads may align equally well, or nearly so, to multiple genomic locations, resulting in ambiguous mappings. With TruPath data, proximity‑linked read pairs can provide additional context that enables both reads in a pair to be mapped uniquely.

Read pairs originating from a region of interest on the flowcell are processed through the standard mapping workflow. Multiple candidate alignments are generated and scored, and key attributes—including alignment score, genomic position, and flowcell position—are stored in an indexed data structure.

For each read pair `X` that may benefit from proximity information, the mapper revisits the candidate alignments and searches the data structure for other read pairs `Y` whose alignment and flowcell positions suggest a shared template of origin. The proximity linking model quantifies the likelihood that `X` and `Y` originated from the same original DNA molecule. A Phred‑scaled score derived from this likelihood is incorporated into the corresponding joint alignment hypothesis.

## Template Tagging

During alignment, the mapper assigns each read a set of link probability scores that estimate the likelihood of links between the read and other nearby reads on the flowcell. Template tagging uses these scores to reconstruct the original template DNA molecules from which paired reads originated.

Template tagging begins by grouping reads into fragments, where each fragment consists of a paired‑end read pair. For each fragment, outgoing link probability scores are collected from the constituent reads. Links with Phred‑scaled quality below the threshold specified by `--proximity-min-linkq-threshold` (default: 10) are discarded.

The remaining high‑quality links are used to connect fragments into templates. Each connected set of fragments represents a reconstructed template molecule. All reads assigned to the same template are annotated with a shared template identifier in the BAM file (`BX:Z`), allowing reads originating from the same original DNA molecule to be identified downstream.

### Outputs

Template tagging generates a set of metrics reports that describe characteristics of all discovered templates and links identified during the DRAGEN run. Reports are produced for whole‑genome data and for any specified QC regions.

A template or link is included in QC region metrics if any portion of its genomic span overlaps the QC region.

#### Template Metrics

**Template Subpair Count Report**

The template subpair count report, `<prefix>.<qc-region>_template_subpairs.csv`, summarizes the distribution of discovered templates by the number of fragments (subpairs) they contain. A *subpair* refers to a read‑pair fragment within a template.

Each record in the report describes the number of templates observed with a given fragment count and the corresponding percentage of all templates. Summary statistics, including the mean and selected percentiles of subpair counts, are also reported. Example summary statistics include the mean subpair count and the 25th, 50th, 75th, and 95th percentile subpair counts across all templates.

**Template Genomic Distance Report**

The template genomic distance report, `<prefix>.<qc-region>_template_gdist.csv`, describes the distribution of template genomic lengths from the 0th to the 100th percentile.

Template genomic length is defined as the genomic distance between the smallest and largest mapped genomic positions represented in the template, corresponding to the span from the start of the first fragment to the end of the last fragment.

Percentile values are interpolated from the distribution of all discovered template lengths and may therefore be non‑integer base‑pair values.

**Template Spatial Distance Reports**

Template spatial distance reports describe the distribution of template spatial extents in flowcell units (FCU) from the 0th to the 100th percentile. Two reports are generated:

* `<prefix>.<qc-region>_template_xdist.csv`, describing spatial extent along the flowcell X axis
* `<prefix>.<qc-region>_template_ydist.csv`, describing spatial extent along the flowcell Y axis

Template spatial length is defined as the distance between the smallest and largest flowcell coordinates represented in the template along the corresponding axis. As with genomic distances, percentile values are interpolated from the observed distribution and may be non‑integer FCU values.

**Template Length Thresholds Report**

The template length thresholds report, `<prefix>.<qc-region>_template_thresholds.csv`, summarizes the count and proportion of discovered templates whose genomic lengths exceed specified thresholds.

Template genomic length is defined as the span between the smallest and largest mapped genomic positions within a template.

Thresholds reported in this file are defined using the `--template-gdist-thresholds` option (default: 10000, 20000, 60000). Each record reports the threshold value, the number of templates meeting or exceeding that threshold, and the corresponding proportion of all discovered templates.

#### Link Metrics

Link metrics are generated for each Phred‑scaled link quality threshold specified at runtime. These thresholds control which links are considered when computing proximity‑based metrics.

The following options determine link metric generation:

* `--proximity-min-linkq-threshold`
  * Specifies the primary link quality threshold used to accept or reject link hypotheses during template tagging (default: 10).
* `--proximity-additional-linkq-thresholds`
  * Specifies up to two additional link quality thresholds at which link metrics are computed (default: 25).

**Link Genomic Distance Report**

The link genomic distance report, `<prefix>.<qc-region>_proximity_gdist.csv`, describes the distribution of genomic distances for links that meet or exceed a specified link quality threshold.

Link genomic length is defined as the genomic distance between the two fragments connected by the link. Distances are reported from the 0th to the 100th percentile.

Percentile values are interpolated from the distribution of all discovered link lengths and may therefore be non‑integer base‑pair values.

**Link Spatial Distance Reports**

Link spatial distance reports describe the spatial extent of links in flowcell units (FCU) from the 0th to the 100th percentile. Two reports are generated for each link quality threshold:

* `<prefix>.<qc-region>_proximity_xdist.csv`, reporting spatial extent along the flowcell X axis
* `<prefix>.<qc-region>_proximity_ydist.csv`, reporting spatial extent along the flowcell Y axis

Link spatial length is defined as the distance between the flowcell coordinates of the two fragments connected by the link along the corresponding axis.

As with genomic distance metrics, percentile values are interpolated from the observed distribution and may be non‑integer flowcell unit values.

## Phasing

When TruPath data is used, DRAGEN performs read phasing upstream of variant calling and uses haplotype‑phased reads to generate phased variant calls. Phasing is informed by both long‑range proximity linking information provided by the TruPath library preparation and inference of the sample's ancestral haplotypes, which enables robust phasing across long genomic distances.

DRAGEN personalization provides the ancestral component of phasing information by inferring the sample’s ancestral haplotypes, such that phasing is typically inferred to be consistent with that observed in the ancestral haplotypes. As in the standard personalization workflow, DRAGEN also uses variants imputed from the haplotype database to inform prior probabilities for variants in the sample, providing a boost to variant calling performance.

### Phasing Model Overview

DRAGEN performs phasing at the level of small, contiguous genomic bins, typically 4,096 bp in length. Within each bin, haplotypes are inferred using the haplotype database in the reference hash table, and reads are assigned accordingly. Proximity linking information is used to propagate phasing information across bins.

Bins are grouped into larger, non‑overlapping phase blocks when there is sufficient evidence of co‑phasing. Each bin is phased in the context of ancestral haplotypes inferred from neighboring bins and from linked reads elsewhere in the genome.

### Phasing Options

Phasing is enabled automatically when proximity mode is enabled using --enable-proximity=true. No additional arguments are required. Default settings are recommended, but phasing behavior can be adjusted using the following options:

* `--personalization-phase-block-threshold`
  * Controls the amount of evidence required to group adjacent bins into a single phase block (default: 20).
* `--read-phasing-gene-list`
  * Specifies an optional GTF file used to compute gene‑based phasing metrics for genes fully contained within phase blocks.

Lowering the phase-block threshold parameter will reduce the amount of co-phasing evidence required to group adjacent personalization bins into a single phase block, and vice versa.

### Output Files

#### BAM/CRAM Output

The phased reads in the map/align output file are annotated with the following tags:

| Tag  | Description                                                                          | Values           |
| ---- | ------------------------------------------------------------------------------------ | ---------------- |
| `pp` | Phasing probability in Phred-scale log odds: $$10 \* \log\_{10}(P(H\_1) / P(H\_2))$$ | $$\[-127, 127]$$ |
| `HP` | Haplotype tag for all reads where $$\|{pp}\| \geq 10$$                               | $$1,2$$          |
| `PS` | Phase block tag                                                                      | $$\[0,2^{32})$$  |

#### Personalized Haplotypes

Personalized haplotypes for each phased bin are output in tab-delimited format (TSV). A summary of the phase blocks defined in the TSV file is also written in GTF format.

#### TSV (`<sample_id>.personal_haplotypes.tsv.gz`)

The personalized haplotypes TSV file contains the following columns:

| Column               | Description                                                                                                  |
| -------------------- | ------------------------------------------------------------------------------------------------------------ |
| `CHROM`              | Chromosome name                                                                                              |
| `START`              | Start position of the phased bin (0-based)                                                                   |
| `END`                | End position of the phased bin (1-based)                                                                     |
| `PHASE_BLOCK`        | Phase block ID for the bins. Bins with the same IDs are confidently co-phased.                               |
| `PHASING_CONFIDENCE` | Phasing confidence for the bin. Lower confidence values indicate a higher likelihood of haplotype switching. |

#### GTF (`<sample_id>.phase_blocks.gtf.gz`)

Regions covered by the phase blocks, as defined in the personalized TSV file's `PHASE_BLOCK` column, are also output in a GTF file with the following fields:

| Column      | Description                                 |
| ----------- | ------------------------------------------- |
| `seqname`   | Chromosome name                             |
| `source`    | Always 'dragen'                             |
| `feature`   | Always 'phaseblock'                         |
| `start`     | Start position of the phase block (1-based) |
| `end`       | End position of the phase block (1-based)   |
| `score`     | Unused ('.')                                |
| `strand`    | Unused ('.')                                |
| `frame`     | Unused ('.')                                |
| `attribute` | Always 'phase\_block n'                     |

#### Imputed Variants

Imputed variants for each phased bin are output in a VCF file. This VCF contains only variants imputed from the haplotype database in the reference hash table. It does not include novel variants observed in the sample, and multi‑allelic variants are split into separate records.

#### VCF (`<sample_id>.personal.vcf.gz`)

The VCF follows the 4.2 standard, below is the description of relevant fields:

| Tag         | Description                                                                                                                                            |
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `QUAL`      | Phred-scale score for the marginal probability of ALT. For example, for a diploid variant: $$-10\*log\_{10}(P(\text{GT='0}\vert\text{0'}))$$           |
| `INFO:HAPS` | Two best haplotype pairs for the bin the variant belongs to                                                                                            |
| `INFO:PGP`  | Marginal probability for $$P(\text{GT='0}\vert\text{0'}),P(\text{GT='1}\vert\text{0'}) + P(\text{GT='0}\vert\text{1'}),P(\text{GT='1}\vert\text{1'})$$ |
| `FORMAT:PS` | Phase block ID for the bin the variant belongs to                                                                                                      |

#### Phasing Metrics

DRAGEN reports a set of phasing metrics for each TruPath analysis and writes them to a summary CSV file. Reported metrics include phase block length statistics (`N50`, `L50`, `NG50`,`LG50`), cumulative phase block lengths, counts of fully phased genomic windows, and counts of fully phased genes. Gene‑based metrics are reported only when a gene list is provided using `--read-phasing-gene-list`.

#### CSV (`<sample_id>.phasing_summary_stats.csv`)

| Metric                                   | Description                                                                                                                                                    |
| ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Phasing chromosomes`                    | A list of the chromosomes used to calculate the metrics. Only autosomes with phased reads are considered.                                                      |
| `N50`                                    | The length of the shortest phase block where all phase blocks of at least that length account for ≥50% of the cumulative phase block length.                   |
| `L50`                                    | The smallest number of phase blocks that account for 50% of the cumulative phase block length.                                                                 |
| `NG50`                                   | The length of the shortest phase block where all phase blocks of at least that length account for ≥50% of the cumulative length of the phasing chromosome set. |
| `LG50`                                   | The smallest number of phase blocks that account for 50% of the cumulative length of the phasing chromosome set.                                               |
| `Total phase block length for L50/N50`   | The cumulative length of the phase-block assembly.                                                                                                             |
| `Total phase block length for LG50/NG50` | The cumulative length of the chromosome set.                                                                                                                   |
| `Number of fully phased 300 kbp windows` | After partitioning each chromosome into 300 kbp windows, the number of such windows that are each fully contained within a single phase block.                 |
| `Number of fully phased genes`           | The number of genes that are each fully contained within a single phase block.                                                                                 |
| `Gene list`                              | The filename of the gene list used to calculate the number of fully phased genes                                                                               |

## Structural Variant Calling

TruPath‑specific structural variant (SV) calling is supported only in single‑sample whole‑genome germline SV discovery mode. DRAGEN‑SV leverages proximity information indirectly through phasing information encoded in the reads, rather than using proximity links directly during SV detection.

This approach provides several key advantages. Candidate regions are assembled separately by haplotype, which reduces assembly graph complexity and produces higher‑quality contigs. Features used by the machine‑learning (ML) model are also segregated by haplotype, enabling improved training and inference. As a result, heterozygous SVs can be distinguished and assigned to specific local haplotypes.

### Leveraging TruPath Proximity-Linked Features

DRAGEN‑SV currently incorporates proximity information indirectly by using phasing information during candidate assembly and ML‑based filtering. For best accuracy, ML filtering should remain enabled.

#### Phased Assembly

Reads collected for candidate assembly are partitioned into two haplotypes based on available phasing information. Each haplotype is assembled independently, resulting in at most one contig per haplotype. Up to two contigs per candidate are propagated through downstream stages of the pipeline.

#### ML Processing

When run with TruPath data, DRAGEN‑SV uses an ML model trained on TruPath‑derived features that depend on read‑level phasing, in addition to features used with standard Illumina sequencing data. Enabling ML processing is critical for achieving optimal SV calling accuracy.

#### Collapsing, Deduplication, Regenotyping

Structural variants of certain types, including insertions and deletions, may be produced from multiple phased assembly rounds. These SVs are collapsed and deduplicated when they are inferred to represent the same event before being written to the VCF output. SV type, length, genomic location, genotype scores, and haplotype of origin are used to determine equivalence.

During this process, genotypes may be updated. For example, if a heterozygous SV is produced only from reads phased to the first haplotype, the genotype `GT` field is set to `1/0`. If two SVs originating from different haplotypes are collapsed into a single event, the resulting SV is re‑genotyped as `1/1`.

### SV VCF Outputs

The following VCF fields are added for TruPath

INFO Fields

| ID           | Description                                                                                                 |
| ------------ | ----------------------------------------------------------------------------------------------------------- |
| `PHASEDASM`  | Haplotype of the reads used for the assembly yielding the SV (only with `--enable-proximity=true`)          |
| `ML_UPDATED` | The FILTER status has changed from PASS to non-PASS or non-PASS to PASS after QUAL being recalibrated by ML |

FORMAT Fields

| ID     | Description                     |
| ------ | ------------------------------- |
| `MLQS` | ML recalibrated QUAL for indels |

FILTER Fields

| ID       | Level  | Description                                                                                                                                                 |
| -------- | ------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `MLFail` | Record | Prob(TP) is less than SV\_ML\_MIN\_PASS\_DEL\_PROB for deletions or Prob(TP) is less than SV\_ML\_MIN\_PASS\_INS\_PROB ([default values](#default-values)). |

## Multi-Region Joint Detection

DRAGEN Multi-Region Joint Detection (MRJD) is a germline small variant caller for paralogous regions. When used with TruPath data, MRJD produces haplotype‑resolved variant calls by leveraging proximity linking information enabled by TruPath. This approach does not rely on known population haplotypes.

With TruPath data, MRJD currently supports nine sets of paralogous regions encompassing 15 clinically relevant genes. Table 1 lists the hg38 genomic coordinates covered by MRJD. MRJD is compatible only with the **hg38** reference genome.

| Chromosome | Start     | End       | Region name   | Paralog set name        | Paralog type |
| ---------- | --------- | --------- | ------------- | ----------------------- | ------------ |
| chr1       | 196786972 | 196827189 | CFHR3-CFHR1   | CFHR3-CFHR1-CFHR4-CFHR2 | Non-tandem   |
| chr1       | 196911497 | 196951222 | CFHR4-CFHR2   | CFHR3-CFHR1-CFHR4-CFHR2 | Non-tandem   |
| chr5       | 70924941  | 70966375  | SMN1          | SMN1-SMN2               | Non-tandem   |
| chr5       | 70049523  | 70090528  | SMN2          | SMN1-SMN2               | Non-tandem   |
| chr6       | 32037415  | 32045473  | CYP21A2-TNXB  | CYP21A2                 | Tandem       |
| chr6       | 32004679  | 32012619  | CYP21A1P-TNXA | CYP21A2                 | Tandem       |
| chr7       | 5969485   | 5987844   | PMS2          | PMS2-PMS2CL             | Non-tandem   |
| chr7       | 6736851   | 6755308   | PMS2CL        | PMS2-PMS2CL             | Non-tandem   |
| chr7       | 74771000  | 74791999  | NCF1          | NCF1-NCF1B-NCF1C        | Non-tandem   |
| chr7       | 73217606  | 73238630  | NCF1B         | NCF1-NCF1B-NCF1C        | Non-tandem   |
| chr7       | 75153934  | 75174978  | NCF1C         | NCF1-NCF1B-NCF1C        | Non-tandem   |
| chr8       | 142873164 | 142879856 | CYP11B1       | CYP11B1-CYP11B2         | Tandem       |
| chr8       | 142910764 | 142917883 | CYP11B2       | CYP11B1-CYP11B2         | Tandem       |
| chr15      | 43599563  | 43618800  | STRC          | STRC-STRCP1             | Tandem       |
| chr15      | 43699418  | 43718260  | STRCP1        | STRC-STRCP1             | Tandem       |
| chr22      | 18159724  | 18174315  | USP18         | USP18-USP41P            | Non-tandem   |
| chr22      | 20362649  | 20377695  | USP41P        | USP18-USP41P            | Non-tandem   |
| chr22      | 42123192  | 42132193  | CYP2D6        | CYP2D6-CYP2D7           | Tandem       |
| chr22      | 42135344  | 42145873  | CYP2D7        | CYP2D6-CYP2D7           | Tandem       |

Table 1. Paralogous regions covered by MRJD.

### Method

MRJD begins by collecting all primary alignments within the paralogous regions of interest, regardless of mapping quality. For each paralogous region set (for example, *SMN1–SMN2*), MRJD estimates the total copy number by leveraging read depth across the regions of interest and a set of pre‑selected stable regions elsewhere in the genome.

Using the estimated total copy number, read sequences, and proximity linking information, MRJD constructs the corresponding number of copies for each paralogous region set. For non‑tandem paralogous regions, proximity information is used to assign each constructed copy to the genomic region from which it most likely originated (for example, *PMS2* versus *PMS2CL*). For tandem paralogous regions, proximity information is instead used to assign each copy to the maternal or paternal haplotype.

Finally, MRJD calls small variants based on the constructed copies and reports variant calls together with their assigned genomic regions or haplotypes.

The figure below provides an overview of the MRJD Workflow using TruPath data.

<div align="center"><img src="https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-659ae9976e6cad4d7b6f8cec18828358a2a1de41%2Fmrjd_constellation_workflow.png?alt=media" alt="" width="700"></div>

### Outputs

Upon analysis completion, DRAGEN produces the following MRJD output files in the directory specified by `--output-directory`, using the prefix defined by `--output-file-prefix`:

* `<prefix>.mrjd.hard-filtered.vcf.gz`
  * VCF file containing small variants called by MRJD in paralogous regions.
* `<prefix>.mrjd.json`
  * JSON file containing MRJD results, including copy number estimates, region or haplotype assignments for each copy, and run status for each paralogous region.
* `<prefix>.mrjd.phased.bam`
  * BAM file containing phased read alignments within paralogous regions.
* `mrjd_supporting_files/`
  * A directory containing additional files that support MRJD visualization, including:
    * `<prefix>.mrjd.<paralog_name>.vcf.gz`
      * Multi‑column VCF file containing MRJD variant calls for each paralogous region (one column per copy). One file is generated for each paralogous region set.
    * `<prefix>.mrjd.reference_region_alignments.sam`
      * SAM file containing reference region alignments used by MRJD.

#### MRJD VCF Output

The MRJD caller generates a gzip‑compressed VCFv4.2 file, `<prefix>.mrjd.hard-filtered.vcf.gz`, containing small variants derived from the inferred genotypes.

For a given set of paralogous regions, all copies are reported under each region. Each copy is annotated with its assigned genomic region or haplotype in the FORMAT fields, depending on the paralog structure.

For non‑tandem paralogous regions, the `REGION_PLACEMENT` field in the `FORMAT` column indicates the genomic region assignment for each copy, following the order of entries in the genotype field. Values indicate assignment to the current region, assignment to an alternate region, or an unplaced copy.

| #CHROM | POS      | ID | REF | ALT    | QUAL | FILTER | INFO                                      | FORMAT                                      | \<prefix>                                          |
| ------ | -------- | -- | --- | ------ | ---- | ------ | ----------------------------------------- | ------------------------------------------- | -------------------------------------------------- |
| chr5   | 70052190 | .  | C   | CA     | 500  | .      | regionGroupName=SMN1-SMN2;REF\_DIFF\_SITE | GT:REGION\_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS | 1\|0\|0\|0:A,A,I,I:.:500:90,30:0.250:120:70052190  |
| chr5   | 70052613 | .  | T   | C      | 500  | .      | regionGroupName=SMN1-SMN2                 | GT:REGION\_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS | 1\|0\|0\|0:A,A,I,I:.:500:86,35:0.289:121:70052190  |
| chr5   | 70052881 | .  | C   | CAAAAA | 500  | .      | regionGroupName=SMN1-SMN2;REF\_DIFF\_SITE | GT:REGION\_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS | 1\|0\|0\|0:A,A,I,I:.:500:93,28:0.231:121:70052190  |
| chr5   | 70053733 | .  | TC  | T      | 500  | .      | regionGroupName=SMN1-SMN2                 | GT:REGION\_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS | 0\|1\|0\|0:A,A,I,I:.:500:85,32:0.274:117:70052190  |
| chr5   | 70053985 | .  | CT  | C      | 500  | .      | regionGroupName=SMN1-SMN2                 | GT:REGION\_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS | 0\|1\|0\|1:A,A,I,I:.:500:67,65:0.492:132:70052190  |
| chr5   | 70054456 | .  | TA  | T      | 500  | .      | regionGroupName=SMN1-SMN2                 | GT:REGION\_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS | 0\|1\|1\|1:A,A,I,I:.:500:22,105:0.827:127:70052190 |

For tandem paralogous regions, the `PSL` field in the `FORMAT` column indicates haplotype assignment for each copy, again following the order of entries in the genotype field. `hap1` and `hap2` correspond to assignment to the first and second haplotypes, respectively. Because tandem copies cannot be assigned to specific genomic regions, the `REGION_PLACEMENT` field is not applicable and is populated with `U` (unplaced) for all copies.

| #CHROM | POS      | ID | REF | ALT | QUAL  | FILTER | INFO                                    | FORMAT                                          | \<prefix>                                                                                           |
| ------ | -------- | -- | --- | --- | ----- | ------ | --------------------------------------- | ----------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| chr6   | 32004754 | .  | T   | C   | 63.01 | .      | regionGroupName=CYP21A2;REF\_DIFF\_SITE | GT:PSL:REGION\_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS | 1\|0\|1\|0:copy1\_hap1,copy2\_hap1,copy3\_hap2,copy4\_hap2:U,U,U,U:0.78:1:57,54:0.486:111:32004754  |
| chr6   | 32004791 | .  | G   | A   | 63.01 | .      | regionGroupName=CYP21A2;REF\_DIFF\_SITE | GT:PSL:REGION\_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS | 1\|0\|1\|0:copy1\_hap1,copy2\_hap1,copy3\_hap2,copy4\_hap2:U,U,U,U:0.78:1:62,56:0.475:118:32004754  |
| chr6   | 32004857 | .  | C   | T   | 63.01 | .      | regionGroupName=CYP21A2;REF\_DIFF\_SITE | GT:PSL:REGION\_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS | 1\|0\|1\|0:copy1\_hap1,copy2\_hap1,copy3\_hap2,copy4\_hap2:U,U,U,U:0.78:1:51,53:0.510:104:32004754  |
| chr6   | 32004862 | .  | C   | T   | 63.01 | .      | regionGroupName=CYP21A2;REF\_DIFF\_SITE | GT:PSL:REGION\_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS | 1\|0\|1\|0:copy1\_hap1,copy2\_hap1,copy3\_hap2,copy4\_hap2:U,U,U,U:0.78:1:48,55:0.534:103:32004754  |
| chr6   | 32004868 | .  | G   | A   | 63.01 | .      | regionGroupName=CYP21A2;REF\_DIFF\_SITE | GT:PSL:REGION\_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS | 1\|0\|1\|0:copy1\_hap1,copy2\_hap1,copy3\_hap2,copy4\_hap2:U,U,U,U:0.78:1:49,55:0.529:104:32004754  |
| chr6   | 32005002 | .  | G   | A   | 63.01 | .      | regionGroupName=CYP21A2                 | GT:PSL:REGION\_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS | 1\|0\|0\|0:copy1\_hap1,copy2\_hap1,copy3\_hap2,copy4\_hap2:U,U,U,U:0.78:1:102,30:0.227:132:32004754 |

#### MRJD JSON Output

The MRJD caller generates a `<prefix>.mrjd.json` file in the output directory. This JSON‑formatted file contains detailed information for each paralogous region analyzed, including total copy number estimates, genomic region assignment for each copy, and haplotype assignment where applicable.

For each paralogous region, the total copy number is reported under `jointCopyNumber`. The `mrjdRunStatus` field indicates whether MRJD completed successfully for the region, with `Success` indicating a successful run and `Aborted` indicating a failure.

For non‑tandem paralogous regions, the JSON output includes copy‑to‑region assignments. For each copy reported in the corresponding VCF file (following the order of entries in the genotype field), the `regionPlacement` field indicates which genomic region the copy is assigned to.

For tandem paralogous regions, the JSON output reports haplotype assignments rather than genomic region placement. For each copy reported in the VCF file, the `locusStructure` field indicates the haplotype to which the copy is assigned. Because tandem copies cannot be uniquely mapped to specific genomic locations, all copies are listed as `unplaced` under `regionPlacement`. Example JSON output shown here illustrate these differences for non-tandem and tandem paralogous regions.

Below is an example of the JSON output for a non-tandem paralogous region:

```json
{
    "regionGroupName": "SMN1-SMN2",
    "region1Coord": "chr5:70924941-70965975",
    "region1Name": "SMN1",
    "region2Coord": "chr5:70049523-70090528",
    "region2Name": "SMN2",
    "jointCopyNumber": "4",
    "jointCopyNumberFloat": "3.972865",
    "regionPlacement": {
        "SMN1": [
            "copy1",
            "copy2"
        ],
        "SMN2": [
            "copy3",
            "copy4"
        ]
    },
    "mrjdRunStatus": "Success"
}
```

Below is an example of the JSON output for a tandem paralogous region:

```json
{
    "regionGroupName": "CYP21A2",
    "region1Coord": "chr6:32037415-32045473",
    "region1Name": "CYP21A2-TNXB",
    "region2Coord": "chr6:32004679-32012619",
    "region2Name": "CYP21A1P-TNXA",
    "jointCopyNumber": "4",
    "jointCopyNumberFloat": "3.892923",
    "locusStructure": {
        "hap1": [
            [
                "copy1"
            ],
            [
                "copy2"
            ]
        ],
        "hap2": [
            [
                "copy3"
            ],
            [
                "copy4"
            ]
        ]
    },
    "regionPlacement": {
        "unplaced": [
            [
                "copy1",
                "copy2",
                "copy3",
                "copy4"
            ]
        ]
    },
    "mrjdRunStatus": "Success"
}
```

#### MRJD Phased BAM Output

The MRJD caller generates a phased alignment file, `<prefix>.mrjd.phased.bam`, in the output directory. This file contains phased read alignments within paralogous regions.

As with the MRJD VCF output, all copies for a given set of paralogous regions are reported under each corresponding region. The phased BAM file enables inspection of read‑to‑copy assignments and phasing relationships within paralogous loci.

The following tags are added to the BAM records in the phased BAM file:

* `HP` - Copy label assigned to the read. For non-tandem paralogs, copy labels correspond to genomic regions (for example, `copy1_SMN1`, `copy2_SMN2`). For tandem paralogs, copy labels correspond to haplotypes (for example, `copy1_hap1`, `copy2_hap1`).
* `PC` - Phred-scaled confidence score for the read-to-copy assignment.
* `PS` - Phasing set identifier.
* `BX` - Template identifier based on proximity linking information. Fragments with the same `BX` tag are likely to originate from the same original DNA molecule.

The output format may be BAM, CRAM, or SAM, depending on the value specified for the `--output-format` option in the DRAGEN run.

#### MRJD Supporting Files

The MRJD caller generates an `mrjd_supporting_files/` directory in the output directory. This directory contains files that support MRJD variant interpretation and visualization.

The following files are produced:

* `<prefix>.mrjd.<paralog_name>.vcf.gz`
  * A multi‑column VCF file containing small variants called by MRJD for each paralogous region. Each copy is represented as a separate column. This file is suitable for visualizing haplotype‑resolved variants in genome browsers, such as IGV, that support multi‑column VCF format.
* `<prefix>.mrjd.reference_region_alignments.sam`
  * A SAM file containing reference region alignments used by MRJD. This file provides context for reference sequence differences between paralogous regions and can aid in interpreting variant calls, including the identification of gene conversion events.

### Visualize MRJD Results in IGV

MRJD results can be inspected in IGV by loading the multi‑column VCF file, the phased BAM file, and the reference region alignments SAM file generated by the pipeline:

* `mrjd_supporting_files/<prefix>.mrjd.SMN1-SMN2.vcf.gz`
* `<prefix>.mrjd.phased.bam`
* `<prefix>.mrjd.reference_region_alignments.sam`

In the multi‑column VCF file, all *SMN1* and *SMN2* copies are reported under the *SMN1* region and are also listed under the *SMN2* region. Copy‑to‑region assignments are indicated in the sample column. In the example shown below, copies 1, 2, and 3 are assigned to the *SMN1* region, while copy 4 is assigned to the *SMN2* region.

The phased BAM file displays reads assigned to each copy. In IGV, this can be visualized by loading the BAM file and grouping alignments by phase.

The reference region alignments SAM file highlights sequence differences between the *SMN1* and *SMN2* reference regions, providing context for interpreting copy‑specific variant assignments.

<div align="center"><img src="https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-2d12effc1d0d1aab860bdeb2ff594cec4ec8eaa4%2Fmrjd_constellation_igv_example.png?alt=media" alt="" width="900"></div>

### Visualize MRJD Results in DRAGEN Reports

MRJD results are integrated into DRAGEN Reports. For sample‑level reports, MRJD results are available under the **Paralogs** tab.

The **Paralog Sets** table provides an overview of each paralogous region analyzed, including the estimated total copy number. Selecting a region opens the **Paralogous regions** view, which displays haplotype‑resolved variant calls within each paralogous region.

The example shown below illustrates MRJD phased variant calls for **PMS2–PMS2CL**. In this visualization, dark orange indicates the alternative allele at a reference difference site between paralogous regions, light orange indicates the reference allele at a reference difference site, and gray indicates a non‑reference difference site variant.

![MRJD DRAGEN reports Example](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-05cdbb8bfa0e06da6d9d6183985a4208143891dc%2Fmrjd_dragen_reports_example.png?alt=media)

### Notes

* MRJD supports paralogous region calling only when the estimated total copy number is less than eight. Regions with higher copy numbers are skipped, and no variants are called; however, total copy number estimates are still reported in the JSON output.
* MRJD supports only the hg38 reference genome.
* Variant calling is supported only when the sample average linked coverage (excluding duplicates) is ≥16×.
* MRJD currently supports small variant calling only.

## STR Calling

TruPath data improves mapping accuracy for long short tandem repeats (STRs) by leveraging proximity linking information to place repetitive read pairs, including in‑repeat reads (IRRs), at their correct genomic locations. This enables more accurate sizing of STR expansions, particularly for large repeats that exceed the fragment length.

DRAGEN also uses phasing information to improve STR genotyping accuracy, which is especially important for large heterozygous expansions. When IRR recovery, proximity linking, and phasing‑aware genotyping are combined, improvements to STR calling are applied automatically when running the DRAGEN Germline pipeline.

All required resource files are automatically detected for supported reference genomes.

### In-Repeat Read (IRR) Recovery

IRR recovery is supported for repeat motifs with lengths between 2 and 6 bases. Motifs outside this range are not evaluated by IRR recovery, even if they are present in the catalog.

DRAGEN uses proximity information to recover in‑repeat reads (IRRs) that would otherwise remain unmapped or misaligned. This capability is particularly important for detecting large repeat expansions that exceed the fragment length. Although the mapper accounts for proximity information to improve alignment, IRRs require additional handling due to their low‑complexity sequence content.

IRR recovery is enabled by default when DRAGEN is run in proximity mode. DRAGEN‑STR automatically adjusts its parameters accordingly, and disabling IRR recovery is not recommended when analyzing samples for repeat expansions.

IRR recovery relies on a BED catalog that defines candidate STR regions and their associated repeat motifs. The catalog may include multiple entries for the same genomic region, allowing different motifs to be specified for a single STR locus.

For example, the *RFC1* locus can be represented in the catalog as follows:

| Chromosome | Start    | End      | Sequence | Name |
| ---------- | -------- | -------- | -------- | ---- |
| 4          | 39348424 | 39348479 | AAAAG    | RFC1 |
| 4          | 39348424 | 39348479 | AAAGG    | RFC1 |
| 4          | 39348424 | 39348479 | AAGGG    | RFC1 |
| 4          | 39348424 | 39348479 | AAGAG    | RFC1 |
| 4          | 39348424 | 39348479 | AACGG    | RFC1 |
| 4          | 39348424 | 39348479 | ACGGG    | RFC1 |
| 4          | 39348424 | 39348479 | ACAGG    | RFC1 |
| 4          | 39348424 | 39348479 | AAAGGG   | RFC1 |

DRAGEN provides BED catalogs for IRR recovery that cover all the locus of the default DRAGEN-STR catalogs. The default BED catalogs are located in the `<INSTALL_PATH>/resources/irr_recovery/` directory.

When using a supported reference genome and the default catalogs, IRR recovery is enabled automatically and does not require additional command‑line arguments.

#### Custom Catalogs

DRAGEN supports custom BED catalogs for in-repeat read (IRR) recovery through the `--irr-recovery-str-bed` command‑line option. Custom catalogs must follow the same format as the default catalogs provided by DRAGEN.

When a custom catalog is supplied, DRAGEN uses it in place of the default catalog for the selected reference genome. It is important to ensure that the custom catalog includes all loci of interest for repeat expansion detection. If a locus is missing from the catalog, IRR recovery is not performed for that locus, which may reduce sensitivity.

DRAGEN-provided built‑in catalogs are available for download from the [DRAGEN Product Files Site](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html) and can serve as a starting point for generating custom catalogs.

#### IRR Recovery BAM Tags

Remapped IRRs are annotated in the output BAM file using the `tr` tag. The `tr` tag encodes the repeat motif and motif length in a 16‑bit packed representation:

* The lower 12 bits encode the motif bases using 2‑bit encoding `[A=00,C=01,G=10,T=11]`
* The upper 4 bits encode the motif length.
* Bases are ordered from least significant to most significant bit.

For example, the motif *AAGGG* with length 5 is encoded accordingly in the packed `tr` representation.

To avoid redundant motif representations, the packed form always corresponds to the shortest motif pattern and the lexicographically smallest rotation across the forward motif and its reverse complement. For example, the motif *CACA* is represented as *AC*.

The `tr` tag is applied to all IRRs recovered using proximity information. Remapped IRRs are assigned a single alignment position corresponding to the first base of the associated STR region in the reference genome and are marked as unmapped with MAPQ 0.

### Phasing

When proximity mode is enabled, DRAGEN uses available phasing information to improve the accuracy of repeat expansion genotyping. Phasing helps resolve ambiguities in assigning reads to haplotypes in diploid regions, which is particularly important for accurately estimating repeat sizes in large heterozygous expansions.

Output calls remain unphased and are reported using the standard VCF format for short tandem repeat (STR) variants. However, the underlying genotyping model incorporates phasing information to improve repeat size estimates.

### Sequencing Efficiency Correction

Some loci are affected by sequencing biases that result in uneven coverage across alleles. These biases can reduce the accuracy of repeat expansion genotyping.

When proximity mode is enabled, DRAGEN applies a sequencing efficiency correction to adjust expected coverage at each locus based on empirical data. This correction improves repeat size estimates by compensating for systematic sequencing bias. To minimize confounding effects from mapping bias, sequencing efficiency correction is enabled only for TruPath samples.

Sequencing efficiency correction can be applied on a per-locus basis by adding the `SequencingEfficiencyCorrection` field to the respective catalog entry. For example:

```yaml
{
     "LocusId": "DMPK",
     "LocusStructure": "(CAG)*",
     "ReferenceRegion": "chr4:3076600-3076625",
     "VariantType": "Repeat",
     "SequencingEfficiencyCorrection": 1.2345 # example correction factor
}
```

Correction factors should be determined empirically based on a set of control samples with known repeat sizes through orthogonal methods. DRAGEN provides precomputed correction factors in the default catalogs that were calibrated for the following loci:

* *FMR1*
* *DMPK*
* *FXN*

## Colocation Maps

Colocation maps capture proximity information to characterize long‑range interactions within a sample. The output of the colocation module is a matrix of interaction counts, where each cell represents the number of observed interactions between two genomic regions.

Colocation maps are typically visualized as heatmaps. The example shown illustrates a small region on chromosome 5. Darker pixels indicate a higher number of interactions between the corresponding genomic regions.

<div align="center"><img src="https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-946f6cac6c169ef73e2044924d91c8733c2c2359%2Fpretty-colo.png?alt=media" alt="Example of a Colocation Plot" width="500"></div>

Several common features can be observed in colocation heatmaps:

* The main diagonal reflects interactions among fragments originating from the same long template molecules and landing in nearby genomic bins.
* Triangular or off‑diagonal structures may indicate structural variants, such as large deletions or breakends.
* Most off‑diagonal pixels are either empty (white) or represent low‑level background signal (green).

### Colocation Map Generation

Colocation map generation is a three-step process.

* Collect relevant alignments
* Compute the colocation matrix
* Generate output files

#### Alignment Collection

During alignment collection, DRAGEN gathers all reads eligible for analysis. Alignments are excluded if mapped to decoy contigs, fall below the mapping quality threshold, or are marked as duplicates. The remaining reads are assigned to genomic bins, with each bin representing approximately 2,000 bp of the genome.

#### Matrix Construction

The colocation matrix is then constructed by evaluating spatial relationships between reads. For each read (`read1`), DRAGEN identifies nearby reads (`read2`) and increments the matrix entry corresponding to their respective bins. A read is considered nearby if it falls within a rectangular region centered on `read1`. The size of this region is determined by the proximity linkage characteristics of the sample and is selected to balance sensitivity and performance.

#### Additional Options

Several options are available to control colocation matrix generation:

* The genome is partitioned into fixed‑size bins of equal length, and alignments are assigned to bins based on their starting position. Bin size can be adjusted using the `--colocation-bin-size` option.
* Alignments with specific BAM flags can be excluded using `--colocation-alignment-filter-flags`, which accepts an integer bitmask specifying flags to ignore.
* A minimum mapping quality can be enforced using `--colocation-alignment-min-mapq`.

### Cooler File

Colocation output is written as a cooler file containing a sparse representation of the colocation matrix.

The file conforms to schema 3 of the [official cooler specification](https://cooler.readthedocs.io/en/latest/schema.html). DRAGEN produces a single‑resolution cooler file. The colocation matrix is stored in square mode and is symmetric, with each pixel containing a single integer `count` field of type `int32`.

The resulting cooler file can be processed using the cooler CLI or Python API.

### Colocation Filter

The colocation filter uses colocation map data to assess proximity support for structural variant (SV) breakends and to flag events that are not supported by proximity evidence.

For each candidate breakend defined by coordinates `chrom1:pos1` and `chrom2:pos2`, the filter evaluates a localized region of the colocation map. A bounding box centered on these coordinates is applied, with a default size of 200 kb, and the values of all bins within this region are summed to quantify local interaction support.

To account for variation in sequencing depth and data quality, the regional sum is normalized using the median non‑zero diagonal value of the colocation map. If the normalized value is below the configured threshold (default:1.0), the `ColocationSum` filter is applied to the breakend in the VCF output.

Filter application follows paired-event semantics:

* If the `ColocationSum` filter is applied to one breakend of a paired SV event, it is also applied to the corresponding mate breakend record.

<div align="center"><img src="https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-d648db5f1548b498746013a362a5d355417acf3d%2Fcolocation_artifact.png?alt=media" alt="DRAGEN-SV BND Call Reduction using Colocation Filtering" width="800"></div>

#### Running DRAGEN SV with Colocation Filter

Colocation filtering is enabled by default if `enable-colocation` and `enable-sv` are both set to `true`. To disable the filter manually, set `--sv-enable-colocation-filter` to `false` when starting the DRAGEN analysis with TruPath enabled.

Additional Options:

* `sv-colocation-filter-normalize-by-median`: If true, colocation filter will normalize the region sum by the median diagonal value of the colocation matrix (default: true)
* `sv-colocation-filter-threshold`: Minimum (normalized) sum of region in colocation matrix to pass filter (default: 1.0)
* `sv-colocation-filter-region-width`: Width (in bp) of square region in colocation matrix to compute sum of (default: 200kbp)
* `sv-colocation-filter-min-svlen`: If true, Colocation filter will not run on intra-chromosomal breakend pairs that are within this distance of each other (default: 200kbp)
* `sv-colocation-filter-inter-bnd`: If true, colocation filter will be applied to inter-chromosomal breakends (default: true)
* `sv-colocation-filter-intra-bnd`: If true, colocation filter will be applied to intra-chromosomal breakends (default: true)

#### Output

The SV VCF file will have the additional headers if the colocation filter is enabled:

```
##INFO=<ID=NORMALIZED_COLOC_SUM,Number=1,Type=Float,Description="The sum of the square region in the colocation matrix centered on variant coordinates with width 200000 and normalized by the median diagonal count">
##FILTER=<ID=ColocationSum,Description="The sum of the square region in the colocation matrix centered on variant coordinates with width 200000 and normalized by the median diagonal count does not meet the threshold of 1">
```

Examples of VCF records can be seen below. The first breakend pair has the `ColocationSum` filter applied, as there was no colocation signal at all (`NORMALIZED_COLOC_SUM=0.0000`).

```
chr1    94900000        DRAGEN:BND:12587:0:1:0:0:0:0    A       A[chr2:39900000[       280     ColocationSum   SVTYPE=BND;MATEID=DRAGEN:BND:12587:0:1:0:0:0:1;BND_DEPTH=52;MATE_BND_DEPTH=54;NORMALIZED_COLOC_SUM=0.0000  GT:GQ:PL:PR:MLQS:VF:VF1:VAF1:VF2:VAF2   0/1:280:330,0,637:38,3:.:38,3:23,3:0.115385:15,3:0.166667
chr2   39900000        DRAGEN:BND:12587:0:1:0:0:0:1    C       ]chr1:94900000]C        280     ColocationSum   SVTYPE=BND;MATEID=DRAGEN:BND:12587:0:1:0:0:0:0;BND_DEPTH=54;MATE_BND_DEPTH=52;NORMALIZED_COLOC_SUM=0.0000      GT:GQ:PL:PR:MLQS:VF:VF1:VAF1:VF2:VAF2   0/1:280:330,0,637:38,3:.:38,3:15,3:0.166667:23,3:0.115385
chr3    52000000        DRAGEN:BND:65926:0:1:0:0:0:1    C       C]chr3:72000000]        955     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:65926:0:1:0:0:0:0;BND_DEPTH=53;MATE_BND_DEPTH=54;NORMALIZED_COLOC_SUM=40.1980       GT:GQ:PL:PR:SR:SB:FS:MLQS:VF:VF1:VAF1:VF2:VAF2  0/1:715:999,0,712:29,8:38,23:21,17,1,22:44.774:.:48,31:21,19:0.475000:27,20:0.425532
chr3    72000000        DRAGEN:BND:65926:0:1:0:0:0:0    A       A]chr3:52000000]        955     PASS    SVTYPE=BND;MATEID=DRAGEN:BND:65926:0:1:0:0:0:1;BND_DEPTH=54;MATE_BND_DEPTH=53;NORMALIZED_COLOC_SUM=40.1980       GT:GQ:PL:PR:SR:SB:FS:MLQS:VF:VF1:VAF1:VF2:VAF2  0/1:715:999,0,712:29,8:38,23:21,17,1,22:44.774:.:48,31:27,20:0.425532:21,19:0.475000
```

## Targeted Calling from TruPath Data

For WGS TruPath data, only `lpa`, `hba`, and `smn` will run when the Targeted Caller is enabled. A custom list of supported targets can be enabled via the command line.

## Proximity Coverage Reports

When proximity mapping is enabled, DRAGEN generates a parallel set of coverage reports filtered to include only linked reads.

During template reconstruction, each read‑pair fragment is assigned a link‑quality score equal to the highest‑quality link connecting it to other fragments. Only reads from fragments with link‑quality scores meeting or exceeding a specified threshold are included in proximity coverage reports.

Proximity coverage reports are generated for each link‑quality threshold specified using `--proximity-min-linkq-threshold` (default: 10) and `--proximity-additional-linkq-thresholds` (default: 25; maximum of two values). These reports are available for WGS and all defined QC coverage regions.

| Report Name                        | Output File                                                     | Notes                                        |
| ---------------------------------- | --------------------------------------------------------------- | -------------------------------------------- |
| Proximity coverage metrics         | \_proximity\_linkqual\<linkq-threshold>\_coverage\_metrics.csv  | Coverage statistics for linked reads         |
| Proximity fine histogram coverage  | \_proximity\_linkqual\<linkq-threshold>\_fine\_hist.csv         | Detailed coverage histogram for linked reads |
| Proximity histogram coverage       | \_proximity\_linkqual\<linkq-threshold>\_hist.csv               | Binned coverage histogram for linked reads   |
| Proximity overall mean coverage    | \_proximity\_linkqual\<linkq-threshold>\_overall\_mean\_cov.csv | Overall mean coverage for linked reads       |
| Proximity per contig mean coverage | \_proximity\_linkqual\<linkq-threshold>\_contig\_mean\_cov.csv  | Per-contig mean coverage for linked reads    |

These reports use the same format and metrics as standard coverage reports but reflect statistics computed exclusively from linked reads meeting the specified threshold.

## Reports

DRAGEN‑Reports includes a TruPath‑specific manifest to generate reports for TruPath WGS analysis. The manifest file, trupath/germline\_wgs.json, is located in the /opt/dragen-reports/manifests directory. In addition to the standard QC metrics and visualizations provided in DRAGEN WGS reports, the TruPath report includes an additional `Proximity` tab highlighting metrics and visualizations specific to TruPath proximity‑enabled analysis, including:

* `Model Fit` – Root mean square error indicating how well the proximity model fits the data.
* `Q25 Proximity Rate` – Percentage of read pairs with at least one neighbor above Q25.
* `Q25 Proximity Coverage` – Average autosomal coverage of read pairs with link quality above Q25.
* `P75 Template Size` – Size of linked template molecules at the 75th percentile.
* `Phase Block NG50` – Size of the smallest phasing block required to cover at least 50% of the genome. The `Proximity` tab also includes several visualizations summarizing proximity‑specific characteristics, including:

The distribution of template genomic lengths from `<prefix>.wgs_template_gdist.csv`

![Proximity 1](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-9965cc9795da7f149e9c1f7d107ea1b951fa1724%2Fproximity_genomic_span.png?alt=media)

The genomic coverage of variant phasing blocks by minimum block size, from `<prefix>.phase_blocks.gtf`

![Proximity 2](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-7e6960737415a5ec2f38c11f1e3156364002ca48%2Fproximity_phase_blocks.png?alt=media)

The distribution of templates by sub-read count from `<prefix>.<qc-region>_template_gdist.csv`

![Proximity 3](https://3156241411-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FWueVEHA9PEBUPz0J4TCJ%2Fuploads%2Fgit-blob-06897294fdd297eb87752e0a73ec05c2c7b4ed49%2Fproximity_subpair_counts.png?alt=media)

## Limitations

Illumina TruPath proximity enabled analysis has the following limitations:

* Illumina TruPath proximity mode is currently supported for the DRAGEN Germline pipeline. The Somatic, RNA, UMI, MRD, and Methylation pipelines are not supported.
* DRAGEN downsampling is not supported. In order to maintain the proximity property of the TruPath assay, FASTQs should not be randomly downsampled.
* Only human samples using hg38 have been verified.
* Only TruPath data inputs from the Illumina TruPath Genome prep are supported at this time. Running `--enable-proximity=true` with non-TruPath data inputs will halt analysis.
* Phasing requires the use of a pangenome reference hash table with personalization enabled. Analysis will halt with low coverage to support personalization.
* For on-premises analyses, TruPath analysis requires a v4 DRAGEN server due to FPGA memory limitations. For reference, v4 servers have a server serial number which begins with the letters "AC".
* MRJD requires at least 16x coverage to make calls; the caller will abort any attempt to call genes with insufficient aligned read coverage.

## TruPath Genome Licensing

Illumina TruPath proximity‑enabled analysis can be run in the cloud or on supported on‑premises systems.

* Cloud analysis is supported via Illumina Connected Analytics (ICA), BaseSpace Sequence Hub (BSSH) Run Planning with AutoLaunch, and DRAGEN FPGA Cloud BYOL on AWS EC2 f2.6xlarge instances.
* Local analysis is supported on Phase 4 DRAGEN On‑Prem servers.
* For DRAGEN On‑Prem servers and DRAGEN FPGA Cloud BYOL customers, the pipeline requires a Proximity license.
* The Proximity license is included with the purchase of the Illumina TruPath Genome prep kit and is automatically assigned.
* Due to FPGA memory constraints, the Proximity license for on‑premises use is supported only on Phase 4 servers. Phase 4 servers can be identified by a server serial number beginning with the letters “AC.”
