Illumina TruPath Genome Pilepine
DRAGEN’s Germline pipeline integrates proximity mapped reads from the Illumina TruPath Genome prep to enhance genomic analysis using long-range information encoded on the flowcell. This proximity-aware workflow supports highly accurate read mapping, phasing, and variant detection, including structural variants, paralog‑resolved small variants, short tandem repeat (STR) genotyping, and colocation analysis. By modeling and applying read‑to‑read linkage probabilities, the pipeline enables more confident interpretation of complex and low‑mappability genomic regions using standard short‑read data.
Summary
Integrated TruPath proximity mapping: Enabling
--enable-proximity=trueactivates proximity-aware modeling and analysis across the DRAGEN Germline pipeline, allowing reads that are spatially close on the flowcell to be probabilistically linked as originating from the same DNA template.Proximity model-driven mapping and alignment: DRAGEN performs a preliminary mapping pass to collect high‑confidence alignments and fits a non‑linear proximity linking model that relates flowcell spatial distance and genomic distance to read‑to‑read linkage probability. The resulting Phred‑scaled linkage probability lookup table is applied during map/align to resolve ambiguous mappings and improve read placement accuracy in repetitive and complex genomic regions.
Enhanced phasing support: Proximity information strengthens read phasing by associating reads from the same original template molecule, enabling longer and more reliable phasing blocks that propagate into variant calling and assembly‑based analyses.
Structural variant calling: The Germline SV caller leverages proximity‑derived phasing to support phased assemblies, haplotype‑aware machine‑learning features, and haplotype‑resolved genotyping for single‑sample TruPath whole‑genome analyses.
Haplotype‑resolved small variant detection in paralogs: For clinically relevant paralogous regions, Multi‑Region Joint Detection (MRJD) estimates total copy number from read depth, reconstructs individual paralog copies using read sequences and proximity information, assigns each copy to a genomic region or haplotype, and calls small variants from the reconstructed copies.
STR genotyping with IRR recovery: Proximity linking enables recovery and placement of in‑repeat reads (IRRs) that would otherwise be unmapped, improving detection and sizing of large STR expansions and supporting phasing‑aware genotyping.
Colocation analysis and filtering: Colocation maps summarize long-range genomic interactions using proximity‑linked reads and are used to visualize structural features and filter SV breakends lacking proximity support.
Specialized outputs and reporting: The pipeline generates proximity‑aware BAM/CRAM files, VCFs, JSON summaries, cooler files, and TruPath‑specific DRAGEN Reports with dedicated QC metrics and visualizations.
Overview
Short‑read DNA sequencing typically captures genomic variation at high accuracy but lacks long-range context needed to confidently resolve complex regions such as repeats, paralogs, and structural variants. The Illumina TruPath Genome Prep encodes long-range molecular information directly on the flowcell by preserving spatial proximity between reads derived from the same original DNA molecule. When combined with DRAGEN’s proximity‑aware algorithms, this information enables long-range analysis that extends the power of standard short‑read data.
The DRAGEN Germline pipeline for Illumina TruPath Genome leverages this flowcell‑encoded proximity information through a probabilistic proximity linking model that assigns read‑to‑read linkage probabilities based on spatial and genomic distance. When proximity mode is enabled, DRAGEN automatically fits this model, generates Phred‑scaled proximity link probability distributions, and applies them across mapping, phasing, and variant calling workflows. These proximity linkage probabilities serve as a foundational signal reused throughout the pipeline—informing alignment scoring, phasing blocks, candidate assemblies, machine‑learning features, and variant filtering—to improve accuracy and confidence in repetitive and structurally complex genomic regions while remaining compatible with standard short‑read sequencing workflows and formats.

Proximity Mode Analysis in DRAGEN
When proximity mode is enabled, DRAGEN automatically performs additional modeling and downstream analyses that integrate proximity information throughout the Germline pipeline. TruPath‑specific proximity analysis is activated by enabling proximity during a DRAGEN Germline run setting --enable-proximity=true. This proximity‑aware processing supports the following workflow and features:
High‑accuracy read mapping using linkage‑informed alignment scoring
Enhanced phasing via read‑to‑template association
Structural variant calling using phased assemblies and haplotype‑aware algorithms
Paralog‑resolved small variant detection with Multi‑Region Joint Detection (MRJD)
Improved STR genotyping through in‑repeat read (IRR) recovery
Long-range genomic interaction analysis and SV filtering using colocation maps

Key Benefits of TruPath Genome vs Standard Illumina SBS
When DRAGEN Germline with proximity mode is enabled for TruPath Genome data, improvements are observed relative to standard Illumina SBS inputs across multiple performance dimensions.
These include improved small variant calling accuracy, longer phasing blocks, a higher proportion of fully phased genes, and improved structural variant recall. The table below summarizes key performance metrics across TruPath and standard Illumina SBS datasets.
Best-in-class small variant calling performance
36,717 FP+FN
40,267 FP+FN
61,288
Multi-megabase phasing blocks
8.1 Mbp
649 kbp
NA
Fully phased genes
98.4%
87.6%
0%
Improved SV recall
94.0%
93.7%
80.7%
Phased, High-Quality Small Variant Calls in Clinically Relevant Gene Families
TruPath proximity-aware analysis enables haplotype‑resolved, copy‑number‑aware small variant calling in ten clinically relevant paralogous gene families using Multi-Region Joint Detection (MRJD), as shown in the table below. With TruPath data, MRJD produces phased variant calls across these supported paralogous regions without reliance on population haplotypes.
Supported Genes
PMS2
Lynch Syndrome
SMN1–SMN2
Spinal Muscular Atrophy
NCF1
Chronic Granulomatous Disease
CYP21A2
Congenital Adrenal Hyperplasia
TNXB
Ehlers–Danlos syndrome
STRC
Recessive Nonsyndromic Hearing Loss
CYP2D6
Pharmacogenetics
CYP11B1–CYP11B2
Glucocorticoid-remediable Aldosteronism
CFHR1–CFHR2–CFHR3–CFHR4
Atypical Hemolytic Uremic Syndrome
USP18
Type I Interferonopathy
The figure below illustrates haplotype‑resolved variant calls generated by MRJD for PMS2 and PMS2CL, reported as separate copies for each locus, with long‑read data shown for comparison.

Improved STR Expansion Length and Classification Accuracy
TruPath analysis improves short tandem repeat (STR) expansion length estimation by recovering fragments composed entirely of STR sequence and by applying sequencing efficiency correction to account for locus‑specific coverage bias.
These improvements result in STR length estimates that more closely track expected repeat sizes and support more accurate expansion classification. The figure below compares STR expansion length estimates generated using standard Illumina sequencing and TruPath analysis across multiple loci.

Improved BND Filtering
TruPath proximity information enables more selective filtering of large (>200 kbp) inter‑ and intra‑chromosomal breakend (BND) calls produced by DRAGEN Structural-Variant (SV) Calling.
Incorporating colocation evidence reduces the number of reported large BND events while maintaining recall. This effect is observed for both intra‑chromosomal and inter‑chromosomal BNDs across evaluated samples.
Summarized below is BND recall and reduction in reported intra‑ and inter‑chromosomal BND calls for TruPath Coriell samples (n=45), with and without colocation filtering.

Proximity Linking Model
In Illumina TruPath Genome data, read pairs that are proximal on the flowcell have an increased likelihood of originating from the same original template molecule. To quantify this likelihood, DRAGEN uses a probabilistic proximity linking model that relates genomic distance and flowcell proximity to calculate the probability that two reads originate from the same input DNA molecule.
When DRAGEN is run with --enable-proximity=true, the mapper estimates the parameters of this proximity linking model and generates a link probability distribution for each TruPath FASTQ input. This process consists of three stages: sample collection, proximity analysis, and model fitting, followed by generation of a link probability lookup table.
Sample Collection
To fit the proximity linking model, DRAGEN first collects a representative subset of preliminary alignments from the input data. During an initial mapping pass, alignments are generated in flowcell‑tile-sized batches and reads meeting suitability requirements are retained for proximity analysis.
Eligible preliminary alignments must satisfy the following criteria:
Mapped with MAPQ ≥ 60
Primary alignments
Non‑duplicate reads
For paired‑end data, first‑in‑pair with a mapped mate and proper pairing
DRAGEN continues sampling until one million qualifying preliminary alignments have been collected or until the entire FASTQ input has been processed. If fewer than one million alignments are collected, processing continues with a warning indicating a potentially insufficient sample. If no suitable alignments are found, DRAGEN exits with an error.
Proximity Analysis
Once a sufficient set of preliminary alignments has been collected, DRAGEN analyzes read pairs that are both spatially proximal on the flowcell and genomically proximal on the reference genome. Read pairs meeting both criteria have a high likelihood of originating from the same template molecule.
Each alignment is associated with a mapped genomic position and a flowcell coordinate (X, Y). For candidate read pairs, DRAGEN computes:
Spatial displacement on the flowcell, represented as (
XD,YD) in nanometersGenomic displacement, represented as
GDISTin base pairs and rounded to the nearest 1,000 bp
Read pairs whose spatial and genomic displacements fall within configured proximity thresholds are considered likely linked. For these pairs, counts are aggregated across combinations of XD, YD, and GDIST. These aggregated counts form the empirical input to the model fitting stage.
A second set of counts is also collected using read pairs that are spatially proximal but genomically distant. These pairs are assumed to represent chance colocation and are used to model background noise.
Before proceeding, DRAGEN evaluates both sets of counts to ensure the observed trends are consistent with TruPath data. If the data fails validation, DRAGEN exits with an error.
Model Fitting
The proximity linking model is non‑linear and includes approximately 20 parameters that predict the expected number of linked read pairs (N) as a function of XD, YD, and GDIST. The aggregated counts from proximity analysis are submitted to a non‑linear least‑squares solver to estimate these parameters.
If the solver fails to converge, DRAGEN exits with an error. When fitting succeeds, the model enables calculation of the expected number of linked read pairs, μ(XD,YD,GDIST), which provides a smoothed estimate relative to the empirical counts.
A separate background model estimates the expected number of proximal read pairs due to chance, (μchance(XD,YD,GDIST). The link probability is then computed as:
1−μ(XD,YD,GDIST)μchance(XD,YD,GDIST)
This probability is typically expressed on a Phred scale as:
−10log10(μ(XD,YD,GDIST)μchance(XD,YD,GDIST))
Higher values indicate a stronger likelihood that two reads originated from the same template molecule.
Link Probability Distribution Generation
After successful model fitting, DRAGEN evaluates the fitted model across the practical range of spatial and genomic displacements and stores the resulting link probabilities in a lookup table. The table is generated continuously until link probabilities fall below a minimum threshold.
This lookup table represents the primary output of the TruPath proximity linking model and is used downstream by the DRAGEN Germline pipeline to incorporate proximity information during mapping, template tagging, and variant calling.
In rare cases where the fitted model fails to produce meaningful link probabilities above the minimum threshold, an empty lookup table is generated and DRAGEN exits with an error.
Map/Align
The proximity linking model is used during mapping to improve read alignment accuracy for TruPath samples. In regions of high sequence homology, standard Illumina sequencing reads may align equally well, or nearly so, to multiple genomic locations, resulting in ambiguous mappings. With TruPath data, proximity‑linked read pairs can provide additional context that enables both reads in a pair to be mapped uniquely.
Read pairs originating from a region of interest on the flowcell are processed through the standard mapping workflow. Multiple candidate alignments are generated and scored, and key attributes—including alignment score, genomic position, and flowcell position—are stored in an indexed data structure.
For each read pair X that may benefit from proximity information, the mapper revisits the candidate alignments and searches the data structure for other read pairs Y whose alignment and flowcell positions suggest a shared template of origin. The proximity linking model quantifies the likelihood that X and Y originated from the same original DNA molecule. A Phred‑scaled score derived from this likelihood is incorporated into the corresponding joint alignment hypothesis.
Template Tagging
During alignment, the mapper assigns each read a set of link probability scores that estimate the likelihood of links between the read and other nearby reads on the flowcell. Template tagging uses these scores to reconstruct the original template DNA molecules from which paired reads originated.
Template tagging begins by grouping reads into fragments, where each fragment consists of a paired‑end read pair. For each fragment, outgoing link probability scores are collected from the constituent reads. Links with Phred‑scaled quality below the threshold specified by --proximity-min-linkq-threshold (default: 10) are discarded.
The remaining high‑quality links are used to connect fragments into templates. Each connected set of fragments represents a reconstructed template molecule. All reads assigned to the same template are annotated with a shared template identifier in the BAM file (BX:Z), allowing reads originating from the same original DNA molecule to be identified downstream.
Outputs
Template tagging generates a set of metrics reports that describe characteristics of all discovered templates and links identified during the DRAGEN run. Reports are produced for whole‑genome data and for any specified QC regions.
A template or link is included in QC region metrics if any portion of its genomic span overlaps the QC region.
Template Metrics
Template Subpair Count Report
The template subpair count report,<prefix>.<qc-region>_template_subpairs.csv, summarizes the distribution of discovered templates by the number of fragments (subpairs) they contain. A subpair refers to a read‑pair fragment within a template.
Each record in the report describes the number of templates observed with a given fragment count and the corresponding percentage of all templates. Summary statistics, including the mean and selected percentiles of subpair counts, are also reported. Example summary statistics include the mean subpair count and the 25th, 50th, 75th, and 95th percentile subpair counts across all templates.
Template Genomic Distance Report
The template genomic distance report, <prefix>.<qc-region>_template_gdist.csv, describes the distribution of template genomic lengths from the 0th to the 100th percentile.
Template genomic length is defined as the genomic distance between the smallest and largest mapped genomic positions represented in the template, corresponding to the span from the start of the first fragment to the end of the last fragment.
Percentile values are interpolated from the distribution of all discovered template lengths and may therefore be non‑integer base‑pair values.
Template Spatial Distance Reports
Template spatial distance reports describe the distribution of template spatial extents in flowcell units (FCU) from the 0th to the 100th percentile. Two reports are generated:
<prefix>.<qc-region>_template_xdist.csv, describing spatial extent along the flowcell X axis<prefix>.<qc-region>_template_ydist.csv, describing spatial extent along the flowcell Y axis
Template spatial length is defined as the distance between the smallest and largest flowcell coordinates represented in the template along the corresponding axis. As with genomic distances, percentile values are interpolated from the observed distribution and may be non‑integer FCU values.
Template Length Thresholds Report
The template length thresholds report,<prefix>.<qc-region>_template_thresholds.csv, summarizes the count and proportion of discovered templates whose genomic lengths exceed specified thresholds.
Template genomic length is defined as the span between the smallest and largest mapped genomic positions within a template.
Thresholds reported in this file are defined using the --template-gdist-thresholds option (default: 10000, 20000, 60000). Each record reports the threshold value, the number of templates meeting or exceeding that threshold, and the corresponding proportion of all discovered templates.
Link Metrics
Link metrics are generated for each Phred‑scaled link quality threshold specified at runtime. These thresholds control which links are considered when computing proximity‑based metrics.
The following options determine link metric generation:
--proximity-min-linkq-thresholdSpecifies the primary link quality threshold used to accept or reject link hypotheses during template tagging (default: 10).
--proximity-additional-linkq-thresholdsSpecifies up to two additional link quality thresholds at which link metrics are computed (default: 25).
Link Genomic Distance Report
The link genomic distance report,<prefix>.<qc-region>_proximity_gdist.csv, describes the distribution of genomic distances for links that meet or exceed a specified link quality threshold.
Link genomic length is defined as the genomic distance between the two fragments connected by the link. Distances are reported from the 0th to the 100th percentile.
Percentile values are interpolated from the distribution of all discovered link lengths and may therefore be non‑integer base‑pair values.
Link Spatial Distance Reports
Link spatial distance reports describe the spatial extent of links in flowcell units (FCU) from the 0th to the 100th percentile. Two reports are generated for each link quality threshold:
<prefix>.<qc-region>_proximity_xdist.csv, reporting spatial extent along the flowcell X axis<prefix>.<qc-region>_proximity_ydist.csv, reporting spatial extent along the flowcell Y axis
Link spatial length is defined as the distance between the flowcell coordinates of the two fragments connected by the link along the corresponding axis.
As with genomic distance metrics, percentile values are interpolated from the observed distribution and may be non‑integer flowcell unit values.
Phasing
When TruPath data is used, DRAGEN performs read phasing upstream of variant calling and uses haplotype‑phased reads to generate phased variant calls. Phasing is informed by both long‑range proximity linking information provided by the TruPath library preparation and inference of the sample's ancestral haplotypes, which enables robust phasing across long genomic distances.
DRAGEN personalization provides the ancestral component of phasing information by inferring the sample’s ancestral haplotypes, such that phasing is typically inferred to be consistent with that observed in the ancestral haplotypes. As in the standard personalization workflow, DRAGEN also uses variants imputed from the haplotype database to inform prior probabilities for variants in the sample, providing a boost to variant calling performance.
Phasing Model Overview
DRAGEN performs phasing at the level of small, contiguous genomic bins, typically 4,096 bp in length. Within each bin, haplotypes are inferred using the haplotype database in the reference hash table, and reads are assigned accordingly. Proximity linking information is used to propagate phasing information across bins.
Bins are grouped into larger, non‑overlapping phase blocks when there is sufficient evidence of co‑phasing. Each bin is phased in the context of ancestral haplotypes inferred from neighboring bins and from linked reads elsewhere in the genome.
Phasing Options
Phasing is enabled automatically when proximity mode is enabled using --enable-proximity=true. No additional arguments are required. Default settings are recommended, but phasing behavior can be adjusted using the following options:
--personalization-phase-block-thresholdControls the amount of evidence required to group adjacent bins into a single phase block (default: 20).
--read-phasing-gene-listSpecifies an optional GTF file used to compute gene‑based phasing metrics for genes fully contained within phase blocks.
Lowering the phase-block threshold parameter will reduce the amount of co-phasing evidence required to group adjacent personalization bins into a single phase block, and vice versa.
Output Files
BAM/CRAM Output
The phased reads in the map/align output file are annotated with the following tags:
pp
Phasing probability in Phred-scale log odds: 10∗log10(P(H1)/P(H2))
[−127,127]
HP
Haplotype tag for all reads where ∥pp∥≥10
1,2
PS
Phase block tag
[0,232)
Personalized Haplotypes
Personalized haplotypes for each phased bin are output in tab-delimited format (TSV). A summary of the phase blocks defined in the TSV file is also written in GTF format.
TSV (<sample_id>.personal_haplotypes.tsv.gz)
<sample_id>.personal_haplotypes.tsv.gz)The personalized haplotypes TSV file contains the following columns:
CHROM
Chromosome name
START
Start position of the phased bin (0-based)
END
End position of the phased bin (1-based)
PHASE_BLOCK
Phase block ID for the bins. Bins with the same IDs are confidently co-phased.
PHASING_CONFIDENCE
Phasing confidence for the bin. Lower confidence values indicate a higher likelihood of haplotype switching.
GTF (<sample_id>.phase_blocks.gtf.gz)
<sample_id>.phase_blocks.gtf.gz)Regions covered by the phase blocks, as defined in the personalized TSV file's PHASE_BLOCK column, are also output in a GTF file with the following fields:
seqname
Chromosome name
source
Always 'dragen'
feature
Always 'phaseblock'
start
Start position of the phase block (1-based)
end
End position of the phase block (1-based)
score
Unused ('.')
strand
Unused ('.')
frame
Unused ('.')
attribute
Always 'phase_block n'
Imputed Variants
Imputed variants for each phased bin are output in a VCF file. This VCF contains only variants imputed from the haplotype database in the reference hash table. It does not include novel variants observed in the sample, and multi‑allelic variants are split into separate records.
VCF (<sample_id>.personal.vcf.gz)
<sample_id>.personal.vcf.gz)The VCF follows the 4.2 standard, below is the description of relevant fields:
QUAL
Phred-scale score for the marginal probability of ALT. For example, for a diploid variant: −10∗log10(P(GT=’0∣0’))
INFO:HAPS
Two best haplotype pairs for the bin the variant belongs to
INFO:PGP
Marginal probability for P(GT=’0∣0’),P(GT=’1∣0’)+P(GT=’0∣1’),P(GT=’1∣1’)
FORMAT:PS
Phase block ID for the bin the variant belongs to
Phasing Metrics
DRAGEN reports a set of phasing metrics for each TruPath analysis and writes them to a summary CSV file. Reported metrics include phase block length statistics (N50, L50, NG50,LG50), cumulative phase block lengths, counts of fully phased genomic windows, and counts of fully phased genes. Gene‑based metrics are reported only when a gene list is provided using --read-phasing-gene-list.
CSV (<sample_id>.phasing_summary_stats.csv)
<sample_id>.phasing_summary_stats.csv)Phasing chromosomes
A list of the chromosomes used to calculate the metrics. Only autosomes with phased reads are considered.
N50
The length of the shortest phase block where all phase blocks of at least that length account for ≥50% of the cumulative phase block length.
L50
The smallest number of phase blocks that account for 50% of the cumulative phase block length.
NG50
The length of the shortest phase block where all phase blocks of at least that length account for ≥50% of the cumulative length of the phasing chromosome set.
LG50
The smallest number of phase blocks that account for 50% of the cumulative length of the phasing chromosome set.
Total phase block length for L50/N50
The cumulative length of the phase-block assembly.
Total phase block length for LG50/NG50
The cumulative length of the chromosome set.
Number of fully phased 300 kbp windows
After partitioning each chromosome into 300 kbp windows, the number of such windows that are each fully contained within a single phase block.
Number of fully phased genes
The number of genes that are each fully contained within a single phase block.
Gene list
The filename of the gene list used to calculate the number of fully phased genes
Structural Variant Calling
TruPath‑specific structural variant (SV) calling is supported only in single‑sample whole‑genome germline SV discovery mode. DRAGEN‑SV leverages proximity information indirectly through phasing information encoded in the reads, rather than using proximity links directly during SV detection.
This approach provides several key advantages. Candidate regions are assembled separately by haplotype, which reduces assembly graph complexity and produces higher‑quality contigs. Features used by the machine‑learning (ML) model are also segregated by haplotype, enabling improved training and inference. As a result, heterozygous SVs can be distinguished and assigned to specific local haplotypes.
Leveraging TruPath Proximity-Linked Features
DRAGEN‑SV currently incorporates proximity information indirectly by using phasing information during candidate assembly and ML‑based filtering. For best accuracy, ML filtering should remain enabled.
Phased Assembly
Reads collected for candidate assembly are partitioned into two haplotypes based on available phasing information. Each haplotype is assembled independently, resulting in at most one contig per haplotype. Up to two contigs per candidate are propagated through downstream stages of the pipeline.
ML Processing
When run with TruPath data, DRAGEN‑SV uses an ML model trained on TruPath‑derived features that depend on read‑level phasing, in addition to features used with standard Illumina sequencing data. Enabling ML processing is critical for achieving optimal SV calling accuracy.
Collapsing, Deduplication, Regenotyping
Structural variants of certain types, including insertions and deletions, may be produced from multiple phased assembly rounds. These SVs are collapsed and deduplicated when they are inferred to represent the same event before being written to the VCF output. SV type, length, genomic location, genotype scores, and haplotype of origin are used to determine equivalence.
During this process, genotypes may be updated. For example, if a heterozygous SV is produced only from reads phased to the first haplotype, the genotype GT field is set to 1/0. If two SVs originating from different haplotypes are collapsed into a single event, the resulting SV is re‑genotyped as 1/1.
SV VCF Outputs
The following VCF fields are added for TruPath
INFO Fields
PHASEDASM
Haplotype of the reads used for the assembly yielding the SV (only with --enable-proximity=true)
ML_UPDATED
The FILTER status has changed from PASS to non-PASS or non-PASS to PASS after QUAL being recalibrated by ML
FORMAT Fields
MLQS
ML recalibrated QUAL for indels
FILTER Fields
MLFail
Record
Prob(TP) is less than SV_ML_MIN_PASS_DEL_PROB for deletions or Prob(TP) is less than SV_ML_MIN_PASS_INS_PROB (default values).
Multi-Region Joint Detection
DRAGEN Multi-Region Joint Detection (MRJD) is a germline small variant caller for paralogous regions. When used with TruPath data, MRJD produces haplotype‑resolved variant calls by leveraging proximity linking information enabled by TruPath. This approach does not rely on known population haplotypes.
With TruPath data, MRJD currently supports nine sets of paralogous regions encompassing 15 clinically relevant genes. Table 1 lists the hg38 genomic coordinates covered by MRJD. MRJD is compatible only with the hg38 reference genome.
chr1
196786972
196827189
CFHR3-CFHR1
CFHR3-CFHR1-CFHR4-CFHR2
Non-tandem
chr1
196911497
196951222
CFHR4-CFHR2
CFHR3-CFHR1-CFHR4-CFHR2
Non-tandem
chr5
70924941
70966375
SMN1
SMN1-SMN2
Non-tandem
chr5
70049523
70090528
SMN2
SMN1-SMN2
Non-tandem
chr6
32037415
32045473
CYP21A2-TNXB
CYP21A2
Tandem
chr6
32004679
32012619
CYP21A1P-TNXA
CYP21A2
Tandem
chr7
5969485
5987844
PMS2
PMS2-PMS2CL
Non-tandem
chr7
6736851
6755308
PMS2CL
PMS2-PMS2CL
Non-tandem
chr7
74771000
74791999
NCF1
NCF1-NCF1B-NCF1C
Non-tandem
chr7
73217606
73238630
NCF1B
NCF1-NCF1B-NCF1C
Non-tandem
chr7
75153934
75174978
NCF1C
NCF1-NCF1B-NCF1C
Non-tandem
chr8
142873164
142879856
CYP11B1
CYP11B1-CYP11B2
Tandem
chr8
142910764
142917883
CYP11B2
CYP11B1-CYP11B2
Tandem
chr15
43599563
43618800
STRC
STRC-STRCP1
Tandem
chr15
43699418
43718260
STRCP1
STRC-STRCP1
Tandem
chr22
18159724
18174315
USP18
USP18-USP41P
Non-tandem
chr22
20362649
20377695
USP41P
USP18-USP41P
Non-tandem
chr22
42123192
42132193
CYP2D6
CYP2D6-CYP2D7
Tandem
chr22
42135344
42145873
CYP2D7
CYP2D6-CYP2D7
Tandem
Table 1. Paralogous regions covered by MRJD.
Method
MRJD begins by collecting all primary alignments within the paralogous regions of interest, regardless of mapping quality. For each paralogous region set (for example, SMN1–SMN2), MRJD estimates the total copy number by leveraging read depth across the regions of interest and a set of pre‑selected stable regions elsewhere in the genome.
Using the estimated total copy number, read sequences, and proximity linking information, MRJD constructs the corresponding number of copies for each paralogous region set. For non‑tandem paralogous regions, proximity information is used to assign each constructed copy to the genomic region from which it most likely originated (for example, PMS2 versus PMS2CL). For tandem paralogous regions, proximity information is instead used to assign each copy to the maternal or paternal haplotype.
Finally, MRJD calls small variants based on the constructed copies and reports variant calls together with their assigned genomic regions or haplotypes.
The figure below provides an overview of the MRJD Workflow using TruPath data.

Outputs
Upon analysis completion, DRAGEN produces the following MRJD output files in the directory specified by --output-directory, using the prefix defined by --output-file-prefix:
<prefix>.mrjd.hard-filtered.vcf.gzVCF file containing small variants called by MRJD in paralogous regions.
<prefix>.mrjd.jsonJSON file containing MRJD results, including copy number estimates, region or haplotype assignments for each copy, and run status for each paralogous region.
<prefix>.mrjd.phased.bamBAM file containing phased read alignments within paralogous regions.
mrjd_supporting_files/A directory containing additional files that support MRJD visualization, including:
<prefix>.mrjd.<paralog_name>.vcf.gzMulti‑column VCF file containing MRJD variant calls for each paralogous region (one column per copy). One file is generated for each paralogous region set.
<prefix>.mrjd.reference_region_alignments.samSAM file containing reference region alignments used by MRJD.
MRJD VCF Output
The MRJD caller generates a gzip‑compressed VCFv4.2 file, <prefix>.mrjd.hard-filtered.vcf.gz, containing small variants derived from the inferred genotypes.
For a given set of paralogous regions, all copies are reported under each region. Each copy is annotated with its assigned genomic region or haplotype in the FORMAT fields, depending on the paralog structure.
For non‑tandem paralogous regions, the REGION_PLACEMENT field in the FORMAT column indicates the genomic region assignment for each copy, following the order of entries in the genotype field. Values indicate assignment to the current region, assignment to an alternate region, or an unplaced copy.
chr5
70052190
.
C
CA
500
.
regionGroupName=SMN1-SMN2;REF_DIFF_SITE
GT:REGION_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS
1|0|0|0:A,A,I,I:.:500:90,30:0.250:120:70052190
chr5
70052613
.
T
C
500
.
regionGroupName=SMN1-SMN2
GT:REGION_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS
1|0|0|0:A,A,I,I:.:500:86,35:0.289:121:70052190
chr5
70052881
.
C
CAAAAA
500
.
regionGroupName=SMN1-SMN2;REF_DIFF_SITE
GT:REGION_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS
1|0|0|0:A,A,I,I:.:500:93,28:0.231:121:70052190
chr5
70053733
.
TC
T
500
.
regionGroupName=SMN1-SMN2
GT:REGION_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS
0|1|0|0:A,A,I,I:.:500:85,32:0.274:117:70052190
chr5
70053985
.
CT
C
500
.
regionGroupName=SMN1-SMN2
GT:REGION_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS
0|1|0|1:A,A,I,I:.:500:67,65:0.492:132:70052190
chr5
70054456
.
TA
T
500
.
regionGroupName=SMN1-SMN2
GT:REGION_PLACEMENT:RPQL:PQ:JAD:JAF:JDP:PS
0|1|1|1:A,A,I,I:.:500:22,105:0.827:127:70052190
For tandem paralogous regions, the PSL field in the FORMAT column indicates haplotype assignment for each copy, again following the order of entries in the genotype field. hap1 and hap2 correspond to assignment to the first and second haplotypes, respectively. Because tandem copies cannot be assigned to specific genomic regions, the REGION_PLACEMENT field is not applicable and is populated with U (unplaced) for all copies.
chr6
32004754
.
T
C
63.01
.
regionGroupName=CYP21A2;REF_DIFF_SITE
GT:PSL:REGION_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS
1|0|1|0:copy1_hap1,copy2_hap1,copy3_hap2,copy4_hap2:U,U,U,U:0.78:1:57,54:0.486:111:32004754
chr6
32004791
.
G
A
63.01
.
regionGroupName=CYP21A2;REF_DIFF_SITE
GT:PSL:REGION_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS
1|0|1|0:copy1_hap1,copy2_hap1,copy3_hap2,copy4_hap2:U,U,U,U:0.78:1:62,56:0.475:118:32004754
chr6
32004857
.
C
T
63.01
.
regionGroupName=CYP21A2;REF_DIFF_SITE
GT:PSL:REGION_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS
1|0|1|0:copy1_hap1,copy2_hap1,copy3_hap2,copy4_hap2:U,U,U,U:0.78:1:51,53:0.510:104:32004754
chr6
32004862
.
C
T
63.01
.
regionGroupName=CYP21A2;REF_DIFF_SITE
GT:PSL:REGION_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS
1|0|1|0:copy1_hap1,copy2_hap1,copy3_hap2,copy4_hap2:U,U,U,U:0.78:1:48,55:0.534:103:32004754
chr6
32004868
.
G
A
63.01
.
regionGroupName=CYP21A2;REF_DIFF_SITE
GT:PSL:REGION_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS
1|0|1|0:copy1_hap1,copy2_hap1,copy3_hap2,copy4_hap2:U,U,U,U:0.78:1:49,55:0.529:104:32004754
chr6
32005002
.
G
A
63.01
.
regionGroupName=CYP21A2
GT:PSL:REGION_PLACEMENT:AGQL:PQ:JAD:JAF:JDP:PS
1|0|0|0:copy1_hap1,copy2_hap1,copy3_hap2,copy4_hap2:U,U,U,U:0.78:1:102,30:0.227:132:32004754
MRJD JSON Output
The MRJD caller generates a <prefix>.mrjd.json file in the output directory. This JSON‑formatted file contains detailed information for each paralogous region analyzed, including total copy number estimates, genomic region assignment for each copy, and haplotype assignment where applicable.
For each paralogous region, the total copy number is reported under jointCopyNumber. The mrjdRunStatus field indicates whether MRJD completed successfully for the region, with Success indicating a successful run and Aborted indicating a failure.
For non‑tandem paralogous regions, the JSON output includes copy‑to‑region assignments. For each copy reported in the corresponding VCF file (following the order of entries in the genotype field), the regionPlacement field indicates which genomic region the copy is assigned to.
For tandem paralogous regions, the JSON output reports haplotype assignments rather than genomic region placement. For each copy reported in the VCF file, the locusStructure field indicates the haplotype to which the copy is assigned. Because tandem copies cannot be uniquely mapped to specific genomic locations, all copies are listed as unplaced under regionPlacement. Example JSON output shown here illustrate these differences for non-tandem and tandem paralogous regions.
Below is an example of the JSON output for a non-tandem paralogous region:
Below is an example of the JSON output for a tandem paralogous region:
MRJD Phased BAM Output
The MRJD caller generates a phased alignment file, <prefix>.mrjd.phased.bam, in the output directory. This file contains phased read alignments within paralogous regions.
As with the MRJD VCF output, all copies for a given set of paralogous regions are reported under each corresponding region. The phased BAM file enables inspection of read‑to‑copy assignments and phasing relationships within paralogous loci.
The following tags are added to the BAM records in the phased BAM file:
HP- Copy label assigned to the read. For non-tandem paralogs, copy labels correspond to genomic regions (for example,copy1_SMN1,copy2_SMN2). For tandem paralogs, copy labels correspond to haplotypes (for example,copy1_hap1,copy2_hap1).PC- Phred-scaled confidence score for the read-to-copy assignment.PS- Phasing set identifier.BX- Template identifier based on proximity linking information. Fragments with the sameBXtag are likely to originate from the same original DNA molecule.
The output format may be BAM, CRAM, or SAM, depending on the value specified for the --output-format option in the DRAGEN run.
MRJD Supporting Files
The MRJD caller generates an mrjd_supporting_files/ directory in the output directory.This directory contains files that support MRJD variant interpretation and visualization.
The following files are produced:
<prefix>.mrjd.<paralog_name>.vcf.gzA multi‑column VCF file containing small variants called by MRJD for each paralogous region. Each copy is represented as a separate column. This file is suitable for visualizing haplotype‑resolved variants in genome browsers, such as IGV, that support multi‑column VCF format.
<prefix>.mrjd.reference_region_alignments.samA SAM file containing reference region alignments used by MRJD. This file provides context for reference sequence differences between paralogous regions and can aid in interpreting variant calls, including the identification of gene conversion events.
Visualize MRJD Results in IGV
MRJD results can be inspected in IGV by loading the multi‑column VCF file, the phased BAM file, and the reference region alignments SAM file generated by the pipeline:
mrjd_supporting_files/<prefix>.mrjd.SMN1-SMN2.vcf.gz<prefix>.mrjd.phased.bam<prefix>.mrjd.reference_region_alignments.sam
In the multi‑column VCF file, all SMN1 and SMN2 copies are reported under the SMN1 region and are also listed under the SMN2 region. Copy‑to‑region assignments are indicated in the sample column. In the example shown below, copies 1, 2, and 3 are assigned to the SMN1 region, while copy 4 is assigned to the SMN2 region.
The phased BAM file displays reads assigned to each copy. In IGV, this can be visualized by loading the BAM file and grouping alignments by phase.
The reference region alignments SAM file highlights sequence differences between the SMN1 and SMN2 reference regions, providing context for interpreting copy‑specific variant assignments.

Visualize MRJD Results in DRAGEN Reports
MRJD results are integrated into DRAGEN Reports. For sample‑level reports, MRJD results are available under the Paralogs tab.
The Paralog Sets table provides an overview of each paralogous region analyzed, including the estimated total copy number. Selecting a region opens the Paralogous regions view, which displays haplotype‑resolved variant calls within each paralogous region.
The example shown below illustrates MRJD phased variant calls for PMS2–PMS2CL. In this visualization, dark orange indicates the alternative allele at a reference difference site between paralogous regions, light orange indicates the reference allele at a reference difference site, and gray indicates a non‑reference difference site variant.

Notes
MRJD supports paralogous region calling only when the estimated total copy number is less than 8. Regions with higher copy numbers are skipped, and no variants are called; however, total copy number estimates are still reported in the JSON output.
MRJD supports only the hg38 reference genome.
Variant calling is supported only when the sample average linked coverage (excluding duplicates) is ≥16×.
MRJD currently supports small variant calling only.
STR Calling
TruPath data improves mapping accuracy for long short tandem repeats (STRs) by leveraging proximity linking information to place repetitive read pairs, including in‑repeat reads (IRRs), at their correct genomic locations. This enables more accurate sizing of STR expansions, particularly for large repeats that exceed the fragment length.
DRAGEN also uses phasing information to improve STR genotyping accuracy, which is especially important for large heterozygous expansions. When IRR recovery, proximity linking, and phasing‑aware genotyping are combined, improvements to STR calling are applied automatically when running the DRAGEN Germline pipeline.
All required resource files are automatically detected for supported reference genomes.
In-Repeat Read (IRR) Recovery
IRR recovery is supported for repeat motifs with lengths between 2 and 6 bases. Motifs outside this range are not evaluated by IRR recovery, even if they are present in the catalog.
DRAGEN uses proximity information to recover in‑repeat reads (IRRs) that would otherwise remain unmapped or misaligned. This capability is particularly important for detecting large repeat expansions that exceed the fragment length. Although the mapper accounts for proximity information to improve alignment, IRRs require additional handling due to their low‑complexity sequence content.
IRR recovery is enabled by default when DRAGEN is run in proximity mode. DRAGEN‑STR automatically adjusts its parameters accordingly, and disabling IRR recovery is not recommended when analyzing samples for repeat expansions.
IRR recovery relies on a BED catalog that defines candidate STR regions and their associated repeat motifs. The catalog may include multiple entries for the same genomic region, allowing different motifs to be specified for a single STR locus.
For example, the RFC1 locus can be represented in the catalog as follows:
4
39348424
39348479
AAAAG
RFC1
4
39348424
39348479
AAAGG
RFC1
4
39348424
39348479
AAGGG
RFC1
4
39348424
39348479
AAGAG
RFC1
4
39348424
39348479
AACGG
RFC1
4
39348424
39348479
ACGGG
RFC1
4
39348424
39348479
ACAGG
RFC1
4
39348424
39348479
AAAGGG
RFC1
DRAGEN provides BED catalogs for IRR recovery that cover all the locus of the default DRAGEN-STR catalogs. The default BED catalogs are located in the <INSTALL_PATH>/resources/irr_recovery/ directory.
When using a supported reference genome and the default catalogs, IRR recovery is enabled automatically and does not require additional command‑line arguments.
Custom Catalogs
DRAGEN supports custom BED catalogs for in-repeat read (IRR) recovery through the --irr-recovery-str-bed command‑line option. Custom catalogs must follow the same format as the default catalogs provided by DRAGEN.
When a custom catalog is supplied, DRAGEN uses it in place of the default catalog for the selected reference genome. It is important to ensure that the custom catalog includes all loci of interest for repeat expansion detection. If a locus is missing from the catalog, IRR recovery is not performed for that locus, which may reduce sensitivity.
DRAGEN-provided built‑in catalogs are available for download from the DRAGEN Product Files Siteand can serve as a starting point for generating custom catalogs.
IRR Recovery BAM Tags
Remapped IRRs are annotated in the output BAM file using the tr tag. The tr tag encodes the repeat motif and motif length in a 16‑bit packed representation: *The lower 12 bits encode the motif bases using 2‑bit encoding [A=00,C=01,G=10,T=11]
The upper 4 bits encode the motif length.
Bases are ordered from least significant to most significant bit.
For example, the motif AAGGG with length 5 is encoded accordingly in the packed tr representation.
To avoid redundant motif representations, the packed form always corresponds to the shortest motif pattern and the lexicographically smallest rotation across the forward motif and its reverse complement. For example, the motif CACA is represented as AC.
The tr tag is applied to all IRRs recovered using proximity information. Remapped IRRs are assigned a single alignment position corresponding to the first base of the associated STR region in the reference genome and are marked as unmapped with MAPQ 0.
Phasing
When proximity mode is enabled, DRAGEN uses available phasing information to improve the accuracy of repeat expansion genotyping. Phasing helps resolve ambiguities in assigning reads to haplotypes in diploid regions, which is particularly important for accurately estimating repeat sizes in large heterozygous expansions.
Output calls remain unphased and are reported using the standard VCF format for short tandem repeat (STR) variants. However, the underlying genotyping model incorporates phasing information to improve repeat size estimates.
Sequencing Efficiency Correction
Some loci are affected by sequencing biases that result in uneven coverage across alleles. These biases can reduce the accuracy of repeat expansion genotyping.
When proximity mode is enabled, DRAGEN applies a sequencing efficiency correction to adjust expected coverage at each locus based on empirical data. This correction improves repeat size estimates by compensating for systematic sequencing bias. To minimize confounding effects from mapping bias, sequencing efficiency correction is enabled only for TruPath samples.
Sequencing efficiency correction can be applied on a per-locus basis by adding the SequencingEfficiencyCorrection field to the respective catalog entry. For example:
Correction factors should be determined empirically based on a set of control samples with known repeat sizes through orthogonal methods. DRAGEN provides precomputed correction factors in the default catalogs that were calibrated for the following loci:
FMR1
DMPK
FXN
Colocation Maps
Colocation maps capture proximity information to characterize long‑range interactions within a sample. The output of the colocation module is a matrix of interaction counts, where each cell represents the number of observed interactions between two genomic regions.
Colocation maps are typically visualized as heatmaps. The example shown illustrates a small region on chromosome 5. Darker pixels indicate a higher number of interactions between the corresponding genomic regions.

Several common features can be observed in colocation heatmaps:
The main diagonal reflects interactions among fragments originating from the same long template molecules and landing in nearby genomic bins.
Triangular or off‑diagonal structures may indicate structural variants, such as large deletions or breakends.
Most off‑diagonal pixels are either empty (white) or represent low‑level background signal (green).
Colocation Map Generation
Colocation map generation is a three-step process.
Collect relevant alignments
Compute the colocation matrix
Generate output files
Alignment Collection
During alignment collection, DRAGEN gathers all reads eligible for analysis.Alignments are excluded if mapped to decoy contigs, fall below the mapping quality threshold, or are marked as duplicates The remaining reads are assigned to genomic bins, with each bin representing approximately 2,000 bp of the genome.
Matrix Construction
The colocation matrix is then constructed by evaluating spatial relationships between reads. For each read (read1), DRAGEN identifies nearby reads (read2) and increments the matrix entry corresponding to their respective bins. A read is considered nearby if it falls within a rectangular region centered on read1. The size of this region is determined by the proximity linkage characteristics of the sample and is selected to balance sensitivity and performance.
Additional Options
Several options are available to control colocation matrix generation:
The genome is partitioned into fixed‑size bins of equal length, and alignments are assigned to bins based on their starting position. Bin size can be adjusted using the
--colocation- bin-sizeoption.Alignments with specific BAM flags can be excluded using
--colocation-alignment-filter-flags, which accepts an integer bitmask specifying flags to ignore.A minimum mapping quality can be enforced using
--colocation-alignment-min-mapq.
Cooler File
Colocation output is written as a cooler file containing a sparse representation of the colocation matrix.
The file conforms to schema 3 of the official cooler specification. DRAGEN produces a single‑resolution cooler file. The colocation matrix is stored in square mode and is symmetric, with each pixel containing a single integer count field of type int32.
The resulting cooler file can be processed using the cooler CLI or Python API.
Colocation Filter
The colocation filter uses colocation map data to assess proximity support for structural variant (SV) breakends and to flag events that are not supported by proximity evidence.
For each candidate breakend defined by coordinates chrom1:pos1 and chrom2:pos2, the filter evaluates a localized region of the colocation map. A bounding box centered on these coordinates is applied, with a default size of 200 kb, and the values of all bins within this region are summed to quantify local interaction support.
To account for variation in sequencing depth and data quality, the regional sum is normalized using the median non‑zero diagonal value of the colocation map. If the normalized value is below the configured threshold (default:1.0), the ColocationSum filter is applied to the breakend in the VCF output.
Filter application follows paired-event semantics:
If the
ColocationSumfilter is applied to one breakend of a paired SV event, it is also applied to the corresponding mate breakend record.

Running DRAGEN SV with Colocation Filter
Colocation filtering is enabled by default if enable-colocation and enable-sv are both set to true. To disable the filter manually, set --sv-enable-colocation-filter to false when starting the DRAGEN analysis with TruPath enabled.
Additional Options:
sv-colocation-filter-normalize-by-median: If true, colocation filter will normalize the region sum by the median diagonal value of the colocation matrix (default: true)sv-colocation-filter-threshold: Minimum (normalized) sum of region in colocation matrix to pass filter (default: 1.0)sv-colocation-filter-region-width: Width (in bp) of square region in colocation matrix to compute sum of (default: 200kbp)sv-colocation-filter-min-svlen: If true, Colocation filter will not run on intra-chromosomal breakend pairs that are within this distance of each other (default: 200kbp)sv-colocation-filter-inter-bnd: If true, colocation filter will be applied to inter-chromosomal breakends (default: true)sv-colocation-filter-intra-bnd: If true, colocation filter will be applied to intra-chromosomal breakends (default: true)
Output
The SV VCF file will have the additional headers if the colocation filter is enabled:
Examples of VCF records can be seen below. The first breakend pair has the ColocationSum filter applied, as there was no colocation signal at all (NORMALIZED_COLOC_SUM=0.0000).
Targeted Calling from TruPath Data
For WGS TruPath data, only lpa, hba, and smn will run when the Targeted Caller is enabled. A custom list of supported targets can be enabled via the command line.
Proximity Coverage Reports
When proximity mapping is enabled, DRAGEN generates a parallel set of coverage reports filtered to include only linked reads.
During template reconstruction, each read‑pair fragment is assigned a link‑quality score equal to the highest‑quality link connecting it to other fragments. Only reads from fragments with link‑quality scores meeting or exceeding a specified threshold are included in proximity coverage reports.
Proximity coverage reports are generated for each link‑quality threshold specified using --proximity-min-linkq-threshold(default: 10) and --proximity-additional-linkq-thresholds (default: 25; maximum of two values). These reports are available for WGS and all defined QC coverage regions.
Proximity coverage metrics
_proximity_linkqual<linkq-threshold>_coverage_metrics.csv
Coverage statistics for linked reads
Proximity fine histogram coverage
_proximity_linkqual<linkq-threshold>_fine_hist.csv
Detailed coverage histogram for linked reads
Proximity histogram coverage
_proximity_linkqual<linkq-threshold>_hist.csv
Binned coverage histogram for linked reads
Proximity overall mean coverage
_proximity_linkqual<linkq-threshold>_overall_mean_cov.csv
Overall mean coverage for linked reads
Proximity per contig mean coverage
_proximity_linkqual<linkq-threshold>_contig_mean_cov.csv
Per-contig mean coverage for linked reads
These reports use the same format and metrics as standard coverage reports but reflect statistics computed exclusively from linked reads meeting the specified threshold.
Reports
DRAGEN‑Reports includes a TruPath‑specific manifest to generate reports for TruPath WGS analysis. The manifest file, trupath/germline_wgs.json, is located in the /opt/dragen-reports/manifests directory. In addition to the standard QC metrics and visualizations provided in DRAGEN WGS reports, the TruPath report includes an additional Proximity tab highlighting metrics and visualizations specific to TruPath proximity‑enabled analysis, including:
Model Fit– Root mean square error indicating how well the proximity model fits the data.Q25 Proximity Rate– Percentage of read pairs with at least one neighbor above Q25.Q25 Proximity Coverage– Average autosomal coverage of read pairs with link quality above Q25.P75 Template Size– Size of linked template molecules at the 75th percentile.Phase Block NG50– Size of the smallest phasing block required to cover at least 50% of the genome. TheProximitytab also includes several visualizations summarizing proximity‑specific characteristics, including:
The distribution of template genomic lengths from <prefix>.wgs_template_gdist.csv

The genomic coverage of variant phasing blocks by minimum block size, from <prefix>.phase_blocks.gtf

The distribution of templates by sub-read count from <prefix>.<qc-region>_template_gdist.csv

Limitations
Illumina TruPath proximity enabled analysis has the following limitations:
Illumina TruPath proximity mode is currently supported for the DRAGEN Germline pipeline. The Somatic, RNA, UMI, MRD, and Methylation pipelines are not supported.
DRAGEN downsampling is not supported. In order to maintain the proximity property of the TruPath assay, FASTQs should not be randomly downsampled.
Only human samples using hg38 have been verified.
Only TruPath data inputs from the Illumina TruPath Genome prep are supported at this time. Running
--enable-proximity=truewith non-TruPath data inputs will halt analysis.Phasing requires the use of a pangenome reference hash table with personalization enabled. Analysis will halt with low coverage to support personalization.
For on-premises analyses, TruPath analysis requires a v4 DRAGEN server due to FPGA memory limitations. For reference, v4 servers have a server serial number which begins with the letters "AC".
For cloud analysis, TruPath analysis must be run on AWS f2 instance types. Azure NP-series and GCP FPGA platforms are not supported at this time.
MRJD requires at least 16x coverage to make calls; the caller will abort any attempt to call genes with insufficient aligned read coverage.
TruPath Genome Licensing
Illumina TruPath proximity‑enabled analysis can be run in the cloud or on supported on‑premises systems.
Cloud analysis is supported via Illumina Connected Analytics (ICA), BaseSpace Sequence Hub (BSSH) Run Planning with AutoLaunch, and DRAGEN FPGA Cloud BYOL on AWS EC2 f2.6xlarge instances.
Local analysis is supported on Phase 4 DRAGEN On‑Prem servers.
For DRAGEN On‑Prem servers and DRAGEN FPGA Cloud BYOL customers, the pipeline requires a Proximity license.
The Proximity license is included with the purchase of the Illumina TruPath Genome prep kit and is automatically assigned.
Due to FPGA memory constraints, the Proximity license for on‑premises use is supported only on Phase 4 servers. Phase 4 servers can be identified by a server serial number beginning with the letters “AC.”
Last updated
Was this helpful?