Unique Molecular Identifiers

DRAGEN can process data from whole genome and hybrid-capture assays with unique molecular identifiers (UMI). UMIs are molecular tags added to DNA fragments before amplification to determine the original input DNA molecule of the amplified fragments. UMIs help reduce errors and biases introduced by DNA damage such as deamination before library prep, PCR error, or sequencing errors.

To use the UMI Pipeline, the input reads files must be from a paired-end run. Input can be pairs of FASTQ files or aligned/unaligned BAM input. DRAGEN supports the following UMI types:

  • Dual, nonrandom UMIs, such as TruSight Oncology (TSO) UMI Reagents or IDT xGen Prism.

  • Dual, random UMIs, such as Agilent SureSelect XT HS2 molecular barcodes (MBC) or IDT xGen Duplex Seq Adapters.

  • Single-ended, random UMIs, such as Agilent SureSelect XT HS molecular barcodes (MBC) or IDT xGen dual index UMI Adapters.

DRAGEN uses the UMI sequence to group the read pairs by their original input fragment and generates a consensus read pair for each such group, or family. The consensus reduces error rates to detect rare and low frequency somatic variants in DNA samples with high accuracy. DRAGEN generates a consensus as follows.

  1. Aligns reads.

  2. Groups reads into groups with matching UMI and pair alignments. These groups are referred to as families.

  3. Generates a single consensus read pair for each read family.

These generated reads have higher quality scores than the input reads and reflect the increased confidence gained by combining multiple observations into each base call. UMI workflow is only compatible with small variant calling and SV in DRAGEN.

UMI Input

Enter UMIs in one of the following formats:

  • Read name—The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME). For example, NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA

  • BAM tag—The UMI is present as an RX tag in prealigned or aligned BAM file (standard SAM format).

  • FASTQ file—The UMI is located in a third FASTQ file using the same read order as the read pairs.

To create FASTQ, append the UMI to the read name, and then specify the appropriate OverrideCycles setting in the BCL conversion tool (see Illumina BCL Data Conversion). DRAGEN supports UMIs with two parts each with a maximum of 8 bp and separated by +, or a single UMI with a maximum of 15 bp.

The UMI workflow must be executed using a set of reads that correspond to a unique set of RGSM/RGLB. DRAGEN supports multiple lanes if all lanes correspond to the same RGSM/RGLB set.

DRAGEN UMI does not support a tumor-normal analysis, because a tumor-normal run corresponds to two different RGSM. In a tumor-normal run, one sample name is used for tumor and one sample name is used for normal. DRAGEN UMI supports one sample in a run.

If using a BAM file or a list of FASTQ files as the input, the input might contain multiple samples. DRAGEN checks if only one sample is included in the run and if the sample uses only a single, unique RGLB library. DRAGEN also accepts a library that was spread across multiple lanes. If there is a single sample and single library, DRAGEN processes all included reads. If there are multiple samples or multiple libraries, DRAGEN aborts analysis with an error.

UMI Input Correction Table:

For dual, nonrandom UMIs, you can provide a predefined UMI correction table or a list of valid UMI sequences as input. To create the UMI correction table, use a tab-delimited file, include a header, and add the following fields.

If customized correction table is not specified, DRAGEN uses the default table for TruSight Oncology (TSO) UMI Reagents that is located at <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz. Alternatively, you can provide a file for whitelisted nonrandom UMI with valid UMI sequence one per line. DRAGEN then autogenerates a UMI correction table with hamming distance of one.

UMI Options

  • --umi-library-type—Set the batch option for different UMIs correction. Three batch modes are available that optimize collapsing configurations for different UMI types. Use one of the following modes:

    • random-duplex—Dual, random UMIs.

    • random-simplex—Single-ended, random UMIs.

    • nonrandom-duplex—Dual, nonrandom UMIs. To use this option, provide either --umi-nonrandom-whitelist or --umi-correction-table.

  • --umi-min-supporting-reads—Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. For example, the following are the recommended settings for FFPE and ctDNA.

    • [FFPE] If the variant > 1%, use --umi-min-supporting-reads=1 with the --vc-enable-umi-solid variant caller parameter. For more information on variant caller options, see Variant Caller Options.

    • [ctDNA] If the variant < 1%, use --umi-min-supporting-reads=2 with the --vc-enable-umi-liquid variant caller parameter. For more information on variant caller options, see Variant Caller Options.

  • --umi-enable—To enable read collapsing, set the --umi-enable option to true. This option is not compatible with --enable-duplicate-marking because the UMI pipeline generates a consensus read from a set of candidate input reads, rather than choosing the best nonduplicate read. If using the --umi-library-type option, --umi-enable is not required.

  • --umi-emit-multiplicity—Set the consensus sequence type to output. DRAGEN UMI allows you to collapse duplex sequences from the two strands of the original molecules. Duplex sequence is typically ~20–60% of total library, depending on library kit, input material, and sequencing depth. Enter one of the following consensus sequence types:

    • both—Output both simplex and duplex sequences. This option is the default.

    • simplex—Output only simplex sequences.

    • duplex—Output only duplex sequences.

  • --umi-source—Specify the input type for the UMI sequence. The following are valid values: qname, bamtag, fastq. If using --umi-source=fastq, provide the UMI sequence from FASTQ file using --umi-fastq.

  • --umi-correction-table—Enter the path to a customized correction table. By default, Local Run Manager NextSeq 1000/2000 doesn't use LRM. What would it use instead? uses lookup correction with a built-in table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits.

  • --umi-nonrandom-whitelist—Enter the path for a customized, valid UMI sequence.

  • --umi-metrics-interval-file—Enter the path for target region in BED format.

  • --umi-output-uncollapsed-bam—Output uncollapsed (raw) reads map/aligning results to separate BAM with filename <output_prefix>.uncollapsed.bam.

Nonrandom and Random UMI Correction

DRAGEN processes UMIs by grouping reads by UMI and alignment position. If there are sequencing errors in the UMIs, DRAGEN can correct and detect small sequencing errors by using a lookup table or by using sequence similarity and read counts. You specify the type of correction with the --umi-library-type or --umi-correction-scheme option using the values lookup, random, or none.

For sparse sets of nonrandom UMIs, it is possible to create a lookup table that specifies which sequence can be corrected and how to correct it. This correct file scheme works best on UMI sets where sequences have a minimum hamming/edit distance between them. By default, DRAGEN uses lookup correction with a built-in correct table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits. Specify the path for your correction file using the --umi-correction-table option. If you are using a different set of nonrandom UMIs, contact Illumina Technical Support for information on generating the corresponding correction file.

In the random UMI correction scheme, DRAGEN must infer which UMIs at a given position are likely to be errors relative to other UMIs observed at the same position. The error modes include small UMI errors, such as one mismatch or UMI jumping or hopping artifact from library prep. DRAGEN accomplishes this as follows.

  • Groups reads by fragment alignment position.

  • Within a small fuzzy window at each position, groups the reads first by exact UMI sequence, which forms a family.

  • Estimate UMI jumping or hopping probability through insert size distribution and number of distinct UMI at certain positions.

  • Within a fuzzy window, calculates pair-wise likelihood ratio to assess if two families with different UMI sequences and genomic positions are derived from same original molecule.

  • Merges families with likelihood lower than threshold. The default threshold is 1.

Merge Duplex UMIs

Duplex UMI adapters simultaneously tag both strands of double-stranded DNA fragments. It is then possible to identify reads resulting from amplification of each strand of the original fragment.

DRAGEN considers two collapsed read pairs to be the sequence of two strands of the same original fragment of DNA if they have the same alignment position (within a fuzzy window), complementary orientations, and their UMIs are swapped from Read 1 and Read 2. If there is only single-ended UMI, DRAGEN compares the start-end position of families from two strands and computes pair-wise likelihood to determine if they are likely originated from two distinct families or should be merged as a duplex sequence. By default, DRAGEN outputs both simplex and duplex consensus sequences. To change the consensus sequence output type, use --umi-emit-multiplicity.

Example UMI Commands

Generate consensus BAM from FASTQ

The following is an example DRAGEN command for generating a consensus BAM file from input reads with Illumina UMIs:

dragen \
-r <REF> \
-1 <FQ1> \
-2 <FQ2> \
--output-dir <OUTPUT> \
--output-file-prefix <PREFIX> \
--enable-map-align true \
--enable-sort true \
--umi-library-type nonrandom-duplex \
--umi-metrics-interval-file <valid target BED file>

Use FASTQ UMI Input

To run with other random UMI library type, change --umi-library-type to random-simplex or random-duplex.

dragen \
-r <REF> \
-1 <FQ1> \
-2 <FQ3> \
--umi-source=fastq \
--umi-fastq <FQ2> \
--output-dir <OUTPUT> \
--output-file-prefix <PREFIX> \
--enable-map-align true \
--enable-sort true \
--umi-library-type nonrandom-duplex \
--umi-metrics-interval-file [valid target BED file]

Use Customized Correction Table**

dragen \
-r <REF> \
-1 <FQ1> \
-2 <FQ2> \
--umi-correction-table <valid umi correction table> \
--output-dir <OUTPUT> \
--output-file-prefix <PREFIX> \
--enable-map-align true \
--enable-sort true \
--umi-library-type nonrandom-duplex \
--umi-metrics-interval-file <valid target BED file>

UMI Outputs

Collapsed BAM

If you enable BAM output, DRAGEN generates a <output_prefix>.bam that includes all UMI consensus reads. The QNAMEs for the reads are generated based on the following convention.

consensus_read_refID1_pos1_refID2_pos2_orientation
  • refID1—The reference ID of Read 1.

  • pos1—The genomic position of Read 1.

  • refID2—The reference ID of Read 2.

  • pos2—The genomic position of Read 2.

  • orientation—The orientation of Read 1 and Read 2. Orientation can be one of the following values. Position refers to the outermost aligned position of the read and is adjusted for soft clips.

    • 1—Read 1 is forward and Read 2 is reverse. The starting position for Read 1 is less than or equal to the Read 2 end position.

    • 2—Read 1 is reverse and Read 2 is forward. The starting position for Read 2 is greater than or equal to the Read 1 end position.

    • 3—Read 1 is forward and Read 2 is reverse. The starting position for Read 1 is greater than the Read 2 end position.

    • 4—Read 1 is reverse and Read 2 is forward. The starting position for Read 2 is greater than the Read 1 end position.

    • 5—Read 1 and Read 2 are forward.

    • 6—Read 1 and Read 2 are reverse.

XV and XW tags are added to consensus reads specifying number of supporting reads in a collapsed family. XV tag indicates number of fragmnets and XW tag indicates number of duplex fragments.

UMI Metrics

DRAGEN outputs an <output_prefix>.umi_metrics.csv file that describes the statistics for UMI collapsing. This file summarizes statistics on input reads, how they were grouped into families, how UMIs were corrected, and how families generated consensus reads. The following metrics can be useful when tuning the pipeline for your application:

  • Discarded families---Any families having fewer than --umi-min-supporting-reads input or having a different duplex/simplex status than specified by --umi-emit-multiplicity are discarded. These reads are logged as Reads filtered out. The families are logged as Families discarded.

  • UMI correction---Families may be combined in various ways. The number of such corrections are reported as follows.

    • Families shifted---Families with fragment alignment coordinates up to the distance specified by the umi-fuzzy-window-size parameter. The default umi-fuzzy-window-size parameter is 3.

    • Families contextually corrected---Families with exactly the same fragment alignment coordinates and compatible UMIs are merged. - Duplex families---Families with close alignment coordinates and complementary UMIs are merged.

When you specify a valid path for --umi-metrics-interval-file, DRAGEN outputs a separate set of on target UMI statistics that contains only families within the specified BED file.

If you need to analyze the extent to which the observed UMIs cover the full space of possible UMI sequences, the histogram of unique UMIs per fragment position metric may be helpful. It is a zero-based histogram, where the index indicates a count of unique UMIs at a particular fragment position and the value represents the number of positions with that count.

The following figures and table describe available UMI metrics.

Fig1) Read pairs with duplex UMI

Fig2) UMI error correction

Fig3) UMI collapsible regions

Last updated