Illumina 5-base Prep

DRAGEN’s 5-Base DNA pipeline integrates genetic and epigenetic analysis, enabling simultaneous genome and methylome profiling through specialized processing options. This comprehensive workflow supports mapping, unique molecular identifiers, methylation calling, variant calling, and copy number variation analysis tailored for 5-base data.

Summary

  • Integrated 5-base processing: Activating --methylation-conversion=illumina automatically enables multiple DRAGEN options for methylation mapping, calling, and UMI processing specific to 5-base data (if UMI processing is enabled). 

  • 5-base hash table building: Setting --ht-methylated-cg=true builds a 5-base reference hash table stored under the methyl_cg sub-directory, which DRAGEN mapper uses automatically. 

  • Mapping and alignment: The pipeline adapts mapping algorithms to account for C>T conversions due to methylation, supporting local alignment, soft-clipping, and graph genomes by default. 

  • Unique molecular identifiers: UMI processing is extended to handle methylation-induced asymmetric base pairing, with duplex consensus reads annotated with methylation status on both strands. Only nonrandom-duplex UMI libraries are compatible. 

  • Methylation calling: Methylation is identified by C>T or G>A mismatches, with variant calling deconvoluting methylation from true variants. Directional methylation protocol is required. 

  • Methylation reports: DRAGEN generates BAM files with methylation tags, genome-wide cytosine methylation reports, and optionally integrates methylation data into VCF/gVCF files, balancing completeness and file size.  

  • Quality metrics: Mapping and methylation-specific metrics are produced, including base quality and methylation rates, to assess run quality.  

  • Small variant calling support: The pipeline supports germline and somatic variant calling on 5-base data with enhanced algorithms to distinguish methylation-induced changes from variants. Some features like pedigree analysis are not currently supported but planned for future releases.  

  • Copy number variant calling: Supported for whole genome sequencing in germline and somatic contexts. Allele-specific CNV calling is supported. 

  • Structural variant calling: Supported for whole genome sequencing in germline and somatic contexts.

Overview

DNA is inherently multiomic, holding both genetic and epigenetic molecular information. Beyond the sequence of adenine (A), thymine (T), guanine (G), and cytosine (C), there are modified bases such as 5-methylcytosine (5mC) that help direct gene expression. The Illumina 5-Base DNA Prep is a single workflow that, when combined with DRAGEN algorithms, provides an integrated readout of both genome and methylome:

image

The following processing is available for 5-base data in DRAGEN, and activated by setting --methylation-conversion=illumina:

image

The --methylation-conversion=illumina batch option sets the following DRAGEN options automatically to facilitate downstream processing. These specific options will be discussed in the following sections, but it is not necessary to set them independently.

5-base Licensing

For DRAGEN OnPrem Servers and DRAGEN FPGA Cloud BYOL customers, this pipeline requires the '5base' license. At this time, the cost of the 5base license is included with the associated Assay and has been automatically assigned. For those using DRAGEN OnPrem Servers, please see the OnPrem Licensing Installation Reference Section for more information. If your pre-existing DRAGEN OnPrem Server is not connected to the internet, please contact usarrow-up-right to retrieve your 5base License.

Build a 5-Base Hash Table

When --ht-methylated-cg=true is set, the DRAGEN reference builder will save a 5-base reference information in a sub-directory of the output directory called methyl_cg. When running the DRAGEN mapper, the top level directory should be provided (parent of methyl_cg), and DRAGEN will auto-detect which reference sub-folder to use depending on which DRAGEN workflow the user specifies (DNA, 5-base, RNA, etc.).

BCL Convert

Due to the use of adapters during Illumina 5-base DNA Prep, additional trimming during BCL Convert is necessary for 5-base data to specify sequencing cycles that are not associated with sample data (first 8 cycles of non-index reads). Failure to include these settings during BCL Convert can result in reduced methylation calling accuracy. The following settings are required in either the BCLConvert_Settings or BCLConvert_Data section of SampleSheet.csv for 5-base data:

Setting
Value

TrimUMI

1

AdapterRead1

CTGTCTCTTATACACATCT

AdapterRead2

CTGTCTCTTATACACATCT

OverrideCycles

*U7N1Y#;I10;I10;U7N1Y#

*Where # is the remaining number of cycles in each read after required trimming for non-index reads. For example, for a sequencing run that has cycles 151-10-10-151 (Read1-Index1-Index2-Read2), the correct OverrideCycles value would be U7N1Y143;I10;I10;U7N1Y143

5-Base Map/Align

Map/align of 5-Base data follows the approach detailed under DRAGEN DNA. 5-base data-specific algorithms have been updated to account for the possibility of C>T conversions due to methylation at both seed mapping and alignment scoring stages. Unlike with DRAGEN Methylation, local alignment, soft-clipping, and graph genomes are all supported and used by default. There is integrated support for the control genomes (unmethylated lambda, CpG methylated pUC19) included with the 5-base assay, meaning no custom reference or command line activation is required to obtain their QC readout.

Unique Molecular Identifiers

UMI processing follows logic and options detailed under unique-molecular-identifiers. The logic used to collapse duplex UMI adapters has been extended with 5-base data-specific algorithms to allow for accurate collapsing of asymmetric base pairing due to mC > T conversion. For duplex consensus reads in the BAM, XM tags report the methylation status of cytosines on both + and – strands. 5-base data is only compatible with--umi-library-type=nonrandom-duplex. Please note that the 5-base methylation pipeline does not enable UMI processing by default. UMI processing can be enabled using the command line option --umi-enable=true.

Methylation Calling

Methylation is primarily identified by reference C>T mismatches on the + strand, or G>A mismatches on the – strand. Additionally, variant calling provides confident deconvolution of methylation and variant status, whether mixed or unique, at non-reference cytosines, extending the completeness and accuracy of the methylome for individual subjects. If any bases on read2 overlap with those on read1, they are reported in the BAM but otherwise excluded downstream quantifications. 5-base data is only compatible with --methylation-protocol=directional

Methylation Calling Outputs

BAM

The DRAGEN BAM file includes methylation related tags for all MAPQ>0, proper-pair mapped reads. The added tags follow Bismark conventions:

image

Control reads are marked as unmapped, and can be found in the uncollapsed .bam with the ca:Z tag. The ca:Z tag contains the following information: ca:Z:contig_name,position,strand_direction,number_of_mismatches,peOverhangClip;.

Cytosine Report

DRAGEN can generate a genome-wide cytosine methylation report (CX_report.txt.gz) containing the methylation status of every reference cytosine in the genome by setting --methylation-generate-cytosine-report=true.

  • If processing 5-base data without enabling the variant caller this option will be set to true automatically

  • The option will default to false when --enable-variant-caller=true is set, as cytosine methylation is already output in the (g)VCF. See VCF methylation reporting.

To keep all cytosines from your reference in the CX_report, even if they are not included in the input sequences, set --methylation-keep-ref-cytosine=true. (default=false).

  • Setting this option to true increases run time and the CX_report file size.

The cytosine report is compressed by default. To uncompress, set --methylation-compress-cx-report=false. (default=true).

  • DRAGEN outputs a compressed *.CX_report.txt.gz, instead of a *.CX_report.txt.

Report Fields

  • The position and strand of each C in genome are given in the first three fields of the report.

  • A record with a - in the strand field is used for a G in the reference FASTA.

  • The counts of methylated and unmethylated Cs covering the positions are given in the fourth and fifth fields.

  • The C context in the reference (CG, CHG, or CHH) is given in the sixth field.

  • The trinucleotide sequence context is given in the last field (eg, CCC, CGT, CGA, and so on)

  • The cytosine report only includes records for positions that have one or more spanning alignments. The following is an example cytosine report record: chr2 24442367 + 18 0 CG CGC

VCF methylation reporting

5-base small variant calling is enabled by setting --enable-variant-caller=true and --methylation-conversion=illumina

  • DRAGEN can integrate methylation reporting into the VCF and gVCF output files as well.

  • The 5-base pipeline supports generating both VCF and gVCF files in a single run for germline, somatic tumor-only, and somatic tumor-normal data.

  • In contrast to the CX_report, methylation reporting is provided not only for the reference allele but also at alternative alleles and produces more accurate %methylation estimates even in the presence of confounding T or A variant alleles.

For reporting in the (g)VCF files, the --methylation-report-to-vcf and --methylation-report-to-gvcf options can be set to none, cg, or c.

  • none will exclude methylation reporting.

  • cg will report methylation of cytosines in a CpG context.

  • c will report methylation at all cytosines. When analysis is configured as described above, the default values will be set to --methylation-report-to-vcf=c and --methylation-report-to-gvcf=cg to balance considerations of complete methylome reporting with file size.

Below are VCF header definitions of the 5-base methylation fields.

  • INFO:M5mC Marks nucleotides for which 5mC levels are reportable. The letters z, x and h indicate CG, CHG and CHH contexts, respectively. The lowercase letters z, x and h are used to report methylation of individual cytosines (C), whereas the uppercase Z marks CpG dinucleotides for which methylation reporting is aggregated across the two CpG cytosines on opposite strands. The missing value (.) is used for unreported or not applicable (A/T) nucleotides.

  • FORMAT:M5mC 5mC methylation levels of individual cytosines/CpG dinucleotides. Encoded as a VCFv4.5 Number=M field but with cardinality defined by the INFO M5mC field.

  • FORMAT:DPM5mC Total informative read depth of potentially 5mC modified bases. Encoded as a VCFv4.5 Number=M field but with cardinality defined by the INFO M5mC field.

Metric files

The quality of each methylation run can be summarized in the following metric files.

*.mapping_metrics.csv—Contains mapping-specific metrics that are generated for the alignment phase, including benchmarks like number of total reads, aligned reads, deduped reads, base quality, etc. For additional information on *.mapping_metrics.csv, please see the mapping_metrics.csv documentation.

*.m-bias.txt—Contains per-read-position methylation bias information. This file is generated when --methylation-generate-mbias-report=true is run. For cfDNA samples, it is recommended to include the --mbias-report-include-overlaps=true option due to significant overlap of R1 and R2 coordinates.

Within *.m-bias.txt contains the read position, count methylated (number of methylated cytosines at this position), count unmethylated (number of unmethylated cytosines at this position), % methylation (percent methylation at this read position), and coverage (total methylated positions + unmethylated positions). These values are provided for CpG, CGH, and CHH contexts for R1 and R2. It can be used to diagnose 3' o 5' end biases, and identify if R1 and R2 are behaving differently.

*.methyl_metrics.csv—Contains methylation-specific metrics that are generated for the methylation calling phase, including benchmarks like the total number of cytosines analyzed, count and rate of methylation in each cytosine context, strand of the best alignment, etc.

Within *.methyl_metrics, metrics are provided for:

  • The sample

  • Lambda unmethylated control

  • pUC19 methylated control The sample metrics are identical to those generated by DRAGEN methylation; the control metrics are 5-base specific.

If the sample is split across multiple lanes, the *.methyl_metrics.csv file provides metrics for the sample as a whole in the Methyl Calling and Methyl QC sections, and per read ground metrics in the Methyl Calling Per RG and Methyl QC Per RG sections.

There are two columns that provide metrics. The first column provides the absolute metric; if the first column isn't a percentage, but could be described as a percentage, the second column provides the percentage.

Definitions:

  • Total Pairs Total number of read pairs. Sum of Number of alignments with unique best hit, Pairs with no alignments under any condition, and Pairs that did not map uniquely.

  • Number of alignments with unique best hit Number of reads that have a unique best hit to the reference genome. The percentage is these reads over total pairs.

  • Mapping efficiency (%) Percentage of pairs with a unique best alignment, or Number of alignments with unique best hit over Total Pairs.

  • Pairs with no alignments under any condition Number of read pairs that did not align to the reference genome (MAPQ = 0). The percentage is these read pairs over total pairs.

  • Pairs that did not map uniquely Number of read pairs that did not map uniquely to the reference genome (MAPQ = 0). The percentage is these read pairs over total pairs.

  • Pairs with unique best alignment from OT strand Number of read pairs that aligned to the reference genome with a unique best alignment score from the original top strand. The percentage is these reads over total pairs.

  • Pairs with unique best alignment from CTOT strand Number of read pairs that aligned to the reference genome with a unique best alignment score from the complimentary to the original top strand. The percentage is these reads over total pairs. The 5-base library prep kit is directional, so 0 pairs are expected for samples.

  • Pairs with unique best alignment from CTOB strand Number of read pairs that aligned to the reference genome with a unique best alignment score from the complimentary to the original bottom strand. The percentage is these reads over total pairs. The 5-base library prep kit is directional, so 0 pairs are expected for samples; pairs are expected for the controls.

  • Pairs with unique best alignment from OB strand Number of read pairs that aligned to the reference genome with a unique best alignment score from the original bottom strand. The percentage is these reads over total pairs.

  • Total number of C's analyzed Total number of sequenced cytosines analyzed.

  • Methylated C's in CpG context Total number of sequenced methylated cytosines in CpG context. The percentage is methylated cytosines in CpG context over all sequenced cytosines.

  • Methylated C's in CHG context Total number of sequenced methylated cytosines in CHG context. Percentage of methylated cytosines in CHG context over all sequenced cytosines.

  • Methylated C's in CHH context Total number of sequenced methylated cytosines in CHH context. The percentage is methylated cytosines in CHH context over all sequenced cytosines.

  • Methylated C's in unknown context Total number of sequenced methylated cytosines in unknown context. The percentage is methylated cytosines in unknown context over all sequenced cytosines.

  • Unmethylated C's in CpG context Total number of sequenced unmethylated cytosines in CpG context.

  • Unmethylated C's in CHG context Total number of sequenced unmethylated cytosines in CHG context.

  • Unmethylated C's in CHH context Total number of sequenced unmethylated cytosines in CHH context.

  • Unmethylated C's in unknown context Total number of sequenced unmethylated cytosines in unknown context.

  • % of C's methylated in CpG context Percentage of methylated cytosines in CpG context over all (methylated and unmethylated) CpG cytosines.

  • % of C's methylated in CHG context Percentage of methylated cytosines in CHG context over all (methylated and unmethylated) CHG cytosines.

  • % of cytosines methylated in CHH context Percentage of methylated cytosines in CHH context over all (methylated and unmethylated) CHH cytosines.

  • % of C's methylated in unknown context Percentage of methylated cytosines in unknown context over all (methylated and unmethylated) unknown context cytosines.

Small Variant Calling

For a comprehensive overview of small variant calling, please see small-variant-calling. 5-base data is a supported input for:

  • Germline

  • Somatic Tumor Only WGS

  • Somatic Tumor/Normal WGS

  • Somatic Tumor Only Enrichment (including both solid and ctDNA modes)

To support accurate variant calling on 5-base data, updates were made throughout the variant caller algorithm and statistical models to account for methylation-specific C>T conversions. The methylation-induced substitutions appear on only a single DNA strand, while variant calls have evidence on both strands. This information allows estimation of methylation levels at putative variant positions, and deconvolution of this signal from DNA variant evidence. This approach provides confident and accurate variant calls without sacrificing excessive information by masking all ambiguous base changes.

Small variant calling is aided by DRAGEN ML and Personalization, which infers the likely population haplotypes per region in the input sample to improve variant calling.

CNV Calling

Copy number variant calling follows the logic outlined in cnv-calling. Not all pipelines or modules are compatible with 5-base data:

  • Germline CNV Calling (depth-based): Supported for WGS; not supported for WES

  • Germline CNV Calling ASCN: Not supported

  • Multisample Germline CNV Calling: Not supported

  • Somatic CNV Calling ASCN: Supported for WGS; not supported for WES

  • Somatic CNV Calling WES: Not supported

  • Cytogenetics Modality: Not supported

  • CNV with SV Support: Supported

SV Calling

For an overview of structural variant calling, please see structural variant calling. 5-base data is a supported input for:

  • Germline

  • Somatic Tumor Only WGS

  • Somatic Tumor/Normal WGS

  • Somatic Tumor Only Enrichment (including both solid and ctDNA modes)

Q-Score Methylation Reporting (BETA FEATURE)

Through the command line, DRAGEN supports setting a minimum Q-score required for calling methylation at a base. Setting this will exclude read bases less than the set threshold from contributing to the XM tag, CX_report (g)VCF, and methyl_metrics.csv. Default is 0.

Unsupported and Beta Features

Not all features within DRAGEN currently support 5-base, including the following:

Within SNV:

  • Pedigree Analysis

  • VCF Imputation

  • Multi-Region Joint Detection

  • STR Profiling

  • Targeted Callers

  • VNTR

  • CNV cytogenetics

In general:

  • MNV

  • LOH

  • DRAGEN Fragmentomics (beta feature). Specifically, DRAGEN Fragmentomics's end motif analysis is not currently methyl-aware.

  • Minimum Q-score threshold required for methylation calling (beta feature)

Processing in common with other pipelines

Last updated

Was this helpful?