Prepare a Reference Genome

Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. The options used in this preprocessing step offer tradeoffs between performance and mapping quality.

Pre-built DRAGEN reference genomes are available for download in the Illumina customer portal. If you find that performance and mapping quality with these are adequate, there is a good chance that you can simply work with these supplied reference genomes. Depending on your read lengths and other particular aspects of your application, you may be able to improve mapping quality and/or performance by tuning the reference preprocessing options.

Hash Table Background

The DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, and looks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in the reference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exact matches. The DRAGEN hash table must be constructed from a chosen reference genome using the --build-hash-table option, which extracts many overlapping seeds from the reference genome, populates them into records in the hash table, and saves the hash table as a binary file.

Automatic Reference Detection

DRAGEN will attempt to detect the provided reference in order to automatically apply recommended resources and settings. There are four human references that DRAGEN can detect: hg38, hg19, hs37d5, and chm13v2. DRAGEN is able to detect references that contain a subset of the primary contigs from one of these references, as long as the names and lengths of the detected contigs are consistent with the names and lengths from the standarad assemblies of these references.

In detail, automatic reference detection operates as follows:

We define a primary contig of a human genome to be an autosome (1-22) or sex chromosome (X,Y). Let F be the input fasta. For each reference genome R in hg38, hg19, hs37d5, and chm13v2, DRAGEN checks if there are any contigs in F that have the same name and length as a primary contig in R, and that there are no contigs in F that have the same name as a contig in R, but with different length. If these conditions hold for exactly one of hg38, hg19, hs37d5, and chm13v2, then that reference is detected and resources may be applied automatically.

The DRAGEN hash table builder will automatically apply decoy contigs and mask bed files to detected reference. Other pipelines may also apply automatic resources. For example variant callers may apply machine learning models and target bed files.

Naming Conventions

In order for DRAGEN to correctly detect the provided reference, it is important to use the standard naming conventions for each of the four human assemblies that DRAGEN detects:

AssemblyAutosome and Sex Chromosome Names

hg38, hg19, chm13v2

chr1-chr22, chrX, chrY

hs37d5

1-22, X, Y

Reference Seed Interval

The size of the DRAGEN hash table is proportionate to the number of seeds populated from the reference genome. The default is to populate a seed starting at every position in the reference genome, ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on the DRAGEN PCIe board.

To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populate less than all reference seeds using the --ht-ref-seed-interval option to specify an average reference interval. The default interval for 100% population is --ht-ref-seed-interval 1, and 50% population is specified with --ht-ref-seed-interval 2. The population interval does not need to be an integer. For example, --ht-ref-seed-interval 1.2 indicates 83.3% population, with mostly 1-base and some 2-base intervals to achieve a 1.2 base interval on average.

Hash Table Occupancy

It is characteristic of hash tables that they are allocated a certain size, but always retain some empty records, so they are less than 100% occupied. A healthy amount of empty space is important for quick access to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty space is important because records are pseudo-randomly placed in the hash table, resulting in an abnormally high number of records in some places. These congested regions can get quite large as the percentage of empty space approaches zero, and queries by the DRAGEN mapper for some seeds can become increasingly slow.

Hash Table / Seed Length

The hash table is populated with reference seeds of a single common length. This primary seed length is controlled with the --ht-seed-len option, which defaults to 21.

The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally, longer seeds are better for run time performance, and shorter seeds are better for mapping quality (success rate and accuracy). A longer seed is more likely to be unique in the reference genome, facilitating fast mapping without needing to check many alternative locations. But a longer seed is also more likely to overlap a deviation from the reference (variant or sequencing error), which prevents successful mapping by an exact match of that seed (although another seed from the read may still map), and there are fewer long seed positions available in each read.

Longer seeds are more appropriate for longer reads, because there are more seed positions available to avoid deviations.

Seed Length Recommendations

Value for --ht-seed-len

Read Length

21

100 bp to 150 bp

17 to 19

shorter reads (36 bp)

27

250+ bp

Hash Table / Seed Extensions

Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.

For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1-3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.

The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.

It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):

--ht-cost-coeff-seed-len

--ht-cost-coeff-seed-freq

--ht-cost-penalty

--ht-cost-penalty-incr

There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.

The defaults for the seed frequency options are as follows:

OptionDefault

--ht-cost-coeff-seed-len

1

--ht-cost-coeff-seed-freq

0.5

--ht-cost-penalty

0

--ht-cost-penalty-incr

0.7

--ht-max-seed-freq

16

--ht-target-seed-freq

4

Seed Frequency Limit and Target

One primary or extended seed can match multiple places in the reference genome. All such matches are populated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seed extracted from a read. The multiple reference positions are then considered and compared to generate aligned mapper output. However, the DRAGEN software enforces a limit on the number of matches, or frequency, of each seed, which is controlled with the --ht-max-seed-freq option. By default, the frequency limit is 16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficiently long secondary seed that the frequency of any particular extended seed pattern falls within the limit. However, if a maximum seed extension would still exceed the limit, the seed is rejected, and not populated into the hash table. Instead, a single High Frequency record is populated.

This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons. First, because seeds are rejected only when extension fails, only extremely high-frequency primary seeds, typically with many thousands of matches are rejected. Such seeds are not very useful for mapping. Second, there are other seed positions to check in a given read. If another seed position is unique enough to return one or more matches, the read can still be properly mapped. However, if all seed positions were rejected as high frequency, often this means that the entire read matches similarly well in many reference positions, so even if the read were mapped it would be an arbitrary choice, with very low or zero MAPQ.

Thus, the default frequency limit of 16 for --ht-max-seed-freq works well. However, it may be decreased or increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the number of reads mapped (especially for short reads), but commonly the additional mapped reads have very low or zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly large numbers of possible mappings are occasionally considered.

In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freq option. This target frequency is used when extensions are generated for high frequency primary seeds. Extension lengths are chosen with a preference toward extended seed frequencies near the target. The default of 4 for --ht-target-seed-freq means that the software is biased toward generating shorter seed extensions than necessary to map seeds uniquely.

References with ALT contigs

When building a reference hash table from a fasta with ALT contigs, it may be desired to mask certain regions of high similarity, or to establish a liftover realtionships between primary and alternate contigs. The recommended approach is masking, as described in the Map-Align section. When hg19 or hg38 alt contigs are detected, the hash table builder will require a liftover file or a bed file to mask the alt contigs. If non are provided, a mask bed file from <INSTALL_PATH>/fasta_mask/ will be used automaticaly.

Masked References

DRAGEN has adopted a masked approach to handle native reference ALT contigs, where strategic regions are masked to increased accuracy. The hash table builder will build the mapper hash table as if the regions that were specified in the argument for ht-mask-bed were masked with N's. The hash table builder will only allow setting one of ht-mask-bed or ht-alt-liftover. Each line in the bed file is expected to contain a contig name, start position (0-based), and end position (1-based), seperated by a single tab or space. Lines that start with # are ignored by the hash table builder to allow commenting. Any line with a contig name that is not found in the input fasta is skipped and logged to the DRAGEN log file. Likewise, lines that describe empty intervals are skipped. If all lines are skipped this way, the hash table builder will issue an error and abort, unless the mask bed file was automatically applied (see Automatic masking). The hash table builder will always issue an error and abort if an interval described in the BED file is outside of the range of the corresponding contig in the fasta. Lines that are not skipped are written to a file called mask.bed that will be present in the hash table output directory, and whose digest will appear in hash_table.cfg. This file is used when a reference is loaded to the FPGA card to dynamically mask reference.bin.

Automatic masking

When running from a fasta for which hg38 or hg19 is detected (See Automatic Reference Detection), and no argument for ht-mask-bed or ht-alt-liftover was provided, the hash table builder will automatically apply the corresponding bed file for the detected reference from <INSTALL_PATH>/fasta_mask/. Note that the hash table builder will identify alt contigs by name. So when running from an input fasta that contains alt contig with standard names but modified base content, it is recommended to suppress automatic masking by setting ht-suppress-mask=true or by passing a custom mask bed file to ht-mask-bed.

Handling Decoy Contigs

The behavior of DRAGEN with respect to the handling of decoy contigs in the reference has changed since version 2.6.

Starting with DRAGEN 3.x, DRAGEN's hash table builder automatically detects the absence of the decoy contigs from the reference and adds it to the FASTA file, prior to building the hash table. The decoys file is found at <INSTALL_PATH>/liftover/hs\_decoys.fa. If the reference is missing the decoy contigs, then the reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). This results in an artificially lower mapping rate, however, the accuracy of variant calling is improved thanks to removing false positive caused by decoy reads.

Illumina recommends using this feature by default. However, you can to set the --ht-suppress-decoys option to true to suppress adding these decoys to the hash table.

The table below describes the difference in behavior between older DRAGEN versions (2.6 and earlier) and DRAGEN 3.x versions with respect to the handling of decoy contigs in the hash table builder:

DRAGEN BehaviorDRAGEN 2.6 and earlier versionsDRAGEN 3.0 and later versions

Reference does not include the decoy contigs (eg, hg19)

Decoy reads mismap elsewhere in the genome due to the lack of contigs in the reference. Artificially higher mapping rate. False positive calls in noisy regions to which the decoy contigs are mismapped.

DRAGEN automatically detects the absence of the decoy contig from the reference and adds it to the FASTA file. Artificially lower mapping rate because decoy reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). False positive calls are avoided thanks to adding the decoy contigs under the hood. Therefore this helps variant calling.

Reference includes the decoy contigs (eg, hs37d5)

Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place

Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place

Prepare a linear Reference

Usage

Use the --build-hash-table option to transform a reference FASTA into the hash table for DRAGEN mapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and a preexisting output directory. Build command usage is as follows:

dragen --build-hash-table true [options] --ht-reference
<reference.fasta> --output-directory <outdir>

Input

The --ht-reference and --output-directory options are required for building a hash table. The --ht‑reference option specifies the path to the reference FASTA file, while --output-directory specifies a preexisting directory where the hash table output files are written. Illumina recommends organizing various hash table builds into different folders. As a best practice, folder names should include any nondefault parameter settings used to generate the contained hash table. The sequence names in the reference FASTA file must be unique.

Command line options

OptionRequiredDescription

--build-hash-table

Yes

Set to true

--ht-reference

Yes

Path to the reference genome FASTA file.

--ht-mask-bed

No (but recommended)

Path to the mask bed file. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from /opt/edico/fasta_mask.

--output-directory

Yes

Specify the directory where all related hash table files will be written

Liftover Based ALT-Aware Hash Tables

While masking is the recommended approach to dealing with ALT contigs, DRAGEN also supports a liftover based method. To enable liftover based ALT-aware mapping in DRAGEN, build the hash table with a liftover file by using the --ht-alt-liftover option. The hash table builder classifies each reference sequence as primary or alternate based on the liftover file, and packs primaries before alternates in reference.bin. SAM liftover files for hg38DH and hg19 are in the <INSTALL_PATH>/liftover folder.

Custom Liftover Files

Custom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAM format, but no SAM header is required. SEQ and QUAL fields can be omitted ('*'). Each alignment record should have an alternate haplotype reference sequence name as QNAME, indicating the RNAME and POS of its liftover alignment in a destination (normally primary assembly) reference sequence.

Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4) or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALT contig unaligned.

A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftover destination (appearing in RNAME). Multiple ALT contigs can align to the same primary assembly location. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged 0x800 supplementary), such as to align one portion forward and another portion reverse-complemented. However, each base of the ALT contig only receives one liftover image, according to the first alignment record with an M CIGAR operation covering that base.

SAM records with QNAME missing from the reference genome are ignored, so that the same liftover file may be used for various reference subsets, but an error occurs if any alignment has its QNAME present but its RNAME absent.

Prepare a Multigenome Reference

DRAGEN analysis is capable of mapping on a multigenome (graph) hash table. The multigenome hash table introduces alternate graph paths to the linear reference hash table to represent more broadly the allelic diversity of the population over the whole genome or in specific regions defined in a bed file. Gain on accuracy from this methodology has been described in scientific blogs available on the Illumina Genomics Research Hub site. Mutigenome hash tables for CHM13_v2, hg38, hg19 and hs37d5 assemblies are available on the DRAGEN Software Support Site page.

See DRAGEN Graph Mapper for information on the graph mapping method.

It is possible to build a custom multigenome reference in order to:

  • customize the released multigenome hash table with custom bed files or hash table builder options. A set of bed files are available in the resource files on the DRAGEN Software Support Site page.

  • generate a population-specific-multigenome hash table from pangenome msVCF generated from the BSSH app.

  • generate a human or non-human multigenome hash table from customer-provided msVCF.

The input files required are a single multi-sample VCF file containing the set of population variants, and optionally bed files restricting graph to some region. The generated files, including hash_table.cmp and associated files in the specified output directory, can then be used as the reference hash table for the DRAGEN mapper. DRAGEN software supports the tool on human reference with files available on the DRAGEN Software Support Site page. For non-human, the user provides the required resource files.

Usage

To enable the multigenome hash table builder, example command usage is :

dragen --build-hash-table true (required) --ht-graph-msvcf-file <path to a multi-sampple VCF file (required for multigenome reference) --ht-reference <reference.fasta> (required) --ht-graph-extra-kmer-bed < graph.bed> (optional) --ht-mask-bed <mask.bed> (optional) --ht-graph-exclusion-bed <exclusion bed> (optional) --output-directory <DIR> (required) [options]

Inputs

Set of population variants, in a multi-sample VCF (msVCF)

The custom multigenome hash table builder tool uses a set of population variants provided by the user to generate a multigenome hash table. The variants must be specified in VCF format, in a single multi-sample VCF (msVCF) file containing the variants for a set of individuals. This multi-sample VCF file must have specific formatting described below.

Specific msVCF input formatting

The custom multigenome hash table builder tool only supports msVCF file input respecting the format described below:

  • msVCF compliant with 4.2 VCF format specification

  • with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference

  • records shall include diploid or haploid GT calls

  • supports multi-allelic variants merged in multi-line or separated in multiple lines

  • with the following FILTER codes, non-PASS records are ignored:

    • ##FILTER=<ID=PASS,Description="All filters passed">

  • with the following FORMAT field :

    • ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

  • for better results, we recommend variants to be left-aligned.

  • maximum number of recommended samples in the msVCF is 256. Higher number may lead to very high memory usage at hash table creation.

Note: INFO/FORMAT subfields must be defined in the header. Events with undefined subfields are ignored.

To build a high-performance custom genome it is highly recommended to use long read sequencing data. We recommend using external tools such as Whatshap (https://github.com/whatshap/whatshap) to generate phased input. DRAGEN analysis leverages the phasing information to reconstruct population haplotypes.

Reference genome

A reference genome in FASTA format must be provided. Reference genomes are available to download from the DRAGEN Software Support Site page.

Note: the reference genome provided as input must be the same as the one used to generate the input phased msVCF. If the msVCF contains variants from regions not present in the fasta file, the multigenome reference builder will stop with an error.

Exclusion bed file (optional)

This bed file is used to filter out regions of the msVCF file. Variants that fall within intervals defined in the "Graph exclusion bed" file will be ignored and not used in any part of the multigenome reference builder. The result will be the same as if the input msVCF did not contain any variants in the regions defined in the exclusion bed. The file is optional, by default every variants in the msVCF file will be used. Exclusion bed files are available to download from DRAGEN Software Support Site page.

A custom exclusion bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the exclusion bed file provided must be from the same build as the reference genome used to build the multigenome reference.

Extra kmer bed file (optional)

This file is used to define regions in the genome where extra seeds will be indexed in the hash table. By default, only seed extracted from the primary reference will be extracted and saved in the reference hash table for mapping. This option will additionally generate seeds from population variants in the defined regions. It is recommended to include the expected difficult regions in this bed file. Extra-kmer-bed files are available to download from DRAGEN Software Support Site page for the human hg38, hg19, hs37d5, and chm13 references.

An Extra-kmer-bed bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the Extra-kmer-bed file provided must be from the same build as the reference genome used to build the graph reference.

A mask bed file must be provided in order to mask certain regions of high similarity between primary and alternate contigs present in the main genome FASTA. Mask bed files are available to download from the DRAGEN Software Support Site page.

A custom mask bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the mask bed file provided must be from the same build as the reference genome used to build the graph reference.

Command line options

OptionRequiredDescription

--build-hash-table

Yes

Set to true

--ht-graph-msvcf-file

Yes

Path to the multi-sample VCF file containing population variants

--ht-reference

Yes

Path to the reference genome FASTA file.

--ht-graph-extra-kmer-bed

No

Path to the extra kmer bed file

--ht-mask-bed

No (but recommended)

Path to the mask bed file

--ht-graph-exclusion-bed

No

Path to the exclusion bed file

--output-directory

Yes

Specify the directory where all related hash table files will be written

Note: The custom graph reference hash table end to end pipeline will return an error if options --ht-alt-liftover or --ht-allow-mask-and-liftover are specified.

Output

The hash table builder generates the following outputs:

FileDescription

reference.bin

The reference sequences, encoded in 4 bits per base. Four-bit codes are used, so the size in bytes is roughly half the reference genome size. In between reference sequences, N are trimmed and padding is automatically inserted. For example, hg19 has 3,137,161,264 bases in 93 sequences. This is encoded in 1,526,285,312 bytes = 1.46 GB, where 1 GB means 1 GiB or 2^30^ bytes.

hash_table.cmp

Compressed hash table. The hash table is decompressed and used by the DRAGEN mapper to look up primary seeds with length specified by the --ht-seed-len option and extended seeds of various lengths.

hash_table.cfg

A list of parameters and attributes for the generated hash table, in a text format. This file provides key information about the reference genome and hash table.

hash_table.cfg.bin

A binary version of hash_table.cfg used to configure the DRAGEN hardware.

hash_table_stats.txt

A text file listing extensive internal statistics on the constructed hash including the hash table occupancy percentages. This table is for information purposes. It is not used by other tools.

mask.bed

Present only for masked hash tables. A tab delimeted bed file that describes the masked regions. Contains all lines from the input bed file that are not comment lines, lines that describe empty intervals, or lines with contig names that were not found in the input fasta.

Options for advanced users

Primary Seed Length

The --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of this same length from each read, and looks for exact matches (unless seed editing is enabled) in the hash table.

The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.

The minimum primary seed length depends mainly on the reference genome size and complexity. It needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound may be smaller for shorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.

Accuracy Considerations

For read mapping to succeed, at least one primary seed must match exactly (or with a single SNP when edited seeds are used). Shorter seeds are more likely to map successfully to the reference, because they are less likely to overlap variants or sequencing errors, and because more of them fit in each read. So for mapping accuracy, shorter seeds are mainly better.

However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions, and lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches may be reported. Run time quality filters such as --Aligner.aln_min_score can control the accuracy issues with very short seeds.

Speed Considerations

Shorter seeds tend to slow down mapping, because they map to more reference locations, resulting in more work such as Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the reference genome's uniqueness threshold, eg, K=16 for whole human genome.

Application Considerations

Read Length---Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chop the read into only short segments matching the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, just one SNP in the middle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bp read, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.

Paired Ends---The use of paired end reads can make longer seeds yield good mapping accuracy. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have essentially twice the opportunity for an exact matching seed to find their correct alignments.

Variant or Error Rate---When read differences from the reference are more frequent, shorter seeds may be required to fit between the difference positions in a given read and match the reference exactly.

Mapping Percentage Requirement---If the application requires a high percentage of reads to be mapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well anywhere are more likely to map using short seeds to find partial matches to the reference.

Maximum Seed Length

The --ht-max-ext-seed-len option limits the length of extended seeds populated into the hash table. Primary seeds (length specified by --ht-seed-len) that match many reference positions can be extended to achieve more unique matching, which may be required to map seeds within the maximum hit frequency (--ht-max-seed-freq).

Given a primary seed length k, the maximum seed length can be configured between k and k+128. The default is the upper bound, k+128.

When to Limit Seed Extension

The --ht-max-ext-seed-len option is recommended for short reads, eg, less than 50 bp. In such cases, it is helpful to limit seed extension to the read length minus a small margin, such as 1-4 bp. For example, with 36 bp reads, setting --ht-max-ext-seed-len to 35 might be appropriate. This ensures that the hash table builder does not plan a seed extension longer than the read causing seed extension and mapping to fail at run time, for seeds that could have fit within the read with shorter extensions.

While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len to 99 for 100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Even with the default k+128 limit, individual seeds are only extended to the lengths required to fit under the maximum hit frequency (--ht-max-seed-freq), and at most a few bases longer to approach the target hit frequency (‑‑ht‑target-seed-freq), or to avoid taking too many incremental extension steps.

Maximum Hit Frequency

The --ht-max-seed-freq option sets a firm limit on the number of seed hits (reference genome locations) that can be populated for any primary or extended seed. If a given primary seed maps to more reference positions than this limit, it must be extended long enough that the extended seeds subdivide into smaller groups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len), a group of identical reference seeds is larger than this limit, their reference positions are not populated into the hash table. Instead, a single High Frequency record is populated.

The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hash table construction can fail because too many seed extensions are needed. The practical minimum for a whole human genome reference, other options being default, is 8.

Accuracy Considerations

Generally, a higher maximum hit frequency leads to more successful mapping. There are two reasons for this. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higher limit allows seed extensions to be shorter, improving the odds of exact seed matching without overlapping variants or sequencing errors.

However, as with very short seeds, allowing high hit counts can sometimes hurt mapping accuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionally one of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits the total number of reference positions it considers, and allowing very high hit counts can potentially crowd out the actual best match from consideration.

Speed Considerations

Higher maximum hit frequencies slow down read mapping, because seed mapping finds more reference locations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.

Graph Reference

The DRAGEN Software enables the user to build a custom multigenome hash table from a set of population variants. The population variants are specified in a single multi-sample VCF file.

  • --ht-graph-msvcf-file: Input file containing list of population variants, in multi-sample VCF format.

This replaces the previous options that were previously used to build a graph Reference that are now deprecated.

List of deprecated options :

  • --ht-pop-alt-contigs: Population based alternate contigs FASTA.

  • --ht-pop-alt-liftover: Liftover SAM file of population alternate contigs.

  • --ht-pop-snps: Population based SNPs VCF

ALT-Contigs

The following options control building hash tables from references with ALT-contigs. See References with ALT contigs for more information.

  • --ht-mask-bed: Set a custom BED file that defines which regions to mask. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/fasta\_mask.

  • --ht-alt-liftover: Set a liftover file to build a liftover based ALT-aware hash table. SAM liftover files for hg38DH and hg19 are provided in <INSTALL_PATH>/liftover.

  • --ht-allow-mask-and-liftover: Allow the use of both --ht-mask-bed and --ht-alt-liftover together.

  • --ht-suppress-mask: Suppress automatic detection of the default mask bed files when building the hash table.

Decoy Contigs

  • --ht-decoys The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoys to the hash table when they are not found in the FASTA file. Use the --ht-decoys option to specify the path to a decoys file. The default is <INSTALL_PATH>/liftover/hs\_decoys.fa.

  • --ht-suppress-decoys: Suppress automatic detection of the default decoys file when building the hash table.

Processing Options

  • --ht-num-threads The --ht-num-threads option determines the maximum number of worker CPU threads that are used to speed up hash table construction. The default for this option is 8, with a maximum of 32 threads allowed. If your server supports execution of more threads, it is recommended that you use the maximum. For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of 32 should be used. When using a higher value, adjust --ht-max-table-chunks needs to be adjusted as well. The servers have 128 GB of memory available.

  • --ht-max-table-chunks The --ht-max-table-chunks option controls the memory footprint during hash table construction by limiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Each additional chunk consumes roughly twice its size (~2 GB) in system memory during construction. The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, which depends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash table contains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enough memory and a 16 GB hash table contains 16 independent 1 GB chunks. The default is --ht-max-table-chunks equal to --ht-num-threads, but with a minimum default --ht-max-table-chunks of 8. It makes sense to have these two options match, because building one hash table chunk requires one chunk space in memory and one thread to work on it. Nevertheless, there are build-speed advantages to raising --ht-max-table-chunks higher than --ht-num-threads, or to raising --ht-num-threads higher than --ht-max-table-chunks.

Size Options

  • --ht-mem-limit Memory Limit. The --ht-mem-limit option controls the generated hash table size by specifying the DRAGEN card memory available for both the hash table and the encoded reference genome. The ‑‑ht‑mem-limit option defaults to 32 GB when the reference genome approaches WHG size, or to a generous size for smaller references. Normally there is little reason to override these defaults.

  • --ht-size Hash Table Size. This option specifies the hash table size to generate, rather than calculating an appropriate table size from the reference genome size and the available memory (option --ht-mem-limit). Using default table sizing is recommended and using --ht-mem-limit is the next best choice.

Seed Population Options

  • --ht-ref-seed-interval Seed Interval. The --ht-ref-seed-interval option defines the step size between positions of seeds in the reference genome populated into the hash table. An interval of 1 (default) means that every seed position is populated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5 yields 40% populated. Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGEN boards. If a substantially larger reference genome is used, change this option.

  • --ht-soft-seed-freq-cap and --ht-max-dec-factor Soft Frequency Cap and Maximum Decimation Factor for Seed Thinning. Seed thinning is an experimental technique to improve mapping performance in high-frequency regions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3 retains at least 1/3 of the original seeds. --ht-max-dec-factor 1 disables any thinning. Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea is that seed thinning can achieve mapped seed coverage in high frequency reference regions where the maximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seed extensions shorter, which is also good for successful mapping. Based on testing to date, seed thinning has not proven to be superior to other accuracy optimization methods.

  • --ht-rand-hit-hifreq and --ht-rand-hit-extend Random Sample Hit with HIFREQ Record and EXTEND Record. Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of a large set of reference hits for a certain seed. Optionally, the hash table builder can choose a random representative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record. Random sample hits provide alternative alignments that are very useful in estimating MAPQ accurately for the alignments that are reported. They are never used outside of this context for reporting alignment positions, because that would result in biased coverage of locations that happened to be selected during hash table construction. To include a sample hit, set --ht-rand-hit-hifreq to 1. The --ht-rand-hit-extend option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is not recommended.

Seed Extension Control

DRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too many reference locations. Seeds are incrementally extended in steps of 2--14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, determining the next extension increment if any.

There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, using a cost function that balances mapping accuracy and speed. The following options define that cost function:

  • --ht-target-seed-freq Target Hit Frequency. The --ht-target-seed-freq option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions, because shorter seeds tend to match more reference positions.

  • --ht-cost-coeff-seed-len Cost Coefficient for Seed Length The --ht-cost-coeff-seed-len option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.

  • --ht-cost-coeff-seed-freq Cost Coefficient for Hit Frequency. The --ht-cost-coeff-seed-freq option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.

  • --ht-cost-penalty Cost Penalty for Seed Extension. The --ht-cost-penalty option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. Current testing shows that zero (0) is appropriate for this parameter.

  • --ht-cost-penalty-incr Cost Increment for Extension Step. The --ht-cost-penalty-incr option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, and takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seed lengths in fewer, larger steps.

Pipeline Specific Hash Tables

RNA-Seq

When building a hash table, DRAGEN configures the options for DNA analysis by default. To run RNA-Seq data, you must build an RNA-Seq hash table by setting --ht-build-rna-hashtable to true. If running RNA-Seq alignment, use the original --output-directory instead of the automatically generated subdirectory.

CNV

If using the CNV pipeline, set --enable-cnv to true. The command generates an additional Kmer hash map that is used in the CNV algorithm. Illumina recommends to always use the --enable-cnv option, so you can perform CNV calling with the same hash table used for mapping and aligning.

Methylation

To run the methylation pipeline, you must build a methylation-specific hash table. DRAGEN can build a single-pass or legacy multi-pass methylation hash table. Methylation runs using a single-pass hash table are completed faster than the legacy multipass hash tables. Single-pass hash tables are recommended for building methylation tables and running analyses.

Hash Table TypeHash Table Commands

single-pass

--ht-methylated-combined=true --ht-seed-len 27

multi-pass

--ht-methylated=true --ht-seed-len 27 --ht-max-seed-freq 16

Single-pass

The following is an example of a single-pass hash table build. The example generates a combined hash table in your reference index folder under the methyl_converted subdirectory.

dragen --build-hash-table true \ --output-directory $REFDIR \ --ht-reference $FASTA \ --ht-num-threads 40 \ --ht-methylated-combined=true \ --ht-seed-len 27

Multipass

Multi-pass methylation mapping requires building two special hash tables with reference bases converted from C to T in one table and G to A in the other table. The conversions are performed automatically when using the --ht-methylated command line option. The converted hash tables are generated in two subdirectories under the folder specified using the --output-directory command line option. The subdirectories are named CT_converted and GA_converted, corresponding with the base conversions. When using the hash tables for methylated alignment runs, make sure to refer to the --output-directory folder, not the subdirectories.

The base conversions remove a significant amount of information from the hash tables. You might need to use different hash table parameters than in a conventional hash table build. The following options are recommended for building hash tables for mammalian species.

dragen --build-hash-table=true --output-directory $REFDIR --ht-reference $FASTA --ht-max-seed-freq 16 --ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true

HLA

To run the HLA caller, an HLA-specific anchored reference hash table must be built. Set --ht-build-hla-hashtable to true. The command will create a anchored_hla subdirectory inside the --output-directory. The HLA-specific reference subdirectory can be built at the same time as the primary reference construction.

An HLA resource file is packaged with DRAGEN and located at the following path after installation: <INSTALL_PATH>/resources/hla/HLA_resource.v1.fasta.gz. This file is used by default when building the HLA-specific anchored hash table. A custom file can be specified with --ht-hla-reference. See the HLA section for more information Using Custom HLA Reference Files

Last updated