Prebuilt K-mer Databases

Overview

There are several pre-built k-mer databases that are available for download and use with the k-mer classifier. The databases can be divided into two categories: (1) taxonomy-based: each k-mer is assigned to a taxonomic identifier (taxid), i.e. node in a taxonomic tree; (2) category-based: each k-mer is assigned to one or more categories, e.g. "bacterial". The type of database used is auto-detected by the k-mer classifier and does not need to be specified in the DRAGEN command.

The databases detailed below are all based on a taxonomy. The human and microbial database is category-based and is described in its own page.

For each database, there are two files to download. The index file contains the k-mer mapping and is pointed to with the --kmer-classifier-db-file option. The name map JSON file maps internal identifiers to taxid/category and name, and is pointed to with the --kmer-classifier-db-to-taxid-json option.

Genome Database

The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.name_map.json
File
Size
md5sum

dragen-kmer-classifier.refseq_genomes.v6dh.t6db

266G

e1fb74ffe669c6001522520f016e73e4

dragen-kmer-classifier.refseq_genomes.name_map.json

11M

e164a1c3859062f10f0dab5272b90092

Genome and NT Database

This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json

To download the compressed reference index file and the taxid mapping JSON:

File
Size
md5sum

dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db

496G

fcf59213c4cbd3193171eb4e58470feb

dragen-kmer-classifier.genomes_plus_nt.compressed.v6dh.t6db

183G

97dd8bca18f1cb0a97c1a55ef49d7640

dragen-kmer-classifier.genomes_plus_nt.name_map.json

171M

5f5a2f7ea5d20b1c5c739b23b09735d9

UniRef90 Database

This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

File
Size
md5sum

dragen-kmer-classifier.u90_all.v6dh.t6db

81G

78bba8b3635241ac9adc35f101df7f46

dragen-kmer-classifier.u90_all.name_map.json

27M

8ebe7b070aa85212f8f37a2f8b901cff

16S database

This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

File
Size
md5sum

dragen-kmer-classifier.16S.v6dh.t6db

59M

ed3c4cd4f19ae7e570d603e86ffb2668

dragen-kmer-classifier.16S.name_map.json

4.4M

50388f2152bb8849c1ffcf83b14e9e69

Last updated

Was this helpful?