# Prebuilt K-mer Databases

## Overview

There are several pre-built k-mer databases that are available for download and use with the k-mer classifier. The databases can be divided into two categories: (1) taxonomy-based: each k-mer is assigned to a taxonomic identifier (taxid), i.e. node in a taxonomic tree; (2) category-based: each k-mer is assigned to one or more categories, e.g. "bacterial". The type of database used is auto-detected by the k-mer classifier and does not need to be specified in the DRAGEN command.

The databases detailed below are all based on a taxonomy. The [human and microbial database is category-based and is described in its own page](https://help.dragen.illumina.com/product-guides/dragen-v4.5/kmer-classifier/microbial-binner).

For each database, there are two files to download. The index file contains the k-mer mapping and is pointed to with the `--kmer-classifier-db-file` option. The name map JSON file maps internal identifiers to taxid/category and name, and is pointed to with the `--kmer-classifier-db-to-taxid-json` option.

## Genome Database

The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.

To download the reference index file and the taxid mapping JSON:

```
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.name_map.json
```

| File                                                  | Size | md5sum                           |
| ----------------------------------------------------- | ---- | -------------------------------- |
| dragen-kmer-classifier.refseq\_genomes.v6dh.t6db      | 266G | e1fb74ffe669c6001522520f016e73e4 |
| dragen-kmer-classifier.refseq\_genomes.name\_map.json | 11M  | e164a1c3859062f10f0dab5272b90092 |

## Genome and NT Database

This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.

To download the reference index file and the taxid mapping JSON:

```
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json
```

To download the compressed reference index file and the taxid mapping JSON:

```
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.compressed.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json
```

| File                                                          | Size | md5sum                           |
| ------------------------------------------------------------- | ---- | -------------------------------- |
| dragen-kmer-classifier.genomes\_plus\_nt.v6dh.t6db            | 496G | fcf59213c4cbd3193171eb4e58470feb |
| dragen-kmer-classifier.genomes\_plus\_nt.compressed.v6dh.t6db | 183G | 97dd8bca18f1cb0a97c1a55ef49d7640 |
| dragen-kmer-classifier.genomes\_plus\_nt.name\_map.json       | 171M | 5f5a2f7ea5d20b1c5c739b23b09735d9 |

## UniRef90 Database

This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

```
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.name_map.json
```

| File                                           | Size | md5sum                           |
| ---------------------------------------------- | ---- | -------------------------------- |
| dragen-kmer-classifier.u90\_all.v6dh.t6db      | 81G  | 78bba8b3635241ac9adc35f101df7f46 |
| dragen-kmer-classifier.u90\_all.name\_map.json | 27M  | 8ebe7b070aa85212f8f37a2f8b901cff |

## 16S database

This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

```
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.name_map.json
```

| File                                      | Size | md5sum                           |
| ----------------------------------------- | ---- | -------------------------------- |
| dragen-kmer-classifier.16S.v6dh.t6db      | 59M  | ed3c4cd4f19ae7e570d603e86ffb2668 |
| dragen-kmer-classifier.16S.name\_map.json | 4.4M | 50388f2152bb8849c1ffcf83b14e9e69 |
