> For the complete documentation index, see [llms.txt](https://help.dragen.illumina.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://help.dragen.illumina.com/dragen-v4.5/product-guides/dragen-v4.5/kmer-classifier/kmer-class-db-builder.md).

# K-mer Classifier Database Builder

## Description

This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.

## Command Line Settings

| Option                                         | Description                                                                                                                                                                                                                                                                  |
| ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Required Inputs                                |                                                                                                                                                                                                                                                                              |
| `--enable-kmer-class-db-builder`               | Enables the K-mer Classifier Database Builder. (Default=false).                                                                                                                                                                                                              |
| `--kmer-class-db-builder-input-file`           | Headerless, tab-delimited file where each line is (1) the path to a reference fasta file and (2) the associated taxid. When using --kmer-class-db-builder-taxids-as-seq-name, the second column is required but ignored.                                                     |
| `--output-file-prefix`                         | Prefix for all output files.                                                                                                                                                                                                                                                 |
| `--output-directory`                           | Directory for all output files.                                                                                                                                                                                                                                              |
| `--kmer-class-db-builder-kmer-length`          | K-mer length (Range: \[4, 31]).                                                                                                                                                                                                                                              |
| `--kmer-class-db-builder-gmer-length`          | G-mer length (must be >= kmer length. Range: \[4, 64]).                                                                                                                                                                                                                      |
| Optional Inputs                                |                                                                                                                                                                                                                                                                              |
| `--kmer-class-db-builder-tax-tree-file`        | .tri file with nodes in the taxonomic tree for a classifier database (not required if building a binner database). Headerless, tab-delimited file where each line has (1) the child node taxid and (2) the parent node taxid. Root of tree must be 1 and have a parent of 0. |
| `--kmer-class-db-builder-protein`              | Set to indicate input sequences are protein sequences. (Default=false).                                                                                                                                                                                                      |
| `--kmer-class-db-builder-taxids-to-keep`       | File with taxids to keep. If set, any k-mers with taxids not in this file will be excluded from database.                                                                                                                                                                    |
| `--kmer-class-db-builder-num-categories`       | Set to build a binner database with this number of categories. Max is 25 categories, assumes categories are from 2^0..2^n sequentially. The categories take the place of taxids in the input file.                                                                           |
| `--kmer-class-db-builder-save-weights`         | Set to build classification database that saves all kmers / taxids / weights.                                                                                                                                                                                                |
| `--kmer-class-db-builder-kmer-cutoff`          | Cutoff that excludes k-mers that are found in more than cutoff number of taxids when building a database using --kmer-class-db-builder-save-weights. Helps speed up classification. (Default=1000).                                                                          |
| `--kmer-class-db-builder-num-cpus`             | Option to set the number of CPUs available for processing.                                                                                                                                                                                                                   |
| `--kmer-class-db-builder-num-kmers-per-bucket` | Set to output number of k-mers in each minimizer bucket. (Default=false).                                                                                                                                                                                                    |
| `--kmer-class-db-builder-include-lowercase`    | Set to include k-mers with lowercase bases (usually repeatmasked). (Default=false).                                                                                                                                                                                          |
| `--kmer-class-db-builder-taxids-as-seq-name`   | Set to indicate that the reference fastas listed in the input file have taxids as sequence name. In this case, the second column of the input file is ignored. (Default=false).                                                                                              |

### Example Command Line

```
dragen \
  --enable-kmer-class-db-builder=true \
  --kmer-class-db-builder-input-file <builder_input.txt> \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-class-db-builder-kmer-length 31 \
  --kmer-class-db-builder-gmer-length 64 \
  --kmer-class-db-builder-num-categories 3
```

## Usage

K-mer/G-mer length considerations:

* G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
* K-mer length refers to the size of the minimizer to be saved in a window size specified by the g-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
* As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
* Pre-built database k-mer/g-mer length settings for reference:
  * Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): g-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
  * Subset of reference genomes from Refseq with a focus on viral detection: g-mer length of 35 and k-mer length of 31
  * Collection of 16S sequences for bacterial identification / profiling: g-mer length of 31 and k-mer length of 31
  * Uniref90 protein sequences: g-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)

Three types of databases can be built with this tool:

* Binner: each k-mer is assigned to a category/bin.
  * Must use `--kmer-class-db-builder-num-categories`.
  * Do not use `--kmer-class-db-builder-tax-tree-file`, `--kmer-class-db-builder-save-weights`, or `--kmer-class-db-builder-kmer-cutoff`.
* Classifier: each k-mer is assigned to one taxid.
  * Must define a taxonomic tree with `--kmer-class-db-builder-tax-tree-file`.
  * Do not use `--kmer-class-db-builder-num-categories`, `--kmer-class-db-builder-save-weights`, or `--kmer-class-db-builder-kmer-cutoff`.
* Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
  * Must use `--kmer-class-db-builder-save-weights` and define a taxonomic tree with `--kmer-class-db-builder-tax-tree-file`.
  * Can use `--kmer-class-db-builder-kmer-cutoff`.
  * Do not use `--kmer-class-db-builder-num-categories`.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://help.dragen.illumina.com/dragen-v4.5/product-guides/dragen-v4.5/kmer-classifier/kmer-class-db-builder.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
