# K-mer Classifier Database Builder

## Description

This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.

## Command Line Settings

| Option                                         | Description                                                                                                                                                                                                                                                                  |
| ---------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Required Inputs                                |                                                                                                                                                                                                                                                                              |
| `--enable-kmer-class-db-builder`               | Enables the K-mer Classifier Database Builder. (Default=false).                                                                                                                                                                                                              |
| `--kmer-class-db-builder-input-file`           | Headerless, tab-delimited file where each line is (1) the path to a reference fasta file and (2) the associated taxid. When using --kmer-class-db-builder-taxids-as-seq-name, the second column is required but ignored.                                                     |
| `--output-file-prefix`                         | Prefix for all output files.                                                                                                                                                                                                                                                 |
| `--output-directory`                           | Directory for all output files.                                                                                                                                                                                                                                              |
| `--kmer-class-db-builder-kmer-length`          | K-mer length (Range: \[4, 31]).                                                                                                                                                                                                                                              |
| `--kmer-class-db-builder-gmer-length`          | G-mer length (must be >= kmer length. Range: \[4, 64]).                                                                                                                                                                                                                      |
| Optional Inputs                                |                                                                                                                                                                                                                                                                              |
| `--kmer-class-db-builder-tax-tree-file`        | .tri file with nodes in the taxonomic tree for a classifier database (not required if building a binner database). Headerless, tab-delimited file where each line has (1) the child node taxid and (2) the parent node taxid. Root of tree must be 1 and have a parent of 0. |
| `--kmer-class-db-builder-protein`              | Set to indicate input sequences are protein sequences. (Default=false).                                                                                                                                                                                                      |
| `--kmer-class-db-builder-taxids-to-keep`       | File with taxids to keep. If set, any k-mers with taxids not in this file will be excluded from database.                                                                                                                                                                    |
| `--kmer-class-db-builder-num-categories`       | Set to build a binner database with this number of categories. Max is 25 categories, assumes categories are from 2^0..2^n sequentially. The categories take the place of taxids in the input file.                                                                           |
| `--kmer-class-db-builder-save-weights`         | Set to build classification database that saves all kmers / taxids / weights.                                                                                                                                                                                                |
| `--kmer-class-db-builder-kmer-cutoff`          | Cutoff that excludes k-mers that are found in more than cutoff number of taxids when building a database using --kmer-class-db-builder-save-weights. Helps speed up classification. (Default=1000).                                                                          |
| `--kmer-class-db-builder-num-cpus`             | Option to set the number of CPUs available for processing.                                                                                                                                                                                                                   |
| `--kmer-class-db-builder-num-kmers-per-bucket` | Set to output number of k-mers in each minimizer bucket. (Default=false).                                                                                                                                                                                                    |
| `--kmer-class-db-builder-include-lowercase`    | Set to include k-mers with lowercase bases (usually repeatmasked). (Default=false).                                                                                                                                                                                          |
| `--kmer-class-db-builder-taxids-as-seq-name`   | Set to indicate that the reference fastas listed in the input file have taxids as sequence name. In this case, the second column of the input file is ignored. (Default=false).                                                                                              |

### Example Command Line

```
dragen \
  --enable-kmer-class-db-builder=true \
  --kmer-class-db-builder-input-file <builder_input.txt> \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-class-db-builder-kmer-length 31 \
  --kmer-class-db-builder-gmer-length 64 \
  --kmer-class-db-builder-num-categories 3
```

## Usage

K-mer/G-mer length considerations:

* G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
* K-mer length refers to the size of the minimizer to be saved in a window size specified by the g-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
* As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
* Pre-built database k-mer/g-mer length settings for reference:
  * Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): g-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
  * Subset of reference genomes from Refseq with a focus on viral detection: g-mer length of 35 and k-mer length of 31
  * Collection of 16S sequences for bacterial identification / profiling: g-mer length of 31 and k-mer length of 31
  * Uniref90 protein sequences: g-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)

Three types of databases can be built with this tool:

* Binner: each k-mer is assigned to a category/bin.
  * Must use `--kmer-class-db-builder-num-categories`.
  * Do not use `--kmer-class-db-builder-tax-tree-file`, `--kmer-class-db-builder-save-weights`, or `--kmer-class-db-builder-kmer-cutoff`.
* Classifier: each k-mer is assigned to one taxid.
  * Must define a taxonomic tree with `--kmer-class-db-builder-tax-tree-file`.
  * Do not use `--kmer-class-db-builder-num-categories`, `--kmer-class-db-builder-save-weights`, or `--kmer-class-db-builder-kmer-cutoff`.
* Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
  * Must use `--kmer-class-db-builder-save-weights` and define a taxonomic tree with `--kmer-class-db-builder-tax-tree-file`.
  * Can use `--kmer-class-db-builder-kmer-cutoff`.
  * Do not use `--kmer-class-db-builder-num-categories`.
