# DRAGEN K-mer Classifier

## Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database; 2) the reference sequence database is searched and query sequences are classified.

The k-mer classifier supports two types of classification. In taxonomy-based classification, the query sequences are classified to a taxonomic identifier (taxid) associated with a node in a taxonomic tree. In category-based classification, the query sequences are classified to a broader category (e.g. "bacterial"). The type of classification performed depends on whether a taxonomy-based or category-based database is used. The type of database is auto-detected and does not need to be specified in the DRAGEN command.

This guide explains how to run query sequences against a pre-existing reference sequence database; [several are available for download](https://help.dragen.illumina.com/product-guides/dragen-v4.5/kmer-classifier/prebuilt-kmer-dbs). Users can also build their own [custom reference sequence database](https://help.dragen.illumina.com/product-guides/dragen-v4.5/kmer-classifier/kmer-class-db-builder).

## Command Line Settings

| Option                               | Description                                                                                                                                                                                                          |
| ------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Required Inputs                      |                                                                                                                                                                                                                      |
| `--enable-kmer-classifier`           | Enables the K-mer Classifier. (Default=false).                                                                                                                                                                       |
| `--output-file-prefix`               | Prefix for all output files.                                                                                                                                                                                         |
| `--output-directory`                 | Directory for all output files.                                                                                                                                                                                      |
| `--kmer-classifier-input-read-file`  | Input sequence file (zipped or unzipped) to the K-mer Classifier.                                                                                                                                                    |
| `--kmer-classifier-db-file`          | Database of sequences to classify against.                                                                                                                                                                           |
| Optional Inputs                      |                                                                                                                                                                                                                      |
| `--intermediate-results-dir`         | Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2.                                                                                                                         |
| `--kmer-classifier-load-db-ram`      | Load the database onto RAM. Do not use if database is on ramdisk. (Default=false).                                                                                                                                   |
| `--kmer-classifier-multiple-inputs`  | Set to true to run with multiple inputs. In this case, `--kmer-classifier-input-read-file` should point to a .tsv file that has up to three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false). |
| `--kmer-classifier-split-fastq`      | Set to true to create a set of FASTQ files (depending on the input) with category-specific reads. Compatible only with a category binner database.                                                                   |
| `--kmer-classifier-min-window`       | The minimum number of consecutive k-mers to classify a read to a taxid or category. (Default=1).                                                                                                                     |
| `--kmer-classifier-output-read-seq`  | Set to true to add a column in the read-level output file with the read sequence. (Default=false).                                                                                                                   |
| `--kmer-classifier-output-taxid-seq` | Set to true to add a column in the read-level output file with the taxid or category assignments for each k-mer. (Default=false).                                                                                    |
| `--kmer-classifier-db-to-taxid-json` | Path to JSON file that maps database IDs to external taxids, names, and ranks.                                                                                                                                       |
| `--kmer-classifier-no-read-output`   | Set to true to not create individual read output. (Default=false).                                                                                                                                                   |
| `--kmer-classifier-no-taxid-counts`  | Set to true to not write taxid count output file. (Default=false).                                                                                                                                                   |
| `--kmer-classifier-protein-input`    | Set to true to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false).                                                                  |
| `--kmer-classifier-remove-dups`      | Deduplicate reads so that each unique sequence is analyzed once. Read counts in the output still reflect the non-deduplicated read count. Not supported for paired-end reads. (Default=false).                       |
| `--kmer-classifier-ncpus`            | Number of CPUs available for processing.                                                                                                                                                                             |

### Example Command Line

```
dragen \
  --enable-kmer-classifier=true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus=2 \
  --kmer-classifier-output-read-seq=false \
  --kmer-classifier-output-taxid-seq=false
```

## Input Files and Options

### Input Reads

Applies to: `--kmer-classifier-input-read-file`, `--kmer-classifier-multiple-inputs`

If the analysis is for a single FASTQ read file, then that filename is input to `--kmer-classifier-input-read-file` and `--kmer-classifier-multiple-inputs=false`. However, many read files can be submitted to the k-mer classifier at one time, minimizing the load time for a large reference sequence database. In this case, the input file must be a `.tsv` (tab-separated) file with two columns (optionally 3 columns). The first column is a unique ID, the second column is the path to the read file, and the optional third column is the path to the second read file in the case of paired-end reads. The ID is used to distinguish the output files. There is no header line. This `.tsv` file is the input file to `--kmer-classifier-input-read-file` when `--kmer-classifier-multiple-inputs=true`.

When paired-end samples are analyzed, each read pair is counted as one read in the output files. The k-mer content of both R1 and R2 is considered in order to classify the read.

Read deduplication can be enabled for single-ended samples with `--kmer-classifier-remove-dups=true`. This will cause each unique sequence to be classified just once, which may increase the speed of classification. The read counts in the output files will still reflect the non-deduplicated read count. Read deduplication is not supported for paired-end reads. If `--kmer-classifier-remove-dups` is set to `true`, it will be automatically suppressed for any paired-end samples.

### Reference Sequences

Applies to: `--kmer-classifier-db-file`, `--kmer-classifier-db-to-taxid-json`, `--kmer-classifier-load-db-ram`

A file of reference sequences (the "database") can be quite large. If the database file is stored on a normal file system, it is recommended that you set `--kmer-classifier-load-db-ram=true`. This will tell the k-mer classifier to load the database file into memory for faster analysis. It is also allowable to store the database file on a RAM disk, which reduces load time over many k-mer classifier runs. In this case, it is recommended to set `--kmer-classifier-load-db-ram=false`.

### Taxid JSON Mapping File

Applies to: `--kmer-classifier-db-to-taxid-json`

This input file is downloaded alongside the reference sequence database. It associates an internal identifier in the database to an external source, like the NCBI taxonomy. This JSON file is a dictionary where the keys are internal identifiers, and is mapped to an external taxid, name, and rank. Example:

```
 {
   "2": {"taxid": 2, "name": "bacteria", "rank": "kingdom"},
   "3": {"taxid": 2697049, "name": "SARS-CoV-2", "rank": "subspecies"},
   "4": {"taxid": 5052, "name": "Aspergillus", "rank": "genus"}
 }
```

The internal identifiers are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.

## Output Files

### Read-level Output

Applies to: `--kmer-classifier-output-taxid-seq`, `--kmer-classifier-output-read-seq` The main output file is a `.tsv` file with the extension `.read_classifications.tsv`. It has no header line, has tab-separated columns, and can vary in the number of columns depending on command line options. It details the results for each read.

| Column | Description                                                                                                             | Data Type                            |
| ------ | ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------ |
| 1      | Read index                                                                                                              | integer                              |
| 2      | Read name                                                                                                               | string                               |
| 3      | Internal identifier (taxid or category) the read classified to                                                          | integer                              |
| 4      | Maximum number of contiguous k-mers that classified to this taxid                                                       | integer                              |
| 5      | Score assigned to the classification                                                                                    | integer                              |
| 6      | Number of k-mers that classified to this taxid                                                                          | integer                              |
| 7      | Read duplication count                                                                                                  | integer                              |
| 8      | Name associated with taxid or category, if given with `--kmer-classifier-db-to-taxid-json`                              | string                               |
| 9      | Taxonomic rank associated with taxid, if given with `--kmer-classifier-db-to-taxid-json`                                | string                               |
| 10     | Internal identifier that each k-mer classified to (is output when the `--kmer-classifier-output-taxid-seq` flag is set) | list of integers separated by commas |
| 11     | Read sequence (is output when the `--kmer-classifier-output-read-seq` flag is set)                                      | string                               |

### Taxid/Category-Level Output

The second output file is a `.tsv` file with the extension `.classifier.taxid_kmer_counts.tsv`. It has a header line and has tab-separated columns. It summarizes the results for each detected taxid. In the case of a category-based database, it summarizes the results for each detected category or category combinations.

| Header                       | Description                                                                                                                                                     | Data Type |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- |
| db\_taxid                    | Identifier for this taxid/category used internally in the database                                                                                              | integer   |
| duplicity                    | Ratio of total number of k-mers from reads assigned to this taxid/category compared to the number of distinct k-mers from reads assigned to this taxid/category | float     |
| distinct\_coverage           | Percent of k-mers in the database assigned to this taxid/category that are covered by k-mers in the reads assigned to this taxid/category                       | integer   |
| read\_count                  | Number of reads that classified to this taxid/category                                                                                                          | integer   |
| total\_kmer\_count           | Number of k-mers that classified to this taxid/category                                                                                                         | integer   |
| distinct\_kmer\_count        | Number of distinct k-mers that classified to this taxid/category                                                                                                | integer   |
| cumulative\_read\_count      | Cumulative number of reads assigned to this taxid and its taxonomic descendants                                                                                 | integer   |
| taxid                        | Taxid                                                                                                                                                           | integer   |
| name                         | Name associated with the taxid/category, if given with `--kmer-classifier-db-to-taxid-json`                                                                     | string    |
| rank                         | Taxonomic rank of the taxid, if given with `--kmer-classifier-db-to-taxid-json`                                                                                 | string    |
| taxid\_distinct\_kmer\_count | Number of distinct k-mers assigned to this taxid/category from the reference sequences                                                                          | string    |
| probability\_present         | Not in use                                                                                                                                                      | float     |

### Category Summary Output

If a category binner database is used, an output file will be generated to summarize the composition of the sample. [See more details here.](https://help.dragen.illumina.com/product-guides/dragen-v4.5/microbial-binner#category-summary-output)

### Category-Specific FASTQs

If a category binner database is used, and `--kmer-classifier-split-fastq` is `true`, a set of category-specific FASTQs will be generated. [See more details here.](https://help.dragen.illumina.com/product-guides/dragen-v4.5/microbial-binner#fastqs)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.dragen.illumina.com/product-guides/dragen-v4.5/kmer-classifier.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
