# DRAGEN K-mer Classifier

## Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database; 2) the reference sequence database is searched and query sequences are classified.

The k-mer classifier supports two types of classification. In taxonomy-based classification, the query sequences are classified to a taxonomic identifier (taxid) associated with a node in a taxonomic tree. In category-based classification, the query sequences are classified to a broader category (e.g. "bacterial"). The type of classification performed depends on whether a taxonomy-based or category-based database is used. The type of database is auto-detected and does not need to be specified in the DRAGEN command.

This guide explains how to run query sequences against a pre-existing reference sequence database; [several are available for download](https://help.dragen.illumina.com/product-guides/dragen-v4.5/kmer-classifier/prebuilt-kmer-dbs). Users can also build their own [custom reference sequence database](https://help.dragen.illumina.com/product-guides/dragen-v4.5/kmer-classifier/kmer-class-db-builder).

## Command Line Settings

| Option                               | Description                                                                                                                                                                                                          |
| ------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Required Inputs                      |                                                                                                                                                                                                                      |
| `--enable-kmer-classifier`           | Enables the K-mer Classifier. (Default=false).                                                                                                                                                                       |
| `--output-file-prefix`               | Prefix for all output files.                                                                                                                                                                                         |
| `--output-directory`                 | Directory for all output files.                                                                                                                                                                                      |
| `--kmer-classifier-input-read-file`  | Input sequence file (zipped or unzipped) to the K-mer Classifier.                                                                                                                                                    |
| `--kmer-classifier-db-file`          | Database of sequences to classify against.                                                                                                                                                                           |
| Optional Inputs                      |                                                                                                                                                                                                                      |
| `--intermediate-results-dir`         | Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2.                                                                                                                         |
| `--kmer-classifier-load-db-ram`      | Load the database onto RAM. Do not use if database is on ramdisk. (Default=false).                                                                                                                                   |
| `--kmer-classifier-multiple-inputs`  | Set to true to run with multiple inputs. In this case, `--kmer-classifier-input-read-file` should point to a .tsv file that has up to three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false). |
| `--kmer-classifier-split-fastq`      | Set to true to create a set of FASTQ files (depending on the input) with category-specific reads. Compatible only with a category binner database.                                                                   |
| `--kmer-classifier-min-window`       | The minimum number of consecutive k-mers to classify a read to a taxid or category. (Default=1).                                                                                                                     |
| `--kmer-classifier-output-read-seq`  | Set to true to add a column in the read-level output file with the read sequence. (Default=false).                                                                                                                   |
| `--kmer-classifier-output-taxid-seq` | Set to true to add a column in the read-level output file with the taxid or category assignments for each k-mer. (Default=false).                                                                                    |
| `--kmer-classifier-db-to-taxid-json` | Path to JSON file that maps database IDs to external taxids, names, and ranks.                                                                                                                                       |
| `--kmer-classifier-no-read-output`   | Set to true to not create individual read output. (Default=false).                                                                                                                                                   |
| `--kmer-classifier-no-taxid-counts`  | Set to true to not write taxid count output file. (Default=false).                                                                                                                                                   |
| `--kmer-classifier-protein-input`    | Set to true to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false).                                                                  |
| `--kmer-classifier-remove-dups`      | Deduplicate reads so that each unique sequence is analyzed once. Read counts in the output still reflect the non-deduplicated read count. Not supported for paired-end reads. (Default=false).                       |
| `--kmer-classifier-ncpus`            | Number of CPUs available for processing.                                                                                                                                                                             |

### Example Command Line

```
dragen \
  --enable-kmer-classifier=true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus=2 \
  --kmer-classifier-output-read-seq=false \
  --kmer-classifier-output-taxid-seq=false
```

## Input Files and Options

### Input Reads

Applies to: `--kmer-classifier-input-read-file`, `--kmer-classifier-multiple-inputs`

If the analysis is for a single FASTQ read file, then that filename is input to `--kmer-classifier-input-read-file` and `--kmer-classifier-multiple-inputs=false`. However, many read files can be submitted to the k-mer classifier at one time, minimizing the load time for a large reference sequence database. In this case, the input file must be a `.tsv` (tab-separated) file with two columns (optionally 3 columns). The first column is a unique ID, the second column is the path to the read file, and the optional third column is the path to the second read file in the case of paired-end reads. The ID is used to distinguish the output files. There is no header line. This `.tsv` file is the input file to `--kmer-classifier-input-read-file` when `--kmer-classifier-multiple-inputs=true`.

When paired-end samples are analyzed, each read pair is counted as one read in the output files. The k-mer content of both R1 and R2 is considered in order to classify the read.

Read deduplication can be enabled for single-ended samples with `--kmer-classifier-remove-dups=true`. This will cause each unique sequence to be classified just once, which may increase the speed of classification. The read counts in the output files will still reflect the non-deduplicated read count. Read deduplication is not supported for paired-end reads. If `--kmer-classifier-remove-dups` is set to `true`, it will be automatically suppressed for any paired-end samples.

### Reference Sequences

Applies to: `--kmer-classifier-db-file`, `--kmer-classifier-db-to-taxid-json`, `--kmer-classifier-load-db-ram`

A file of reference sequences (the "database") can be quite large. If the database file is stored on a normal file system, it is recommended that you set `--kmer-classifier-load-db-ram=true`. This will tell the k-mer classifier to load the database file into memory for faster analysis. It is also allowable to store the database file on a RAM disk, which reduces load time over many k-mer classifier runs. In this case, it is recommended to set `--kmer-classifier-load-db-ram=false`.

### Taxid JSON Mapping File

Applies to: `--kmer-classifier-db-to-taxid-json`

This input file is downloaded alongside the reference sequence database. It associates an internal identifier in the database to an external source, like the NCBI taxonomy. This JSON file is a dictionary where the keys are internal identifiers, and is mapped to an external taxid, name, and rank. Example:

```
 {
   "2": {"taxid": 2, "name": "bacteria", "rank": "kingdom"},
   "3": {"taxid": 2697049, "name": "SARS-CoV-2", "rank": "subspecies"},
   "4": {"taxid": 5052, "name": "Aspergillus", "rank": "genus"}
 }
```

The internal identifiers are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.

## Output Files

### Read-level Output

Applies to: `--kmer-classifier-output-taxid-seq`, `--kmer-classifier-output-read-seq` The main output file is a `.tsv` file with the extension `.read_classifications.tsv`. It has no header line, has tab-separated columns, and can vary in the number of columns depending on command line options. It details the results for each read.

| Column | Description                                                                                                             | Data Type                            |
| ------ | ----------------------------------------------------------------------------------------------------------------------- | ------------------------------------ |
| 1      | Read index                                                                                                              | integer                              |
| 2      | Read name                                                                                                               | string                               |
| 3      | Internal identifier (taxid or category) the read classified to                                                          | integer                              |
| 4      | Maximum number of contiguous k-mers that classified to this taxid                                                       | integer                              |
| 5      | Score assigned to the classification                                                                                    | integer                              |
| 6      | Number of k-mers that classified to this taxid                                                                          | integer                              |
| 7      | Read duplication count                                                                                                  | integer                              |
| 8      | Name associated with taxid or category, if given with `--kmer-classifier-db-to-taxid-json`                              | string                               |
| 9      | Taxonomic rank associated with taxid, if given with `--kmer-classifier-db-to-taxid-json`                                | string                               |
| 10     | Internal identifier that each k-mer classified to (is output when the `--kmer-classifier-output-taxid-seq` flag is set) | list of integers separated by commas |
| 11     | Read sequence (is output when the `--kmer-classifier-output-read-seq` flag is set)                                      | string                               |

### Taxid/Category-Level Output

The second output file is a `.tsv` file with the extension `.classifier.taxid_kmer_counts.tsv`. It has a header line and has tab-separated columns. It summarizes the results for each detected taxid. In the case of a category-based database, it summarizes the results for each detected category or category combinations.

| Header                       | Description                                                                                                                                                     | Data Type |
| ---------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- |
| db\_taxid                    | Identifier for this taxid/category used internally in the database                                                                                              | integer   |
| duplicity                    | Ratio of total number of k-mers from reads assigned to this taxid/category compared to the number of distinct k-mers from reads assigned to this taxid/category | float     |
| distinct\_coverage           | Percent of k-mers in the database assigned to this taxid/category that are covered by k-mers in the reads assigned to this taxid/category                       | integer   |
| read\_count                  | Number of reads that classified to this taxid/category                                                                                                          | integer   |
| total\_kmer\_count           | Number of k-mers that classified to this taxid/category                                                                                                         | integer   |
| distinct\_kmer\_count        | Number of distinct k-mers that classified to this taxid/category                                                                                                | integer   |
| cumulative\_read\_count      | Cumulative number of reads assigned to this taxid and its taxonomic descendants                                                                                 | integer   |
| taxid                        | Taxid                                                                                                                                                           | integer   |
| name                         | Name associated with the taxid/category, if given with `--kmer-classifier-db-to-taxid-json`                                                                     | string    |
| rank                         | Taxonomic rank of the taxid, if given with `--kmer-classifier-db-to-taxid-json`                                                                                 | string    |
| taxid\_distinct\_kmer\_count | Number of distinct k-mers assigned to this taxid/category from the reference sequences                                                                          | string    |
| probability\_present         | Not in use                                                                                                                                                      | float     |

### Category Summary Output

If a category binner database is used, an output file will be generated to summarize the composition of the sample. [See more details here.](https://help.dragen.illumina.com/product-guides/dragen-v4.5/microbial-binner#category-summary-output)

### Category-Specific FASTQs

If a category binner database is used, and `--kmer-classifier-split-fastq` is `true`, a set of category-specific FASTQs will be generated. [See more details here.](https://help.dragen.illumina.com/product-guides/dragen-v4.5/microbial-binner#fastqs)
