Kmer Classifier

Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to run query sequences against a pre-existing reference sequence database (As of DRAGEN 4.3+, users can build their own custom reference sequence database).

Command Line Settings

Option
Description

Required Inputs

--enable-kmer-classifier

Enables the Kmer Classifier. (Default=false).

--output-file-prefix

Prefix for all output files.

--output-directory

Directory for all output files.

--kmer-classifier-input-read-file

Input sequence file (zipped or unzipped) to the Kmer Classifier.

--kmer-classifier-db-file

Database of sequences to classify against.

Optional Inputs

--intermediate-results-dir

Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2.

--kmer-classifier-load-db-ram

Load the database onto RAM. Do not use if database is on ramdisk. (Default=false).

--kmer-classifier-multiple-inputs

Set to true to run with multiple inputs. The input read file is now a .tsv file that has three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false).

--kmer-classifier-min-window

The minimum number of consecutive kmers to classify assignment at taxid. (Default=1).

--kmer-classifier-output-read-seq

Option to enable read sequence column in the output file. (Default=false).

--kmer-classifier-output-taxid-seq

Option to enable a taxid string column in the output file. (Default=false).

--kmer-classifier-db-to-taxid-json

Path to JSON file that maps database IDs to external taxids, names, and ranks.

--kmer-classifier-no-read-output

Option to not create individual read output. (Default=false).

--kmer-classifier-no-taxid-counts

Option to not write taxid count output file. (Default=false).

--kmer-classifier-protein-input

Option to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false).

--kmer-classifier-remove-dups

Deduplicate reads so that each unique sequence is analyzed once. Read counts in the output still reflect the non-deduplicated read count. Not supported for paired-end reads. (Default=false).

--kmer-classifier-ncpus

Option to set the number of CPUs available for processing.

Example Command Line

dragen \
  --enable-kmer-classifier=true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus=2 \
  --kmer-classifier-output-read-seq=false \
  --kmer-classifier-output-taxid-seq=false

Input Details

Input Reads

Applies to: --kmer-classifier-input-read-file, --kmer-classifier-multiple-inputs

If the analysis is for a single FASTA/FASTQ read file, then that filename is input to --kmer-classifier-input-read-file and --kmer-classifier-multiple-inputs=false. However, many read files can be submitted to the Kmer Classifier at one time, minimizing the load time for a large reference sequence database. In this case, the input file must be a .tsv (tab-separated) file with two columns (optionally 3 columns). The first column is a unique ID, the second column is the path to the read file, and the optional third column is the path to the second read file in the case of paired-end reads. The ID is used to distinguish the output files. There is no header line. This .tsv file is the input file to --kmer-classifier-input-read-file and --kmer-classifier-multiple-inputs=true.

When paired-end samples are analyzed, each read pair is counted as one read in the output files. The k-mer content of both R1 and R2 is considered in order to classify the read.

Read deduplication can be enabled for single-ended samples with --kmer-classifier-remove-dups=true. This will cause each unique sequence to be classified just once, which may increase the speed of classification. The read counts in the output files will still reflect the non-deduplicated read count. Read deduplication is not supported for paired-end reads. If --kmer-classifier-remove-dups is set to true, it will be automatically suppressed for any paired-end samples.

Reference Sequences

Applies to: --kmer-classifier-db-file, --kmer-classifier-db-to-taxid-json, --kmer-classifier-load-db-ram

A file of reference sequences (the "database") can be quite large. If the database file is stored on a normal file system, it is recommended that you set --kmer-classifier-load-db-ram=true. This will tell the Kmer Classifier to load the database file into memory for faster analysis. It is also allowable to store the database file on a RAM disk, which reduces load time over many Kmer Classifier runs. In this case, it is recommended to set --kmer-classifier-load-db-ram=false.

DB TaxID JSON Mapping File

Applies to: --kmer-classifier-db-to-taxid-json

This input file is downloaded alongside the reference sequence database. It associates a taxid internal to the classifier database to an external source, like the NCBI taxonomy. This JSON file is a dictionary where the keys are internal taxids, and is mapped to an external taxid, name, and rank. Example:

 {
   "2": {"taxid": 2, "name": "bacteria", "rank": "kingdom"},
   "3": {"taxid": 2697049, "name": "SARS-CoV-2", "rank": "subspecies"},
   "4": {"taxid": 5052, "name": "Aspergillus", "rank": "genus"}
 }

The internal taxids are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.

Downloading Reference Sequence Databases and Mapping Files

There are several databases that are available for download for use with the k-mer classifier. The human and microbial binning database will classify reads to categories (e.g. viral). In contrast, the rest of the listed databases will classify each read to a node in the taxonomy.

Human and Microbial Binning Database

The binning database can be used to classify (or "bin") each read to a category. The categories include human, viral, bacterial, fungal, and parasite. It is possible for a read to classify equally well to more than one category, or for a read to remain unclassified.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.name_map.json
File
Size
md5sum

dragen-kmer-classifier.human_microbial_binner.v6dh.t6db

54G

1cdf5176bf03d9fcc3b86fd5a12fe99a

dragen-kmer-classifier.human_microbial_binner.name_map.json

0.3K

eda0826ef3d079b73c52609b8cadd048

The output files are detailed below, but a python snippet is provided here for ease of use. It creates a small table that summarizes the results for each category. Reads that classify equally well to more than one category are counted toward the "Ambiguous" category:

import pandas

# Initialize read count dictionary
counts = {
  "Human": 0,
  "Viral": 0,
  "Bacterial": 0,
  "Fungal": 0,
  "Parasite": 0,
  "Unclassified": 0,
  "Ambiguous": 0
}

# Load the taxid-level tsv file
tsv = pandas.read_csv(
  "/path/to/mySample.classifier.taxid_kmer_counts.tsv",
  usecols=["db_taxid", "name", "read_count"],
  sep="\t"
)

# Count reads
for row in tsv.itertuples(index=False):
  if row.name == ".":
    if row.db_taxid == 0:
      counts["Unclassified"] += row.read_count
    else:
      counts["Ambiguous"] += row.read_count
  else:
    counts[row.name] += row.read_count

# Create summary table
summary = pandas.DataFrame([{"Category": k, "Reads": v} for k, v in counts.items()])
total = summary["Reads"].sum()
summary["Percent"] = summary["Reads"] / total * 100
summary.to_csv("/path/to/output.tsv", sep="\t", index=None)

This database is also useful for spliting an input FASTQ file by category into separate FASTQs to isolate human and microbial reads.

Genome Database

The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.refseq_genomes.name_map.json
File
Size
md5sum

dragen-kmer-classifier.refseq_genomes.v6dh.t6db

266G

e1fb74ffe669c6001522520f016e73e4

dragen-kmer-classifier.refseq_genomes.name_map.json

11M

e164a1c3859062f10f0dab5272b90092

Genome and NT Database

This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json

To download the compressed reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.compressed.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.genomes_plus_nt.name_map.json
File
Size
md5sum

dragen-kmer-classifier.genomes_plus_nt.v6dh.t6db

496G

fcf59213c4cbd3193171eb4e58470feb

dragen-kmer-classifier.genomes_plus_nt.compressed.v6dh.t6db

183G

97dd8bca18f1cb0a97c1a55ef49d7640

dragen-kmer-classifier.genomes_plus_nt.name_map.json

171M

5f5a2f7ea5d20b1c5c739b23b09735d9

UniRef90 Database

This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.u90_all.name_map.json
File
Size
md5sum

dragen-kmer-classifier.u90_all.v6dh.t6db

81G

78bba8b3635241ac9adc35f101df7f46

dragen-kmer-classifier.u90_all.name_map.json

27M

8ebe7b070aa85212f8f37a2f8b901cff

16S database

This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.16S.name_map.json
File
Size
md5sum

dragen-kmer-classifier.16S.v6dh.t6db

59M

ed3c4cd4f19ae7e570d603e86ffb2668

dragen-kmer-classifier.16S.name_map.json

4.4M

50388f2152bb8849c1ffcf83b14e9e69

Output Details

There are two output files, one organized around the reads, and the other organized around the taxids.

Read-level Output

Applies to: --kmer-classifier-output-taxid-seq, --kmer-classifier-output-read-seq The main output file is a .tsv file with the extension .read_classifications.tsv. It has no header line, has tab-separated columns, and can vary in the number of columns depending on command line options. It details the results for each read.

Column
Description
Data Type

1

Read index

integer

2

Read name

string

3

Taxid the read classified to

integer

4

Maximum number of contiguous kmers that classified to this taxid

integer

5

Score assigned to the classification

integer

6

Number of kmers that classified to this taxid

integer

7

Read duplication count

integer

8

Name associated with taxid, if given with --kmer-classifier-db-to-taxid-json

string

9

Taxonomic rank associated with taxid, if given with --kmer-classifier-db-to-taxid-json

string

10

Taxid that each kmer classified to (is output when the --kmer-classifier-output-taxid-seq flag is set)

list of integers separated by commas

11

Read sequence (is output when the the --kmer-classifier-output-read-seq flag is set)

string

TaxID-level Output

The second output file is a .tsv file with the extension .classifier.taxid_kmer_counts.tsv. It has a header line and has tab-separated columns. It summarizes the results for each taxid.

Header
Description
Data Type

db_taxid

Identifier for this taxid used internally in the database

integer

duplicity

Ratio of total number of kmers from reads assigned to this taxid compared to the number of distinct kmers from reads assigned to this taxid

float

distinct_coverage

Percent of kmers in the database assigned to this taxid that are covered by kmers in the reads assigned to this taxid

integer

read_count

Number of reads that classified to this taxid

integer

total_kmer_count

Number of kmers that classified to this taxid

integer

distinct_kmer_count

Number of distinct kmers that classified to this taxid

integer

cumulative_read_count

Cumulative number of reads assigned to this taxid and its taxonomic descendants

integer

taxid

Taxid

integer

name

Name associated with the taxid, if given with --kmer-classifier-db-to-taxid-json

string

rank

Taxonomic rank of the taxid, if given with --kmer-classifier-db-to-taxid-json

string

taxid_distinct_kmer_count

Number of distinct kmers assigned to this taxid from the reference sequences

string

probability_present

Not in use

float

Using the Kmer Classifier to Split FASTQs by Category

The Kmer Classifier can be used to divide reads in FASTQs by category in order to isolate human and/or microbial reads. This may be done to produce a FASTQ that contains only human reads in preparation for a DRAGEN analysis. Or, a FASTQ with human reads excluded could be produced as input to the DRAGEN Microbial Enrichment (Explify) analysis pipeline.

The human and microbial binning database is ideal for this task and contains human, viral, bacterial, fungal, and parasite sequences. The following instructions will detail how to use this database with the Kmer Classifier to split FASTQs.

  1. Run the Kmer Classifier with the read-level output file enabled following the example command below.

dragen \
  --enable-kmer-classifier true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db \
  --kmer-classifier-db-to-taxid-json /path/to/database/dragen-kmer-classifier.human_microbial_binner.name_map.json \
  --kmer-classifier-no-read-output false \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus 20 \
  --kmer-classifier-remove-dups false

Two important dragen options from the command above are explained below:

dragen key option
default value
necessary value
explanation

--kmer-classifier-no-read-output

false

false

Must be set to false (default value) to create the read-level output file detailed above

--kmer-classifier-remove-dups

false

false

Must be set to false (default value) to prevent deduplication of the input reads, and to allow all of the reads in the input FASTQ to be placed into a category-specific FASTQ in step 3 below.

  1. Use the read-level output file to split the reads into separate FASTQs.

The following python snippet is an example of how to accomplish this. Please note that this approach may be slow for large FASTQs.

import pysam

# --kmer-classifier-input-read-file from step 2
fastqPath = "/path/to/mySample.fastq.gz"

# Output file from step 2
readLevelPath = "output/mySample.read_classifications.tsv"

# Setup output FASTQ files
categories = {"Human", "Viral", "Bacterial", "Fungal", "Parasite", "Unclassified", "Ambiguous"}
outputFilePaths = {cat: readLevelPath.replace(".read_classifications.tsv", f".{cat.lower()}.fastq") for cat in categories}
outputFileHandles = {cat: open(path, "w") for cat, path in outputFilePaths.items()}

# Populate dictionary of read name -> category
readCategories = {}
for line in open(readLevelPath):
    rowValues = line.strip().split("\t")
    readName = rowValues[1]
    dbTaxid = int(rowValues[2])
    category = rowValues[7]

    if category == ".":
        if dbTaxid == 0:
            readCategories[readName] = "Unclassified"
        else:
            readCategories[readName] = "Ambiguous"
    else:
        readCategories[readName] = category

# Iterate over input FASTQ, writing records to category FASTQs
for read in pysam.FastxFile(fastqPath):
    category = readCategories[read.name]
    fh = outputFileHandles[category]
    fh.write(str(read) + "\n")

for fh in outputFileHandles.values():
    fh.close()

Last updated

Was this helpful?