Microbial Binner Database

The human and microbial binning database can be used to classify (or "bin") each read to a category. The categories include human, viral, bacterial, fungal, and parasite. It is possible for a read to classify equally well to more than one category, or for a read to remain unclassified.

This database is useful for analyzing the composition of a sample or splitting an input FASTQ file into category-specific FASTQ files.

Download the database

To download the reference index file and the taxid mapping JSON:

wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db
wget https://illumina-explify-databases.s3.us-east-1.amazonaws.com/kmer-classifier/dragen-kmer-classifier.human_microbial_binner.name_map.json
File
Size
md5sum

dragen-kmer-classifier.human_microbial_binner.v6dh.t6db

54G

1cdf5176bf03d9fcc3b86fd5a12fe99a

dragen-kmer-classifier.human_microbial_binner.name_map.json

0.3K

eda0826ef3d079b73c52609b8cadd048

Run the K-mer Classifier

Point to the downloaded resource files with --kmer-classifier-db-file and --kmer-classifier-db-to-taxid-json in the DRAGEN command. For example:

dragen \
  --enable-kmer-classifier true \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-classifier-input-read-file /path/to/fastq.gz \
  --kmer-classifier-db-file /path/to/database/dragen-kmer-classifier.human_microbial_binner.v6dh.t6db \
  --kmer-classifier-db-to-taxid-json /path/to/database/dragen-kmer-classifier.human_microbial_binner.name_map.json \
  --kmer-classifier-min-window 1 \
  --kmer-classifier-ncpus 20 \
  --kmer-classifier-split-fastq true \
  --kmer-classifier-remove-dups false

If you want to generate category-specific FASTQ files, ensure that --kmer-classifier-split-fastq is set to true. If you only wish to analyze the sample composition, and do not wish to generate category-specific FASTQ files, setting it to false may reduce the run time.

When generating category-specific FASTQs, it is important that --kmer-classifier-remove-dups is set to false (its default value) so that all reads from the input file are represented in the output files.

There is no specific argument to enable the sample composition analysis. The category-based database will be automatically detected and a category summary output will be generated.

Output Files

Category Summary Output

The high-level results of the composition analysis can be found in an output file with the extension .categories.tsv. It contains the number and percent of reads that classified to each category in the database. If a read classifies equally well to multiple categories, it is counted towards "Ambiguous". If it does not classify to any category, it is counted towards "Unclassified".

Here is an example of the output file:

category
reads
percent

Ambiguous

1817

0.18

Bacterial

13036

1.3

Fungal

1454

0.15

Human

983385

98.34

Parasite

297

0.03

Unclassified

3

0

Viral

8

0

FASTQs

If --kmer-classifier-split-fastq is set to true, category-specific FASTQ files will be generated. They are named according to the category and are only created if at least one read is classified to that category. The possible set of FASTQ files are:

  • *.ambiguous.fastq.gz

  • *.bacterial.fastq.gz

  • *.fungal.fastq.gz

  • *.human.fastq.gz

  • *.parasite.fastq.gz

  • *.unclassified.fastq.gz

  • *.viral.fastq.gz

If a read classifies equally well to multiple categories, it will be written to the "ambiguous" FASTQ. If it does not classify to any cateogry, it will be written to the "unclassified" FASTQ.

Last updated

Was this helpful?