DRAGEN
Illumina Connected Software
  • Overview
    • Illumina® DRAGEN™ Secondary Analysis
    • DRAGEN Applications
    • Deployment Options
  • Product Guides
    • DRAGEN v4.4
      • Getting Started
      • DRAGEN Host Software
        • DRAGEN Secondary Analysis
      • Clinical Research Workflows
        • DRAGEN Heme WGS Tumor Only Pipeline
          • Quick Start
          • Sample Sheets
            • Introduction
            • Requirements
            • Templates
          • Run Planning
            • Sample Sheet Creation in BaseSpace
            • Custom Config Support
          • DRAGEN Server App
            • Getting Started
            • Launching Analysis
            • Command Line Options
            • Output
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Illumina Connected Insights
          • ICA Cloud App
            • Getting Started
            • Launching Analysis
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Post Processing
              • Illumina Connected Insights
          • Analysis Output
          • Analysis Methods
          • Troubleshooting
        • DRAGEN Solid WGS Tumor Normal Pipeline
          • Quick Start
          • Sample Sheets
            • Introduction
            • Requirements
            • Templates
          • Run Planning
            • Sample Sheet Creation in BaseSpace
            • Custom Config Support
          • DRAGEN Server App
            • Quick Start
            • Getting Started
            • Launching Analysis
            • Command Line Options
            • Output
            • Advanced Topics
            • Custom Workflow
              • Custom Config Support
            • Illumina Connected Insights
          • ICA Cloud App
            • Getting Started
            • Launching Analysis
            • Output
            • Advanced Topics
              • Custom Workflow
              • Custom Config Support
              • Post Processing
              • Illumina Connected Insights
          • Analysis Output
          • Analysis Methods
          • Troubleshooting
      • DRAGEN Recipes
        • DNA Germline Panel UMI
        • DNA Germline Panel
        • DNA Germline WES UMI
        • DNA Germline WES
        • DNA Germline WGS UMI
        • DNA Germline WGS
        • DNA Somatic Tumor-Normal Solid Panel UMI
        • DNA Somatic Tumor-Normal Solid Panel
        • DNA Somatic Tumor-Normal Solid WES UMI
        • DNA Somatic Tumor-Normal Solid WES
        • DNA Somatic Tumor-Normal Solid WGS UMI
        • DNA Somatic Tumor-Normal Solid WGS
        • DNA Somatic Tumor-Only Heme WGS
        • DNA Somatic Tumor-Only Solid Panel UMI
        • DNA Somatic Tumor-Only Solid Panel
        • DNA Somatic Tumor-Only Solid WES UMI
        • DNA Somatic Tumor-Only Solid WES
        • DNA Somatic Tumor-Only Solid WGS UMI
        • DNA Somatic Tumor-Only Solid WGS
        • DNA Somatic Tumor-Only ctDNA Panel UMI
        • Illumina scRNA
        • Other scRNA prep
        • RNA Panel
        • RNA WTS
      • DRAGEN Reference Support
        • Prepare a Reference Genome
      • DRAGEN DNA Pipeline
        • DNA Mapping
        • Read Trimming
        • DRAGEN FASTQC
        • Sorting and Duplicate Marking
        • Small Variant Calling
          • ROH Caller
          • B-Allele Frequency Output
          • Somatic Mode
          • Pedigree Analysis
          • De Novo Small Variant Filtering
          • Autogenerated MD5SUM for VCF Files
          • Force Genotyping
          • Machine Learning for Variant Calling
          • Evidence BAM
          • Mosaic Detection
          • VCF Imputation
          • Multi-Region Joint Detection
        • Copy Number Variant Calling
          • Available pipelines
            • Germline CNV Calling (WGS/WES)
            • Germline CNV Calling ASCN (WGS)
            • Multisample Germline CNV Calling
            • Somatic CNV Calling ASCN (WGS)
            • Somatic CNV Calling WES
            • Somatic CNV Calling ASCN (WES)
          • Additional documentation
            • CNV Input
            • CNV Preprocessing
            • CNV Segmentation
            • CNV Output
            • CNV ASCN module
            • CNV with SV Support
            • Cytogenetics Modality
        • Repeat Expansion Detection
          • De Novo Repeat Expansion Detection
        • Targeted Caller
          • CYPDB6 Caller
          • CYP2D6 Caller
          • CYP21A2 Caller
          • GBA Caller
          • HBA Caller
          • LPA Caller
          • Rh Caller
          • SMN Caller
        • Structural Variant Calling
          • Structural Variant De Novo Quality Scoring
          • Structural Variant IGV Tutorial
        • VNTR Calling
        • Population Genotyping
        • Filter Duplicate Variants
        • Ploidy Calling
          • Ploidy Estimator
          • Ploidy Caller
        • Multi Caller
        • QC Metrics Reporting
        • JSON Metrics Reporting
        • HLA Typing
        • Biomarkers
          • Tumor Mutational Burden
          • Microsatellite Instability
          • Homologous Recombination Deficiency
          • BRCA Large Genomic Rearrangment
          • DRAGEN Fragmentomics
        • Downsampling
          • Fractional (Raw Reads) Downsampling
        • Unique Molecular Identifiers
        • Indel Re-aligner (Beta)
        • Star Allele Caller
        • High Coverage Analysis
        • CheckFingerprint
        • Population Haplotyping (Beta)
        • DUX4 Rearrangement Caller
      • DRAGEN RNA Pipeline
        • RNA Alignment
        • Gene Fusion Detection
        • Gene Expression Quantification
        • RNA Variant Calling
        • Splice Variant Caller
      • DRAGEN Single Cell Pipeline
        • Illumina PIPseq scRNA
        • Other scRNA Prep
        • scATAC
        • Single-Cell Multiomics
      • DRAGEN Methylation Pipeline
      • DRAGEN MRD Pipeline
      • DRAGEN Amplicon Pipeline
      • Explify Analysis Pipeline
        • Kmer Classifier
        • Kmer Classifier Database Builder
      • BCL conversion
      • Illumina Connected Annotations
      • ORA Compression
      • Command Line Options
        • Docker Requirements
      • DRAGEN Reports
      • Tools and Utilities
    • DRAGEN v4.3
      • Getting Started
      • DRAGEN Host Software
        • DRAGEN Secondary Analysis
      • DRAGEN Reference Support
        • Prepare a Reference Genome
      • DRAGEN DNA Pipeline
        • DNA Mapping
        • Read Trimming
        • DRAGEN FASTQC
        • Sorting and Duplicate Marking
        • Small Variant Calling
          • ROH Caller
          • B-Allele Frequency Output
          • Somatic Mode
          • Joint Analysis
          • De Novo Small Variant Filtering
          • Autogenerated MD5SUM for VCF Files
          • Force Genotyping
          • Machine Learning for Variant Calling
          • Evidence BAM
          • Mosaic Detection
          • VCF Imputation
          • Multi-Region Joint Detection
        • Copy Number Variant Calling
          • CNV Output
          • CNV with SV Support
          • Multisample CNV Calling
          • Somatic CNV Calling WGS
          • Somatic CNV Calling WES
          • Allele Specific CNV for Somatic WES CNV
        • Repeat Expansion Detection
          • De Novo Repeat Expansion Detection
        • Targeted Caller
          • CYPDB6 Caller
          • CYP2D6 Caller
          • CYP21A2 Caller
          • GBA Caller
          • HBA Caller
          • LPA Caller
          • Rh Caller
          • SMN Caller
        • Structural Variant Calling
          • Structural Variant De Novo Quality Scoring
        • VNTR Calling
        • Filter Duplicate Variants
        • Ploidy Calling
          • Ploidy Estimator
          • Ploidy Caller
        • Multi Caller
        • QC Metrics Reporting
        • HLA Typing
        • Biomarkers
          • Tumor Mutational Burden
          • Microsatellite Instability
          • Homologous Recombination Deficiency
          • BRCA Large Genomic Rearrangment
          • DRAGEN Fragmentomics
        • Downsampling
          • Fractional (Raw Reads) Downsampling
          • Effective Coverage Downsampling
        • Unique Molecular Identifiers
        • Indel Re-aligner (Beta)
        • Star Allele Caller
        • High Coverage Analysis
        • CheckFingerprint
        • Population Haplotyping (Beta)
        • DUX4 Rearrangement Caller
      • DRAGEN RNA Pipeline
        • RNA Alignment
        • Gene Fusion Detection
        • Gene Expression Quantification
        • RNA Variant Calling
        • Splice Variant Caller
      • DRAGEN Single-Cell Pipeline
        • scRNA
        • scATAC
        • Single-Cell Multiomics
      • DRAGEN Methylation Pipeline
      • DRAGEN Amplicon Pipeline
      • Explify Analysis Pipeline
        • Kmer Classifier
        • Kmer Classifier Database Builder
      • DRAGEN Recipes
        • DNA Germline Panel UMI
        • DNA Germline Panel
        • DNA Germline WES UMI
        • DNA Germline WES
        • DNA Germline WGS UMI
        • DNA Germline WGS
        • DNA Somatic Tumor-Normal Solid Panel UMI
        • DNA Somatic Tumor-Normal Solid Panel
        • DNA Somatic Tumor-Normal Solid WES UMI
        • DNA Somatic Tumor-Normal Solid WES
        • DNA Somatic Tumor-Normal Solid WGS UMI
        • DNA Somatic Tumor-Normal Solid WGS
        • DNA Somatic Tumor-Only Heme WGS
        • DNA Somatic Tumor-Only Solid Panel UMI
        • DNA Somatic Tumor-Only Solid Panel
        • DNA Somatic Tumor-Only Solid WES UMI
        • DNA Somatic Tumor-Only Solid WES
        • DNA Somatic Tumor-Only Solid WGS UMI
        • DNA Somatic Tumor-Only Solid WGS
        • DNA Somatic Tumor-Only ctDNA Panel UMI
        • RNA Panel
        • RNA WTS
      • BCL conversion
      • Illumina Connected Annotations
      • ORA Compression
      • Command Line Options
      • DRAGEN Reports
      • Tools and Utilities
  • Reference
    • DRAGEN Server
    • DRAGEN Multi-Cloud
      • DRAGEN on AWS
      • DRAGEN on AWS Batch
      • DRAGEN on Microsoft Azure
        • Run DRAGEN VM on Azure
      • DRAGEN on Microsoft Azure Batch
        • Azure Batch Run Modes
    • DRAGEN Licensing
      • DRAGEN Server Licensing
      • DRAGEN Cloud Licensing
    • DRAGEN Application Manager
    • Support
    • Resource Files
      • Noise Baselines
    • Supplementary Information
    • Troubleshooting
    • Citing DRAGEN software
    • Release Notes
    • Revision History
Powered by GitBook
On this page
  • Description
  • Command Line Settings
  • Example Command Line
  • Usage

Was this helpful?

Export as PDF
  1. Product Guides
  2. DRAGEN v4.4
  3. Explify Analysis Pipeline

Kmer Classifier Database Builder

Description

The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.

Command Line Settings

Option
Description

Required Inputs

--enable-kmer-class-db-builder

Enables the Kmer Classifier Database Builder. (Default=false).

--kmer-class-db-builder-input-file

Headerless, tab-delimited file where each line is (1) the path to a reference fasta file and (2) the associated taxid. When using --kmer-class-db-builder-taxids-as-seq-name, the second column is required but ignored.

--output-file-prefix

Prefix for all output files.

--output-directory

Directory for all output files.

--kmer-class-db-builder-kmer-length

Kmer length (Range: [4, 31]).

--kmer-class-db-builder-gmer-length

Gmer length (must be >= kmer length. Range: [4, 64]).

Optional Inputs

--kmer-class-db-builder-tax-tree-file

.tri file with nodes in the taxonomic tree for a classifier database (not required if building a binner database). Headerless, tab-delimited file where each line has (1) the child node taxid and (2) the parent node taxid. Root of tree must be 1 and have a parent of 0.

--kmer-class-db-builder-protein

Set to indicate input sequences are protein sequences. (Default=false).

--kmer-class-db-builder-taxids-to-keep

File with taxids to keep. If set, any kmers with taxids not in this file will be excluded from database.

--kmer-class-db-builder-num-categories

Set to build a binner database with this number of categories. Max is 25 categories, assumes categories are from 2^0..2^n sequentially. The categories take the place of taxids in the input file.

--kmer-class-db-builder-save-weights

Set to build classification database that saves all kmers / taxids / weights.

--kmer-class-db-builder-kmer-cutoff

Cutoff that excludes k-mers that are found in more than cutoff number of taxids when building a database using --kmer-class-db-builder-save-weights. Helps speed up classification. (Default=1000).

--kmer-class-db-builder-mask-bits

Number of bits to mask in kmer before building / searching. (Default=7).

--kmer-class-db-builder-num-cpus

Option to set the number of CPUs available for processing.

--kmer-class-db-builder-num-kmers-per-bucket

Set to output number of kmers in each minimizer bucket. (Default=false).

--kmer-class-db-builder-include-lowercase

Set to include kmers with lowercase bases (usually repeatmasked). (Default=false).

--kmer-class-db-builder-taxids-as-seq-name

Set to indicate that the reference fastas listed in the input file have taxids as sequence name. In this case, the second column of the input file is ignored. (Default=false).

Example Command Line

dragen \
  --enable-kmer-class-db-builder=true \
  --kmer-class-db-builder-input-file <builder_input.txt> \
  --output-file-prefix <PREFIX> \
  --output-directory <OUTPUT_DIR> \
  --kmer-class-db-builder-kmer-length 31 \
  --kmer-class-db-builder-gmer-length 64 \
  --kmer-class-db-builder-num-categories 3

Usage

K-mer/G-mer length considerations:

  • G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.

  • K-mer length refers to the size of the minimizer to be saved in a window size specified by the G-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.

  • As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.

  • Pre-built Explify Reference Database k-mer/g-mer length settings for reference:

    • Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): G-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.

    • Subset of reference genomes from Refseq with a focus on viral detection: G-mer length of 35 and k-mer length of 31

    • Collection of 16S sequences for bacterial identification / profiling: G-mer length of 31 and k-mer length of 31

    • Uniref90 protein sequences: G-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)

Three types of databases can be built with this tool:

  • Binner: each k-mer is assigned to a category/bin.

    • Must use --kmer-class-db-builder-num-categories.

    • Do not use --kmer-class-db-builder-tax-tree-file, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.

  • Classifier: each k-mer is assigned to one taxid.

    • Must define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.

    • Do not use --kmer-class-db-builder-num-categories, --kmer-class-db-builder-save-weights, or --kmer-class-db-builder-kmer-cutoff.

  • Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.

    • Must use --kmer-class-db-builder-save-weights and define a taxonomic tree with --kmer-class-db-builder-tax-tree-file.

    • Can use --kmer-class-db-builder-kmer-cutoff.

    • Do not use --kmer-class-db-builder-num-categories.

PreviousKmer ClassifierNextBCL conversion

Last updated 2 days ago

Was this helpful?