# DRAGEN Microbial Enrichment Plus

## Description

DRAGEN Microbial Enrichment Plus (DME+), formerly known as the Explify Analysis Pipeline, offers a dedicated informatics solution with flexible analysis options for the following Illumina Infectious Disease and Microbiology target-capture enrichment panel kits: the Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP), Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP), and Illumina Viral Surveillance Panel V2 Kit (VSP V2). The application delivers easy-to-use, powerful secondary analysis of Illumina sequencing data, with workflows for sample QC, viral WGS (whole-genome sequencing), pathogen detection and quantification, and antimicrobial resistance (AMR) marker profiling. It also supports custom reference sequence analysis.

* RPIP: Target-capture enrichment of >280 RNA and DNA respiratory pathogens, including SARS-CoV-2, Influenza viruses, Respiratory syncytial virus, Mycobacterium and Legionella species, and >4000 AMR markers.
* UPIP: Target-capture enrichment of >170 genitourinary pathogens, including fastidious, slow-growing, and anaerobic uropathogens, sexually transmitted microorganisms, and >4000 bacterial AMR markers.
* VSP V2: Target-capture enrichment for whole-genome sequencing (WGS) of 200 RNA and DNA viruses prioritized as high-risk to public health, zoonotic surveillance, and biotech, and >200 viral AMR markers.
* Custom: Analyze FASTQ/FASTA read files with a custom reference sequence database.

Note that samples enriched using the Illumina Respiratory Virus Oligo Panel/Respiratory Virus Enrichment Kit (RVOP/RVEK) and Viral Surveillance Panel Kit (VSP) can also be analyzed using DME+ and the VSP V2 database.

## Pipeline Steps

The following table describes the different steps performed by the pipeline, which steps apply to each panel, and whether the step is run when using a set of custom references.

|              Step             |                                                                               Description                                                                              |    Panels    | Custom References |
| :---------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :----------: | :---------------: |
|            Read QC            | Can be disabled. Low-quality bases are trimmed. Short and low-quality reads are discarded. It is assumed that appropriate adapter trimming has already been performed. |      All     |        Yes        |
|    Post-QC FASTQ Generation   |                  Can choose to create a FASTQ with the trimmed reads, or a set of kingdom-specific FASTQs with the trimmed reads. Disabled by default.                 |      All     |        Yes        |
|           Dehosting           |                                                                          Removes human reads.                                                                          |      All     |        Yes        |
|           Sample QC           |                                   Sample composition analysis and enrichment factor calculation (which requires an internal control).                                  |      All     |         No        |
|  Microorganism Classification |                                                           K-mer-based analysis with configurable sensitivity.                                                          |    VSP V2    |         No        |
|    Microorganism Detection    |                                                           Alignment-based analysis and consensus generation.                                                           |      All     |        Yes        |
|  Microorganism Quantification |                                                                      Requires an internal control.                                                                     |      All     |         No        |
| Bacterial AMR Marker Analysis |                                         Nucleotide and protein alignment, consensus generation, variant calling and annotation.                                        |  RPIP, UPIP  |         No        |
|     Viral Variant Calling     |                                                                Detects variants from alignment results.                                                                | RPIP, VSP V2 |         No        |
|   Viral AMR Marker Analysis   |                                                                     Variant calling and annotation.                                                                    | RPIP, VSP V2 |         No        |
|       Report Generation       |                                                                          Creates the AP JSON.                                                                          |      All     |        Yes        |

## Command Line Settings

| Option                                      | Description                                                                                                                         |
| ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| Required Inputs                             |                                                                                                                                     |
| `--enable-explify`                          | Enables the DME+ pipeline. (Default=false).                                                                                         |
| `--output-file-prefix`                      | Prefix for all output files.                                                                                                        |
| `--output-directory`                        | Directory for all output files.                                                                                                     |
| `--explify-sample-list`                     | Input sample list .tsv file with sample IDs, FASTQs, etc.                                                                           |
| `--explify-test-panel-name`                 | "RPIP", "UPIP", "VSP V2", "Custom".                                                                                                 |
| `--explify-test-panel-version`              | Set to test panel version (e.g. "1.0.0").                                                                                           |
| `--explify-ref-db-dir`                      | Path to root directory for database files.                                                                                          |
| Optional Inputs                             |                                                                                                                                     |
| `--intermediate-results-dir`                | Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 3.                                        |
| `--explify-load-db-ram`                     | Option to load database into RAM if not on ramdisk. (Default=false).                                                                |
| `--explify-no-read-qc`                      | Option to turn off read QC on FASTQs before analysis. (Default=false).                                                              |
| `--explify-internal-control`                | Option to set internal control from an accepted list. (Default="Enterobacteria phage T7").                                          |
| `--explify-internal-control-concentration`  | Option to set internal control concentration. (Default=12100000).                                                                   |
| `--explify-ncpus`                           | Option to set the number of CPUs available for processing.                                                                          |
| `--explify-sensitivity-threshold`           | Option to set sensitivity threshold for considering a virus present. Range: 0 < Integer < 1000. Only valid for VSP V2. (Default=5). |
| `--explify-custom-ref-fasta`                | Reference FASTA file. Required for Custom reference DBs.                                                                            |
| `--explify-custom-ref-bed`                  | Reference BED file. Optional for Custom reference DBs.                                                                              |
| `--explify-viral-consensus-depth-threshold` | Minimum depth at position to include base in viral consensus sequence. Only relevant for RPIP and VSP V2 (Default=1).               |
| `--explify-viral-vc-depth-threshold`        | Minimum total depth at position to report viral variant. Only relevant for RPIP and VSP V2. (Default=5).                            |
| `--explify-viral-vc-af-threshold`           | Minimum allele frequency to report viral variant. Only relevant for RPIP and VSP V2. (Default=0.2).                                 |
| `--explify-post-qc-fastq-mode`              | Create a single post-quality fastq file or files split by kingdom. Choices='off', 'single', 'split'. (Default=off).                 |

### Example Command Line

```shell
dragen \
  --enable-explify=true \
  --output-file-prefix <PREFIX> \
  --explify-sample-list /path/to/sample/list/tsv \
  --explify-test-panel-name <"RPIP"/"UPIP"/"VSP V2"/"Custom"> \
  --explify-test-panel-version <VERSION> \
  --explify-ref-db-dir /path/to/root/db/dir \
  --explify-load-db-ram=true \
  --output-directory <OUTPUT_DIR> \
  --intermediate-results-dir <OUTPUT_DIR> \
  --explify-ncpus=20
```

## Input Details

### Sample Input List

Applies to: `--explify-sample-list`

The sample input list is a column-formatted file with *tab* separations between the columns (i.e., a `.tsv` file).

```
SampleID     BatchID     RunID     ControlFlag     FastQs
MySample     MyBatch     MyRun     POS             /path/to/fastq1.gz     /path/to/fastq2.gz
```

Notes:

* The **SampleID** values *must* be unique.
* **BatchID** and **RunID** are to help users track and manage sample analyses. Often the **BatchID** is used to track libraries that were prepared together, and the **RunID** is used to track sequencing runs. They can also be left blank.
* The **ControlFlag** value can be *POS*, *NEG*, *BLANK*, or left empty.
  * *POS* is used to indicate a positive control sample.
  * *NEG* is used to indicate a negative control sample.
  * *BLANK* is used to indicate a blank control sample (e.g. buffer only).
* If there are multiple FASTQ files, they are tab delimited.
* Please be very careful when editing tsv files. Some editors replace tabs with spaces without alerting the user.

### Internal Control

Applies to: `--explify-internal-control`, `--explify-internal-control-concentration`

The user may specify one of the internal controls listed below. If `NONE` is specified, the internal control concentration is ignored. These are case-sensitive and must be input exactly as they appear:

* `Allobacillus halotolerans`
* `Armored RNA Quant Internal Process Control`
* `Enterobacteria phage T7` (This is the default)
* `Escherichia virus MS2`
* `Escherichia virus Qbeta`
* `Escherichia virus T4`
* `Imtechella halotolerans`
* `Phocid alphaherpesvirus 1`
* `Phocine morbillivirus`
* `Truepera radiovictrix`
* `NONE`

The internal control concentration is an integer representing the number of *copies/mL of sample* for the internal control.
