# Oncovirus Detection

## Overview

The DRAGEN oncovirus detection analysis can detect the presence of oncoviruses, whether they have integrated into the human genome, and at what locations. The oncovirus analysis takes in unmapped reads, uses the DRAGEN *k*-mer classifier to identify whether a read is from an oncovirus, and determines to which reference sequence it best matches. A TSV file describing which oncoviruses were detected is generated.

An oncovirus is considered detected if it passes a read count threshold and has at least one reference that passes its *k*-mer fraction threshold (described in more detail below).

Any oncovirus that is determined to be present is further analyzed by the DRAGEN SV caller. Assembled SV breakends are aligned to oncoviral references identified by *k*-mer classification. Integration sites discovered by this process are included in the SV VCF file.

Oncovirus detection can be enabled with WGS, WES, and panels, but it is expected to perform best with WGS and panels with oncoviral probes. Integration site detection has not been evaluated outside of WGS.

## Database

Oncovirus detection requires resource files that can be downloaded on the [DRAGEN Secondary Analysis Product Files page.](https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html) This set of resource files are referred to as the oncovirus database below.

The downloaded tar.gz file will need to be unpacked:

```shell
tar xzvf oncovirus-detection-files.tar.gz
```

The unpacked `md5sum` file can be used to check the integrity of the other unpacked files.

A subdirectory is also unpacked and is named after the version of the database (e.g. "1.0.0"). This subdirectory is used with the `--oncovirus-detection-db` command line argument.

## Oncovirus Presence

The detection of oncoviruses in a sample is enabled with `--enable-oncovirus-detection=true` and by providing the database path with `--oncovirus-detection-db=/path/to/directory/`. An example command is given below where tumor and normal sample reads are analyzed for the presence of oncoviral sequences:

```shell
dragen \
  --enable-oncovirus-detection true \
  --oncovirus-detection-db $db \
  --tumor-fastq-list $tumorFastqList \
  --fastq-list $normalFastqList \
  --ref-dir $ref \
  --output-file-prefix $prefix \
  --output-directory $out
```

Enabling oncovirus detection will create an output TSV file at `$out/$prefix.oncovirus_detections.tsv` with the fields described below. Empty values are denoted in the TSV with a hyphen.

| Field                              | Description                                                                                       |
| ---------------------------------- | ------------------------------------------------------------------------------------------------- |
| oncovirus                          | Virus name                                                                                        |
| sample                             | Name of the sample                                                                                |
| detected                           | Value is "detected" if virus metrics are above thresholds                                         |
| oncovirus\_read\_count             | Number of reads that classified to the virus and its references                                   |
| best\_match\_ref\_accession        | Accession of the reference with the highest *k*-mer fraction                                      |
| best\_match\_ref\_read\_count      | Number of reads that classified to the best-match reference                                       |
| best\_match\_ref\_kmer\_fraction   | Fraction of *k*-mers detected for the best-match reference                                        |
| best\_match\_ref\_length           | Length of the best-match reference                                                                |
| best\_match\_ref\_completeness     | Length of the best-match reference compared to the RefSeq reference for this virus; capped at 1.0 |
| best\_primary\_ref\_accession      | Accession of the primary (e.g. RefSeq) reference with the highest *k*-mer fraction                |
| best\_primary\_ref\_read\_count    | Number of reads that classified to the best-match primary reference                               |
| best\_primary\_ref\_kmer\_fraction | Fraction of *k*-mers detected for the best-match primary reference                                |
| best\_primary\_ref\_length         | Length of the best-match primary reference                                                        |

In order to be considered detected, an oncovirus must pass a read count threshold and have at least one reference that passes its *k*-mer fraction threshold.

The *k*-mer fraction quantifies how much of a reference sequence is supported by the sequencing data. First, all canonical *k*-mers are enumerated from the reference sequence. The *k*-mer fraction is then calculated as the proportion of these reference *k*-mers that are observed at least once in the reads. A value close to 1 indicates broad coverage across the reference, whereas lower values indicate partial or sparse support.

## Included Oncoviruses and Thresholds

|                   Virus Name                   | Read Count Threshold | K-mer Fraction Threshold | Database Reference Count |
| :--------------------------------------------: | :------------------: | :----------------------: | :----------------------: |
|            Epstein-Barr virus (EBV)            |           5          |           0.05           |            196           |
|             Hepatitis B virus (HBV)            |           5          |           0.05           |           5493           |
|             Hepatitis C virus (HCV)            |           5          |           0.05           |           3293           |
|       Human papillomavirus (25+ types)\*       |           5          |           0.25           |            310           |
|      Human T-lymphotropic virus 1 (HTLV-1)     |           5          |           0.05           |            11            |
| Kaposi's sarcoma-associated herpesvirus (KSHV) |           5          |           0.05           |            54            |
|        Merkel cell polyomavirus (MCPyV)        |           5          |           0.05           |            13            |

\*Classifications are HPV6, HPV11, HPV16, HPV18, HPV26, HPV31, HPV33, HPV35, HPV39, HPV40, HPV42, HPV43, HPV44, HPV45, HPV51, HPV52, HPV53, HPV54, HPV56, HPV58, HPV59, HPV61, HPV66, HPV68, HPV69, HPV70, HPV73, HPV82, Other HPV

## Integration Site Detection

When the SV caller is enabled alongside oncovirus detection, DRAGEN can call sites where oncoviral sequences have integrated into the human genome and report them in the SV VCF output. For details on enabling and interpreting viral integration site detection, see [Viral Integration Site Detection](https://help.dragen.illumina.com/product-guides/dragen-v4.5/sv-calling#viral-integration-site-detection) in the SV Calling documentation.

## Command Line Arguments

| Argument                                 | Type   | Description                                                                                                                                | Default      |
| ---------------------------------------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------ |
| enable-oncovirus-detection               | bool   | Enables detection of oncoviruses                                                                                                           | false        |
| oncovirus-detection-db                   | string | Path to directory containing resource files                                                                                                | empty string |
| oncovirus-detection-all-reads            | bool   | Enable to use all reads instead of just unmapped reads                                                                                     | false        |
| oncovirus-detection-softclipped-reads    | bool   | Enable to keep softclipped reads in addition to unmapped reads                                                                             | false        |
| oncovirus-detection-below-threshold      | bool   | Enable to include below-threshold viruses in detections TSV                                                                                | false        |
| oncovirus-detection-enable-read-output\* | bool   | Enable to create an [output file with per-read results](https://help.dragen.illumina.com/product-guides/kmer-classifier#read-level-output) | false        |
| oncovirus-detection-num-threads          | int    | Number of threads to use for processing reads                                                                                              | 8            |

**\*Note that when --oncovirus-detection-enable-read-output=true, --oncovirus-detection-num-threads must be set to 1 to ensure the per-read output file is properly formed.**
