Oncovirus Detection
Overview
The DRAGEN oncovirus detection analysis can detect the presence of oncoviruses, whether they have integrated into the human genome, and at what locations. The oncovirus analysis takes in unmapped reads, uses the DRAGEN k-mer classifier to identify whether a read is from an oncovirus, and determines to which reference sequence it best matches. A TSV file describing which oncoviruses were detected is generated.
An oncovirus is considered detected if it passes a read count threshold and has at least one reference that passes its k-mer fraction threshold (described in more detail below).
Any oncovirus that is determined to be present is further analyzed by the DRAGEN SV caller. Assembled SV breakends are aligned to oncoviral references identified by k-mer classification. Integration sites discovered by this process are included in the SV VCF file.
Oncovirus detection can be enabled with WGS, WES, and panels, but it is expected to perform best with WGS and panels with oncoviral probes. Integration site detection has not been evaluated outside of WGS.
Database
Oncovirus detection requires resource files that can be downloaded on the DRAGEN Secondary Analysis Product Files page. This set of resource files are referred to as the oncovirus database below.
The downloaded tar.gz file will need to be unpacked:
tar xzvf oncovirus-detection-files.tar.gzThe unpacked md5sum file can be used to check the integrity of the other unpacked files.
A subdirectory is also unpacked and is named after the version of the database (e.g. "1.0.0"). This subdirectory is used with the --oncovirus-detection-db command line argument.
Oncovirus Presence
The detection of oncoviruses in a sample is enabled with --enable-oncovirus-detection=true and by providing the database path with --oncovirus-detection-db=/path/to/directory/. An example command is given below where tumor and normal sample reads are analyzed for the presence of oncoviral sequences:
dragen \
--enable-oncovirus-detection true \
--oncovirus-detection-db $db \
--tumor-fastq-list $tumorFastqList \
--fastq-list $normalFastqList \
--ref-dir $ref \
--output-file-prefix $prefix \
--output-directory $outEnabling oncovirus detection will create an output TSV file at $out/$prefix.oncovirus_detections.tsv with the fields described below. Empty values are denoted in the TSV with a hyphen.
oncovirus
Virus name
sample
Name of the sample
detected
Value is "detected" if virus metrics are above thresholds
oncovirus_read_count
Number of reads that classified to the virus and its references
best_match_ref_accession
Accession of the reference with the highest k-mer fraction
best_match_ref_read_count
Number of reads that classified to the best-match reference
best_match_ref_kmer_fraction
Fraction of k-mers detected for the best-match reference
best_match_ref_length
Length of the best-match reference
best_match_ref_completeness
Length of the best-match reference compared to the RefSeq reference for this virus; capped at 1.0
best_primary_ref_accession
Accession of the primary (e.g. RefSeq) reference with the highest k-mer fraction
best_primary_ref_read_count
Number of reads that classified to the best-match primary reference
best_primary_ref_kmer_fraction
Fraction of k-mers detected for the best-match primary reference
best_primary_ref_length
Length of the best-match primary reference
In order to be considered detected, an oncovirus must pass a read count threshold and have at least one reference that passes its k-mer fraction threshold.
The k-mer fraction quantifies how much of a reference sequence is supported by the sequencing data. First, all canonical k-mers are enumerated from the reference sequence. The k-mer fraction is then calculated as the proportion of these reference k-mers that are observed at least once in the reads. A value close to 1 indicates broad coverage across the reference, whereas lower values indicate partial or sparse support.
Included Oncoviruses and Thresholds
Epstein-Barr virus (EBV)
5
0.05
196
Hepatitis B virus (HBV)
5
0.05
5493
Hepatitis C virus (HCV)
5
0.05
3293
Human papillomavirus (25+ types)*
5
0.25
310
Human T-lymphotropic virus 1 (HTLV-1)
5
0.05
11
Kaposi's sarcoma-associated herpesvirus (KSHV)
5
0.05
54
Merkel cell polyomavirus (MCPyV)
5
0.05
13
*Classifications are HPV6, HPV11, HPV16, HPV18, HPV26, HPV31, HPV33, HPV35, HPV39, HPV40, HPV42, HPV43, HPV44, HPV45, HPV51, HPV52, HPV53, HPV54, HPV56, HPV58, HPV59, HPV61, HPV66, HPV68, HPV69, HPV70, HPV73, HPV82, Other HPV
Integration Site Detection
When the SV caller is enabled alongside oncovirus detection, DRAGEN can call sites where oncoviral sequences have integrated into the human genome and report them in the SV VCF output. For details on enabling and interpreting viral integration site detection, see Viral Integration Site Detection in the SV Calling documentation.
Command Line Arguments
enable-oncovirus-detection
bool
Enables detection of oncoviruses
false
oncovirus-detection-db
string
Path to directory containing resource files
empty string
oncovirus-detection-all-reads
bool
Enable to use all reads instead of just unmapped reads
false
oncovirus-detection-softclipped-reads
bool
Enable to keep softclipped reads in addition to unmapped reads
false
oncovirus-detection-below-threshold
bool
Enable to include below-threshold viruses in detections TSV
false
oncovirus-detection-enable-read-output*
bool
Enable to create an output file with per-read results
false
oncovirus-detection-num-threads
int
Number of threads to use for processing reads
8
*Note that when --oncovirus-detection-enable-read-output=true, --oncovirus-detection-num-threads must be set to 1 to ensure the per-read output file is properly formed.
Last updated
Was this helpful?