Somatic WGS Heme Tumor Only

DRAGEN Recipe - Somatic WGS Heme Tumor Only

Overview

This recipe is for processing whole genome sequencing data for somatic heme tumor only workflows.

Example Command Line

For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.

  • Configure the INPUT options

  • Configure the OUTPUT options

  • Configure MAP/ALIGN depending on if realignment is desired or not

  • Configure the VARIANT CALLERs based on the application

  • Configure any additional options

  • Build up the necessary options for each component separately, so that they can be re-used in the final command line.

We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to Dragen Reference Support.

The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.

#!/bin/bash
set -euo pipefail

# Path to DRAGEN hashtable
DRAGEN_HASH_TABLE=<REF_DIR>

# Path to output directory for the DRAGEN run
OUTPUT=<OUT_DIR>

# File prefix for DRAGEN output files
PREFIX=<OUT_PREFIX>

# Population SNP VCF. It can be retrieved from catalogs of population variation
# such as the 1000 genome project or other large cohort discovery efforts.
# Only high-frequency SNPs should be included. A suitable file can be retrieved
# from the GATK resource bundle: 1000G_phase1.snps.high_confidence.vcf.gz
CNV_POP_VCF=<POPULATION_VCF_PATH>

# Path to VC systematic noise BED file. In tumor-only variant calling, this filter
# is essential for removing systematic noise observed in normal samples. Prebuilt
# systematic noise files are available for download on the DRAGEN Software
# Support Site page. Alternatively, running the somatic TO pipeline on
# normal samples can generate a systematic noise file. We recommend using a
# systematic noise file based on normal samples that match the library prep of
# the tumor samples. A prebuilt systematic noise BED file can be downloaded from
# https://support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/product_files.html
VC_SYSTEMATIC_NOISE_FILE=<VC_SYSTEMATIC_NOISE_BED_FILE_PATH>

# The Nirvana annotation database is downloadable at 
# https://support-docs.illumina.com/SW/DRAGEN_v310/Content/SW/DRAGEN/IAE_DownloadData.htm
NIRVANA_ANNOTATION_FOLDER=<NIRVANA_ANNOTATION_FOLDER_PATH>

# Define the input sources, select fastq list, fastq, bam, or cram.
INPUT_FASTQ_LIST="
  --tumor-fastq-list $TUMOR_FASTQ_LIST \
  --tumor-fastq-list-sample-id $TUMOR_FASTQ_LIST_SAMPLE_ID \
"

INPUT_FASTQ="
  --tumor-fastq1 $TUMOR_FASTQ1 \
  --tumor-fastq2 $TUMOR_FASTQ2 \
  --RGSM-tumor $RGSM_TUMOR \
  --RGID-tumor $RGID_TUMOR \
"

INPUT_BAM="
  --tumor-bam-input $TUMOR_BAM \
"

INPUT_CRAM="
  --tumor-cram-input $TUMOR_CRAM \
"

# Select input source, here in this example we use INPUT_FASTQ_LIST
INPUT_OPTIONS="
  --ref-dir $DRAGEN_HASH_TABLE \
  $INPUT_FASTQ_LIST \
"

OUTPUT_OPTIONS="
  --output-directory $OUTPUT \
  --output-file-prefix $PREFIX \
"

MA_OPTIONS="
  --enable-map-align true \
  --enable-sort true \
  --enable-duplicate-marking true \
"

CNV_OPTIONS="
  --heme-cnv true \
  --cnv-population-b-allele-vcf $CNV_POP_VCF \
"

QC_OPTIONS="
  --gc-metrics-enable=true \
"
SNV_OPTIONS="
  --enable-variant-caller true \
  --vc-systematic-noise $VC_SYSTEMATIC_NOISE_FILE \
  --vc-enable-germline-tagging true \
  --enable-variant-annotation true \
  --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER \
  --variant-annotation-assembly $REF_TYPE \  # GRCh37 or GRCh38
"

SV_OPTIONS="
  --heme-sv true \
  --sv-systematic-noise $SV_SYSTEMATIC_NOISE_BEDPE \
"

DUX4_OPTIONS="
  --enable-dux4-caller true \
"
SNV_SV_DEDUPLICATION_OPTIONS="
  --enable-variant-deduplication true \
"

# Construct final command line
CMD="
  dragen \
  $INPUT_OPTIONS \
  $OUTPUT_OPTIONS \
  $MA_OPTIONS \
  $QC_OPTIONS \
  $CNV_OPTIONS \
  $SNV_OPTIONS \
  $SV_OPTIONS \
  $DUX4_OPTIONS \
  $SNV_SV_DEDUPLICATION_OPTIONS \
"

# Execute
echo $CMD
bash -c $CMD

Additional Notes and Options

Optional settings per component are listed below. Full option list at this page.

CNV

Option
Description

--heme-cnv true

Configures DRAGEN to use CNV settings for Liquid Tumors (e.g., AML/MLL).

SNV

Option
Description

--vc-sq-filter-threshold $THRESHOLD

Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.

--vc-systematic-noise $SYSTEMATIC_NOISE_FILE

Systematic noise filter. In tumor-only variant calling, this filter is essential for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.

--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz

Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probability to variants at these positions. Use this option to override a custom hotspot file if a list of positions of interest is available.

--vc-combine-phased-variants-distance $DIST

Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]

--vc-enable-germline-tagging true --enable-variant-annotation true --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER --variant-annotation-assembly $REFERENCE

Germline filtering. Enable to tag variants as germline or somatic based on population databases. $REFERENCE can be GRCh37 or GRCh38 (GRCh37 is compatible with hs37d5 and hg19). The Nirvana annotation database is downloadable at this page.

--vc-target-vaf FLOAT

This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true).

SNV systematic noise file

Generic SNV noise files (including a HEME specific WGS noise file) can be downloaded here: DRAGEN Software Support Site page

When possible it is recommended to build a pipeline specific systematic noise file that matches the library prep and sequencer of interest:

Step 1. Run DRAGEN somatic tumor-only on each of approximately 50 normal samples:

### choose input either from
### i) BAM
INPUT="--tumor-bam-input ${NORMAL_BAM}"
### ii) FASTQs
INPUT="--tumor-fastq-list ${NORMAL_FASTQ_LIST} \
  --tumor-fastq-list-sample-id ${NORMAL_FASTQ_LIST_SAMPLE_ID}"
###

dragen \
-r ${REFERENCE} \
${INPUT} \
--enable-variant-caller true \
--vc-detect-systematic-noise true \
--build-sys-noise-germline-vaf-threshold=1 \
--vc-enable-germline-tagging true \
--enable-variant-annotation true \
--variant-annotation-data ${NIRVANA_ANNOTATION_FOLDER} \
--variant-annotation-assembly ${REF_TYPE} \  # GRCh37 or GRCh38
--intermediate-results-dir ${TMP} \
--output-directory ${DIR} \
--output-file-prefix ${PREFIX}

Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.

Step 2. Generate the final noise file with:

dragen \
-r ${REF_DIR} \
--build-sys-noise-vcfs-list ${VCF_LIST} \  
--build-sys-noise-method max \
--intermediate-results-dir ${TMP} \
--output-directory ${DIR} \
--output-file-prefix ${PREFIX}

SV

Option
Description

--sv-systematic-noise $SV_SYSTEMATIC_NOISE_BEDPE

Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information, see Systematic Noise Filtering.

--heme-sv true

configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).

--sv-min-scored-variant-size $MIN_SCORED_VAR_SIZE

100000

--sv-somatic-ins-tandup-hotspot-regions-bed $BED_FILE

Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)

To build the SV systematic noise file

You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.

To generate a BEDPE file, do as follows.

  1. Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.

  2. Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below

dragen \
-r <HASHTABLE> \
--sv-build-systematic-noise-vcfs-list <LIST OF VCF FILES>
--output-directory <OUTPUT> \
--output-file-prefix <SAMPLE> \

You can also build systematic noise BEDPE files in the cloud using the DRAGEN Baseline Builder App on BaseSpace.

Pre-built SV systematic noise file

The following prebuilt systematic noise files for WGS are available for download on the DRAGEN Software Support Site page. To generate these noise files, we used 46 unrelated normal samples.

Pre-built Systematic Noise File
Comment
Systematic Noise Version
DRAGEN Compatibilit

IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz

>200x coverage with 2x150bp reads for the HG38 reference

3.0.0

4.3.*

SNV-SV deduplication

We recommend using --enable-variant-deduplication true to filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS in the FILTER column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup.vcf.gz. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.

Last updated