1 of 100

DRAGEN

Overview

Illumina® DRAGEN™ Secondary Analysis

Illumina DRAGEN (Dynamic Read Analysis for GENomics) secondary analysis was developed to address important challenges associated with analyzing NGS (Next Generation Sequencing) data for a range of applications, including genome, exome, transcriptome, and methylome studies. DRAGEN secondary analysis processes NGS data and enables tertiary analysis to drive insights. The available tools make up a highly accurate, comprehensive, and efficient solution that enables labs of all sizes and disciplines to do more with their genomic data.

Product highlights

Accurate results:

Pangenome reference genome and machine learning drive unprecedented accuracy
99.89% accuracy score with the Precision FDA Truth Challenge V2 benchmark data (2,3)

Comprehensive platform:

Analyze NGS data from whole genomes, exomes, methylomes, and transcriptomes
Available on platform of choice and scalable based on needs

Efficient analysis:

Process a 34x genome in ~ 30 minutes, with all supported callers with DRAGEN server v4 (1)
Reduce FASTQ file sizes up to 5x with DRAGEN ORA Compression

References:

Illumina data on file, 2022.
Illumina DRAGEN Secondary Analysis is the first single platform to achieve 99.89% accuracy based on PrecisionFDA v2 Truth Challenge Benchmark Data. Details here DRAGEN sets new standard for data accuracy in PrecisionFDA benchmark data. Accessed March 22, 2023
PrecisionFDA Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions. precision.fda.gov/challenges/10. Accessed November 3, 2020.

DRAGEN Applications

Applications

DRAGEN analysis offers a large selection of application pipelines.

Pipeline

Description

Variant Types Detected

Metrics Provided

DRAGEN Demultiplexing

Rapid demultiplexing of NGS analysis

N/A

DRAGEN ORA Compression

DRAGEN ORA compression is optimized for high compression ratios of FASTQ files, as well as rapid compression and decompression, all while preserving data integrity.

N/A

Compression Ratio Run Time

DRAGEN Map + Align

The DRAGEN Map + Align can be run as a standalone or as part of DRAGEN’s suite of pipelines

N/A

Mapping metrics Duration Metrics Coverage Metrics

DRAGEN Germline

The DRAGEN Germline Pipeline provides end-to-end NGS analysis, including advanced error model calibration for increased accuracy, and repeat expansion detection and genotyping through Illumina Expansion Hunter.

SNV/Indel CNV SV Repeat Expansions

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Somatic

The DRAGEN Somatic Pipeline includes tumor-only and tumor–normal modes, designed for detecting somatic variants in tumor samples. Both modes make no ploidy assumptions, enabling detection of low-frequency alleles.

SNV/Indel CNV SV TMB MSI HLA

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Enrichment

The DRAGEN Enrichment Pipeline combines DRAGEN’s germline and somatic callers into a pipeline designed specifically for analyzing enrichment samples. Includes a full suite of enrichment metrics and reporting.

SNV/Indel CNV SV

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN RNA

The DRAGEN RNA Pipeline performs transcriptome analysis starting with splice junction discovery and alignment, followed by rapid alignment and splice junction mapping and quantification. For differential expression, Illumina recommends the DRAGEN Differential Expression app on BaseSpace Sequence Hub.

Gene fusion SNV/Indel

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Single Cell RNA

The DRAGEN Single Cell RNA pipeline performs demultiplexing, cell-barcode and UMI error correction, sequence alignment, and quantification of gene expression.

N/A

Mapping Metrics Duration Metrics Coverage Metrics Callability Report Cell Metrics

DRAGEN Joint Genotyping

The DRAGEN Joint Genotyping/Population Pipeline calls variants jointly across multiple genomes and scales to large cohorts of samples at expedited speeds with uncompromising accuracy.

SNV/Indel CNV SV Repeat Expansions

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Methylation

The DRAGEN Methylation Pipeline performs alignment, methyl calling, and calculates alignment and methylation metrics.

N/A

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Reference Builder

Accepts FASTA files, and builds the proprietary reference used by the DRAGEN apps.

N/A

DRAGEN TruSight Oncology 500 ctDNA Analysis Software

Secondary analysis support for Illumina’s TruSight Oncology 500 ctDNA. Available on the local DRAGEN Server version 3 and later.

SNV/Indel CNV DNA fusions MSI TMB

Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report

DRAGEN Imputation

The DRAGEN Imputation pipeline is an end to end user friendly tool that enables scalable low pass whole genome sequencing analysis

N/A

Impute ≤100 samples simultaneously 1.7x faster compared to original GLIMPSE code

Analysis Uses

DRAGEN analysis can be used in numerous fields in the biological sciences.

Analysis

Description

Genetic Diseases

Reduce time required for genomic analysis, with high accuracy and comprehensiveness

Oncology

Analyze tumor-only and tumor/normal samples with accuracy, comprehensiveness, and efficiency

Cell and Molecular Biology

Advance understanding of cellular mechanisms with rapid analysis pipelines for bulk and single cell samples

Population Genomics

Accurately and efficiently analyze sequenced genomes at scale. Accelerate re-analysis as computational tools improve over time

Infectious Disease

Detect and characterize infectious diseases with a comprehensive solution

Agrigenomics

Efficiently analyze animals and plants of varying genomic complexities with custom reference

Deployment Options

DRAGEN analysis is available on multiple platforms.

Platform

Description

DRAGEN on-premises server

DRAGEN on-premises server offers highly accurate secondary analysis in a fraction of time compared with a traditional CPU-based system. - Analyze and store data locally - Supports varying levels of command line interface - Replace up to 30 traditional compute instances - Fully process a 34× whole human genome in ~30 minutes. (1) - One unit supports two NovaSeq 6000 Systems running at full capacity

DRAGEN analysis on Illumina Connected Analytics

Couples the accuracy and speed of the DRAGEN with the ability to customize analysis pipeline to operationalize informatics on a secure platform.

DRAGEN on BaseSpace Sequence Hub (BSSH)

Push button analysis capability in an intuitive, easy-to-use interface with compliance, and storage features of BaseSpace Sequence Hub and Amazon Web Services (AWS).

DRAGEN onboard NovaSeq X Series

- Flexibly runs multiple secondary analysis pipelines in parallel. - Performs up to four simultaneous applications per flow cell in a single run. - Brings up to 5x lossless data compression, and analysis with supported applications - Provides savings on analysis, which over five years can exceed the price of the sequencer

DRAGEN onboard NextSeq 1000 and NextSeq 2000 Systems

- Provides access to select DRAGEN analysis informatics pipelines - Enables users to generate results in as little as two hours - Uses intuitive pipeline algorithms to reduce reliance on external informatics experts

DRAGEN onboard MiSeq i100 Series

Intuitive, ultra-rapid analysis including DRAGEN BCL convert, DRAGEN Library QC, DRAGEN small WGS and DRAGEN Microbial Enrichment Plus. - Rapid results with comprehensive secondary analysis generated in two hours or less (2) - Highly efficient workflow with a single user touchpoint to VCF and/or html report and no intermediate file transfers - Exceptionally easy with an intuitive interface for non-expert users

DRAGEN on AWS, Azure

DRAGEN supports the FPGA enabled instance types of AWS, Azure. Rpm installers and the Kernel driver can be installed on images managed by the user, and DRAGEN can be run by purchasing a license.

DRAGEN on AWS and Azure Marketplace

Pre-configured Amazon Machine Images (AMI) and Azure Virtual Machines with DRAGEN installed can be accessed from the respective marketplace offerings in a Pay-As-You-Use model.

DRAGEN on GCP

DRAGEN is made available on the Google Cloud Platform. Pre-configured instances with DRAGEN installed can be accessed through the GCP application interface. Limited availability. Please reach out to your Illumina representative for access.

(1) HG002 from PrecisionFDA truth challenge V2 run with DRAGEN analysis v4.0 on DRAGEN server v4, all callers

(2) When run according to sample recommendations

Product Guides

DRAGEN v4.3

Getting Started

DRAGEN provides tests you can run to make sure that your DRAGEN system is properly installed and configured. Before running the tests, make sure that the DRAGEN server has adequate power and cooling, and is connected to a network that is fast enough to move your data to and from the machine with adequate performance.

Please refer to the Server Site Prep & Installation Guide when installing a new system.

On-premises Installation

The software can be installed on an on-premises server by executing the .run installer for the desired version. Installers are made available for all releases at the DRAGEN Software Support Site page.

Installation procedure:

Download the desired installer from the support website and unzip the package
The archive integrity can be checked using: ./<dragen .run file> --check
Install the appropriate release based on your Linux OS with the command: sudo sh <dragen .run file>

The .run file includes a script that administers un-installation of an existing software, integrity checking of the package and files, installation of the new DRAGEN software version. The DRAGEN software is installed in part by use of the Linux RPM Package Manager (rpm). Several rpm packages comprise the installation of a single DRAGEN software version. The RPM packages also configure the system for dragen, like raised user ulimits, and the .run script starts services needed for functionality, such as the Licensing daemon dragen_licd, and the hugepages daemon, dragend_hp.

NOTE: Root privileges are required for the installation.

Single Version Installation

Up to DRAGEN Software v4.2, only one version of the DRAGEN software can be installed at a time. Executing the .run file will remove any existing installed version and (re)install the new version.

After installation, the application and associated files are available at /opt/edico.

The single version installer will add /opt/edico to the Linux $PATH, so that the user can just call dragen without specifying the full path.

Multi-Version Installation

Starting with DRAGEN Software v4.3 and later, multiple compatible versions of the DRAGEN software can be installed at a time. Executing the .run file will add the new version to the system.

After installation, the application files are available at /opt/dragen/{version} and FPGA files are located at /opt/bitstream/{bitstream version}.

The multi-version installer will NOT add /opt/dragen/{version} to the Linux $PATH, since multiple versions can be present at a given time. User should manage the desired paths to the specific version they want to run. When this guide provides command line examples, it will assume that the Linux $PATH is set to correct dragen version, and we will just refer to dragen <options>

Notes on multi-version installation:

Installers released for DRAGEN v4.2 and earlier are single version packages
Single version packages and multi-version packages can not be mixed
- Installation of a prior single version package will remove all the multi-version packages
- Installation of a multi-version package will remove any installed single version package
After installing a multi-version package, see a list of installed versions at any time by running /usr/bin/dragen_versions
To remove any multi-version package, call yum remove on its Path

Example:

$ dragen_versions
The output format of this command may change. Use --json for machine readable output.

Dragen Version           Size (MB)  Install Date         Path
4.3.2                    1378.03    2024-03-10 18:26:17  /opt/dragen/4.3.2
4.4.3                    1381.41    2024-03-18 20:56:39  /opt/dragen/4.4.3
4.3.5                    1379.25    2024-03-11 15:20:24  /opt/dragen/4.3.5

Bitstream Version        Size (MB)  Install Date         Path
07.031.732 (0x18101306)  598.95     2024-03-10 18:26:03  /opt/bitstream/07.031.732
07.031.745 (0x18101306)  598.95     2024-03-18 20:56:18  /opt/bitstream/07.031.745
 
To remove a dragen version, call `yum remove` on its Path.

Location of `dragen` and resource files

DRAGEN Version

on-premises server

cloud instance

4.3 and later

/opt/dragen/{version}

/opt/edico/

4.2 and earlier

/opt/edico/

Throughout this guide we will refer to <INSTALL_PATH> which will be either of the locations above

Running the System Check

After turning on the server, you can make sure that your DRAGEN server is functioning properly by running <INSTALL_PATH>/self_test/self_test.sh, which does the following:

Automatically indexes chromosome M from the hg19 reference genome
Loads the reference genome and index
Maps and aligns a set of reads
Saves the aligned reads in a BAM file
Asserts that the alignments exactly match the expected results

Each server ships with the test input FASTQ data for this script, which is located in <INSTALL_PATH>/self_test. The system check takes approximately 25--30 minutes.

The following example shows how to run the script and shows the output from a successful test.

$ /opt/dragen/4.3.4/self_test/self_test.sh
#############################################################
Logging to /var/log/dragen/self_test.1714627157_160164.0.details.log
Using dragen executables in /opt/dragen/4.3.4/bin
Using board(s): 0 
#############################################################
Running tests for board 0 (u200)
Using scratch directory /tmp/self_test.4BO0pfPST9/0
-------------------------------------------------------------
Board 0 test 1, FPGA MEMORY TEST
Loading DIAG bitstream
Running fpga memory test, this will take ~13 minutes
Board 0 test 1, FPGA MEMORY TEST: PASS
-------------------------------------------------------------
Board 0 test 2, BAR REGISTER ACCESS
Board 0 test 2, BAR REGISTER ACCESS: PASS
-------------------------------------------------------------
Board 0 test 3, FPGA TEMP REG ACCESS
FPGA Temperature: 27C  (Max Temp: 36C, Min Temp: 22C)
Board 0 test 3, FPGA TEMP REG ACCESS: PASS
-------------------------------------------------------------
Board 0 test 4, BOARD SERIAL # REG ACCESS
Serial Number: 2130069BM05V
Board 0 test 4, BOARD SERIAL # REG ACCESS: PASS
-------------------------------------------------------------
Board 0 test 5, DRAGEN GENOME LICENSE
Board 0 test 5, DRAGEN GENOME LICENSE: PASS
-------------------------------------------------------------
Board 0 test 6, CPLD DATE TEST
cpld date is n/a
Board 0 test 6, CPLD DATE TEST: PASS
-------------------------------------------------------------
Board 0 test 7, ENCRYPTION KEY EXISTENCE TEST
Board 0 test 7, ENCRYPTION KEY EXISTENCE TEST: PASS
-------------------------------------------------------------
Board 0 test 8, PARTIAL RECONFIGURATION
DNA-MAPPER: ok
RNA-MAPPER: ok
HMM: ok
ZIP: ok
UNZIP: ok
DIAG: ok
Board 0 test 8, PARTIAL RECONFIGURATION: PASS
-------------------------------------------------------------
Board 0 test 9, HASH TABLE GENERATION
Board 0 test 9, HASH TABLE GENERATION: PASS
-------------------------------------------------------------
Board 0 test 10, MAP AND ALIGNER
running mapper aligner: ok
unmapped input records percentages: ok
md5sum check dbam sorted: pass
Board 0 test 10, MAP AND ALIGNER: PASS
-------------------------------------------------------------
Board 0 test 11, VARIANT CALLER E2E
running variant caller: ok
md5sum check dbam sorted: ok
md5sum check VCF: ok
Board 0 test 11, VARIANT CALLER E2E: PASS
#############################################################
SELF TEST COMPLETED
SELF TEST RESULT : PASS
#############################################################
Log file at /var/log/dragen/self_test.1714627157_160164.0.details.log

If the output BAM file does not match expected results, then the last line of the above text is as follows:

SELF TEST RESULT : FAIL

If you experience a FAIL result after running this test script immediately after turning on your DRAGEN server, contact Illumina Technical Support.

Running Your Own Test

When you are satisfied that your DRAGEN system is performing as expected, you are ready to run some of your own data through the machine, as follows:

Load the reference table for the reference genome
Determine location of input and output files
Process input data

Loading the Reference Genome

Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. For more information, see Prepare a Reference Genome.

The reference hash table specified on the command line is automatically loaded onto the board the first time you process data with a pipeline. You can manually load the hash table for your reference genome by using the following command:

dragen -r <reference_hash-table_directory>

Make sure that the reference hash table directory is on the fast file IO drive.

The default location for the hash table for hg19 is as follows.

/staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The command to load reference genome hg19 from the default location is as follows.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

This command loads the binary reference genome into memory on the DRAGEN board, where it is used for processing any number of input data sets. You do not need to reload the reference genome unless you restart the system or need to switch to a different reference genome. It can take up to a minute to load a reference genome.

DRAGEN checks whether the specified reference genome is already resident on the board. If it is, then the upload of the reference genome is automatically skipped. You can force reloading of the same reference genome using the force-load-reference (-l) command line option.

The command to load the reference genome prints the software and hardware versions to standard output. For example:

DRAGEN Host Software Version 01.001.035.01.00.30.6682 and

Bio-IT Processor Version 0x1001036

After the reference genome has been loaded, the following message is printed to standard output:

DRAGEN finished normally

Determine Input and Output File Locations

The DRAGEN Pipeline is very fast, which requires careful planning for the locations of the input and output files. If the input or output files are on a slow file system, then the overall performance of the system is limited by the throughput of that file system. It is recommended that inputs and outputs are streamed directly from/to a mounted external storage system.

The DRAGEN system is preconfigured with at least one fast file system consisting of a set of fast SSD disks grouped with RAID-0 for performance. This file system is mounted at /staging. This name was chosen to emphasize the fact that this area was built to be large and fast, but is not redundant. Failure of any of the file system's constituent disks leads to the loss of all data stored there.

During processing, DRAGEN generates and reads back temporary files. With DRAGEN, it is highly recommended to always direct temporary files to the fast SSD (or /staging) by using the --intermediate-results-dir option. If the --intermediate-results-dir option is not provided, temporary files are written to the --output-directory. DRAGEN recommends streaming inputs and outputs using an mounted external storage system.

Process Your Input Data

To analyze FASTQ data, use the dragen command. For example, the following command can be used to analyze a single-ended FASTQ file:

dragen \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 /staging/test/data/SRA056922.fastq \
--output-directory /staging/test/output \
--output-file-prefix SRA056922_dragen \
--RGID DRAGEN_RGID \
--RGSM DRAGEN_RGSM

For detailed information on the command line options, see DRAGEN Host Software.

For recommended command lines in typical use cases, see DRAGEN Recipes.

DRAGEN Host Software

You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.

Invoke the software using the dragen command. The command line options are described in the following sections.

Command line options can also be set in a configuration file. For more information on configuration files, see Configuration Files . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.

Command-line Options

The following are examples of frequently used command lines:

Build Reference/Hash Table

dragen --build-hash-table true --ht-reference <REF_FASTA> \
--output-directory <REF_DIRECTORY>  [options]

Run Map/Align and Variant Caller (*.fastq to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --RGID <RG0> --RGSM <SM0> --enable-variant-caller true

Run Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] \
-1 <FASTQ1> [-2 <FASTQ2>]  \
--RGID <RG0> --RGSM

Run Variant Caller Only (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align false \
--enable-variant-caller true

Re-map and Run Variant Caller (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align true \
--enable-variant-caller true

Run BCL Converter (BCL to *.fastq)

dragen --bcl-conversion-only true --bcl-input-directory <BCL_DIRECTORY> \
--output-directory <OUT_DIRECTORY>

Run RNA Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --enable-rna true

For recommended command lines in typical use cases, see DRAGEN Recipes.

Reference Genome Options

Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see Prepare a Reference Genome. You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir] option. This argument is always required.

Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

Use the -l (--force-load-reference) option to force the reference genome to load even if it is already loaded.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.

Operating Modes

DRAGEN has two primary modes of operation, as follows:

Mapper/aligner
Variant caller

DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.

Full pipeline mode To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.
Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking to true.
Variant caller mode To execute variant caller mode, set the --enable-variant-caller option to true, and set --enable-map-align option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort to false will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.
RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna to true. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..
Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol setting.

Output Options

The following command line options for output are mandatory:

--output-directory <out_dir>—Specifies the output directory for generated files.
--output-file-prefix <out_prefix>-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.
-r [--ref-dir ]—Specifies the reference hash table.

The following examples do not include these mandatory options.

For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM> option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ] option.

For example, the following commands output to a compressed BAM file, and then forces overwrite:

dragen ... -f

dragen ... -f --output-format bam

To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and then forces overwrite:

dragen ... -f --output-format sam

The following example outputs to a CRAM file, and then forces overwrite:

dragen ... -f --output-format cram

DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.

Alignment tags

DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags to true.

To generate ZS:Z alignment status tags, set --generate-zs-tags to true. These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns was set to 0). The following are valid tag values:

Tag

Tag meaning

ZS:Z:R

Multiple alignments with similar score were found.

ZS:Z:NM

No alignment was found.

ZS:Z:QL

An alignment was found but it was below the quality threshold.

To generate SA:Z tags, set --generate-sa-tags to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.

To generate pair score in a ps:i tag, set --generate-ps-tags to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.

DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags to true (default is false) and set --generate-q2-tags to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.

DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags (true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.

The ga tag uses the same format as the SA tag used to describe supplementary alignments.

CRAM Output

When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:

CRAM format V3.0 is produced
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted

The following list of default settings are used for the CRAM output

CRAM option

Value

Description

SEQS_PER_SLICE

2000

Max sequences per slice

BASES_PER_SLICE

SEQS_PER_SLICE*500

Max bases per slice

SLICE_PER_CNT

Max slices per container

embed_ref

Do not embed reference sequence

noref

Do not use non-referenced based encoding

multiseq

-1

Do not use multiple references per slice

unsorted

Do not use unsorted mode

use_bz2

Do not compress using bzip2

use_lzma

Do not compress using lmza

use_rans

Use rANS for quality score compression

binning

NONE

Qual score binning not used

preserve_aux_order

Preserve all aux tags and order (incl RG,NM,MD)

preserve_aux_size

Aux tag sizes not preserved ('i', 's', 'c')

lossy_read_names

Preserve read names

lossy

Do not enable Illumina 8 quality-binning system

ignore_md5

Enable all checking of checksums

decode_md

Do not (re)generate MD and NM tags

Input Options

DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.

Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.

If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.

FASTQ Input Files

FASTQ input files can be single-ended or paired-end, as shown in the following examples.

Single-ended in one FASTQ file (-1 option)

dragen -r <REF_DIR> -1 <fastq> \
--output-directory <OUT_DIR> -output-file-prefix <OUTPUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in two matched FASTQ files(-1 and -2 options)

dragen -r <REF_DIR> -1 <fastq1> -2 <fastq2> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in a single interleaved FASTQ file(--interleaved (-i) option)

dragen -r <REF_DIR> -1 <INTERLEAVED_FASTQ> -i \
--RGID <RGID> --RGSM <RGSM>

Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile to false on the command line.

DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name option to true

If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.

To avoid impacting system performance, input files must be located on a fast file system.

Multiple FASTQ Input Files

To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name> option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name option.

For example:

dragen -r <ref_dir> --fastq-list <CSV_FILE> \
-fastq-list-sample-id <Sample_ID> -output-directory <OUT_DIR> 
--output-file-prefix <OUT_PREFIX>

Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv and contains an entry for each FASTQ file or paired-end file pair produced during the run.

FASTQ CSV File Format

The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.

Column titles are case-sensitive. The following column titles are required:

RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:

ID (from RGID)
SM (from RGSM)
LB (from RGLB)

You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.

A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID> must be used in addition to --fastq-list <filename> to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.

Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true can be used instead of --fastq-list-sample-id.

Note

For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:

RGID,RGSM,RGLB,Lane,Read1File,Read2File
CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq,
/staging/RDSR181520_S1_L001_R2_001.fastq
AGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq,
/staging/RDSR181521_S2_L001_R2_001.fastq
TAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq,
/staging/RDSR181522_S3_L001_R2_001.fastq
AGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq,
/staging/RDSR181523_S4_L001_R2_001.fastq

If you use the --tumor-fastq-list option for somatic input, use the --tumor-fastq-list-sample-id SampleID> option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \
--tumor-fastq-list-sample-id <Sample_ID> \
--output-directory <out_dir> \
--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \
--fastq-list-sample-id <Sample_ID_2>

Tumor-Normal Pairs Input

If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.

You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.

#!/bin/bash

HT="/staging/HT/"
tumor_fastq_list="/staging/inputs/tumor_fastq_list.csv"
normal_fastq_list="/staging/inputs/normal_fastq_list.csv"

tumor_samples_list="/staging/inputs/tumor_samples_list.txt"
normal_samples_list="/staging/inputs/normal_samples_list.txt"

while read -u 3 -r tumor_RGSM && read -u 4 -r normal_RGSM; do
output_dir="/staging/results/${tumor_RGSM}_${normal_RGSM}"
mkdir -p ${output_dir}

dragen \
-r ${HT} \
--tumor-fastq-list ${tumor_fastq_list} \
--tumor-fastq-list-sample-id ${tumor_RGSM} \
--fastq-list ${normal_fastq_list} \
--fastq-list-sample-id ${normal_RGSM} \
--output-directory ${output_dir} \
--output-file-prefix ${tumor_RGSM}_${normal_RGSM}
done 3<${tumor_samples_list} 4<${normal_samples_list}


Sample fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_N1.1,normal-1,ILLUMINA,1,/staging/inputs/normal-1_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N1.2,normal-1,ILLUMINA,2,/staging/inputs/normal-1_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.1,normal-2,ILLUMINA,1,/staging/inputs/normal-2_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.2,normal-2,ILLUMINA,2,/staging/inputs/normal-2_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.1,normal-3,ILLUMINA,1,/staging/inputs/normal-3_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.2,normal-3,ILLUMINA,2,/staging/inputs/normal-3_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L002_R2_001.fastq.gz

The following are examples of the FASTQ lists and samples lists used as input for the script.

Sample tumor_fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_T1.1,tumor-1,ILLUMINA,1,/staging/inputs/tumor-1_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T1.2,tumor-1,ILLUMINA,2,/staging/inputs/tumor-1_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.1,tumor-2,ILLUMINA,1,/staging/inputs/tumor-2_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.2,tumor-2,ILLUMINA,2,/staging/inputs/tumor-2_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.1,tumor-3,ILLUMINA,1,/staging/inputs/tumor-3_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.2,tumor-3,ILLUMINA,2,/staging/inputs/tumor-3_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L002_R2_001.fastq.gz

Sample normal_samples_list content

normal-1
normal-2
normal-3

Sample tumor_samples_list content

tumor-1
tumor-2
tumor-3

FASTQ ORA Input Files

You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference.

See ORA Compression and Decompression for more information on ORA reference files.

The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).

dragen -r <REF_DIR> -1 <fastq.ora1> -2 <fastq.ora2> \
--ora-reference <ORADATA_DIR> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

BAM Input Files

BAM files can be used as input to the mapper/aligner. By default --enable-map-align is true. You can use the BAM file as input to the variant caller by setting the --enable-map-align option to false.

When you specify a BAM file as input, with map/align enabled, DRAGEN ignores any alignment information contained in the input file, and outputs new alignments for all reads.

If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name option to enable or disable this feature (the default is true).

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name false

Specify paired-end input in one BAM file with the (-b) and \--pair-by-name=true options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

CRAM Input

You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input.

By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.

DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference option. This option will make the CRAM decompressor use the specified reference.

--cram-reference can be either a fasta file, or a DRAGEN hash table folder.
If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the --ref-dir reference

Example: CRAM was created with hg19, re-analysis with hg38

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <ref_dir HG19>

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <hg19.fa>

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

--cram-input--The name and path for the CRAM file
--cram-input--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.

dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

BCL Input Files

BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.

DRAGEN can read directly from BCL in the following circumstances:

Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).

The following example command is for BCL input with only one lane of input:

dragen --bcl-input-dir <BCL_ROOT> --bcl-only-lane <num> -r <ref_dir> \
--output-directory <out_dir> --output-file-prefix <out_prefix>

For additional BCL conversion options, see Input File Types.

Handling of N bases

One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.

When you use the --fastq-n-quality and --fastq-offset options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).

Read Names for Paired-End Reads

By a common convention, read names can include suffixes, such as /1 or /2), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1 and /2 when comparing names. By default, DRAGEN strips these suffixes from the original read names.

DRAGEN has the following options to control how suffixes are used:

To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values for this option include forward-slash (/), dot (.), and colon (:).
To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.
To append a new set of suffixes to all read names, set --append-read-index-to-name to true. The delimiter is determined by the --pair-suffix-delimiter option. By default, the delimiter is a slash, so /1 and /2 are added to the names.

Gene Annotation Input Files

When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.

DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.

Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.

Input streaming is supported for the following use cases:

Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).

For other file types that are significantly smaller in size, download them locally before running the analysis.

Streaming FASTQ Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 s3://s3-bucket-name/path/to/object_1.fastq.gz \
  -2 s3://s3-bucket-name/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://storage-account-name.blob.core.windows.net/path/to/object_1.fastq.gz \
  -2 https://storage-account-name.blob.core.windows.net/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://bucket-name.amazonaws.com/path/to/object_1.fastq.gz?querystring \
  -2 https://bucket-name.amazonaws.com/path/to/object_2.fastq.gz?querystring \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b s3://s3-bucket-name/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://storage-account-name.blob.core.windows.net/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://bucket-name.amazonaws.com/path/to/object_1.bam?querystring \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

AWS S3, Azure Blob Storage, Output Streaming

DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.

Streaming output to AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory s3://s3-bucket-name/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Streaming output to Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory https://storage-account-name.blob.core.windows.net/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Security and Permissions

To stream input files or write to a cloud providers storage, you must have permission to access the remote files.

AWS S3

S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.

Azure Blob Storage Account

Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.

To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name> environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id> environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.

With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the Storage Account Access Key and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name> and AZ_ACCOUNT_KEY=<account-key>.

Presigned URL (AWS only)

An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.

Sample Sex

Use the --sample-sex command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.

The --sample-sex option supports the following values. Values are not case-sensitive.

none: No sex karyotype input. Components use a default reference sex karyotype.
auto: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none. auto is the default value.
female: Sex karyotype input is XX.
male: Sex karyotype input is XY.

The following example command lines use --sample-sex to specify the sex karyotype.

--sample-sex FEMALE

--sample-sex MALE

--sample-sex NONE

If the value is none, female, or male, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.

The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex is used.

Reference Sex Karyotype

Sex Karyotype Input

CNV Caller

ExpansionHunter

Ploidy Caller

Small Variant Caller

SV Caller

XXYY

XXY

XXYY

XYY

XXYY

XXXY

XXYY

XXX

XXYY

None

XX/XY

XXYY

For sex karyotype input of None, CNV independently checks the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.

Preservation or Stripping of BQSR Tags

The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.

The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.

Read Group Options

DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:

Attribute

Argument

Description

--RGID

Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.

--RGLB

Library.

--RGPL

Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.

--RGPU

Platform unit, eg, flowcell-barcode.lane.

--RGSM

Sample.

--RGCN

Name of the sequencing center that produced the read.

--RGDS

Description.

--RGDT

Date the run was produced.

--RGPI

Predicted mean insert size.

If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \
--RGPL Illumina --RGPU 1 --RGSM NA12878 \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 SRA056922.fastq --output-directory /staging/tmp/ \
--output-file-prefix rg_example

When using the --fastq-list option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.

License Options

To suppress the license status message at the end of the run, use the --lic-no-print option. The following shows an example of the license status message:

LICENSE_MSG| =====================================================
LICENSE_MSG| License report
LICENSE_MSG|   Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbases
since 2018-Feb-15 (1263886160894 bases, unlimited)
LICENSE_MSG|   Genome  bases [ACxxxxxxxxxxx] : 202000000
LICENSE_MSG|   Genome  bases [total]         : 202000000

Autogenerated MD5SUM for BAM and CRAM Output Files

An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.

The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).

Configuration Files

Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg. You can override this file by using the --config-file (-c) option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.

The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.

Cloud Authentication and Licensing

Authentication is required for users that run DRAGEN on the cloud, with the Bring-Your-Own-License (BYOL) model, outside of integrated Illumina cloud products. A valid license is required to enable authentication and usage quotas.

License Server

DRAGEN cloud runs access the DRAGEN License Server to validate the credentials and licenses against the intended run. BYOL users must provide credentials and must allow access to the license server URL. The following command line option can be used to pass the credentials to DRAGEN: --lic-server=https://<user>:<pass>@license.edicogenome.com.

An alternative way to provide license server credentials is by using a license credentials file. The --lic-credentials input command line option can be used to provide the full path to the license credentials file. This provides a more secure way to pass cloud credentials, which avoids accidental credentials leaks from command line console logs.

A license credentials file is a plain text file audited by the customer. The format is the same as the DRAGEN config files: = , each {key,value} separated by new line. The following key names must be used: credentials1 = credentials2 =

AWS Instance Metadata Service Support (IMDSv1/IMDSv2)

DRAGEN uses AWS Instance Metadata Service (IMDS) to identify its own metadata within the AWS environment, including location, identity, and configuration.

DRAGEN supports both AWS IMDSv1, and the more secure AWS IMDSv2. AWS IMDSv1 is request/response based. It accesses metadata by HTTP requests to a specific endpoint on the instance. AWS IMDSv2 is token-based authentication with time-limited tokes.

AWS IMDSv2 must be enabled on the AWS instance, otherwise, IMDSv1 is used by default. DRAGEN software will automatically detect the IMDS version in use and adapt its behavior accordingly.

Instance Identity

DRAGEN cloud runs access the instance identity document via the Instance Metadata Service as part of the authentication. It uses the IPv4 local address. If access to the local address is not allowed, the authentication will fail. Alternately, the user may save the instance identity document(s) and point DRAGEN to use them instead, if the user does not want to allow applications to access this service. The method for providing instance identity documents to the software is described below.

Save the instance identity document(s) as files from the user's instance, and provide them as inputs to the DRAGEN software with each run.
The instance identity document(s) only need to be saved once per AWS account and region, and those files can be re-used subsequently.

Examples for saving instance identity document(s):

AWS

IMDSv1

curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/pkcs7" -o /opt/instance-identity/pkcs7
curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/document" -o /opt/instance-identity/document
cp /opt/instance-identity/pkcs7 /opt/instance-identity/signature

IMDSv2

curl -X PUT -H "X-aws-ec2-metadata-token-ttl-seconds: 300" -H "X-aws-ec2-metadata-token: required" --noproxy "*" "http://169.254.169.254/latest/api/token"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/document"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/signature"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/pkcs7"

There should be 3 files in this folder, respectively named pkcs7, signature and document. Run Dragen using the --lic-instance-id-location ${instance_identity} command option.

Azure

curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2020-09-01" -o /opt/instance-identity/instance
curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/attested/document?api-version=2020-09-01" -o /opt/instance-identity/document

There should be 2 files in this folder, respectively named instance and document. Run Dragen using the --lic-instance-id-location ${instance_identity} command option.

DRAGEN Secondary Analysis

The DRAGEN secondary analysis software utilizes a highly reconfigurable Field Programmable Gate Array (FPGA) card and is available on a preconfigured DRAGEN server that can be seamlessly integrated into bioinformatics workflows. The platform can be loaded with highly optimized algorithms for many different NGS secondary analysis pipelines, including the following:

Whole genome
Exome
RNA-Seq
Methylome
Cancer

All user interaction is accomplished via DRAGEN software that runs on the host server and manages all communication with the FPGA card. This user guide summarizes the technical aspects of the system and provides detailed information for all DRAGEN command line options. If you are working with DRAGEN for the first time, Illumina recommends that you first read the Getting Started section, which provides a short introduction to DRAGEN, including running a test of the server, generating a reference genome, and running example commands.

DNA Pipeline

DRAGEN DNA Pipeline

The DRAGEN DNA Pipeline massively accelerates the secondary analysis of NGS data. For example, the time taken to process an entire human genome at 30x coverage is reduced from approximately 10 hours (using the current industry standard, BWA-MEM+GATK-HC software) to approximately 20 minutes. Time scales linearly with coverage depth.

These pipelines harness the tremendous power of the DRAGEN server and include highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. They also use platform features such as hardware-accelerated compression and optimized BCL conversion, together with the full set of platform tools.

Unlike all other secondary analysis methods, DRAGEN DNA Applications do not reduce accuracy to achieve speed improvements. Accuracy for both SNPs and INDELs is improved over that of BWA-MEM+GATK-HC in side-by-side comparisons.

In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions.

RNA Pipeline

DRAGEN secondary anaylsis includes an RNA-seq (splicing-aware) aligner, as well as RNA-specific analysis components for gene expression quantification and gene fusion detection.

The DRAGEN RNA Pipeline shares many components with the DNA Pipeline. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.

DRAGEN secondary analysis uses hardware accelerated algorithms to map and align RNA-Seq--based reads faster and more accurately than popular software tools. For instance, it can align 100 million paired-end RNA-Seq--based reads in about three minutes. With simulated benchmark RNA-Seq data sets, its splice junction sensitivity and specificity are unsurpassed.

Methylation Pipeline

The DRAGEN Methylation Pipeline provides support for automating the processing of bisulfite sequencing data to generate a BAM with the tags required for methylation analysis and reports detailing the locations with methylated cytosines.

DRAGEN Reference Support

DRAGEN supports the construction of reference hash tables for both human and non-human reference genomes. The reference autodetect feature of DRAGEN is able to recognize the reference hash tables build on the four Human reference genomes: hg19 (hg19), GRCh37/hs37d5 (hs37d5), GRCh38/hs38d1(hg38), and T2T-CHM13v2.0 (chm13).

DRAGEN supports pangenome reference hash tables which extend the reference genomes with alternative variant paths from a sample cohort used to construct the pangenome reference. A pangenome-based reference improves the mapping accuracy of Illumina reads in the “Difficult-to-Map Regions” of the genome and the downstream variant calling.

Pre-built human references are available for download at DRAGEN Software Support Site page.

The pangenome is the recommended for Germline human analyses. The accuracy achieved with pangenome references are highlighted in the plot below.

In the following tables we summarize the reference support for each DRAGEN component and the recommended reference type for each component.

Germline

Component

Recommended Human Reference Type

Recommended Non-Human Reference Type

Human hg19

Human hs37d5

Human hg38

Human chm13

Non-Human

SNV

Pangenome

Linear

Yes

CNV

Pangenome

Linear

Yes

Yes*

Pangenome

Linear

Yes

Yes*

Yes

Expansion Hunter

Pangenome

Linear

Yes

Targeted Callers

Pangenome

Linear

Yes

RNA

Linear

Yes

Yes*

Yes

De Novo

Pangenome

Linear

Yes

Yes*

Yes

Joint Genotyping

Pangenome

Linear

Yes

Yes*

Yes

Biomarkers (HLA)

Pangenome

Linear

Yes

Yes*

gVCF genotyper

Pangenome

Linear

Yes

Yes*

Yes

Somatic

Component

Recommended Human Reference Type

Recommended Non-Human Reference Type

Human hg19

Human hs37d5

Human hg38

Human chm13

Non-Human

SNV

Linear

Yes

Yes*

UMI SNV

Linear

Yes

Yes*

CNV

Linear

Yes

Yes*

Linear

Yes

Yes*

Methylation

Component

Recommended Human Reference Type

Recommended Non-Human Reference Type

Human hg19

Human hs37d5

Human hg38

Human chm13

Non-Human

Methylation

Linear

Yes

Annotation

Component

Recommended Human Reference Type

Recommended Non-Human Reference Type

Human hg19

Human hs37d5

Human hg38

Human chm13

Non-Human

Nirvana

Pangenome

Linear

Yes

* DRAGEN supports the component execution, however the component's accuracy has not been established.

See Prepare a Reference Genome for how to build a custom reference genome.

Prepare a Reference Genome

Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. The options used in this preprocessing step offer tradeoffs between performance and mapping quality.

Pre-built DRAGEN reference genomes are available for download in the Illumina customer portal. If you find that performance and mapping quality with these are adequate, there is a good chance that you can simply work with these supplied reference genomes. Depending on your read lengths and other particular aspects of your application, you may be able to improve mapping quality and/or performance by tuning the reference preprocessing options.

Hash Table Background

The DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, and looks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in the reference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exact matches. The DRAGEN hash table must be constructed from a chosen reference genome using the --build-hash-table option, which extracts many overlapping seeds from the reference genome, populates them into records in the hash table, and saves the hash table as a binary file.

Automatic Reference Detection

DRAGEN will attempt to detect the provided reference in order to automatically apply recommended resources and settings. There are four human references that DRAGEN can detect: hg38, hg19, hs37d5, and chm13v2. DRAGEN is able to detect references that contain a subset of the primary contigs from one of these references, as long as the names and lengths of the detected contigs are consistent with the names and lengths from the standarad assemblies of these references.

In detail, automatic reference detection operates as follows:

We define a primary contig of a human genome to be an autosome (1-22) or sex chromosome (X,Y). Let F be the input fasta. For each reference genome R in hg38, hg19, hs37d5, and chm13v2, DRAGEN checks if there are any contigs in F that have the same name and length as a primary contig in R, and that there are no contigs in F that have the same name as a contig in R, but with different length. If these conditions hold for exactly one of hg38, hg19, hs37d5, and chm13v2, then that reference is detected and resources may be applied automatically.

The DRAGEN hash table builder will automatically apply decoy contigs and mask bed files to detected reference. Other pipelines may also apply automatic resources. For example variant callers may apply machine learning models and target bed files.

Naming Conventions

In order for DRAGEN to correctly detect the provided reference, it is important to use the standard naming conventions for each of the four human assemblies that DRAGEN detects:

Reference Seed Interval

The size of the DRAGEN hash table is proportionate to the number of seeds populated from the reference genome. The default is to populate a seed starting at every position in the reference genome, ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on the DRAGEN PCIe board.

To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populate less than all reference seeds using the --ht-ref-seed-interval option to specify an average reference interval. The default interval for 100% population is --ht-ref-seed-interval 1, and 50% population is specified with --ht-ref-seed-interval 2. The population interval does not need to be an integer. For example, --ht-ref-seed-interval 1.2 indicates 83.3% population, with mostly 1-base and some 2-base intervals to achieve a 1.2 base interval on average.

Hash Table Occupancy

It is characteristic of hash tables that they are allocated a certain size, but always retain some empty records, so they are less than 100% occupied. A healthy amount of empty space is important for quick access to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty space is important because records are pseudo-randomly placed in the hash table, resulting in an abnormally high number of records in some places. These congested regions can get quite large as the percentage of empty space approaches zero, and queries by the DRAGEN mapper for some seeds can become increasingly slow.

Hash Table / Seed Length

The hash table is populated with reference seeds of a single common length. This primary seed length is controlled with the --ht-seed-len option, which defaults to 21.

The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally, longer seeds are better for run time performance, and shorter seeds are better for mapping quality (success rate and accuracy). A longer seed is more likely to be unique in the reference genome, facilitating fast mapping without needing to check many alternative locations. But a longer seed is also more likely to overlap a deviation from the reference (variant or sequencing error), which prevents successful mapping by an exact match of that seed (although another seed from the read may still map), and there are fewer long seed positions available in each read.

Longer seeds are more appropriate for longer reads, because there are more seed positions available to avoid deviations.

Seed Length Recommendations

Hash Table / Seed Extensions

Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.

For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1-3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.

The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.

It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):

--ht-cost-coeff-seed-len

--ht-cost-coeff-seed-freq

--ht-cost-penalty

--ht-cost-penalty-incr

There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.

The defaults for the seed frequency options are as follows:

Seed Frequency Limit and Target

One primary or extended seed can match multiple places in the reference genome. All such matches are populated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seed extracted from a read. The multiple reference positions are then considered and compared to generate aligned mapper output. However, the DRAGEN software enforces a limit on the number of matches, or frequency, of each seed, which is controlled with the --ht-max-seed-freq option. By default, the frequency limit is 16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficiently long secondary seed that the frequency of any particular extended seed pattern falls within the limit. However, if a maximum seed extension would still exceed the limit, the seed is rejected, and not populated into the hash table. Instead, a single High Frequency record is populated.

This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons. First, because seeds are rejected only when extension fails, only extremely high-frequency primary seeds, typically with many thousands of matches are rejected. Such seeds are not very useful for mapping. Second, there are other seed positions to check in a given read. If another seed position is unique enough to return one or more matches, the read can still be properly mapped. However, if all seed positions were rejected as high frequency, often this means that the entire read matches similarly well in many reference positions, so even if the read were mapped it would be an arbitrary choice, with very low or zero MAPQ.

Thus, the default frequency limit of 16 for --ht-max-seed-freq works well. However, it may be decreased or increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the number of reads mapped (especially for short reads), but commonly the additional mapped reads have very low or zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly large numbers of possible mappings are occasionally considered.

In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freq option. This target frequency is used when extensions are generated for high frequency primary seeds. Extension lengths are chosen with a preference toward extended seed frequencies near the target. The default of 4 for --ht-target-seed-freq means that the software is biased toward generating shorter seed extensions than necessary to map seeds uniquely.

References with ALT contigs

When building a reference hash table from a fasta with ALT contigs, it may be desired to mask certain regions of high similarity, or to establish a liftover realtionships between primary and alternate contigs. The recommended approach is masking, as described in the Map-Align section. When hg19 or hg38 alt contigs are detected, the hash table builder will require a liftover file or a bed file to mask the alt contigs. If non are provided, a mask bed file from <INSTALL_PATH>/fasta_mask/ will be used automaticaly.

Masked References

DRAGEN has adopted a masked approach to handle native reference ALT contigs, where strategic regions are masked to increased accuracy. The hash table builder will build the mapper hash table as if the regions that were specified in the argument for ht-mask-bed were masked with N's. The hash table builder will only allow setting one of ht-mask-bed or ht-alt-liftover. Each line in the bed file is expected to contain a contig name, start position (0-based), and end position (1-based), seperated by a single tab or space. Lines that start with # are ignored by the hash table builder to allow commenting. Any line with a contig name that is not found in the input fasta is skipped and logged to the DRAGEN log file. Likewise, lines that describe empty intervals are skipped. If all lines are skipped this way, the hash table builder will issue an error and abort, unless the mask bed file was automatically applied (see Automatic masking). The hash table builder will always issue an error and abort if an interval described in the BED file is outside of the range of the corresponding contig in the fasta. Lines that are not skipped are written to a file called mask.bed that will be present in the hash table output directory, and whose digest will appear in hash_table.cfg. This file is used when a reference is loaded to the FPGA card to dynamically mask reference.bin.

Automatic masking

When running from a fasta for which hg38 or hg19 is detected (See Automatic Reference Detection), and no argument for ht-mask-bed or ht-alt-liftover was provided, the hash table builder will automatically apply the corresponding bed file for the detected reference from <INSTALL_PATH>/fasta_mask/. Note that the hash table builder will identify alt contigs by name. So when running from an input fasta that contains alt contig with standard names but modified base content, it is recommended to suppress automatic masking by setting ht-suppress-mask=true or by passing a custom mask bed file to ht-mask-bed.

Handling Decoy Contigs

The behavior of DRAGEN with respect to the handling of decoy contigs in the reference has changed since version 2.6.

Starting with DRAGEN 3.x, DRAGEN's hash table builder automatically detects the absence of the decoy contigs from the reference and adds it to the FASTA file, prior to building the hash table. The decoys file is found at <INSTALL_PATH>/liftover/hs\_decoys.fa. If the reference is missing the decoy contigs, then the reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). This results in an artificially lower mapping rate, however, the accuracy of variant calling is improved thanks to removing false positive caused by decoy reads.

Illumina recommends using this feature by default. However, you can to set the --ht-suppress-decoys option to true to suppress adding these decoys to the hash table.

The table below describes the difference in behavior between older DRAGEN versions (2.6 and earlier) and DRAGEN 3.x versions with respect to the handling of decoy contigs in the hash table builder:

Prepare a Pangenome Reference

It is possible to build a custom pangenome reference in order to:

generate a population-specific-pangenome hash table from pangenome msVCF generated from the BSSH app.
generate a human or non-human pangenome hash table from customer-provided msVCF.

Usage

To enable the pangenome hash table builder, example command usage is :

dragen --build-hash-table true (required) --ht-graph-msvcf-file <path to a multi-sampple VCF file (required for pangenome reference) --ht-reference <reference.fasta> (required) --ht-graph-extra-kmer-bed < graph.bed> (optional) --ht-mask-bed <mask.bed> (optional) --ht-graph-exclusion-bed <exclusion bed> (optional) --output-directory <DIR> (required) [options]

Inputs

Set of population variants, in a multi-sample VCF (msVCF)

The custom pangenome hash table builder tool uses a set of population variants provided by the user to generate a pangenome hash table. The variants must be specified in VCF format, in a single multi-sample VCF (msVCF) file containing the variants for a set of individuals. This multi-sample VCF file must have specific formatting described below.

Specific msVCF input formatting

The custom pangenome hash table builder tool only supports msVCF file input respecting the format described below:

msVCF compliant with 4.2 VCF format specification
with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference
records shall include diploid or haploid GT calls
supports multi-allelic variants merged in multi-line or separated in multiple lines
with the following FILTER codes, non-PASS records are ignored:
- ##FILTER=<ID=PASS,Description="All filters passed">
with the following FORMAT field :
- ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
for better results, we recommend variants to be left-aligned.
maximum number of recommended samples in the msVCF is 256. Higher number may lead to very high memory usage at hash table creation.

Note: INFO/FORMAT subfields must be defined in the header. Events with undefined subfields are ignored.

To build a high-performance custom genome it is highly recommended to use long read sequencing data. We recommend using external tools such as Whatshap (https://github.com/whatshap/whatshap) to generate phased input. DRAGEN analysis leverages the phasing information to reconstruct population haplotypes.

Reference genome

Note: the reference genome provided as input must be the same as the one used to generate the input phased msVCF. If the msVCF contains variants from regions not present in the fasta file, the pangenome reference builder will stop with an error.

Exclusion bed file (optional)

A custom exclusion bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the exclusion bed file provided must be from the same build as the reference genome used to build the pangenome reference.

Extra kmer bed file (optional)

An Extra-kmer-bed bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the Extra-kmer-bed file provided must be from the same build as the reference genome used to build the graph reference.

Mask bed file (recommended)

A custom mask bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.

Note: records of the mask bed file provided must be from the same build as the reference genome used to build the graph reference.

Command line options

Note: The custom graph reference hash table end to end pipeline will return an error if options --ht-alt-liftover or --ht-allow-mask-and-liftover are specified.

Output

The hash table builder generates the following outputs:

Prepare a linear Reference

Usage

Use the --build-hash-table option to transform a reference FASTA into the hash table for DRAGEN mapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and a preexisting output directory. Build command usage is as follows:

Input

The --ht-reference and --output-directory options are required for building a hash table. The --ht‑reference option specifies the path to the reference FASTA file, while --output-directory specifies a preexisting directory where the hash table output files are written. Illumina recommends organizing various hash table builds into different folders. As a best practice, folder names should include any nondefault parameter settings used to generate the contained hash table. The sequence names in the reference FASTA file must be unique.

Command line options

Liftover Based ALT-Aware Hash Tables

While masking is the recommended approach to dealing with ALT contigs, DRAGEN also supports a liftover based method. To enable liftover based ALT-aware mapping in DRAGEN, build the hash table with a liftover file by using the --ht-alt-liftover option. The hash table builder classifies each reference sequence as primary or alternate based on the liftover file, and packs primaries before alternates in reference.bin. SAM liftover files for hg38DH and hg19 are in the <INSTALL_PATH>/liftover folder.

Custom Liftover Files

Custom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAM format, but no SAM header is required. SEQ and QUAL fields can be omitted ('*'). Each alignment record should have an alternate haplotype reference sequence name as QNAME, indicating the RNAME and POS of its liftover alignment in a destination (normally primary assembly) reference sequence.

Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4) or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALT contig unaligned.

A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftover destination (appearing in RNAME). Multiple ALT contigs can align to the same primary assembly location. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged 0x800 supplementary), such as to align one portion forward and another portion reverse-complemented. However, each base of the ALT contig only receives one liftover image, according to the first alignment record with an M CIGAR operation covering that base.

SAM records with QNAME missing from the reference genome are ignored, so that the same liftover file may be used for various reference subsets, but an error occurs if any alignment has its QNAME present but its RNAME absent.

Options for advanced users

Primary Seed Length

The --ht-seed-len option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of this same length from each read, and looks for exact matches (unless seed editing is enabled) in the hash table.

The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.

The minimum primary seed length depends mainly on the reference genome size and complexity. It needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound may be smaller for shorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16 for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.

Accuracy Considerations

For read mapping to succeed, at least one primary seed must match exactly (or with a single SNP when edited seeds are used). Shorter seeds are more likely to map successfully to the reference, because they are less likely to overlap variants or sequencing errors, and because more of them fit in each read. So for mapping accuracy, shorter seeds are mainly better.

However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions, and lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches may be reported. Run time quality filters such as --Aligner.aln_min_score can control the accuracy issues with very short seeds.

Speed Considerations

Shorter seeds tend to slow down mapping, because they map to more reference locations, resulting in more work such as Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the reference genome's uniqueness threshold, eg, K=16 for whole human genome.

Application Considerations

Read Length---Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chop the read into only short segments matching the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, just one SNP in the middle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bp read, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.

Paired Ends---The use of paired end reads can make longer seeds yield good mapping accuracy. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have essentially twice the opportunity for an exact matching seed to find their correct alignments.

Variant or Error Rate---When read differences from the reference are more frequent, shorter seeds may be required to fit between the difference positions in a given read and match the reference exactly.

Mapping Percentage Requirement---If the application requires a high percentage of reads to be mapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well anywhere are more likely to map using short seeds to find partial matches to the reference.

Maximum Seed Length

The --ht-max-ext-seed-len option limits the length of extended seeds populated into the hash table. Primary seeds (length specified by --ht-seed-len) that match many reference positions can be extended to achieve more unique matching, which may be required to map seeds within the maximum hit frequency (--ht-max-seed-freq).

Given a primary seed length k, the maximum seed length can be configured between k and k+128. The default is the upper bound, k+128.

When to Limit Seed Extension

The --ht-max-ext-seed-len option is recommended for short reads, eg, less than 50 bp. In such cases, it is helpful to limit seed extension to the read length minus a small margin, such as 1-4 bp. For example, with 36 bp reads, setting --ht-max-ext-seed-len to 35 might be appropriate. This ensures that the hash table builder does not plan a seed extension longer than the read causing seed extension and mapping to fail at run time, for seeds that could have fit within the read with shorter extensions.

While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len to 99 for 100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Even with the default k+128 limit, individual seeds are only extended to the lengths required to fit under the maximum hit frequency (--ht-max-seed-freq), and at most a few bases longer to approach the target hit frequency (‑‑ht‑target-seed-freq), or to avoid taking too many incremental extension steps.

Maximum Hit Frequency

The --ht-max-seed-freq option sets a firm limit on the number of seed hits (reference genome locations) that can be populated for any primary or extended seed. If a given primary seed maps to more reference positions than this limit, it must be extended long enough that the extended seeds subdivide into smaller groups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len), a group of identical reference seeds is larger than this limit, their reference positions are not populated into the hash table. Instead, a single High Frequency record is populated.

The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hash table construction can fail because too many seed extensions are needed. The practical minimum for a whole human genome reference, other options being default, is 8.

Accuracy Considerations

Generally, a higher maximum hit frequency leads to more successful mapping. There are two reasons for this. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higher limit allows seed extensions to be shorter, improving the odds of exact seed matching without overlapping variants or sequencing errors.

However, as with very short seeds, allowing high hit counts can sometimes hurt mapping accuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionally one of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits the total number of reference positions it considers, and allowing very high hit counts can potentially crowd out the actual best match from consideration.

Speed Considerations

Higher maximum hit frequencies slow down read mapping, because seed mapping finds more reference locations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.

Pangenome Reference

The DRAGEN Software enables the user to build a custom pangenome hash table from a set of population variants. The population variants are specified in a single multi-sample VCF file.

--ht-graph-msvcf-file: Input file containing list of population variants, in multi-sample VCF format.

This replaces the previous options that were previously used to build a graph Reference that are now deprecated.

List of deprecated options :

--ht-pop-alt-contigs: Population based alternate contigs FASTA.
--ht-pop-alt-liftover: Liftover SAM file of population alternate contigs.
--ht-pop-snps: Population based SNPs VCF

ALT-Contigs

The following options control building hash tables from references with ALT-contigs. See References with ALT contigs for more information.

--ht-mask-bed: Set a custom BED file that defines which regions to mask. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/fasta\_mask.
--ht-alt-liftover: Set a liftover file to build a liftover based ALT-aware hash table. SAM liftover files for hg38DH and hg19 are provided in <INSTALL_PATH>/liftover.
--ht-allow-mask-and-liftover: Allow the use of both --ht-mask-bed and --ht-alt-liftover together.
--ht-suppress-mask: Suppress automatic detection of the default mask bed files when building the hash table.

Decoy Contigs

--ht-decoys The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoys to the hash table when they are not found in the FASTA file. Use the --ht-decoys option to specify the path to a decoys file. The default is <INSTALL_PATH>/liftover/hs\_decoys.fa.
--ht-suppress-decoys: Suppress automatic detection of the default decoys file when building the hash table.

Processing Options

--ht-num-threads The --ht-num-threads option determines the maximum number of worker CPU threads that are used to speed up hash table construction. The default for this option is 8, with a maximum of 32 threads allowed. If your server supports execution of more threads, it is recommended that you use the maximum. For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of 32 should be used. When using a higher value, adjust --ht-max-table-chunks needs to be adjusted as well. The servers have 128 GB of memory available.
--ht-max-table-chunks The --ht-max-table-chunks option controls the memory footprint during hash table construction by limiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Each additional chunk consumes roughly twice its size (~2 GB) in system memory during construction. The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, which depends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash table contains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enough memory and a 16 GB hash table contains 16 independent 1 GB chunks. The default is --ht-max-table-chunks equal to --ht-num-threads, but with a minimum default --ht-max-table-chunks of 8. It makes sense to have these two options match, because building one hash table chunk requires one chunk space in memory and one thread to work on it. Nevertheless, there are build-speed advantages to raising --ht-max-table-chunks higher than --ht-num-threads, or to raising --ht-num-threads higher than --ht-max-table-chunks.

Size Options

--ht-mem-limit Memory Limit. The --ht-mem-limit option controls the generated hash table size by specifying the DRAGEN card memory available for both the hash table and the encoded reference genome. The ‑‑ht‑mem-limit option defaults to 32 GB when the reference genome approaches WHG size, or to a generous size for smaller references. Normally there is little reason to override these defaults.
--ht-size Hash Table Size. This option specifies the hash table size to generate, rather than calculating an appropriate table size from the reference genome size and the available memory (option --ht-mem-limit). Using default table sizing is recommended and using --ht-mem-limit is the next best choice.

Seed Population Options

--ht-ref-seed-interval Seed Interval. The --ht-ref-seed-interval option defines the step size between positions of seeds in the reference genome populated into the hash table. An interval of 1 (default) means that every seed position is populated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5 yields 40% populated. Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGEN boards. If a substantially larger reference genome is used, change this option.
--ht-soft-seed-freq-cap and --ht-max-dec-factor Soft Frequency Cap and Maximum Decimation Factor for Seed Thinning. Seed thinning is an experimental technique to improve mapping performance in high-frequency regions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3 retains at least 1/3 of the original seeds. --ht-max-dec-factor 1 disables any thinning. Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea is that seed thinning can achieve mapped seed coverage in high frequency reference regions where the maximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seed extensions shorter, which is also good for successful mapping. Based on testing to date, seed thinning has not proven to be superior to other accuracy optimization methods.
--ht-rand-hit-hifreq and --ht-rand-hit-extend Random Sample Hit with HIFREQ Record and EXTEND Record. Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of a large set of reference hits for a certain seed. Optionally, the hash table builder can choose a random representative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record. Random sample hits provide alternative alignments that are very useful in estimating MAPQ accurately for the alignments that are reported. They are never used outside of this context for reporting alignment positions, because that would result in biased coverage of locations that happened to be selected during hash table construction. To include a sample hit, set --ht-rand-hit-hifreq to 1. The --ht-rand-hit-extend option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is not recommended.

Seed Extension Control

DRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too many reference locations. Seeds are incrementally extended in steps of 2--14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, determining the next extension increment if any.

There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, using a cost function that balances mapping accuracy and speed. The following options define that cost function:

--ht-target-seed-freq Target Hit Frequency. The --ht-target-seed-freq option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions, because shorter seeds tend to match more reference positions.
--ht-cost-coeff-seed-len Cost Coefficient for Seed Length The --ht-cost-coeff-seed-len option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.
--ht-cost-coeff-seed-freq Cost Coefficient for Hit Frequency. The --ht-cost-coeff-seed-freq option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.
--ht-cost-penalty Cost Penalty for Seed Extension. The --ht-cost-penalty option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. Current testing shows that zero (0) is appropriate for this parameter.
--ht-cost-penalty-incr Cost Increment for Extension Step. The --ht-cost-penalty-incr option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, and takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seed lengths in fewer, larger steps.

Pipeline Specific Hash Tables

RNA-Seq

When building a hash table, DRAGEN configures the options for DNA analysis by default. To run RNA-Seq data, you must build an RNA-Seq hash table by setting --ht-build-rna-hashtable to true. If running RNA-Seq alignment, use the original --output-directory instead of the automatically generated subdirectory.

CNV

If using the CNV pipeline, set --ht-build-cnv-hashtable to true. The command generates an additional Kmer hash map that is used in the CNV algorithm. Illumina recommends to always use the --ht-build-cnv-hashtable option, so you can perform CNV calling with the same hash table used for mapping and aligning.

Methylation

To run the methylation pipeline, you must build a methylation-specific hash table. DRAGEN can build a single-pass or legacy multi-pass methylation hash table. Methylation runs using a single-pass hash table are completed faster than the legacy multipass hash tables. Single-pass hash tables are recommended for building methylation tables and running analyses.

Single-pass

The following is an example of a single-pass hash table build. The example generates a combined hash table in your reference index folder under the methyl_converted subdirectory.

dragen --build-hash-table true \ --output-directory $REFDIR \ --ht-reference $FASTA \ --ht-num-threads 40 \ --ht-methylated-combined=true \ --ht-seed-len 27

Multipass

Multi-pass methylation mapping requires building two special hash tables with reference bases converted from C to T in one table and G to A in the other table. The conversions are performed automatically when using the --ht-methylated command line option. The converted hash tables are generated in two subdirectories under the folder specified using the --output-directory command line option. The subdirectories are named CT_converted and GA_converted, corresponding with the base conversions. When using the hash tables for methylated alignment runs, make sure to refer to the --output-directory folder, not the subdirectories.

The base conversions remove a significant amount of information from the hash tables. You might need to use different hash table parameters than in a conventional hash table build. The following options are recommended for building hash tables for mammalian species.

dragen --build-hash-table=true --output-directory $REFDIR --ht-reference $FASTA --ht-max-seed-freq 16 --ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true

HLA

To run the HLA caller, an HLA-specific anchored reference hash table must be built. Set --ht-build-hla-hashtable to true. The command will create a anchored_hla subdirectory inside the --output-directory. The HLA-specific reference subdirectory can be built at the same time as the primary reference construction.

DRAGEN DNA Pipeline

The DRAGEN DNA Pipeline accelerates the secondary analysis of NGS data by harnessing the tremendous power available on the DRAGEN Platform. The pipeline includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions and targeted calls.

DNA Mapping

Seed Density Option

The seed-density option controls how many (normally overlapping) primary seeds from each read the mapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates a seed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.

Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to the requested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.

Accuracy Considerations--Generally, denser seed lookup patterns improve mapping accuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there is little to be gained beyond the default 50% seed lookup density.
Speed Considerations--Denser seed lookup patterns generally slow down mapping, and sparser seed patterns speed it up. However, when the seed mapping stage can run faster than the aligning stage, a sparser seed pattern does not make the mapper much faster.

Relationship to Reference Seed Interval

Functionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longer reference seed interval (build hash table option --ht-ref-seed-interval). Populating 100% of reference seed positions and looking up 50% of read seed positions has the same effect as populating 50% of reference seed positions and looking up 100% of read seed positions. Either way, the expected density of seed hits is 50%.

More generally, the expected density of seed hits is the product of the reference seed density (the inverse of the reference seed interval) and the seed lookup density. For example, if 50% of reference seeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hit density should be 16.7% (1/6).

DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematically miss the seed positions populated from the reference. For example, the mapper does not look up seeds matching only odd positions in the reference when only even positions are populated in the hash table, even if the reference seed interval is 2 and seed-density is 0.5.

Map Orientations Option

The --Mapper.map-orientations option is used in mapping reads for bisulfite methylation analysis. It is set automatically based on the value set for ‑‑methylation-protocol.

The --Mapper.map-orientations option can restrict the orientation of read mapping to only forward in the reference genome, or only reverse-complemented. The valid values for --map-orientations are as follows.

0--Either orientation (default)
1--Only forward mapping
2--Only reverse-complemented mapping

If mapping orientations are restricted and paired end reads are used, the expected pair orientation can only be FR, not FF or RF.

Seed-Editing Options

Although DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can also map seeds differing from the reference by one nucleotide by also looking up single-SNP edited seeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have a high probability of containing at least one exact seed match. This is especially true when paired ends are used, because a seed match from either mate can successfully align the pair. But seed editing can, for example, be useful to increase mapping accuracy for short single-ended reads, with some cost in increased mapping time. The following options control seed editing:

Seed Editing Options

Command-Line Option Name

Configuration File Option Name

--Mappper.seed-density

seed-density

-Mapper.edit-mode

edit-mode

--Mapper.edit-seed-num

edit-seed-num

--Mapper.edit-read-len

edit-read-len

--Mapper.edit-chain-limit

edit-chain-limit

edit-mode and edit-chain-limit

The edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:

Mode

Description

No editing (default)

Chain length test

Paired chain length test

Full seed editing

Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed that fails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seeds only for reads most likely to be salvaged to accurate mapping.

The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to the reference in a first pass over a given read, and the matching seeds are grouped into chains of similarly aligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read is judged not to require seed editing, because there is already a promising mapping position.

Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chain exceeds edit-chain-limit (including if no exact seeds match), then a second seed mapping pass is attempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If either mate has an exact seed chain longer than edit-chain-limit, then seed editing is disabled for the pair, because a rescue scan is likely to recover the mate alignment based on seed matches from one read. Edit mode 2 is the same as mode 1 for single-ended reads.

edit-seed-num and edit-read-len

For edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seed positions are edited in the second pass over the read. Although exact seed mapping can use a densely overlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value of seed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlapping pattern. Generally, if a user application can afford to spend some additional amount of mapping time on seed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editing seeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a small number of reads.

Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions, distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5' end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield better base qualities nearer the (5') beginning of each read, this can focus seed editing where it is most likely to succeed. When a particular read is shorter than edit-read-len, fewer seeds are edited.

Seed editing is more expensive when the reference seed interval (build hash table option ‑-ht‑ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automatically generated to avoid missing the populated reference seed positions. For edit mode 3, the time cost can increase dramatically because query seeds matching unpopulated reference positions typically miss and trigger editing.

DNA Aligning

Smith-Waterman Alignment Scoring Settings

The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.

The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.

The following alignment options control Smith-Waterman Alignment:

Command-Line Option Name

Configuration File Option Name

--Aligner.global

global

--Aligner.match-score

match-score

--Aligner.match-n-score

match-n-score

--Aligner.mismatch-pen

mismatch-pen

--Aligner.gap-open-pen

gap-open-pen

--Aligner.gap-ext-pen

gap-ext-pen

--Aligner.unclip-score

unclip-score

--Aligner.no-unclip-score

no-unclip-score

--Aligner.aln-min-score

aln-min-score

--Aligner.min-score-coeff

min-score-coeff

global The global option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative. Generally, global=0 is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions. Using global=1 is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end. Consider using the unclip-score option, or increasing it, instead ofsetting global=1, to make a soft preference for unclipped alignments.
match-score The match-score option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.
match-n-score The match-n-score option specifies the score for an aligned position where the read position and/or the reference position is an N code. This option is a signed integer, from -16 to 15.
mismatch-pen The mismatch-pen option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.
gap-open-pen The gap-open-pen option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.
gap-ext-pen The gap-ext-pen option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.
unclip-score The unclip-score option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels. A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1
no-unclip-score The no-unclip-score option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1. The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment. When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.
aln-min-score The aln-min-score option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.
min-score-coeff The min-score-coeff option makes adjustments to aln-min-score per read base. When using the min-score-coeff and aln-min-score options together, you can define the minimum alignment score for each read as an affine function of read lengths. The minimum score for an N-base read is calculated as follows: (min-score-coeff)\*N+(aln-min-score) The min-score-coeff option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read length. You can use positive values for min-score-coeff to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.

Paired-End Options

DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:

Reorientation The pe-orientation option specifies the expected paired-end orientation. Only pairs with this orientation can be flagged as proper pairs. Valid values are as follows:
- 0--FR (default)
- 1--RF
- 2--FF
unpaired-pen For paired end reads, best mapping positions are determined jointly for each pair, according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair. The unpaired-pen option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. This option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths. The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.
pe-max-penalty

The pe-max-penalty option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit. The key difference between unpaired-pen and pe-max-penalty is that unpaired-pen affects calculated pair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQ for paired alignments.

Mean Insert Size Detection

When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a skew normal insert model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the observation that common library preparation methods have insert-size distributions that are sometimes close to normal, but also sometimes clearly asymmetric, often skewing toward longer insert sizes. The skew normal insert model is used only for the DNA mode.

If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, shape (or skewness) and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert, Aligner.pe-stat-stddev-insert, Aligner.pe-stat-shape-insert, Aligner.pe-stat-quartiles-insert, and Aligner.pe-stat-mean-read-len options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.

Dragen automatically samples the insert-length distribution. When the software starts execution, it runs a sample of up to 2,000,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.

The DRAGEN host software reports the statistics in its stdout log in a report, as follows:

Initial paired-end statistics detected for read group RGID, based on 39042 high quality pairs for FR orientation
        Quartiles (25 50 75) = 398 409 420
        Mean = 410.192
        Standard deviation = 14.1254
        NOTE: DRAGEN's insert estimates include corrections for clipping (so they are not identical to TLEN)

        Skew-normal insert distribution applied:
          Position (xi) = 424.084
          Scale (omega) = 19.8719
          Shape (alpha) = -1.88125

        To rerun with identical insert stats, specify:
          --Aligner.pe-stat-mean-insert=424.084
          --Aligner.pe-stat-stddev-insert=19.8719
          --Aligner.pe-stat-shape-insert=-1.88125
          --Aligner.pe-stat-quartiles-insert="398 409 420"
          --Aligner.pe-stat-mean-read-len=101

Note that the Mean, Standard deviation and Quartiles reported above are the sample mean, standard deviation and quartiles calculated from the initial sample of up to 2,000,000 pairs, assuming a normal distribution. The sample mean and standard deviation are used to fit the parameters of a skew-normal distribution. A skew-normal distribution is defined by starting with an underlying normal distribution (whose mean we call position or xi and standard deviation we call scale or omega) and folding a varying portion of the probability mass from one side of the mean (e.g., left side) to the other (e.g., right) side. The portion folded varies smoothly, from 0% at the original mean, approaching 100% from the left tail to the right tail. A shape parameter which we call alpha controls how rapidly the folded fraction increases, and at alpha=0 there is no folding and the distribution remains normal.

In the standard output, we also include the command line options needed to reproduce the DRAGEN run with the same insert stat settings. Note that when specifying stats on the command line, the skew-normal xi value should be used for Aligner.pe-stat-mean-insert. The omega value should be used for Aligner.pe-stat-stddev-insert, and the alpha value should be used for Aligner.pe-stat-shape-insert. If Aligner-pe-stat-shape-insert is not specified on the command line, a default value of 0 is assumed.

The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines

 #Sample: sample name
 FragmentLength,Count

These lines are followed by the histogram for the first ~2M read pairs for DNA (~100K read pairs for RNA). The histogram counts are aggregated across all read groups sharing the same sample id (RGSM field).

When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:

WARNING: Less than 28 high quality pairs found - standard deviation is
calculated from the small samples formula

The small samples formula calculates standard deviation as follows:

 if samples < 3 then                                                     
      standard deviation = 10000                                          
 else if samples < 28 then                                               
    standard deviation = 25 * (standard deviation + 1) / (samples - 2) 
 end if                                                                   
                                                                          
 if standard deviation < 12 then                                         
      standard deviation = 12                                             
 end if

The default model is "standard deviation = 10000". If the first 2M reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples. Also, in the DNA mode when we have fewer than 1000 high quality alignments we revert to the normal distribution based insert model, because of insufficient number of samples to accurately estimate the parameters of the skew normal distribution.

For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.

DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, shape, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans. Note that the reported mean and standard deviation in this tab-limited log file are the xi and omega parameters of the skew-normal distribution.

Rescue Scans

For paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt for missing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN host software sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But in cases where the insert standard deviation is large compared to the read length, the rescue radius is restricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:

Rescue radius = 220
     Effective rescue sigmas = 0.5
            WARNING: Default rescue sigmas value of 2.5 was overridden by host software!
            The user may wish to set rescue sigmas value explicitly with --Aligner.rescue-sigmas

Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mapping speed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. To disable rescue scanning, set max-rescues to 0.

Output Options

DRAGEN can track multiple independent alignments for each read. These alignments include the optimal (primary) one, as well as those mapping different subsegments of the read, (chimeric/supplementary), and sub-optimal (secondary) mappings of the read to different areas of the reference.

For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.

You can use the following configuration options to control how many of each type of alignment to include in DRAGEN output.

mapq-max The mapq-max option specifies a ceiling on the estimated MAPQ that can be reported for any alignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. The default is 60.
supp-aligns, sec-aligns The supp-aligns and sec-aligns options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.
sec-phred-delta The sec-phred-delta option controls which secondary alignments are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments with likelihood within this Phred value of the primary are reported.
sec-aligns-hard The sec-aligns-hard option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to be unmapped when not all secondary alignments can be output.
supp-as-sec When the supp-as-sec option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.
hard-clips The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows:
- Bit 0--primary alignments
- Bit 1--supplementary alignments
- Bit 2--secondary alignments

Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.

Mapping with ALT-contigs

The GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previous versions of the reference. Generally, including ALT contigs in the mapping reference improves mapping and variant calling specificity, because misalignments are eliminated for reads matching an ALT contig but scoring poorly against the primary assembly. However, mapping with GRCh38's ALT contigs without special treatment can substantially degrade variant calling sensitivity in corresponding regions, because many reads align equally well to an ALT contig and to the corresponding position in the primary assembly.

Masked Based ALT-awareness

The recomeneded and default approach for dealing with ALT-contigs in DRAGEN is masking regions of ALT contigs of high similarity to their corresponding primary contig. This approach is more accurate than liftover based ALT-awarness because there are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is ambiguous. Incorrect liftover can produce dense clusters of mismapped reads and false variant calls. The base masking approach has the benefits of using ALT contigs without the negative consequences.

Masked hash tables are built from a standard hg18 or hg38 FASTA that contains ALT contigs. The hash table builder will automatically mask regions of the ALT contigs with Ns.

Liftover Based ALT-awarness

With liftover based ALT-awareness, the mapper and aligner are aware of the liftover relationship between ALT contig positions and corresponding primary assembly positions. Seed matches within ALT contigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly. Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or more ALT alignment candidates that lift to the same location. Each liftover group is scored according to its best-matching alignments, taking properly paired alignments into account. The winning liftover group provides its primary assembly representative as the primary output alignment, with MAPQ calculated based on the score difference to the second-best liftover group. Emitting primary alignments within the primary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then corresponding alternate haplotype alignments can also be output, flagged as supplementary alignments.

DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs are detected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.

The following is a comparison of alternative options for dealing with alternate haplotypes.

Mapping without ALT contigs in the reference:
- False-positive variant calls result when reads matching an alternate haplotype misalign somewhere else.
- Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
Mapping with ALT contigs but no ALT awareness:
- False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
- Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due to some reads mapping to ALT contigs.
- Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or identical to the primary assembly.
- Variant calling sensitivity is dramatically reduced throughout regions covered by alternate haplotypes.
Mapping with ALT contigs and ALT awareness:
- False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
- Normal aligned coverage in regions covered by alternate haplotypes because primary alignments are to the primary assembly.
- Normal MAPQs are assigned because alignment candidates in alternative haplotypes are not considered in competition.
- Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.

DRAGEN Multigenome Mapper

The Multigenome Mapper in DRAGEN significantly improves the accuracy of mapping Illumina reads, particularly in challenging regions such as segmental duplications and other difficult to map regions. This advanced method leverages population haplotypes from pangenome references to incorporate additional variant information, constructing alternative haplotype paths that improve reads mapping. By offering these alternate paths, the Multigenome Mapper enables reads containing population-specific variants to align directly to their most likely genomic locations, reducing mapping ambiguity. This improved mapping also results in improved variant calling accuracy.

When given a set of population variants (VCF) or haplotypes, the pangenome reference modification is categorized in the following types:

Alternate contigs represent population haplotypes. Alt-contigs can have a single variant or a combination of nearby phased variants.
Ambiguous codes (IUPAC codes) to represent SNPs. To improve alignment, it edits the reference FASTA with isolated population SNPs.
Haplotype database. An additional haplotype database is built and used to augment the reference FASTA with population variants. A multigenome based mapper algorithm is used to score read alignment according to the variants in this database.

The DRAGEN pangenome hashtables are available to download from the DRAGEN Software Support Site page.

Read Trimming

DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.

To enable hard-trimming mode, use --read-trimmers. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.

DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers.

Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.

Soft-trimming for Poly-G artifacts is enabled by default on supported systems.

Read Trimming Tools

Fixed-Length Trimming

Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.

Poly-G Trimming

Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.

Quality Trimming

Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality.

Adapter Trimming

Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.

Ambiguous Base Trimming

If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.

Minimum Length Trimming

You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.

Maximum Length Trimming

If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.

PolyA Tail Trimming

If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in RNA alignment section.

Read Trimming Metrics

The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.

Total input reads Total number of reads in the input files.
Total input bases Total number of bases in the input reads.
Total input bases R1 Total number of bases in R1 reads.
Total input bases R2 Total number of bases in R2 reads.
Average input read length Total number of input bases divided by the number of input reads.
Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.
Total trimmed bases Total number of bases trimmed, not including soft-trimming.
Average bases trimmed per read The number of trimmed bases divided by the number of input reads.
Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.
Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.
Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.
Total filtered reads The number of reads that were filtered out during trimming.
Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.
Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.
<Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.
<Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.

Read Trimming Settings

Read trimmer

Option

Description

--read-trimmers

To enable trimming filters in hard-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable trimming, set to none. During mapping, artifacts are removed from all reads. The following are valid trimmer names:

fixed-len—Fixed-length trimming
polyg—Poly-G trimming
quality—Quality trimming
adapter—Adapter trimming
n—Ambiguous base trimming
min-len—Minimum length trimming
cut-end—Maximum length trimming
bisulfite—Bisulfite trimming

Read trimming is disabled by default (default: "none").

--soft-read-trimmers

To enable trimming filters in soft-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable soft trimming, set to none. During mapping, reads are aligned as if trimmed, and bases are not removed from the reads. The following are the valid trimmer names.

fixed-len—Fixed-length trimming
polyg—Poly-G trimming
quality—Quality trimming
adapter—Adapter trimming
n—Ambiguous base trimming
min-len—Minimum length trimming
cut-end—Maximum length trimming
bisulfite—Bisulfite trimming

Soft-trimming is enabled for the polyg filter by default (default: "polyg").

--trimming-only

Disables mapping and alignment to run read-trimming only.

Filtering after the trimmer's execution

Option

Description

--trim-min-length

Specify a minimum read length allowed after the trimmer execution. DRAGEN filters any reads with a length less than the value after all read-trimming steps are completed (default: 20).

--trim-min-len-read1

Specify a minimum read length allowed for read1 after the trimmer execution. DRAGEN filters any reads with a length of read1 less than the value after all read-trimming steps are completed (default: 20).

--trim-min-len-read2

Specify a minimum read length allowed for read2 after the trimmer execution. DRAGEN filters any reads with a length of read2 less than the value after all read-trimming steps are completed (default: 20).

--trim-filter-dummy-len

Specify the number of N bases in dummy reads that replace filtered reads (default: 10).

--trim-filter-set-flag

If enabled, dummy reads will have their 0x200 SAM flag set (default: true).

Fixed-length trimming

Option

Description

--trim-r1-5prime

Specify a fixed number of bases to trim from the 5' end of Read 1 (default: 0).

--trim-r1-3prime

Specify a fixed number of bases to trim from the 3' end of Read 1 (default: 0).

--trim-r2-5prime

Specify a fixed number of bases to trim from the 5' end of Read 2 (default: 0).

--trim-r2-3prime

Specify a fixed number of bases to trim from the 3' end of Read 2 (default: 0).

Quality trimming

Option

Description

--trim-min-quality

Specify the minimum read quality. DRAGEN trims bases from the 3' end of reads with a quality below the value.

--trim-quality-r1-5prime

Specify the quality cutoff below which to trim from the 5' end of read 1.

--trim-quality-r1-3prime

Specify the quality cutoff below which to trim from the 3' end of read 1.

--trim-quality-r2-5prime

Specify the quality cutoff below which to trim from the 5' end of read 2.

--trim-quality-r2-3prime

Specify the quality cutoff below which to trim from the 3' end of read 2.

Adapter trimming

Option

Description

--trim-adapter-read1

Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 1.

--trim-adapter-read2

Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 2.

--trim-adapter-r1-5prime

Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 1. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.

--trim-adapter-r2-5prime

Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 2. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.

--trim-adapter-stringency

Specify the minimum number of adapter bases required for trimming (default: 4).

Bisulfite trimming

Option

Description

--trim-bisulfite-ends

Enable both 5-Prime and 3-Prime bisulfite trimming.

--trim-bisulfite-5prime

If a 3' adapter was trimmed, trim an additional 2bp from the 3' end, unless the 5' end matches 'CAA' or 'CGA'".

--trim-bisulfite-3prime

If the 5' end matches 'CAA' or 'CGA', trim the first two of these 5' bases.

Minimum-length trimming

Option

Description

--trim-min-r1-5prime

Specify the minimum number of bases to trim from the 5' end of Read 1 (default: 0).

--trim-min-r1-3prime

Specify the minimum number of bases to trim from the 3' end of Read 1 (default: 0).

--trim-min-r2-5prime

Specify the minimum number of bases to trim from the 5' end of Read 2 (default: 0).

--trim-min-r2-3prime

Specify the minimum number of bases to trim from the 3' end of Read 2 (default: 0).

Maximum-length trimming

Option

Description

--trim-max-length

Specify the maximum number of bases that can be trimmed from the sequences of both reads.

--trim-max-len-read1

Specify the maximum number of bases that can be trimmed from the sequences of read1.

--trim-max-len-read2

Specify the maximum number of bases that can be trimmed from the sequences of read2.

PolyA trimming

Option

Description

--trim-polya-min-trim

The minimum number of poly-As required for polya trimming (default: 20).

PolyG trimming

Option

Description

--trim-polyg-kmer-len

How many bases to check at each read end for poly-G artifact detection (default: 25).

--trim-polyg-kmer-non-g

The maximum number of non-G bases in the K-mer for poly-G artifact detection (default: 2).

--trim-polyg-g-score-r1-5prime

The score for G bases on the 5' end of read 1 (default: 0).

--trim-polyg-g-score-r1-3prime

The score for G bases on the 3' end of read 1 (default: 15).

--trim-polyg-g-score-r2-5prime

The score for G bases on the 5' end of read 2 (default: 0).

--trim-polyg-g-score-r2-3prime

The score for G bases on the 3' end of read 2 (default: 15).

--trim-polyg-min-trim-r1-5prime

The minimum number of G's to trim from the 5' end of read 1 (default: 6).

--trim-polyg-min-trim-r1-3prime

The minimum number of G's to trim from the 3' end of read 1 (default: 6).

--trim-polyg-min-trim-r2-5prime

The minimum number of G's to trim from the 5' end of read 2 (default: 6).

--trim-polyg-min-trim-r2-3prime

The minimum number of G's to trim from the 3' end of read 2 (default: 6).

--trim-polyg-early-exit-threshold

The signed score threshold for poly-G trimming to exit early (default: -500).

PolyX trimming

Option

Description

--trim-polyx-bases-r1-5prime

The bases to trim for polyX trimming from the 5' end of read 1 (default: empty string "" ).

--trim-polyx-bases-r1-3prime

The bases to trim for polyX trimming from the 3' end of read 1 (default: empty string "" ).

--trim-polyx-bases-r2-5prime

The bases to trim for polyX trimming from the 5' end of read 2 (default: empty string "" ).

--trim-polyx-bases-r2-3prime

The bases to trim for polyX trimming from the 3' end of read 2 (default: empty string "" ).

--trim-polyx-min-trim-r1-5prime

The minimum number of X's to trim from the 5' end of read 1 (default: 20).

--trim-polyx-min-trim-r1-3prime

The minimum number of X's to trim from the 3' end of read 1 (default: 20).

--trim-polyx-min-trim-r2-5prime

The minimum number of X's to trim from the 5' end of read 2 (default: 20).

--trim-polyx-min-trim-r2-3prime

The minimum number of X's to trim from the 3' end of read 2 (default: 20).

DRAGEN FASTQC

DRAGEN FastQC is a tool for calculating common metrics used for quality control of high-throughput sequencing data. The tool is modeled after the metrics generated by Babraham Institute's FastQC tool.

The metrics are generated automatically on all DRAGEN map-align workflows with no additional run time and output in a CSV format file called \<PREFIX\>.fastqc_metrics.csv. All metrics are calculated and reported separately for each mate-pair.

For users only interested in sample QC or would like to obtain FastQC results only, DRAGEN provides a mode to generate the fastqc_metrics.csv file directly.

By default DRAGEN FastQC and read-trimming are run as preprocessing steps to standard sequence alignment workflows. If DNA alignment is not needed or if QC results are needed more quickly, the mapping and BAM output portions of the workflow can be disabled. The workflow only outputs key metric files and runs ~70% faster. This option is available on the command-line by entering --fastqc-only=true after the DRAGEN command.

If FastQC runs stand-alone, then the license will not be consumed. If FastQC runs with map-align enabled, then the license will be consumed.

Differences from the Babraham Institutes' FastQC

DRAGEN FastQC is a complete reimplementation of the original FastQC tool developed by the Babraham Institute (henceforth BI-FastQC). The reimplementation of FastQC in DRAGEN, however, has been modified to take advantage of the hardware-acceleration provided by the DRAGEN Field-Programmable Gate Array (FPGA) for a significant speed improvement. As such, there are some differences in how the values are calculated and the resulting metrics will not be exactly identical between the two tools. The most significant differences are described below.

Binning: BI-FastQC uses a customizable binning strategy with a default of 5bp bins, while DRAGEN uses an algorithmic binning strategy based on the Granularity setting described below. In general, this should mean that DRAGEN provides more precise results at default settings.
Outputs: BI-FastQC text output contain the same information as their plots in tabular format, while DRAGEN-FastQC outputs it's raw data. For example, BI-FastQC both plots an outputs the average base quality per-position, while DRAGEN outputs the average base quality by both position and nucleotide. This allows for a more detailed analysis of the data, but requires slightly more work to generate the associated plot.
Rounding: DRAGEN consistently rounds it's calculations to the nearest integer, while the original FastQC uses a mixture of rounding and taking the mathematical floor, leading DRAGEN-FastQC to provide incrementally higher results for some metrics.
Smoothing: Both DRAGEN-FastQC and BI-FastQC utilize smoothing techniques for their distributions of %GC, to account for the fact that 151bp do not divide evenly into 100 percentile bins. However, to take advantage of the speed offered by the FPGA, DRAGEN utilizes a slightly different algorithm than BI-FastQC which results in slightly different results.

Metric Granularity

It is not possible due to memory constraints to guarantee single-base resolution for all metrics. DRAGEN provides an algorithmic solution for binning via --fastqc-granularity. DRAGEN allocates 256 bins in memory for each size or position-based metric. The granularity value of 4–7 inclusive can be used to determine the bin size. High values use smaller bins for greater resolution. Lower values can be used to create larger bins for larger read-lengths

If a value for --fastqc-granularity is not provided by the user, DRAGEN will attempt to estimate the read length of the input data and set the granularity accordingly.

Adapter and Kmer Sequence Files

To include metrics for adapter or other sequence content, DRAGEN FastQC needs to be provided with the desired sequences in FASTA format. DRAGEN provides two options for this purpose, --fastqc-adapter-file for adapter sequences and --fastqc-kmer-file for any additional kmers of interest so that users can add sequences of interest without changing the expected adapter results.

DRAGEN FastQC can accept up to a combined total of 16 adapters and kmer sequences. Each sequence can be a maximum of 12 bp in length. By default, DRAGEN uses the adapter file located at <INSTALL_PATH>/config/adapter_sequences.fasta. The file contains the following same adapter sequences as Babraham's FastQC v 0.11.10 and later.

Illumina Universal Adapter--AGATCGGAAGAG
Illumina Small RNA 3' Adapter--TGGAATTCTCGG
Illumina Small RNA 5' Adapter--GATCGTCGGACT
Nextera Transposase Sequence--CTGTCTCTTATA

FastQC Metrics Output

The FastQC metrics are output to a CSV file format in the run output directory called

<PREFIX>.fastqc_metrics.csv

The reported metrics are broken down into eight sections by metric type. Each section is broken down further into separate rows by either the length, position, or other relevant categorical variables. The following are the metric sections.

Read Mean Quality---Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.
Positional Base Mean Quality---Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.
Positional Base Content---Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.
Read Lengths---Total number of reads with each observed length. Lengths can be either specific sizes or ranges, depending on settings specified using --fastqc-granularity.
Read GC Content---Total number of reads with each GC content percentile between 0 % and 100 %.
Read GC Content Quality---Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.
Sequence Positions---Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads. Sequences are listed first in the metric description in quotes. Locations are listed second and can be either specific positions or ranges.
Positional Quality---Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.

The following are examples rows from each section.

Sorting and Duplicate Marking

Sorting

The map/align system produces a BAM file sorted by reference sequence and position by default. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalent postprocessing command. The ‑‑enable-sort option can be used to enable or disable creation of the BAM file, as follows:

To enable, set to true.
To disable, set to false.

On the reference hardware system, running with sort enabled increases run time for a 30x full genome by about 6--7 minutes.

Duplicate Marking

Marking or removing duplicate aligned reads is a common best practice in whole-genome sequencing. Not doing so can bias variant calling and lead to incorrect results.

The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicates marked in the FLAG field, or with duplicates entirely removed.

In testing, enabling duplicate marking adds minimal run time over and above the time required to produce the sorted BAM file. The additional time is approximately 1--2 minutes for a 30x whole human genome, which is a huge improvement over the long run times of open source tools.

The Duplicate Marking Algorithm

The DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit's MarkDuplicates feature. All the aligned reads are grouped into subsets in which all the members of each subset are potential duplicates.

For two pairs to be duplicates, they must have the following:

Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at both ends.
Identical orientations (direction of the two ends, with the left-most coordinate being first).

In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientation with either end of any other read, whether paired or not.

Unmapped read pairs are never marked as duplicates.

When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marks the others with the BAM duplicate flag (0x400, or decimal 1024). For this comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The idea of this score is to try, all other things being equal, to preserve the reads with the highest-quality base calls.

If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on this attribute, then DRAGEN chooses a winner arbitrarily.

The score for an unpaired read R is the average Phred quality score per base, calculated as follows:

Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGEN configuration option with default value of 15. For a pair, the score is the sum of the scores for the two ends.

This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. This rounding may lead to different duplicate marks from those chosen by Picard, but because the reads were very close in quality this has negligible impact on variant calling results.

Duplicate Marking Limitations

The limitations to DRAGEN duplicate marking implementation are as follows:

When there are two duplicate reads or pairs with very close Phred sequence quality scores, DRAGEN might choose a different winner from that chosen by Picard. These differences have negligible impact on variant calling results.
If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (RGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID. Library ID cannot be used as a criterion for distinguishing non-duplicates.
DRAGEN does not distinguish between optical and PCR duplicates.

Duplicate Marking Settings

The following options can be used to configure duplicate marking in DRAGEN:

--enable-duplicate-marking Set to true to enable duplicate marking. When \--enable-duplicate-marking is enabled, the output is sorted, regardless of the value of the enable-sort option.
--remove-duplicates Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable- duplicate-marking is forced to enabled as well.
--dedup-min-qual Specifies the Phred quality score below which a base should be excluded from the quality score calculation used for choosing among duplicate reads.

Small Variant Calling

The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).

Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller option to true. The VCF header is annotated with ##source=DRAGEN_SNV to indicate the file is generated by the DRAGEN SNV pipeline.

The Variant Caller Algorithm

The DRAGEN Small Variant Caller performs the following steps:

Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.

Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.

Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.

Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.

Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.

Read filtering and reporting of vcf DP fields

In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:

Unfiltered depth is the number of reads covering the position, downstream of any read collapsing or deduplication that may have preceded the variant calling step, but upstream of most read filtering and overlapping mate handling. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads actually used to make the calling decision, where filtered reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded, and overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.

The following figure summarizes the different filtering steps in more detail.

Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
- Duplicate reads.
- Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
- [Somatic] Reads with MAPQ=0.
- [Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
- Unmapped reads.
- Secondary reads.
- Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
- Reads that are badly mated. A badly mated read is a read where the pair is mapped to two different reference contigs.
- Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out reads that are not informative. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.

Mosaic Calling

Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC INFO flag. As default, MOSAIC tagged variants with AF smaller than 20% are filtered with the MosaicLowAF filter.

Variant Caller Options

The following options control the variant caller stage of the DRAGEN host software.

--enable-variant-caller
Set --enable-variant-caller to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:

If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.

--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection with mosaic AF filter threshold set to 0.0. Set to false to disable DRAGEN mosaic detection. The default is true with mosaic AF filter threshold set to 0.2.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF filter to mosaic calls. All MOSAIC tagged variants with AF smaller than the AF threshold are filtered with the MosaicLowAF filter. The default mosaic AF filter threshold is set to 0.2 when the germline variant caller is enabled. The AF default threshold is set to 0.0 when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true.

Downsampling Options for Small Variant Calling

You can use the following options for downsampling reads in the small variant calling pipeline.

For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.

--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.

The following are the default downsampling values for each small variant calling mode.

The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos setting.

If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos, default value is 10) at any start position.

The next downsampling step is to apply the --vc-max-reads-per-raw-region and --vc-max-reads-per-active-region limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.

This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.

When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.

gVCF Output

A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF> allele. Reads that do not support the reference or any variants are assigned the <NON_REF> allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ and in somatic mode the score is FORMAT/SQ.

The following options are available to enable and control gVCF output.

--vc-emit-ref-confidence
To enable gVCF output, set to GVCF. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence to BP_RESOLUTION.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf (banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80 for germline and 1 3 10 20 50 80 for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30 and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.

Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.

The following are example gVCF records that include a hom-ref block call and a variant call.

In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.

QUAL, QD, and GQ Formulation

In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.

QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.

The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:

Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3) compared to DRAGEN without ML (10).

Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3 with ML enabled.

Phasing and Phased Variants

DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same set.

The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.

During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, then they are phased together. If the variants only occur on different haplotypes, then they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.

Combine Phased Variants

Phased variant records that belong to the same phasing set can be combined into a single VCF record. For example, assuming reference at position chr2 115035 is A, the following two phased variants are combined.

The phased variants are combined as follows.

The command-line option --vc-combine-phased-variants-distance specifies the maximum distance over which phased variants will be combined. The default value 0 disables the feature. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value.

DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.

Examples of diploid haplotypes where phasing is supported:

Examples of diploid haplotypes where phasing is not supported:

By default in somatic mode, DRAGEN will output all component SNVs and INDELs that make up an MNV along with the MNV call itself. MNVs and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. Setting --vc-mnv-emit-component-calls=false can be used to restrict which component calls are reported. When DRAGEN reports an MNV call, it considers the difference between the VAF of the MNV call and the VAF of each component call, and reports any given component call in addition to the MNV call if this difference is greater than --vc-combine-phased-variants-max-vaf-delta (default: 0.1). The --vc-mnv-emit-component-calls and --vc-combine-phased-variants-max-vaf-delta options are only applicable in somatic mode and are not supported in germline mode. In germline mode, functionality to output component calls is not available and MNV calls are emitted only without component calls.

Variant Representation

Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned

Additional notes on variant representation in the DRAGEN VCF:

Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
Allele decomposition: by default, multi-nucleotide polymorphisms (MNPs) are represented as separate, contiguous individual SNVs records in the VCF. If phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants. This determines which variant have occurred on the same haplotype. Phased variant records that belong to the same phasing set can be combined into a single VCF record by using the --vc-combine-phased-variants-distance command-line option and set it to a non-zero value. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value (specified in the number of basepairs).

Multi-allelic Variants and Overlapping Variants

A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:

chr1 2656216 . A T,C 107.65 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5

Two indels are considered as multi-allelic if they share the same reference base preceding the indel. chr1 7392258 . C CT,CTTT 234.76 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12

If a SNP overlaps an INDEL, but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below. However DRAGEN has the joint detection of overlaping variants feature which is designed to detect overlapping SNP and INDEL and output them in a single VCF variant record, represented as a multi-allelic genotype.

chr1 1029628 . C CGT 49.88 PASS AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2_5P_bias=-19.742 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628 chr1 1029629 . A G 50.00 PASS AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2_5P_bias=-24.923 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628

Ploidy Support

The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.

If --sample-sex is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex is specified on the command line, contigs are processed as follows.
- For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
- For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.

For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 85% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid to false.

An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8

DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.

Overlapping Mates in the Small Variant Calling

Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.

When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.

Mitochondrial Calling

Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.

DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.

--vc-enable-af-filter-mito Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito Set the threshold to mark emitted vcf call as filtered. The default is 0.02.

QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).

The following filters can be applied to mitochondrial variant calls.

--vc-sq-call-threshold Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter Enables the multiallelic filter. The default value is false.

If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.

If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.

The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.

FORMAT/GT

For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.

The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.

Personalized Germline Small Variant Calling

We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.

The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.

To enable personalized variant calling and machine learning, set --enable-personalization to true (default: false).

Note that this is a beta feature and available only for the germline small variant caller when run with a V4 pangenome reference.

Joint Detection of Overlapping Variants

When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:

Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.

Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.

Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection to false.

Modeling of Correlated Errors Across Reads

DRAGEN has two algorithms that model correlated errors across reads in a given pileup.

Foreign Read Detection

Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.

Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.

FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.

Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.

Base Quality Dropoff

The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.

Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.

ROH Caller

Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.

A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.

ROH Algorithm

The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.

The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.

Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.
Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).

Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.

ROH Options

--vc-enable-roh Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.
--vc-roh-blacklist-bed If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.

ROH Output

The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed in which each row represents one region of homozygosity. The BED file contains the following columns:

Chromosome Start End Score #Homozygous #Heterozygous

Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.
Start and end positions are a 0-based, half-open interval.
#Homozygous is number of homozygous variants in the region.
#Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named <output-file-prefix>.roh_metrics.csv that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).

Concordance with PLINK

The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.

B-Allele Frequency Output

B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.

The BAF value is calculated as either AF or (1 - AF), where

AF = (alt_count / (ref_count + alt_count))
BAF = 1 - AF, only when ref base < alt base, order of priority for bases is A < T < G < C < N.

The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF or BAF = 1-AF. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N is used.

For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.

<NON_REF> lines are excluded
- ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.
- Variants where the ref_count and alt_count are both zero are not included in the output.

BAF Options

--vc-enable-baf Enable or disable B-allele frequency output. Enabled by default.

BAF Output

The BF generates are BigWig-compressed files, named <output-file-prefix>.baf.bw and <output-file-prefix>.hard-filtered.baf.bw. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).

Each entry contains the following information: Chromosome Start End BAF

Where:

Chromosome is a string matching a reference contig.
Start and end values are zero-based, half open intervals.
BAF is a floating point value.

Ploidy Calling

Biomarkers

Downsampling

DRAGEN Single-Cell Pipeline

DRAGEN Host Software

Invoke the software using the dragen command. The command line options are described in the following sections.

Command-line Options

The following are examples of frequently used command lines:

Build Reference/Hash Table

dragen --build-hash-table true --ht-reference <REF_FASTA> \
--output-directory <REF_DIRECTORY>  [options]

Run Map/Align and Variant Caller (*.fastq to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --RGID <RG0> --RGSM <SM0> --enable-variant-caller true

Run Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] \
-1 <FASTQ1> [-2 <FASTQ2>]  \
--RGID <RG0> --RGSM

Run Variant Caller Only (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align false \
--enable-variant-caller true

Re-map and Run Variant Caller (*.bam to *.vcf)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
--enable-map-align true \
--enable-variant-caller true

Run BCL Converter (BCL to *.fastq)

dragen --bcl-conversion-only true --bcl-input-directory <BCL_DIRECTORY> \
--output-directory <OUT_DIRECTORY>

Run RNA Map/Align (*.fastq to *.bam)

dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
--output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
[-2 <FASTQ2>] --enable-rna true

For recommended command lines in typical use cases, see DRAGEN Recipes.

Reference Genome Options

Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

Use the -l (--force-load-reference) option to force the reference genome to load even if it is already loaded.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.

Operating Modes

DRAGEN has two primary modes of operation, as follows:

Mapper/aligner
Variant caller

Full pipeline mode To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.
Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking to true.
Variant caller mode To execute variant caller mode, set the --enable-variant-caller option to true, and set --enable-map-align option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort to false will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.
RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna to true. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..
Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol setting.

Output Options

The following command line options for output are mandatory:

--output-directory <out_dir>—Specifies the output directory for generated files.
--output-file-prefix <out_prefix>-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.
-r [--ref-dir ]—Specifies the reference hash table.

The following examples do not include these mandatory options.

For example, the following commands output to a compressed BAM file, and then forces overwrite:

dragen ... -f

dragen ... -f --output-format bam

To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and then forces overwrite:

dragen ... -f --output-format sam

The following example outputs to a CRAM file, and then forces overwrite:

dragen ... -f --output-format cram

DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.

Alignment tags

Tag

Tag meaning

ZS:Z:R

Multiple alignments with similar score were found.

ZS:Z:NM

No alignment was found.

ZS:Z:QL

An alignment was found but it was below the quality threshold.

The ga tag uses the same format as the SA tag used to describe supplementary alignments.

CRAM Output

When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:

CRAM format V3.0 is produced
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted

The following list of default settings are used for the CRAM output

CRAM option

Value

Description

SEQS_PER_SLICE

2000

Max sequences per slice

BASES_PER_SLICE

SEQS_PER_SLICE*500

Max bases per slice

SLICE_PER_CNT

Max slices per container

embed_ref

Do not embed reference sequence

noref

Do not use non-referenced based encoding

multiseq

-1

Do not use multiple references per slice

unsorted

Do not use unsorted mode

use_bz2

Do not compress using bzip2

use_lzma

Do not compress using lmza

use_rans

Use rANS for quality score compression

binning

NONE

Qual score binning not used

preserve_aux_order

Preserve all aux tags and order (incl RG,NM,MD)

preserve_aux_size

Aux tag sizes not preserved ('i', 's', 'c')

lossy_read_names

Preserve read names

lossy

Do not enable Illumina 8 quality-binning system

ignore_md5

Enable all checking of checksums

decode_md

Do not (re)generate MD and NM tags

Input Options

DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.

Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.

FASTQ Input Files

FASTQ input files can be single-ended or paired-end, as shown in the following examples.

Single-ended in one FASTQ file (-1 option)

dragen -r <REF_DIR> -1 <fastq> \
--output-directory <OUT_DIR> -output-file-prefix <OUTPUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in two matched FASTQ files(-1 and -2 options)

dragen -r <REF_DIR> -1 <fastq1> -2 <fastq2> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

Paired-end in a single interleaved FASTQ file(--interleaved (-i) option)

dragen -r <REF_DIR> -1 <INTERLEAVED_FASTQ> -i \
--RGID <RGID> --RGSM <RGSM>

Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

To avoid impacting system performance, input files must be located on a fast file system.

Multiple FASTQ Input Files

For example:

dragen -r <ref_dir> --fastq-list <CSV_FILE> \
-fastq-list-sample-id <Sample_ID> -output-directory <OUT_DIR> 
--output-file-prefix <OUT_PREFIX>

FASTQ CSV File Format

Column titles are case-sensitive. The following column titles are required:

RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:

ID (from RGID)
SM (from RGSM)
LB (from RGLB)

Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true can be used instead of --fastq-list-sample-id.

Note

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:

RGID,RGSM,RGLB,Lane,Read1File,Read2File
CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq,
/staging/RDSR181520_S1_L001_R2_001.fastq
AGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq,
/staging/RDSR181521_S2_L001_R2_001.fastq
TAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq,
/staging/RDSR181522_S3_L001_R2_001.fastq
AGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq,
/staging/RDSR181523_S4_L001_R2_001.fastq

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \
--tumor-fastq-list-sample-id <Sample_ID> \
--output-directory <out_dir> \
--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \
--fastq-list-sample-id <Sample_ID_2>

Tumor-Normal Pairs Input

#!/bin/bash

HT="/staging/HT/"
tumor_fastq_list="/staging/inputs/tumor_fastq_list.csv"
normal_fastq_list="/staging/inputs/normal_fastq_list.csv"

tumor_samples_list="/staging/inputs/tumor_samples_list.txt"
normal_samples_list="/staging/inputs/normal_samples_list.txt"

while read -u 3 -r tumor_RGSM && read -u 4 -r normal_RGSM; do
output_dir="/staging/results/${tumor_RGSM}_${normal_RGSM}"
mkdir -p ${output_dir}

dragen \
-r ${HT} \
--tumor-fastq-list ${tumor_fastq_list} \
--tumor-fastq-list-sample-id ${tumor_RGSM} \
--fastq-list ${normal_fastq_list} \
--fastq-list-sample-id ${normal_RGSM} \
--output-directory ${output_dir} \
--output-file-prefix ${tumor_RGSM}_${normal_RGSM}
done 3<${tumor_samples_list} 4<${normal_samples_list}


Sample fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_N1.1,normal-1,ILLUMINA,1,/staging/inputs/normal-1_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N1.2,normal-1,ILLUMINA,2,/staging/inputs/normal-1_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.1,normal-2,ILLUMINA,1,/staging/inputs/normal-2_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.2,normal-2,ILLUMINA,2,/staging/inputs/normal-2_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.1,normal-3,ILLUMINA,1,/staging/inputs/normal-3_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.2,normal-3,ILLUMINA,2,/staging/inputs/normal-3_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L002_R2_001.fastq.gz

The following are examples of the FASTQ lists and samples lists used as input for the script.

Sample tumor_fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_T1.1,tumor-1,ILLUMINA,1,/staging/inputs/tumor-1_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T1.2,tumor-1,ILLUMINA,2,/staging/inputs/tumor-1_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.1,tumor-2,ILLUMINA,1,/staging/inputs/tumor-2_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.2,tumor-2,ILLUMINA,2,/staging/inputs/tumor-2_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.1,tumor-3,ILLUMINA,1,/staging/inputs/tumor-3_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.2,tumor-3,ILLUMINA,2,/staging/inputs/tumor-3_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L002_R2_001.fastq.gz

Sample normal_samples_list content

normal-1
normal-2
normal-3

Sample tumor_samples_list content

tumor-1
tumor-2
tumor-3

FASTQ ORA Input Files

See ORA Compression and Decompression for more information on ORA reference files.

The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).

dragen -r <REF_DIR> -1 <fastq.ora1> -2 <fastq.ora2> \
--ora-reference <ORADATA_DIR> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

BAM Input Files

When you specify a BAM file as input, with map/align enabled, DRAGEN ignores any alignment information contained in the input file, and outputs new alignments for all reads.

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name false

Specify paired-end input in one BAM file with the (-b) and \--pair-by-name=true options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

CRAM Input

You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input.

--cram-reference can be either a fasta file, or a DRAGEN hash table folder.
If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the --ref-dir reference

Example: CRAM was created with hg19, re-analysis with hg38

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <ref_dir HG19>

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <hg19.fa>

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

--cram-input--The name and path for the CRAM file
--cram-input--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.

dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

BCL Input Files

BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.

DRAGEN can read directly from BCL in the following circumstances:

Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).

The following example command is for BCL input with only one lane of input:

dragen --bcl-input-dir <BCL_ROOT> --bcl-only-lane <num> -r <ref_dir> \
--output-directory <out_dir> --output-file-prefix <out_prefix>

For additional BCL conversion options, see Input File Types.

Handling of N bases

One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.

Read Names for Paired-End Reads

DRAGEN has the following options to control how suffixes are used:

To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values for this option include forward-slash (/), dot (.), and colon (:).
To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.
To append a new set of suffixes to all read names, set --append-read-index-to-name to true. The delimiter is determined by the --pair-suffix-delimiter option. By default, the delimiter is a slash, so /1 and /2 are added to the names.

Gene Annotation Input Files

DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

Input streaming is supported for the following use cases:

Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).

For other file types that are significantly smaller in size, download them locally before running the analysis.

Streaming FASTQ Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 s3://s3-bucket-name/path/to/object_1.fastq.gz \
  -2 s3://s3-bucket-name/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://storage-account-name.blob.core.windows.net/path/to/object_1.fastq.gz \
  -2 https://storage-account-name.blob.core.windows.net/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://bucket-name.amazonaws.com/path/to/object_1.fastq.gz?querystring \
  -2 https://bucket-name.amazonaws.com/path/to/object_2.fastq.gz?querystring \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b s3://s3-bucket-name/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://storage-account-name.blob.core.windows.net/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://bucket-name.amazonaws.com/path/to/object_1.bam?querystring \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

AWS S3, Azure Blob Storage, Output Streaming

DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.

Streaming output to AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory s3://s3-bucket-name/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Streaming output to Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory https://storage-account-name.blob.core.windows.net/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Security and Permissions

To stream input files or write to a cloud providers storage, you must have permission to access the remote files.

AWS S3

S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.

Azure Blob Storage Account

Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.

Presigned URL (AWS only)

Sample Sex

The --sample-sex option supports the following values. Values are not case-sensitive.

none: No sex karyotype input. Components use a default reference sex karyotype.
auto: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none. auto is the default value.
female: Sex karyotype input is XX.
male: Sex karyotype input is XY.

The following example command lines use --sample-sex to specify the sex karyotype.

--sample-sex FEMALE

--sample-sex MALE

--sample-sex NONE

The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex is used.

Reference Sex Karyotype

Sex Karyotype Input

CNV Caller

ExpansionHunter

Ploidy Caller

Small Variant Caller

SV Caller

XXYY

XXY

XXYY

XYY

XXYY

XXXY

XXYY

XXX

XXYY

None

XX/XY

XXYY

For sex karyotype input of None, CNV independently checks the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.

Preservation or Stripping of BQSR Tags

Read Group Options

Attribute

Argument

Description

--RGID

Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.

--RGLB

Library.

--RGPL

Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.

--RGPU

Platform unit, eg, flowcell-barcode.lane.

--RGSM

Sample.

--RGCN

Name of the sequencing center that produced the read.

--RGDS

Description.

--RGDT

Date the run was produced.

--RGPI

Predicted mean insert size.

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \
--RGPL Illumina --RGPU 1 --RGSM NA12878 \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 SRA056922.fastq --output-directory /staging/tmp/ \
--output-file-prefix rg_example

License Options

To suppress the license status message at the end of the run, use the --lic-no-print option. The following shows an example of the license status message:

LICENSE_MSG| =====================================================
LICENSE_MSG| License report
LICENSE_MSG|   Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbases
since 2018-Feb-15 (1263886160894 bases, unlimited)
LICENSE_MSG|   Genome  bases [ACxxxxxxxxxxx] : 202000000
LICENSE_MSG|   Genome  bases [total]         : 202000000

Autogenerated MD5SUM for BAM and CRAM Output Files

The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).

Configuration Files

Cloud Authentication and Licensing

License Server

AWS Instance Metadata Service Support (IMDSv1/IMDSv2)

DRAGEN uses AWS Instance Metadata Service (IMDS) to identify its own metadata within the AWS environment, including location, identity, and configuration.

AWS IMDSv2 must be enabled on the AWS instance, otherwise, IMDSv1 is used by default. DRAGEN software will automatically detect the IMDS version in use and adapt its behavior accordingly.

Instance Identity

Save the instance identity document(s) as files from the user's instance, and provide them as inputs to the DRAGEN software with each run.
The instance identity document(s) only need to be saved once per AWS account and region, and those files can be re-used subsequently.

Examples for saving instance identity document(s):

AWS

IMDSv1

curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/pkcs7" -o /opt/instance-identity/pkcs7
curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/document" -o /opt/instance-identity/document
cp /opt/instance-identity/pkcs7 /opt/instance-identity/signature

IMDSv2

curl -X PUT -H "X-aws-ec2-metadata-token-ttl-seconds: 300" -H "X-aws-ec2-metadata-token: required" --noproxy "*" "http://169.254.169.254/latest/api/token"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/document"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/signature"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/pkcs7"

There should be 3 files in this folder, respectively named pkcs7, signature and document. Run Dragen using the --lic-instance-id-location ${instance_identity} command option.

Azure

curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2020-09-01" -o /opt/instance-identity/instance
curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/attested/document?api-version=2020-09-01" -o /opt/instance-identity/document

There should be 2 files in this folder, respectively named instance and document. Run Dragen using the --lic-instance-id-location ${instance_identity} command option.

Joint Analysis

DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other.

Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis.

The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.

Combined phased variants in the gVCF input

The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance to combine phased variants in the gVCF output. Input gVCF files created with this option cannot be processed in a population-based analysis using gVCF Genotyper.

The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.

Force-genotyped and targeted variants in the gVCF input

If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.

Similarly, targeted variant calls (option --targeted-merge-vc) in any gVCF file that are not also called by the variant caller will be ignored as well.

Processing GATK gVCF Files

Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.

Joint Analysis Output Format

There are two available joint analysis output files:

Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.

The multisample gVCF output is only available in the pedigree-based analysis.

The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".

1 605262 . G A 13.41 DRAGENHardQUAL
AC=2;AF=1.000;AN=2;DP=2;FS=0.000;MQ=14.00;QD=6.70;SOR=0.693
GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP ./.:.:.:.:.:LowDepth
1/1:0,2:1.000:2:4:PASS:0,0:0,2:50,6,0:1.383e+01,4.943e+00,1.951e+00
./.:.:.:.:.:LowDepth

Hom-ref Blocks FORMAT Fields

In hom-ref blocks, the following FORMAT fields are calculated uniquely.

FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.

In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.

chr1 10288 . C <NON_REF> . PASS END=10290
GT:AD:DP:GQ:MIN_DP:PL:SPL:ICNT
0/0:131,4:135:69:132:0,69,1035:0,125,255:23,1

chr1 10291 . C
T,<NON_REF> 38.45 PASS
DP=100;MQ=24.72;MQRankSum=0.733;ReadPosRankSum=4.112;FractionInformativeReads=0.600;R2_5P_bias=0.000
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:SPL:ICNT:GP:PRI:SB:MB
0/1:28,32,0:0.533,0.000:60:20,21,0:8,11,0:15:73,0,12,307,157,464:255,0,255:23,10:3.8452e+01,1.3151e-01,1.5275e+01,3.0757e+02,1.9173e+02,4.5000e+02:0.00,34.77,37.77,34.77,69.54,37.77:4,24,7,25:8,20,14,18

SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.

In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.

In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.

Pedigree Mode

Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.

To invoke pedigree mode, set the --enable-joint-genotyping option to true. Use the --pedigree-file option to specify the path to a pedigree file that describes the relationship between panels.

The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.

Column Header

Description

Family_ID

The pedigree identifier.

Individual_ID

The ID of the individual.

Paternal_ID

The ID of the individual's father. If the founder, the value is 0.

Maternal_ID

The ID of the individual's mother. If the founder, the value is 0.

Sex

The sex of the sample. If male, the value is 1. If female, the value is 2.

Phenotype

The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2.

The following is an example of an input pedigree file.

#Family_ID Individual_ID Paternal_ID Maternal_ID Sex Phenotype
FAM001 NA12877_Father 0 0 1 1
FAM001 NA12878_Mother 0 0 2 1
FAM001 NA12882_Proband NA12877_Father NA12878_Mother 2 2
FAM001 NA12883_Proband NA12877_Father NA12878_Mother 1 0

De Novo Calling

The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.

Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.

Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
- gVCF files for the Small Variant Caller.
- *.tn.tsv files for the Copy Number Caller.
- BAM files for the Structural Variant Caller.
Run Pedigree Mode for Small Variant Caller. For more information, see Small Variant DeNovo Calling.
Run Pedigree Mode for Copy Number Caller. For more information, see Multisample CNV Calling.
Run Pedigree Mode for Structural Variant Caller. For more information, see Structural Variant De Novo Quality Scoring.
Run DeNovo Variant Small Variant Filtering. For more information, see De Novo Small Variant Filtering.

Small Variant DeNovo Calling

The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.

Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.

At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.

The DQ formula is DQ = -10log10(1 - Pdenovo).

Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.

In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.

The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.

The following are the possible values:

Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.

The following is an example VCF line for a trio:

1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.

Pedigree Mode Options

The following command line options are available for de novo small variant calling.

--enable-joint-genotyping--Run the joint genotyping caller.
--pedigree-file--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant or --variant-list --Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory--The output directory. This is required.
--output-file-prefix--The prefix used to label all output files. This is required.
-r The directory where the hash table resides.

The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant or --variant-list. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.

Population Mode

DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.

The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See Iterative gVCF Genotyper analysis for information on the available command line options.

To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true option to write a multisample gVCF.

Iterative gVCF Genotyper analysis

gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.

Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.

When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.

Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.

Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.

To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard below.

There is a special treatment of alternative or unaligned contigs when the --shard option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.

If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.

Commandline arguments common to all steps:

--enable-gvcf-genotyper-iterative: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference: The file containing the reference sequence in FASTA format (always required).
--output-directory: The output directory (always required).
--output-file-prefix: The prefix used to label all output files (optional, default value dragen).
--shard: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N (e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-discard-ac-zero If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.
--gg-hard-filter Specifies a filtering expression to be applied to the output msVCF records. See msVCF hard filtering below. The default is to apply no filters.
--gg-skip-filtered-sites Omits msVCF records that fail the given hard filter. The default is false.
--gg-msvcf-format-fields Can be used to override the default set of sample genotype fields in the output msVCF. See msVCF metric customization below.
--gg-msvcf-info-fields Can be used to override the default set of site-wise INFO fields in the output msVCF. See msVCF metric customization below.
--gg-squeeze-msvcf Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.
--gg-output-type Set to spvcf to write the output in spVCF format rather then the default msVCF. See File size optimizations below for details.
--gg-diploidify In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g. 1 becomes 1/1. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include the DF FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see msVCF metric customization below.

Commandline arguments for Step1 (step-by-step mode)

--gvcfs-to-cohort-census: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for Step2 (step-by-step mode)

--aggregate-censuses: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.

Commandline arguments for Step3 (step-by-step mode)

--generate-msvcf: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file: the path to the per batch cohort file (from Step1).
--input-census-file: the path to the per batch census file (from Step1).
--input-global-census-file: the path to the global census file (from Step2).

Commandline arguments for running Step1 + Step2 + Step3 (end-to-end mode for a single batch)

--gvcfs-to-msvcf : set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant: if --variant-list is not given, use this option for each input gVCF file. Absolute file paths must be provided.

Commandline arguments for merging per-batch msVCF files

--merge-batches: set to true to merge msVCF files for a set of batches.
--input-batch-list: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing: set to true (the default) to generate a tabix index for the merged msVCF.
--gg-merge-subset: set to override the restriction that all batches must be included in the merge.

Enabling use of mimalloc library for enhanced performance

Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.

For this purpose, the convenience script mi_dragen.sh is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.

The multi-sample VCF output of the iterative gVCF Genotyper

The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.

The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.

We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.

This approach is also referred to as local alleles and is also used by open source software such as bcftools and Hail.

Iterative gVCF Genotyper with Mitochondrial Variant Calls

When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:

Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF

QUAL column in msVCF

The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.

Measures of Hardy-Weinberg Equilibrium in the msVCF output

The Hardy-Weinberg Equilibrium (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. Adherance to HWE can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.

Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values, an allele-wise excess heterozygosity (ExcHet) P-value and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G.

Metric

Description

Scope

Number of values

HWE

Hardy-Weinberg Equilibrium P-value

Allele-wise

One for each alt allele

ExcHet

Excess Heterozygosity P-value

Allele-wise

One for each alt allele

HWEc2

Hardy-Weinberg Equilibrium P-value

Site-wise

Inbreeding Coefficient

Site-wise

Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.

Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.

Allele-wise vs site-wise Hardy-Weinberg metrics

Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method Am J Hum Genet. 2005 May; 76(5): 887–893 the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in Am J Hum Genet. 2005 May; 76(5): 887–893. Care should be taken when deciding which to use.

Allele-Wise Hardy-Weinberg Equilibrium and Excess Heterozygosity.

Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in Am J Hum Genet. 2005 May; 76(5): 887–893. The implementation does not use a mid P-value correction.

For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.

For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).

Site-Wise Hardy-Weinberg Equilibrium.

Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as

𝜒2 = ∑gt (Egt - Ogt)2 / Egt

where the summation is over gt is over all genotypes possible at the site given the alleles present, and Egt and Ogt are the expected and observed counts for genotype gt, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is

dof = n(n-1)/2

where n is the number of alleles.

The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.

A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.

The Inbreeding Coefficient

Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:

IC = 1 - O(het) / E(het),

where O(het) and E(het) are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].

Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.

Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.

Allelic Balance

Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD). For homozygotes this is taken as

AB = ADi / ∑j ADj

where i is the index of the called allele and j runs over all alleles. For heterozygotes this is taken as

ABi = ADi / (ADa + ADb)

where a and b are the indices of the called alleles and i can have values a or b. For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.

DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, the values are not calculated and the fields will be omitted from the output msVCF file.

It is also possible to filter based on the maximum ABHetP value, see msVCF hard filtering.

msVCF hard filtering

Sites in the output msVCF can be filtered on the following global metrics:

QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)

The syntax of a filtering expression is the same as that used by the small variant caller (see Germline Small Variant Hard Filtering). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.

msVCF metric customization

The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL. Supported metrics are GT, GQ, AD, LAD, FT, LPL, LAA, LA, LGT, QL, MQR, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify option is enabled). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None.

Metric

Description

Number of values1

Type

Genotype

String

Genotype quality

Integer

Allelic depths

Integer

LAD

Localized allelic depths

Integer

Sample filter

String

LPL

Local normalized, Phred-scaled likelihoods for genotypes as in original gVCF

Integer

LAA

Mapping of local alt allele index from original gVCF to msVCF excluding the reference allele

Integer

Mapping of local allele indices from original gVCF to msVCF including the reference allele

Integer

LGT

Local GT value as in original gVCF

String

Phred-scaled probability that the site has no variant in this sample (original gVCF QUAL)

Float

MQR

Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (original gVCF INFO/MQRankSum)

Float

LAF

Allele fractions for the local alt alleles

Float

Diploidified, 1 represents a genotype that was originally haploid, 0 represents originally diploid

Integer

The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None, in which case the INFO field will contain the missing symbol, .. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G. INFO fields that have a missing value, ., at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.

Metric

Description

Number of values1

Type

Allele count in genotypes

Integer

Total number of alleles in called genotypes

Integer

Total number of samples

Integer

NS_GT

Total number of samples with called genotypes

Integer

NS_NOGT

Total number of samples with unknown genotypes ./.

Integer

NS_NODATA

Total number of samples with no coverage

Integer

The inbreeding coefficient

Float

HWE

The exact conditional Hardy-Weinberg Equilibrium P-value

Float

ExcHet

he exact conditional Excess Heterozygosity P-value

Float

HWEc2

The chi-squared Hardy-Weinberg Equilibrium P-value

Float

The ALT allele frequencies (AC/AN)

Float

ABHom

The allelic balance among homozygotes

Float

ABHet

The allelic balance among heterozygotes

Float

ABHetP

The P-value for allelic balance among heterozygotes

Float

File size optimizations

For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.

The following options have applicability to this:

The small variant caller's --vc-compact-gvcf, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
- Outputting local allele values, as described above.
- Use of the msVCF metric customization options to output only those metrics required for the downstream analysis.
- Omitting records that fail filters (--gg-skip-filtered-sites option).
- Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf option). This behaviour is explicitly permitted by the VCF specification.

The option that can have the biggest impact on the final output file size is that to write it directly in spVCF format. This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf) can be directly merged via the --merge-batches subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for download. The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.

1: The number of values is coded as per the VCF specification, with A denoting one value per alt allele, R one value per possible allele (including the reference allele), G one value per possible genotype and . an unspecified number of values that may vary between site and sample. The number of elements in localised array FORMAT fields that depend on the number of local alleles will vary between samples and so are specified as ..