Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Illumina DRAGEN (Dynamic Read Analysis for GENomics) secondary analysis was developed to address important challenges associated with analyzing NGS (Next Generation Sequencing) data for a range of applications, including genome, exome, transcriptome, and methylome studies. DRAGEN secondary analysis processes NGS data and enables tertiary analysis to drive insights. The available tools make up a highly accurate, comprehensive, and efficient solution that enables labs of all sizes and disciplines to do more with their genomic data.
Product highlights
Accurate results:
Pangenome reference genome and machine learning drive unprecedented accuracy
99.89% accuracy score with the Precision FDA Truth Challenge V2 benchmark data (2,3)
Comprehensive platform:
Analyze NGS data from whole genomes, exomes, methylomes, and transcriptomes
Available on platform of choice and scalable based on needs
Efficient analysis:
Process a 34x genome in ~ 30 minutes, with all supported callers with DRAGEN server v4 (1)
Reduce FASTQ file sizes up to 5x with DRAGEN ORA Compression
References:
Illumina data on file, 2022.
Illumina DRAGEN Secondary Analysis is the first single platform to achieve 99.89% accuracy based on PrecisionFDA v2 Truth Challenge Benchmark Data. Details here DRAGEN sets new standard for data accuracy in PrecisionFDA benchmark data. Accessed March 22, 2023
PrecisionFDA Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions. precision.fda.gov/challenges/10. Accessed November 3, 2020.
DRAGEN analysis is available on multiple platforms.
DRAGEN on-premises server
DRAGEN on-premises server offers highly accurate secondary analysis in a fraction of time compared with a traditional CPU-based system. - Analyze and store data locally - Supports varying levels of command line interface - Replace up to 30 traditional compute instances - Fully process a 34× whole human genome in ~30 minutes. (1) - One unit supports two NovaSeq 6000 Systems running at full capacity
DRAGEN analysis on Illumina Connected Analytics
Couples the accuracy and speed of the DRAGEN with the ability to customize analysis pipeline to operationalize informatics on a secure platform.
DRAGEN on BaseSpace Sequence Hub (BSSH)
Push button analysis capability in an intuitive, easy-to-use interface with compliance, and storage features of BaseSpace Sequence Hub and Amazon Web Services (AWS).
DRAGEN onboard NovaSeq X Series
- Flexibly runs multiple secondary analysis pipelines in parallel. - Performs up to four simultaneous applications per flow cell in a single run. - Brings up to 5x lossless data compression, and analysis with supported applications - Provides savings on analysis, which over five years can exceed the price of the sequencer
DRAGEN onboard NextSeq 1000 and NextSeq 2000 Systems
- Provides access to select DRAGEN analysis informatics pipelines - Enables users to generate results in as little as two hours - Uses intuitive pipeline algorithms to reduce reliance on external informatics experts
DRAGEN onboard MiSeq i100 Series
Intuitive, ultra-rapid analysis including DRAGEN BCL convert, DRAGEN Library QC, DRAGEN small WGS and DRAGEN Microbial Enrichment Plus. - Rapid results with comprehensive secondary analysis generated in two hours or less (2) - Highly efficient workflow with a single user touchpoint to VCF and/or html report and no intermediate file transfers - Exceptionally easy with an intuitive interface for non-expert users
DRAGEN on AWS, Azure
DRAGEN supports the FPGA enabled instance types of AWS, Azure. Rpm installers and the Kernel driver can be installed on images managed by the user, and DRAGEN can be run by purchasing a license.
DRAGEN on AWS and Azure Marketplace
Pre-configured Amazon Machine Images (AMI) and Azure Virtual Machines with DRAGEN installed can be accessed from the respective marketplace offerings in a Pay-As-You-Use model.
DRAGEN on GCP
DRAGEN is made available on the Google Cloud Platform. Pre-configured instances with DRAGEN installed can be accessed through the GCP application interface. Limited availability. Please reach out to your Illumina representative for access.
(1) HG002 from PrecisionFDA truth challenge V2 run with DRAGEN analysis v4.0 on DRAGEN server v4, all callers
(2) When run according to sample recommendations
DRAGEN analysis offers a large selection of application pipelines.
DRAGEN Demultiplexing
Rapid demultiplexing of NGS analysis
N/A
N/A
DRAGEN ORA Compression
DRAGEN ORA compression is optimized for high compression ratios of FASTQ files, as well as rapid compression and decompression, all while preserving data integrity.
N/A
Compression Ratio Run Time
DRAGEN Map + Align
The DRAGEN Map + Align can be run as a standalone or as part of DRAGEN’s suite of pipelines
N/A
Mapping metrics Duration Metrics Coverage Metrics
DRAGEN Germline
The DRAGEN Germline Pipeline provides end-to-end NGS analysis, including advanced error model calibration for increased accuracy, and repeat expansion detection and genotyping through Illumina Expansion Hunter.
SNV/Indel CNV SV Repeat Expansions
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN Somatic
The DRAGEN Somatic Pipeline includes tumor-only and tumor–normal modes, designed for detecting somatic variants in tumor samples. Both modes make no ploidy assumptions, enabling detection of low-frequency alleles.
SNV/Indel CNV SV TMB MSI HLA
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN Enrichment
The DRAGEN Enrichment Pipeline combines DRAGEN’s germline and somatic callers into a pipeline designed specifically for analyzing enrichment samples. Includes a full suite of enrichment metrics and reporting.
SNV/Indel CNV SV
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN RNA
The DRAGEN RNA Pipeline performs transcriptome analysis starting with splice junction discovery and alignment, followed by rapid alignment and splice junction mapping and quantification. For differential expression, Illumina recommends the DRAGEN Differential Expression app on BaseSpace Sequence Hub.
Gene fusion SNV/Indel
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN Single Cell RNA
The DRAGEN Single Cell RNA pipeline performs demultiplexing, cell-barcode and UMI error correction, sequence alignment, and quantification of gene expression.
N/A
Mapping Metrics Duration Metrics Coverage Metrics Callability Report Cell Metrics
DRAGEN Joint Genotyping
The DRAGEN Joint Genotyping/Population Pipeline calls variants jointly across multiple genomes and scales to large cohorts of samples at expedited speeds with uncompromising accuracy.
SNV/Indel CNV SV Repeat Expansions
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN Methylation
The DRAGEN Methylation Pipeline performs alignment, methyl calling, and calculates alignment and methylation metrics.
N/A
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN Reference Builder
Accepts FASTA files, and builds the proprietary reference used by the DRAGEN apps.
N/A
N/A
DRAGEN TruSight Oncology 500 ctDNA Analysis Software
Secondary analysis support for Illumina’s TruSight Oncology 500 ctDNA. Available on the local DRAGEN Server version 3 and later.
SNV/Indel CNV DNA fusions MSI TMB
Mapping metrics Duration Metrics Coverage Metrics Variant Metrics Callability Report
DRAGEN Imputation
The DRAGEN Imputation pipeline is an end to end user friendly tool that enables scalable low pass whole genome sequencing analysis
N/A
Impute ≤100 samples simultaneously 1.7x faster compared to original GLIMPSE code
DRAGEN analysis can be used in numerous fields in the biological sciences.
Genetic Diseases
Reduce time required for genomic analysis, with high accuracy and comprehensiveness
Oncology
Analyze tumor-only and tumor/normal samples with accuracy, comprehensiveness, and efficiency
Cell and Molecular Biology
Advance understanding of cellular mechanisms with rapid analysis pipelines for bulk and single cell samples
Population Genomics
Accurately and efficiently analyze sequenced genomes at scale. Accelerate re-analysis as computational tools improve over time
Infectious Disease
Detect and characterize infectious diseases with a comprehensive solution
Agrigenomics
Efficiently analyze animals and plants of varying genomic complexities with custom reference
DRAGEN provides tests you can run to make sure that your DRAGEN system is properly installed and configured. Before running the tests, make sure that the DRAGEN server has adequate power and cooling, and is connected to a network that is fast enough to move your data to and from the machine with adequate performance.
Please refer to the Server Site Prep & Installation Guide when installing a new system.
The software can be installed on an on-premises server by executing the .run installer for the desired version. Installers are made available for all releases at the DRAGEN Software Support Site page.
Installation procedure:
Download the desired installer from the support website and unzip the package
The archive integrity can be checked using: ./<dragen .run file> --check
Install the appropriate release based on your Linux OS with the command: sudo sh <dragen .run file>
The .run file includes a script that administers un-installation of an existing software, integrity checking of the package and files, installation of the new DRAGEN software version. The DRAGEN software is installed in part by use of the Linux RPM Package Manager (rpm). Several rpm packages comprise the installation of a single DRAGEN software version. The RPM packages also configure the system for dragen, like raised user ulimits
, and the .run script starts services needed for functionality, such as the Licensing daemon dragen_licd
, and the hugepages daemon, dragend_hp
.
NOTE: Root privileges are required for the installation.
Up to DRAGEN Software v4.2, only one version of the DRAGEN software can be installed at a time. Executing the .run file will remove any existing installed version and (re)install the new version.
After installation, the application and associated files are available at /opt/edico
.
The single version installer will add /opt/edico
to the Linux $PATH, so that the user can just call dragen
without specifying the full path.
Starting with DRAGEN Software v4.3 and later, multiple compatible versions of the DRAGEN software can be installed at a time. Executing the .run file will add the new version to the system.
After installation, the application files are available at /opt/dragen/{version}
and FPGA files are located at /opt/bitstream/{bitstream version}
.
The multi-version installer will NOT add /opt/dragen/{version}
to the Linux $PATH, since multiple versions can be present at a given time. User should manage the desired paths to the specific version they want to run. When this guide provides command line examples, it will assume that the Linux $PATH is set to correct dragen version, and we will just refer to dragen <options>
Notes on multi-version installation:
Installers released for DRAGEN v4.2 and earlier are single version packages
Single version packages and multi-version packages can not be mixed
Installation of a prior single version package will remove all the multi-version packages
Installation of a multi-version package will remove any installed single version package
After installing a multi-version package, see a list of installed versions at any time by running /usr/bin/dragen_versions
To remove any multi-version package, call yum remove
on its Path
Example:
dragen
and resource files4.3 and later
/opt/dragen/{version}
/opt/edico/
4.2 and earlier
/opt/edico/
/opt/edico/
Throughout this guide we will refer to <INSTALL_PATH>
which will be either of the locations above
After turning on the server, you can make sure that your DRAGEN server is functioning properly by running <INSTALL_PATH>/self_test/self_test.sh
, which does the following:
Automatically indexes chromosome M from the hg19 reference genome
Loads the reference genome and index
Maps and aligns a set of reads
Saves the aligned reads in a BAM file
Asserts that the alignments exactly match the expected results
Each server ships with the test input FASTQ data for this script, which is located in <INSTALL_PATH>/self_test
. The system check takes approximately 25--30 minutes.
The following example shows how to run the script and shows the output from a successful test.
If the output BAM file does not match expected results, then the last line of the above text is as follows:
SELF TEST RESULT : FAIL
If you experience a FAIL result after running this test script immediately after turning on your DRAGEN server, contact Illumina Technical Support.
When you are satisfied that your DRAGEN system is performing as expected, you are ready to run some of your own data through the machine, as follows:
Load the reference table for the reference genome
Determine location of input and output files
Process input data
Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. For more information, see Prepare a Reference Genome.
The reference hash table specified on the command line is automatically loaded onto the board the first time you process data with a pipeline. You can manually load the hash table for your reference genome by using the following command:
dragen -r <reference_hash-table_directory>
Make sure that the reference hash table directory is on the fast file IO drive.
The default location for the hash table for hg19 is as follows.
/staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
The command to load reference genome hg19 from the default location is as follows.
dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
This command loads the binary reference genome into memory on the DRAGEN board, where it is used for processing any number of input data sets. You do not need to reload the reference genome unless you restart the system or need to switch to a different reference genome. It can take up to a minute to load a reference genome.
DRAGEN checks whether the specified reference genome is already resident on the board. If it is, then the upload of the reference genome is automatically skipped. You can force reloading of the same reference genome using the force-load-reference (-l)
command line option.
The command to load the reference genome prints the software and hardware versions to standard output. For example:
After the reference genome has been loaded, the following message is printed to standard output:
The DRAGEN Pipeline is very fast, which requires careful planning for the locations of the input and output files. If the input or output files are on a slow file system, then the overall performance of the system is limited by the throughput of that file system. It is recommended that inputs and outputs are streamed directly from/to a mounted external storage system.
The DRAGEN system is preconfigured with at least one fast file system consisting of a set of fast SSD disks grouped with RAID-0 for performance. This file system is mounted at /staging
. This name was chosen to emphasize the fact that this area was built to be large and fast, but is not redundant. Failure of any of the file system's constituent disks leads to the loss of all data stored there.
During processing, DRAGEN generates and reads back temporary files. With DRAGEN, it is highly recommended to always direct temporary files to the fast SSD (or /staging
) by using the --intermediate-results-dir
option. If the --intermediate-results-dir
option is not provided, temporary files are written to the --output-directory
. DRAGEN recommends streaming inputs and outputs using an mounted external storage system.
To analyze FASTQ data, use the dragen command. For example, the following command can be used to analyze a single-ended FASTQ file:
For detailed information on the command line options, see DRAGEN Host Software.
For recommended command lines in typical use cases, see DRAGEN Recipes.
The DRAGEN secondary analysis software utilizes a highly reconfigurable Field Programmable Gate Array (FPGA) card and is available on a preconfigured DRAGEN server that can be seamlessly integrated into bioinformatics workflows. The platform can be loaded with highly optimized algorithms for many different NGS secondary analysis pipelines, including the following:
Whole genome
Exome
RNA-Seq
Methylome
Cancer
All user interaction is accomplished via DRAGEN software that runs on the host server and manages all communication with the FPGA card. This user guide summarizes the technical aspects of the system and provides detailed information for all DRAGEN command line options. If you are working with DRAGEN for the first time, Illumina recommends that you first read the Getting Started section, which provides a short introduction to DRAGEN, including running a test of the server, generating a reference genome, and running example commands.
DRAGEN DNA Pipeline
The DRAGEN DNA Pipeline massively accelerates the secondary analysis of NGS data. For example, the time taken to process an entire human genome at 30x coverage is reduced from approximately 10 hours (using the current industry standard, BWA-MEM+GATK-HC software) to approximately 20 minutes. Time scales linearly with coverage depth.
These pipelines harness the tremendous power of the DRAGEN server and include highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. They also use platform features such as hardware-accelerated compression and optimized BCL conversion, together with the full set of platform tools.
Unlike all other secondary analysis methods, DRAGEN DNA Applications do not reduce accuracy to achieve speed improvements. Accuracy for both SNPs and INDELs is improved over that of BWA-MEM+GATK-HC in side-by-side comparisons.
In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions.
DRAGEN secondary anaylsis includes an RNA-seq (splicing-aware) aligner, as well as RNA-specific analysis components for gene expression quantification and gene fusion detection.
The DRAGEN RNA Pipeline shares many components with the DNA Pipeline. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.
DRAGEN secondary analysis uses hardware accelerated algorithms to map and align RNA-Seq--based reads faster and more accurately than popular software tools. For instance, it can align 100 million paired-end RNA-Seq--based reads in about three minutes. With simulated benchmark RNA-Seq data sets, its splice junction sensitivity and specificity are unsurpassed.
The DRAGEN Methylation Pipeline provides support for automating the processing of bisulfite sequencing data to generate a BAM with the tags required for methylation analysis and reports detailing the locations with methylated cytosines.
The seed-density option controls how many (normally overlapping) primary seeds from each read the mapper looks up in its hash table for exact matches. The maximum density value of 1.0 generates a seed starting at every position in the read, ie, (L-K+1) K-base seeds from an L-base read.
Seed density must be between 0.0 and 1.0. Internally, an available seed pattern equal or close to the requested density is selected. The sparsest pattern is one seed per 32 positions, or density 0.03125.
Accuracy Considerations--Generally, denser seed lookup patterns improve mapping accuracy. However, for modestly long reads (eg, 50 bp+) and low sequencer error rates, there is little to be gained beyond the default 50% seed lookup density.
Speed Considerations--Denser seed lookup patterns generally slow down mapping, and sparser seed patterns speed it up. However, when the seed mapping stage can run faster than the aligning stage, a sparser seed pattern does not make the mapper much faster.
Relationship to Reference Seed Interval
Functionally, a denser or sparser seed lookup pattern has an impact very similar to a shorter or longer reference seed interval (build hash table option --ht-ref-seed-interval
). Populating 100% of reference seed positions and looking up 50% of read seed positions has the same effect as populating 50% of reference seed positions and looking up 100% of read seed positions. Either way, the expected density of seed hits is 50%.
More generally, the expected density of seed hits is the product of the reference seed density (the inverse of the reference seed interval) and the seed lookup density. For example, if 50% of reference seeds are populated and 33.3% (1/3) of read seed positions are looked up, then the expected seed hit density should be 16.7% (1/6).
DRAGEN automatically adjusts its precise seed lookup pattern to ensure it does not systematically miss the seed positions populated from the reference. For example, the mapper does not look up seeds matching only odd positions in the reference when only even positions are populated in the hash table, even if the reference seed interval is 2 and seed-density is 0.5.
The --Mapper.map-orientations
option is used in mapping reads for bisulfite methylation analysis. It is set automatically based on the value set for ‑‑methylation-protocol
.
The --Mapper.map-orientations
option can restrict the orientation of read mapping to only forward in the reference genome, or only reverse-complemented. The valid values for --map-orientations
are as follows.
0--Either orientation (default)
1--Only forward mapping
2--Only reverse-complemented mapping
If mapping orientations are restricted and paired end reads are used, the expected pair orientation can only be FR, not FF or RF.
Although DRAGEN primarily maps reads by finding exact reference matches to short seeds, it can also map seeds differing from the reference by one nucleotide by also looking up single-SNP edited seeds. Seed editing is usually not necessary with longer reads (100 bp+), because longer reads have a high probability of containing at least one exact seed match. This is especially true when paired ends are used, because a seed match from either mate can successfully align the pair. But seed editing can, for example, be useful to increase mapping accuracy for short single-ended reads, with some cost in increased mapping time. The following options control seed editing:
Seed Editing Options
--Mappper.seed-density
seed-density
-Mapper.edit-mode
edit-mode
--Mapper.edit-seed-num
edit-seed-num
--Mapper.edit-read-len
edit-read-len
--Mapper.edit-chain-limit
edit-chain-limit
edit-mode and edit-chain-limit
The edit-mode and edit-chain-limit options control when seed editing is used. The following four edit-mode values are available:
0
No editing (default)
1
Chain length test
2
Paired chain length test
3
Full seed editing
Edit mode 0 requires all seeds to match exactly. Mode 3 is the most expensive because every seed that fails to match the reference exactly is edited. Modes 1 and 2 employ heuristics to look up edited seeds only for reads most likely to be salvaged to accurate mapping.
The main heuristic in edit modes 1 and 2 is a seed chain length test. Exact seeds are mapped to the reference in a first pass over a given read, and the matching seeds are grouped into chains of similarly aligning seeds. If the longest seed chain (in the read) exceeds a threshold edit-chain-limit, the read is judged not to require seed editing, because there is already a promising mapping position.
Edit mode 1 triggers seed editing for a given read using the seed chain length test. If no seed chain exceeds edit-chain-limit
(including if no exact seeds match), then a second seed mapping pass is attempted using edited seeds. Edit mode 2 further optimizes the heuristic for paired-end reads. If either mate has an exact seed chain longer than edit-chain-limit
, then seed editing is disabled for the pair, because a rescue scan is likely to recover the mate alignment based on seed matches from one read. Edit mode 2 is the same as mode 1 for single-ended reads.
edit-seed-num and edit-read-len
For edit modes 1 and 2, when the heuristic triggers seed editing, these options control how many seed positions are edited in the second pass over the read. Although exact seed mapping can use a densely overlapping seed pattern, such as seeds starting at 50% or 100% of read positions, most of the value of seed editing can be obtained by editing a much sparser pattern of seeds, even a nonoverlapping pattern. Generally, if a user application can afford to spend some additional amount of mapping time on seed editing, a greater increase in mapping accuracy can be obtained for the same time cost by editing seeds in sparse patterns for a large number of reads, than by editing seeds in dense patterns for a small number of reads.
Whenever seed editing is triggered, these two options request edit-seed-num seed editing positions, distributed evenly over the first edit-read-len bases of the read. For example, with 21-base seeds, edit-seed-num=6 and edit-read-len=100, edited seeds can begin at offsets {0, 16, 32, 48, 64, 80} from the 5' end, consecutive seeds overlapping by 5 bases. Because sequencing technologies often yield better base qualities nearer the (5') beginning of each read, this can focus seed editing where it is most likely to succeed. When a particular read is shorter than edit-read-len
, fewer seeds are edited.
Seed editing is more expensive when the reference seed interval (build hash table option ‑-ht‑ref-seed-interval) is greater than 1. For edit modes 1 and 2, additional seed editing positions are automatically generated to avoid missing the populated reference seed positions. For edit mode 3, the time cost can increase dramatically because query seeds matching unpopulated reference positions typically miss and trigger editing.
The first stage of mapping is to generate seeds from the read and look for exact matches in the reference genome. These results are then refined by running full Smith-Waterman alignments on the locations with the highest density of seed matches. This well-documented algorithm works by comparing each position of the read against all the candidate positions of the reference. These comparisons correspond to a matrix of potential alignments between read and reference. For each of these candidate alignment positions, Smith-Waterman generates scores that are used to evaluate whether the best alignment passing through that matrix cell reaches it by a nucleotide match or mismatch (diagonal movement), a deletion (horizontal movement), or an insertion (vertical movement). A match between read and reference provides a bonus, on the score, and a mismatch or indel imposes a penalty. The overall highest scoring path through the matrix is the alignment chosen.
The specific values chosen for scores in this algorithm indicate how to balance, for an alignment with multiple possible interpretations, the possibility of an indel as opposed to one or more SNPs, or the preference for an alignment without clipping. The default DRAGEN scoring values are reasonable for aligning moderate length reads to a whole human reference genome for variant calling applications. But any set of Smith-Waterman scoring parameters represents an imprecise model of genomic mutation and sequencing errors, and differently tuned alignment scoring values can be more appropriate for some applications.
The following alignment options control Smith-Waterman Alignment:
--Aligner.global
global
--Aligner.match-score
match-score
--Aligner.match-n-score
match-n-score
--Aligner.mismatch-pen
mismatch-pen
--Aligner.gap-open-pen
gap-open-pen
--Aligner.gap-ext-pen
gap-ext-pen
--Aligner.unclip-score
unclip-score
--Aligner.no-unclip-score
no-unclip-score
--Aligner.aln-min-score
aln-min-score
--Aligner.min-score-coeff
min-score-coeff
global The global
option (value can be 0 or 1) controls whether alignment is forced to be end-to-end in the read. When set to 1, alignments are always end-to-end, as in the Needleman-Wunsch global alignment algorithm (although not end-to-end in the reference), and alignment scores can be positive or negative. When set to 0, alignments can be clipped at either or both ends of the read, as in the Smith-Waterman local alignment algorithm, and alignment scores are nonnegative. Generally, global=0
is preferred for longer reads, so significant read segments after a break of some kind (large indel, structural variant, chimeric read, and so forth) can be clipped without severely decreasing the alignment score. Setting global=1 might not have the desired effect with longer reads because insertions at or near the ends of a read can function as pseudoclipping. Also, with global=0, multiple (chimeric) alignments can be reported when various portions of a read match widely separated reference positions. Using global=1
is sometimes preferable with short reads, which are unlikely to overlap structural breaks, unable to support chimeric alignments, and are suspected of incorrect mapping if they cannot align well end-to-end. Consider using the unclip-score option, or increasing it, instead ofsetting global=1, to make a soft preference for unclipped alignments.
match-score The match-score
option specifies the score for a read nucleotide matching a reference nucleotide (A, C, G, or T), or matching a reference 2–3 nucleotide IUPAC-IUB code. Its value is an unsigned integer, from 0 to 15. match_score=0 can only be used when global=1. A higher match score results in longer alignments, and fewer long insertions.
match-n-score The match-n-score
option specifies the score for an aligned position where the read position and/or the reference position is an N code. This option is a signed integer, from -16 to 15.
mismatch-pen The mismatch-pen
option is the penalty (negative score) for a read nucleotide mismatching any reference nucleotide or IUPAC-IUB code, except N. This option is an unsigned integer, from 0 to 63. A higher mismatch penalty results in alignments with more insertions, deletions, and clipping to avoid SNPs.
gap-open-pen The gap-open-pen
option is the penalty (negative score) for opening a gap (ie, an insertion or deletion). This value is only for a 0-base gap. It is always added to the gap length times gap-ext-pen. This option is an unsigned integer, from 0 to 127. A higher gap open penalty causes fewer insertions and deletions of any length in alignment CIGARs, with clipping or alignment through SNPs used instead.
gap-ext-pen The gap-ext-pen
option is the penalty (negative score) for extending a gap (ie, an insertion or deletion) by one base. This option is an unsigned integer, from 0 to 15. A higher gap extension penalty causes fewer long insertions and deletions in alignment CIGARs, with short indels, clipping, or alignment through SNPs used instead.
unclip-score The unclip-score
option is the score bonus for an alignment reaching the beginning or end of the read. An end-to-end alignment receives twice this bonus. This option is an unsigned integer, from 0 to 127. A higher unclipped bonus causes alignment to reach the beginning and/or end of a read more often, where this can be done without too many SNPs or indels. A nonzero unclip-score is useful when global=0 to make a soft preference for unclipped alignments. Unclipped bonuses have little effect on alignments when global=1, because end-to-end alignments are forced anyway (although 2 × unclip-score does add to every alignment score unless no-unclip-score = 1). Note that, especially with longer reads, setting unclip-score much higher than gap-open-pen can have the undesirable effect of insertions at or near one end of a read being utilized as pseudoclipping, as happens with global=1
no-unclip-score The no-unclip-score
option can be 0 or 1. The default is 1. When no-unclip-score is set to 1, any unclipped bonus (unclip-score) contributing to an alignment is removed from the alignment score before further processing, such as comparison with aln-min-score, comparison with other alignment scores, and reporting in AS or XS tags. However, the unclipped bonus still affects the best-scoring alignment found by Smith-Waterman alignment to a given reference segment, biasing toward unclipped alignments When unclip-score > 0 causes a Smith-Waterman local alignment to extend out to one or both ends of the read, the alignment score stays the same or increases if no-unclip-score=0, whereas it stays the same or decreases if no-unclip-score=1. The default, no-unclip-score=1, is recommended when global=1, because every alignment is end-to-end, and there is no need to add the same bonus to every alignment. When changing no-unclip-score, consider whether aln-min-score should be adjusted. When no-unclip-score=0, unclipped bonuses are included in alignment scores compared to the aln-min-score floor, so the subset of alignments filtered out by aln-min-score can change significantly with no-unclip-score.
aln-min-score The aln-min-score
option specifies a minimum acceptable alignment score. Any alignment results scoring lower are discarded. Increasing or decreasing aln-min-score can reduce or increase the percentage of reads mapped. This option is a signed integer (negative alignment scores are possible with global=0). aln-min-score also affects MAPQ estimates. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. A read's best alignment score is saved in the AS SAM tag, and the second-best score (if available) is saved in the XS tag. aln-min-score serves as the suboptimal alignment score if nothing higher was found except the best score. Therefore, increasing aln-min-score can decrease reported MAPQ for some low-scoring alignments. You can use the min-score-coeff option to adjust aln-min-score as a function of read length.
min-score-coeff The min-score-coeff
option makes adjustments to aln-min-score
per read base. When using the min-score-coeff
and aln-min-score
options together, you can define the minimum alignment score for each read as an affine function of read lengths. The minimum score for an N-base read is calculated as follows: (min-score-coeff)\*N+(aln-min-score)
The min-score-coeff
option is an integer ranging from –64 to 63.999. If the value is 0, then the minimum alignment score is fixed at aln-min-score for all read length. You can use positive values for min-score-coeff
to allow shorter reads to match with lower alignment scores, but require longer reads to achieve higher scores.
DRAGEN can process paired-end data passed via a pair of FASTQ files or in a single interleaved FASTQ file. The hardware maps the two ends separately, and then determines a set of alignments that seem most likely to form a pair in the expected orientation and having roughly the expected insert size. The alignments for the two ends are evaluated for the quality of their pairing, with larger penalties for insert sizes far from the expected size. The following options control processing of paired-end data:
Reorientation The pe-orientation
option specifies the expected paired-end orientation. Only pairs with this orientation can be flagged as proper pairs. Valid values are as follows:
0--FR (default)
1--RF
2--FF
unpaired-pen For paired end reads, best mapping positions are determined jointly for each pair, according to the largest pair score found, considering the various combinations of alignments for each mate. A pair score is the sum of the two alignment scores minus a pairing penalty, which estimates the unlikelihood of insert lengths further from the mean insert than this aligned pair. The unpaired-pen
option specifies how much alignment pair scores should be penalized when the two alignments are not in properly paired position or orientation. This option also serves as the maximum pairing penalty for properly paired alignments with extreme insert lengths. The unpaired-pen option is specified in Phred scale, according to its potential impact on MAPQ. Internally, it is scaled into alignment score space based on Smith-Waterman scoring parameters.
pe-max-penalty
The pe-max-penalty
option limits how much the estimated MAPQ for one read can increase because its mate aligned nearby. A paired alignment is never assigned MAPQ higher than the MAPQ that it would have received mapping single-ended, plus this value. By default, pe-max-penalty = mapq-max = 255, effectively disabling this limit. The key difference between unpaired-pen
and pe-max-penalty
is that unpaired-pen
affects calculated pair scores and thus which alignments are selected and pe-max-penalty affects only reported MAPQ for paired alignments.
When working with paired-end data, DRAGEN must choose among the highest-quality alignments for the two ends to try to choose likely pairs. To make this choice, DRAGEN uses a skew normal insert model to evaluate the likelihood that a pair of alignments constitutes a pair. This model is based on the observation that common library preparation methods have insert-size distributions that are sometimes close to normal, but also sometimes clearly asymmetric, often skewing toward longer insert sizes. The skew normal insert model is used only for the DNA mode.
If you know the statistics of your library prep for an input file (and the file consists of a single read group), you can specify the characteristics of the insert-length distribution: mean, standard deviation, shape (or skewness) and three quartiles. These characteristics can be specified with the Aligner.pe-stat-mean-insert
, Aligner.pe-stat-stddev-insert
, Aligner.pe-stat-shape-insert
, Aligner.pe-stat-quartiles-insert
, and Aligner.pe-stat-mean-read-len
options. However, it is typically preferable to allow DRAGEN to detect these characteristics automatically.
Dragen automatically samples the insert-length distribution. When the software starts execution, it runs a sample of up to 2,000,000 pairs through the aligner, calculates the distribution, and then uses the resulting statistics for evaluating all pairs in the input set.
The DRAGEN host software reports the statistics in its stdout log in a report, as follows:
Note that the Mean
, Standard deviation
and Quartiles
reported above are the sample mean, standard deviation and quartiles calculated from the initial sample of up to 2,000,000 pairs, assuming a normal distribution. The sample mean and standard deviation are used to fit the parameters of a skew-normal distribution. A skew-normal distribution is defined by starting with an underlying normal distribution (whose mean we call position
or xi
and standard deviation we call scale
or omega
) and folding a varying portion of the probability mass from one side of the mean (e.g., left side) to the other (e.g., right) side. The portion folded varies smoothly, from 0% at the original mean, approaching 100% from the left tail to the right tail. A shape
parameter which we call alpha
controls how rapidly the folded fraction increases, and at alpha=0
there is no folding and the distribution remains normal.
In the standard output, we also include the command line options needed to reproduce the DRAGEN run with the same insert stat settings. Note that when specifying stats on the command line, the skew-normal xi
value should be used for Aligner.pe-stat-mean-insert
. The omega
value should be used for Aligner.pe-stat-stddev-insert
, and the alpha
value should be used for Aligner.pe-stat-shape-insert
. If Aligner-pe-stat-shape-insert
is not specified on the command line, a default value of 0 is assumed.
The insert length distribution for each sample is written to fragment_length_hist.csv. Each sample starts with the following lines
These lines are followed by the histogram for the first ~2M read pairs for DNA (~100K read pairs for RNA). The histogram counts are aggregated across all read groups sharing the same sample id (RGSM
field).
When the number of sample pairs is very small, there is not enough information to characterize the distribution with high confidence. In this case, DRAGEN applies default statistics that specify a very wide insert distribution, which tends to admit pairs of alignments as proper pairs, even if they may lie tens of thousands of bases apart. In this situation, DRAGEN outputs a message, as follows:
The small samples formula calculates standard deviation as follows:
The default model is "standard deviation = 10000". If the first 2M reads are unmapped or if all pairs are improper pairs, then the standard deviation is set to 10000 and the mean and quartiles are set to 0. Note that the minimum value for standard deviation is 12, which is independent of the number of samples. Also, in the DNA mode when we have fewer than 1000 high quality alignments we revert to the normal distribution based insert model, because of insufficient number of samples to accurately estimate the parameters of the skew normal distribution.
For RNA-Seq data, the insert size distribution is not normal due to pairs containing introns. The DRAGEN software estimates the distribution using a kernel density estimator to fit a long tail to the samples. This estimate leads to a more accurate mean and standard deviation for RNA-Seq data and proper pairing.
DRAGEN writes detected paired-end stats into a tab-delimited log file in the output directory called .insert-stats.tab. This file contains the statistical distribution of detected insert sizes for each read group, including quartiles, mean, standard deviation, shape, minimum, and maximum. The information matches the standard-out report above. Additionally, the log file includes the minimum and maximum insert limits that DRAGEN applied for rescue scans. Note that the reported mean and standard deviation in this tab-limited log file are the xi
and omega
parameters of the skew-normal distribution.
For paired-end reads, where a seed hit is found for one mate but not the other, rescue scans hunt for missing mate alignments within a rescue radius of the mean insert length. Normally, the DRAGEN host software sets the rescue radius to 2.5 standard deviations of the empirical insert distribution. But in cases where the insert standard deviation is large compared to the read length, the rescue radius is restricted to limit mapping slowdowns. In this case, a warning message is displayed, as follows:
Although the user can ignore this warning, or specify an intermediate rescue radius to maintain mapping speed, it is recommended to use 2.5 sigmas for the rescue radius to maintain mapping sensitivity. To disable rescue scanning, set max-rescues to 0.
DRAGEN can track multiple independent alignments for each read. These alignments include the optimal (primary) one, as well as those mapping different subsegments of the read, (chimeric/supplementary), and sub-optimal (secondary) mappings of the read to different areas of the reference.
For DNA alignment by default, DRAGEN can emit one primary alignment for each read, up to three chimeric alignments (Aligner.supp-aligns=3), and no secondary alignments (Aligner.sec-aligns=0). The maximum user-specified value for supp-aligns or sec-aligns is 4095.
You can use the following configuration options to control how many of each type of alignment to include in DRAGEN output.
mapq-max The mapq-max
option specifies a ceiling on the estimated MAPQ that can be reported for any alignment, from 0 to 255. If the calculated MAPQ is higher, this value is reported instead. The default is 60.
supp-aligns, sec-aligns The supp-aligns
and sec-aligns
options restrict the maximum number of supplementary (ie, chimeric and SAM FLAG 0x800) alignments and secondary (ie, suboptimal and SAM FLAG 0x100) alignments, respectively, that can be reported for each read. A maximum of 4095 supplementary alignments and 4095 secondary alignments can be reported for any read, in addition to a primary alignment. High settings for these two options impact speed so it is advisable to increase only as needed.
sec-phred-delta The sec-phred-delta
option controls which secondary alignments are emitted based on the alignment score relative to the primary reported alignment. Only secondary alignments with likelihood within this Phred value of the primary are reported.
sec-aligns-hard The sec-aligns-hard
option suppresses the output of all secondary alignments if there are more secondary alignments than can be emitted. Set sec-aligns-hard to 1 to force the read to be unmapped when not all secondary alignments can be output.
supp-as-sec When the supp-as-sec
option is set to 1, then supplementary (chimeric) alignments are reported with SAM FLAG 0x100 instead of 0x800. The default is 0. The supp-as-sec option provides compatibility with tools that do not support FLAG 0x800.
hard-clips The hard-clips option is used as a field of 3 bits, with values ranging from 0 to 7. The bits specify alignments, as follows:
Bit 0--primary alignments
Bit 1--supplementary alignments
Bit 2--secondary alignments
Each bit determines whether local alignments of that type are reported with hard clipping (1) or soft clipping (0). The default is 6, meaning primary alignments use soft clipping and supplementary and secondary alignments use hard clipping.
The GRCh38 human reference contains many more alternate haplotypes (ALT contigs) than previous versions of the reference. Generally, including ALT contigs in the mapping reference improves mapping and variant calling specificity, because misalignments are eliminated for reads matching an ALT contig but scoring poorly against the primary assembly. However, mapping with GRCh38's ALT contigs without special treatment can substantially degrade variant calling sensitivity in corresponding regions, because many reads align equally well to an ALT contig and to the corresponding position in the primary assembly.
The recomeneded and default approach for dealing with ALT-contigs in DRAGEN is masking regions of ALT contigs of high similarity to their corresponding primary contig. This approach is more accurate than liftover based ALT-awarness because there are many places where the "correct" or most useful liftover between a long ALT haplotype and the primary assembly is ambiguous. Incorrect liftover can produce dense clusters of mismapped reads and false variant calls. The base masking approach has the benefits of using ALT contigs without the negative consequences.
Masked hash tables are built from a standard hg18 or hg38 FASTA that contains ALT contigs. The hash table builder will automatically mask regions of the ALT contigs with Ns.
With liftover based ALT-awareness, the mapper and aligner are aware of the liftover relationship between ALT contig positions and corresponding primary assembly positions. Seed matches within ALT contigs are used to obtain corresponding primary assembly alignments, even if the latter score poorly. Liftover groups are formed, each containing a primary assembly alignment candidate, and zero or more ALT alignment candidates that lift to the same location. Each liftover group is scored according to its best-matching alignments, taking properly paired alignments into account. The winning liftover group provides its primary assembly representative as the primary output alignment, with MAPQ calculated based on the score difference to the second-best liftover group. Emitting primary alignments within the primary assembly maintains normal aligned coverage and facilitates variant calling there. If the --Aligner.en-alt-hap-aln option is set to 1 and --Aligner.supp-aligns is greater than 0, then corresponding alternate haplotype alignments can also be output, flagged as supplementary alignments.
DRAGEN requires ALT-Aware hash tables for any hg19 or GRCh38 reference where ALT contigs are detected. To disable this requirement in DRAGEN, set the --ht-alt-aware-validate option to false.
The following is a comparison of alternative options for dealing with alternate haplotypes.
Mapping without ALT contigs in the reference:
False-positive variant calls result when reads matching an alternate haplotype misalign somewhere else.
Poor mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
Mapping with ALT contigs but no ALT awareness:
False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
Low or zero aligned coverage in primary assembly regions covered by alternate haplotypes, due to some reads mapping to ALT contigs.
Low or zero MAPQ in regions covered by alternate haplotypes, where they are similar or identical to the primary assembly.
Variant calling sensitivity is dramatically reduced throughout regions covered by alternate haplotypes.
Mapping with ALT contigs and ALT awareness:
False-positive variant calls from misaligned reads matching ALT contigs are eliminated.
Normal aligned coverage in regions covered by alternate haplotypes because primary alignments are to the primary assembly.
Normal MAPQs are assigned because alignment candidates in alternative haplotypes are not considered in competition.
Good mapping and variant calling sensitivity where reads matching an ALT contig differ greatly from the primary assembly.
The Multigenome Mapper in DRAGEN significantly improves the accuracy of mapping Illumina reads, particularly in challenging regions such as segmental duplications and other difficult to map regions. This advanced method leverages population haplotypes from pangenome references to incorporate additional variant information, constructing alternative haplotype paths that improve reads mapping. By offering these alternate paths, the Multigenome Mapper enables reads containing population-specific variants to align directly to their most likely genomic locations, reducing mapping ambiguity. This improved mapping also results in improved variant calling accuracy.
When given a set of population variants (VCF) or haplotypes, the pangenome reference modification is categorized in the following types:
Alternate contigs represent population haplotypes. Alt-contigs can have a single variant or a combination of nearby phased variants.
Ambiguous codes (IUPAC codes) to represent SNPs. To improve alignment, it edits the reference FASTA with isolated population SNPs.
Haplotype database. An additional haplotype database is built and used to augment the reference FASTA with population variants. A multigenome based mapper algorithm is used to score read alignment according to the variants in this database.
The DRAGEN pangenome hashtables are available to download from the DRAGEN Software Support Site page.
You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.
Invoke the software using the dragen command. The command line options are described in the following sections.
Command line options can also be set in a configuration file. For more information on configuration files, see Configuration Files . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.
The following are examples of frequently used command lines:
Build Reference/Hash Table
Run Map/Align and Variant Caller (*.fastq to *.vcf)
Run Map/Align (*.fastq to *.bam)
Run Variant Caller Only (*.bam to *.vcf)
Re-map and Run Variant Caller (*.bam to *.vcf)
Run BCL Converter (BCL to *.fastq)
Run RNA Map/Align (*.fastq to *.bam)
For recommended command lines in typical use cases, see DRAGEN Recipes.
Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see Prepare a Reference Genome. You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir]
option. This argument is always required.
Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.
dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
Use the -l (--force-load-reference)
option to force the reference genome to load even if it is already loaded.
dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149
The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.
DRAGEN has two primary modes of operation, as follows:
Mapper/aligner
Variant caller
DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.
Full pipeline mode To execute full pipeline mode, set --enable-variant-caller
to true
and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.
Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking
to true
.
Variant caller mode To execute variant caller mode, set the --enable-variant-caller
option to true, and set --enable-map-align
option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort
to false
will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.
RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna
to true
. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..
Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling
option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated
enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol
setting.
The following command line options for output are mandatory:
--output-directory <out_dir>
—Specifies the output directory for generated files.
--output-file-prefix <out_prefix>
-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.
-r [--ref-dir ]
—Specifies the reference hash table.
The following examples do not include these mandatory options.
For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM>
option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ]
option.
For example, the following commands output to a compressed BAM file, and then forces overwrite:
dragen ... -f
dragen ... -f --output-format bam
To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing
to true.
The following example outputs to a SAM file, and then forces overwrite:
dragen ... -f --output-format sam
The following example outputs to a CRAM file, and then forces overwrite:
dragen ... -f --output-format cram
DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.
DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags
to true.
To generate ZS:Z alignment status tags, set --generate-zs-tags
to true. These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns
was set to 0). The following are valid tag values:
ZS:Z:R
Multiple alignments with similar score were found.
ZS:Z:NM
No alignment was found.
ZS:Z:QL
An alignment was found but it was below the quality threshold.
To generate SA:Z tags, set --generate-sa-tags
to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.
To generate pair score in a ps:i tag, set --generate-ps-tags
to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.
DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags
to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags
to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags
to true (default is false) and set --generate-q2-tags
to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.
DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags
(true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.
The ga tag uses the same format as the SA tag used to describe supplementary alignments.
When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:
CRAM format V3.0 is produced
The CRAM is lossless. Lossy compression is never employed and not optional
Quality score compression is lossless. Read names are preserved
Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores
All input BAM tags are preserved
The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.
A CRAM index is produced in .crai format
CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted
The following list of default settings are used for the CRAM output
SEQS_PER_SLICE
2000
Max sequences per slice
BASES_PER_SLICE
SEQS_PER_SLICE*500
Max bases per slice
SLICE_PER_CNT
1
Max slices per container
embed_ref
0
Do not embed reference sequence
noref
0
Do not use non-referenced based encoding
multiseq
-1
Do not use multiple references per slice
unsorted
0
Do not use unsorted mode
use_bz2
0
Do not compress using bzip2
use_lzma
0
Do not compress using lmza
use_rans
1
Use rANS for quality score compression
binning
NONE
Qual score binning not used
preserve_aux_order
1
Preserve all aux tags and order (incl RG,NM,MD)
preserve_aux_size
0
Aux tag sizes not preserved ('i', 's', 'c')
lossy_read_names
0
Preserve read names
lossy
0
Do not enable Illumina 8 quality-binning system
ignore_md5
0
Enable all checking of checksums
decode_md
0
Do not (re)generate MD and NM tags
DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.
Uncompressed
gzip or bgzip compression
ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.
If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.
FASTQ input files can be single-ended or paired-end, as shown in the following examples.
Single-ended in one FASTQ file (-1 option)
Paired-end in two matched FASTQ files(-1 and -2 options)
Paired-end in a single interleaved FASTQ file(--interleaved (-i)
option)
Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:
<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz
Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.
For Example:
These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq
). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile
to false on the command line.
DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name
option to true
If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.
To avoid impacting system performance, input files must be located on a fast file system.
To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name>
option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name
option.
For example:
Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv
and contains an entry for each FASTQ file or paired-end file pair produced during the run.
FASTQ CSV File Format
The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.
Column titles are case-sensitive. The following column titles are required:
RGID--Read Group
RGSM--Sample ID
RGLB--Library
Lane--Flow cell lane
Read1File--Full path to a valid FASTQ input file
Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.
Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.
When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:
ID (from RGID)
SM (from RGSM)
LB (from RGLB)
You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.
A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID>
must be used in addition to --fastq-list <filename>
to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.
Independent processing and output for multiple individual samples in one run is not supported.
To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true
can be used instead of --fastq-list-sample-id
.
Note
For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.
There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.
The following is an example FASTQ list CSV file with the required columns:
If you use the --tumor-fastq-list
option for somatic input, use the --tumor-fastq-list-sample-id SampleID>
option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:
Tumor-Normal Pairs Input
If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.
You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.
The following are examples of the FASTQ lists and samples lists used as input for the script.
You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference
.
See ORA Compression and Decompression for more information on ORA reference files.
The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).
BAM files can be used as input to the mapper/aligner. By default --enable-map-align
is true. You can use the BAM file as input to the variant caller by setting the --enable-map-align
option to false.
When you specify a BAM file as input, with map/align enabled, DRAGEN ignores any alignment information contained in the input file, and outputs new alignments for all reads.
If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name
option to enable or disable this feature (the default is true).
Specify single-ended input in one BAM file with the (-b
) and --pair-by-name=false
options, as follows:
Specify paired-end input in one BAM file with the (-b
) and \--pair-by-name=true
options, as follows:
You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input.
By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir
option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.
DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference
option. This option will make the CRAM decompressor use the specified reference.
--cram-reference
can be either a fasta file, or a DRAGEN hash table folder.
If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file
CRAM output will always be compressed using the --ref-dir
reference
Example: CRAM was created with hg19, re-analysis with hg38
The following options are used for providing a CRAM input to either mapper/aligner or variant caller:
--cram-input
--The name and path for the CRAM file
--cram-input
--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option
to true.
BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.
DRAGEN can read directly from BCL in the following circumstances:
Only one lane is input as part of a run (specified on the command-line).
The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).
The following example command is for BCL input with only one lane of input:
For additional BCL conversion options, see Input File Types.
One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.
When you use the --fastq-n-quality
and --fastq-offset
options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).
By a common convention, read names can include suffixes, such as /1
or /2
), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name
option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1
and /2
when comparing names. By default, DRAGEN strips these suffixes from the original read names.
DRAGEN has the following options to control how suffixes are used:
To change the delimiter character, for suffixes, use the --pair-suffix-delimiter
option. Valid values for this option include forward-slash (/), dot (.), and colon (:).
To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes
to false.
To append a new set of suffixes to all read names, set --append-read-index-to-name
to true. The delimiter is determined by the --pair-suffix-delimiter
option. By default, the delimiter is a slash, so /1
and /2
are added to the names.
When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file
option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.
DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.
DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.
Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.
Input streaming is supported for the following use cases:
Mapping/aligning of FASTQ and BAM.
Germline and somatic small variant calling from BAM (without remapping).
For other file types that are significantly smaller in size, download them locally before running the analysis.
Streaming FASTQ Input Using AWS S3
Streaming FASTQ Input Using Azure Blob Storage Account
Streaming FASTQ Input Using Presigned URLs (for AWS only)
Streaming BAM Input Using AWS S3
Streaming BAM Input Using Azure Blob Storage Account
Streaming BAM Input Using Presigned URLs (for AWS only)
DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.
Streaming output to AWS S3
Streaming output to Azure Blob Storage Account
To stream input files or write to a cloud providers storage, you must have permission to access the remote files.
AWS S3
S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.
Azure Blob Storage Account
Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.
To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor
permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name>
environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id>
environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.
With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the Storage Account Access Key and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name>
and AZ_ACCOUNT_KEY=<account-key>
.
Presigned URL (AWS only)
An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring
). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.
Use the --sample-sex
command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.
The --sample-sex
option supports the following values. Values are not case-sensitive.
none
: No sex karyotype input. Components use a default reference sex karyotype.
auto
: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none
. auto
is the default value.
female
: Sex karyotype input is XX.
male
: Sex karyotype input is XY.
The following example command lines use --sample-sex
to specify the sex karyotype.
If the value is none
, female
, or male
, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.
The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex
is used.
XX
XX
XX
XX
XX
XXYY
XY
XY
XY
XY
XY
XXYY
XXY
XY
XX
XY
XXYY
XXYY
XYY
XY
XY
XY
XXYY
XXYY
X0
XX
XY
XX
XXYY
XXYY
XXXY
XY
XX
XY
XXYY
XXYY
XXX
XX
XX
XX
XXYY
XXYY
None
XX/XY
XX
XX
XXYY
XXYY
For sex karyotype input of None, CNV independently checks the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.
The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.
The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags
option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.
DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:
ID
--RGID
Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.
LB
--RGLB
Library.
PL
--RGPL
Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.
PU
--RGPU
Platform unit, eg, flowcell-barcode.lane.
SM
--RGSM
Sample.
CN
--RGCN
Name of the sequencing center that produced the read.
DS
--RGDS
Description.
DT
--RGDT
Date the run was produced.
PI
--RGPI
Predicted mean insert size.
If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:
When using the --fastq-list
option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv
file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.
To suppress the license status message at the end of the run, use the --lic-no-print
option. The following shows an example of the license status message:
An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.
The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).
Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg
. You can override this file by using the --config-file (-c)
option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.
The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.
Authentication is required for users that run DRAGEN on the cloud, with the Bring-Your-Own-License (BYOL) model, outside of integrated Illumina cloud products. A valid license is required to enable authentication and usage quotas.
DRAGEN cloud runs access the DRAGEN License Server to validate the credentials and licenses against the intended run. BYOL users must provide credentials and must allow access to the license server URL. The following command line option can be used to pass the credentials to DRAGEN: --lic-server=https://<user>:<pass>@license.edicogenome.com
.
An alternative way to provide license server credentials is by using a license credentials file. The --lic-credentials input command line option can be used to provide the full path to the license credentials file. This provides a more secure way to pass cloud credentials, which avoids accidental credentials leaks from command line console logs.
A license credentials file is a plain text file audited by the customer. The format is the same as the DRAGEN config files: = , each {key,value} separated by new line. The following key names must be used: credentials1 = credentials2 =
DRAGEN uses AWS Instance Metadata Service (IMDS) to identify its own metadata within the AWS environment, including location, identity, and configuration.
DRAGEN supports both AWS IMDSv1, and the more secure AWS IMDSv2. AWS IMDSv1 is request/response based. It accesses metadata by HTTP requests to a specific endpoint on the instance. AWS IMDSv2 is token-based authentication with time-limited tokes.
AWS IMDSv2 must be enabled on the AWS instance, otherwise, IMDSv1 is used by default. DRAGEN software will automatically detect the IMDS version in use and adapt its behavior accordingly.
DRAGEN cloud runs access the instance identity document via the Instance Metadata Service as part of the authentication. It uses the IPv4 local address. If access to the local address is not allowed, the authentication will fail. Alternately, the user may save the instance identity document(s) and point DRAGEN to use them instead, if the user does not want to allow applications to access this service. The method for providing instance identity documents to the software is described below.
Save the instance identity document(s) as files from the user's instance, and provide them as inputs to the DRAGEN software with each run.
The instance identity document(s) only need to be saved once per AWS account and region, and those files can be re-used subsequently.
Examples for saving instance identity document(s):
IMDSv1
IMDSv2
There should be 3 files in this folder, respectively named pkcs7
, signature
and document
. Run Dragen using the --lic-instance-id-location ${instance_identity}
command option.
There should be 2 files in this folder, respectively named instance
and document
. Run Dragen using the --lic-instance-id-location ${instance_identity}
command option.
B-Allele frequency (BAF) output is enabled by default in germline and somatic VCF and gVCF runs.
The BAF value is calculated as either AF
or (1 - AF)
, where
AF = (alt_count / (ref_count + alt_count))
BAF = 1 - AF
, only when ref base < alt base, order of priority for bases is A < T < G < C < N
.
The B-allele frequency values are often plotted to visually inspect the spread away from a perfectly diploid heterozygous call (BAF=50%). This plot is more easily interpreted if it is symmetric about the BAF=50% line. To ensure the symmetry, a heuristic must be used to determine when BAF = AF
or BAF = 1-AF
. This definition of B-Allele Frequency is based on the definition that is used for bead arrays, as most users are accustomed to that implementation. Here, the choice of the B allele is based on the color of dye attached to each nucleotide. A and T get one color, G and C get the other color. The bead array implementation has much more complex rule for tie-breaking between A and T or G and C that involves top and bottom strands. This is unnecessary and so the simpler hierarchical approach of using a priority for the nucleotides A<T<G<C<N
is used.
For each small variant VCF entry with exactly one SNP alternate allele, the output contains a corresponding entry in the BAF output file.
<NON_REF>
lines are excluded
ForceGT variants (as marked by the "FGT" tag in the INFO field) are not included in the output, unless the variant also contains the "NML" tag in the INFO field.
Variants where the ref_count and alt_count are both zero are not included in the output.
--vc-enable-baf
Enable or disable B-allele frequency output. Enabled by default.
The BF generates are BigWig-compressed files, named <output-file-prefix>.baf.bw
and <output-file-prefix>.hard-filtered.baf.bw
. The hard-filtered file only contains entries for variants that pass the filters defined in the VCF (ie, PASS entries).
Each entry contains the following information: Chromosome Start End BAF
Where:
Chromosome is a string matching a reference contig.
Start and end values are zero-based, half open intervals.
BAF is a floating point value.
DRAGEN supports the construction of reference hash tables for both human and non-human reference genomes. The reference autodetect feature of DRAGEN is able to recognize the reference hash tables build on the four Human reference genomes: hg19 (hg19
), GRCh37/hs37d5 (hs37d5
), GRCh38/hs38d1(hg38
), and T2T-CHM13v2.0 (chm13
).
DRAGEN supports pangenome reference hash tables which extend the reference genomes with alternative variant paths from a sample cohort used to construct the pangenome reference. A pangenome-based reference improves the mapping accuracy of Illumina reads in the “Difficult-to-Map Regions” of the genome and the downstream variant calling.
Pre-built human references are available for download at DRAGEN Software Support Site page.
The pangenome is the recommended for Germline human analyses. The accuracy achieved with pangenome references are highlighted in the plot below.
In the following tables we summarize the reference support for each DRAGEN component and the recommended reference type for each component.
SNV
Pangenome
Linear
Yes
Yes
Yes
Yes
Yes
CNV
Pangenome
Linear
Yes
Yes
Yes
Yes*
No
SV
Pangenome
Linear
Yes
Yes
Yes
Yes*
Yes
Expansion Hunter
Pangenome
Linear
Yes
Yes
Yes
No
No
Targeted Callers
Pangenome
Linear
Yes
Yes
Yes
No
No
RNA
Linear
Linear
Yes
Yes
Yes
Yes*
Yes
De Novo
Pangenome
Linear
Yes
Yes
Yes
Yes*
Yes
Joint Genotyping
Pangenome
Linear
Yes
Yes
Yes
Yes*
Yes
Biomarkers (HLA)
Pangenome
Linear
Yes
Yes
Yes
Yes*
No
gVCF genotyper
Pangenome
Linear
Yes
Yes
Yes
Yes*
Yes
SNV
Linear
Linear
Yes
Yes
Yes
Yes*
No
UMI SNV
Linear
Linear
Yes
Yes
Yes
Yes*
No
CNV
Linear
Linear
Yes
Yes
Yes
Yes*
No
SV
Linear
Linear
Yes
Yes
Yes
Yes*
No
Methylation
Linear
Linear
Yes
Yes
Yes
No
No
Nirvana
Pangenome
Linear
Yes
Yes
Yes
No
Yes
* DRAGEN supports the component execution, however the component's accuracy has not been established.
See Prepare a Reference Genome for how to build a custom reference genome.
The DRAGEN DNA Pipeline accelerates the secondary analysis of NGS data by harnessing the tremendous power available on the DRAGEN Platform. The pipeline includes highly optimized algorithms for mapping, aligning, sorting, duplicate marking, and haplotype variant calling. In addition to haplotype variant calling, the pipeline supports calling of copy number and structural variants as well as detection of repeat expansions and targeted calls.
DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.
To enable hard-trimming mode, use --read-trimmers
. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.
DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers
.
Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.
Soft-trimming for Poly-G artifacts is enabled by default on supported systems.
Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.
Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.
Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality
.
Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.
If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.
You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.
If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.
If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in RNA alignment section.
The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv
. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.
Total input reads Total number of reads in the input files.
Total input bases Total number of bases in the input reads.
Total input bases R1 Total number of bases in R1 reads.
Total input bases R2 Total number of bases in R2 reads.
Average input read length Total number of input bases divided by the number of input reads.
Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.
Total trimmed bases Total number of bases trimmed, not including soft-trimming.
Average bases trimmed per read The number of trimmed bases divided by the number of input reads.
Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.
Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.
Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.
Total filtered reads The number of reads that were filtered out during trimming.
Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.
Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.
<Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.
<Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.
--read-trimmers
To enable trimming filters in hard-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable trimming, set to none
. During mapping, artifacts are removed from all reads. The following are valid trimmer names:
fixed-len
—Fixed-length trimming
polyg
—Poly-G trimming
quality
—Quality trimming
adapter
—Adapter trimming
n
—Ambiguous base trimming
min-len
—Minimum length trimming
cut-end
—Maximum length trimming
bisulfite
—Bisulfite trimming
Read trimming is disabled by default (default: "none").
--soft-read-trimmers
To enable trimming filters in soft-trimming mode, set to a comma-separated list of the trimmer tools you would like to use (in the order of execution). To disable soft trimming, set to none
. During mapping, reads are aligned as if trimmed, and bases are not removed from the reads. The following are the valid trimmer names.
fixed-len
—Fixed-length trimming
polyg
—Poly-G trimming
quality
—Quality trimming
adapter
—Adapter trimming
n
—Ambiguous base trimming
min-len
—Minimum length trimming
cut-end
—Maximum length trimming
bisulfite
—Bisulfite trimming
Soft-trimming is enabled for the polyg
filter by default (default: "polyg").
--trimming-only
Disables mapping and alignment to run read-trimming only.
--trim-min-length
Specify a minimum read length allowed after the trimmer execution. DRAGEN filters any reads with a length less than the value after all read-trimming steps are completed (default: 20).
--trim-min-len-read1
Specify a minimum read length allowed for read1 after the trimmer execution. DRAGEN filters any reads with a length of read1 less than the value after all read-trimming steps are completed (default: 20).
--trim-min-len-read2
Specify a minimum read length allowed for read2 after the trimmer execution. DRAGEN filters any reads with a length of read2 less than the value after all read-trimming steps are completed (default: 20).
--trim-filter-dummy-len
Specify the number of N bases in dummy reads that replace filtered reads (default: 10).
--trim-filter-set-flag
If enabled, dummy reads will have their 0x200 SAM flag set (default: true).
--trim-r1-5prime
Specify a fixed number of bases to trim from the 5' end of Read 1 (default: 0).
--trim-r1-3prime
Specify a fixed number of bases to trim from the 3' end of Read 1 (default: 0).
--trim-r2-5prime
Specify a fixed number of bases to trim from the 5' end of Read 2 (default: 0).
--trim-r2-3prime
Specify a fixed number of bases to trim from the 3' end of Read 2 (default: 0).
--trim-min-quality
Specify the minimum read quality. DRAGEN trims bases from the 3' end of reads with a quality below the value.
--trim-quality-r1-5prime
Specify the quality cutoff below which to trim from the 5' end of read 1.
--trim-quality-r1-3prime
Specify the quality cutoff below which to trim from the 3' end of read 1.
--trim-quality-r2-5prime
Specify the quality cutoff below which to trim from the 5' end of read 2.
--trim-quality-r2-3prime
Specify the quality cutoff below which to trim from the 3' end of read 2.
--trim-adapter-read1
Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 1.
--trim-adapter-read2
Specify the FASTA file that contains adapter sequences to trim from the 3' end of Read 2.
--trim-adapter-r1-5prime
Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 1. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.
--trim-adapter-r2-5prime
Specify the FASTA file that contains adapter sequences to trim from the 5' end of Read 2. NB: the sequences should be in reverse order (with respect to their appearance in the FASTQ) but not complemented.
--trim-adapter-stringency
Specify the minimum number of adapter bases required for trimming (default: 4).
--trim-bisulfite-ends
Enable both 5-Prime and 3-Prime bisulfite trimming.
--trim-bisulfite-5prime
If a 3' adapter was trimmed, trim an additional 2bp from the 3' end, unless the 5' end matches 'CAA' or 'CGA'".
--trim-bisulfite-3prime
If the 5' end matches 'CAA' or 'CGA', trim the first two of these 5' bases.
--trim-min-r1-5prime
Specify the minimum number of bases to trim from the 5' end of Read 1 (default: 0).
--trim-min-r1-3prime
Specify the minimum number of bases to trim from the 3' end of Read 1 (default: 0).
--trim-min-r2-5prime
Specify the minimum number of bases to trim from the 5' end of Read 2 (default: 0).
--trim-min-r2-3prime
Specify the minimum number of bases to trim from the 3' end of Read 2 (default: 0).
--trim-max-length
Specify the maximum number of bases that can be trimmed from the sequences of both reads.
--trim-max-len-read1
Specify the maximum number of bases that can be trimmed from the sequences of read1.
--trim-max-len-read2
Specify the maximum number of bases that can be trimmed from the sequences of read2.
--trim-polya-min-trim
The minimum number of poly-As required for polya trimming (default: 20).
--trim-polyg-kmer-len
How many bases to check at each read end for poly-G artifact detection (default: 25).
--trim-polyg-kmer-non-g
The maximum number of non-G bases in the K-mer for poly-G artifact detection (default: 2).
--trim-polyg-g-score-r1-5prime
The score for G bases on the 5' end of read 1 (default: 0).
--trim-polyg-g-score-r1-3prime
The score for G bases on the 3' end of read 1 (default: 15).
--trim-polyg-g-score-r2-5prime
The score for G bases on the 5' end of read 2 (default: 0).
--trim-polyg-g-score-r2-3prime
The score for G bases on the 3' end of read 2 (default: 15).
--trim-polyg-min-trim-r1-5prime
The minimum number of G's to trim from the 5' end of read 1 (default: 6).
--trim-polyg-min-trim-r1-3prime
The minimum number of G's to trim from the 3' end of read 1 (default: 6).
--trim-polyg-min-trim-r2-5prime
The minimum number of G's to trim from the 5' end of read 2 (default: 6).
--trim-polyg-min-trim-r2-3prime
The minimum number of G's to trim from the 3' end of read 2 (default: 6).
--trim-polyg-early-exit-threshold
The signed score threshold for poly-G trimming to exit early (default: -500).
--trim-polyx-bases-r1-5prime
The bases to trim for polyX trimming from the 5' end of read 1 (default: empty string "" ).
--trim-polyx-bases-r1-3prime
The bases to trim for polyX trimming from the 3' end of read 1 (default: empty string "" ).
--trim-polyx-bases-r2-5prime
The bases to trim for polyX trimming from the 5' end of read 2 (default: empty string "" ).
--trim-polyx-bases-r2-3prime
The bases to trim for polyX trimming from the 3' end of read 2 (default: empty string "" ).
--trim-polyx-min-trim-r1-5prime
The minimum number of X's to trim from the 5' end of read 1 (default: 20).
--trim-polyx-min-trim-r1-3prime
The minimum number of X's to trim from the 3' end of read 1 (default: 20).
--trim-polyx-min-trim-r2-5prime
The minimum number of X's to trim from the 5' end of read 2 (default: 20).
--trim-polyx-min-trim-r2-3prime
The minimum number of X's to trim from the 3' end of read 2 (default: 20).
Before a reference genome can be used with DRAGEN, it must be converted from FASTA format into a custom binary format for use with the DRAGEN hardware. The options used in this preprocessing step offer tradeoffs between performance and mapping quality.
Pre-built DRAGEN reference genomes are available for download in the Illumina customer portal. If you find that performance and mapping quality with these are adequate, there is a good chance that you can simply work with these supplied reference genomes. Depending on your read lengths and other particular aspects of your application, you may be able to improve mapping quality and/or performance by tuning the reference preprocessing options.
The DRAGEN mapper extracts many overlapping seeds (subsequences or K-mers) from each read, and looks up those seeds in a hash table residing in memory on its PCIe card, to identify locations in the reference genome where the seeds match. Hash tables are ideal for extremely fast lookups of exact matches. The DRAGEN hash table must be constructed from a chosen reference genome using the --build-hash-table option
, which extracts many overlapping seeds from the reference genome, populates them into records in the hash table, and saves the hash table as a binary file.
DRAGEN will attempt to detect the provided reference in order to automatically apply recommended resources and settings. There are four human references that DRAGEN can detect: hg38, hg19, hs37d5, and chm13v2. DRAGEN is able to detect references that contain a subset of the primary contigs from one of these references, as long as the names and lengths of the detected contigs are consistent with the names and lengths from the standarad assemblies of these references.
In detail, automatic reference detection operates as follows:
We define a primary contig of a human genome to be an autosome (1-22) or sex chromosome (X,Y). Let F be the input fasta. For each reference genome R in hg38, hg19, hs37d5, and chm13v2, DRAGEN checks if there are any contigs in F that have the same name and length as a primary contig in R, and that there are no contigs in F that have the same name as a contig in R, but with different length. If these conditions hold for exactly one of hg38, hg19, hs37d5, and chm13v2, then that reference is detected and resources may be applied automatically.
The DRAGEN hash table builder will automatically apply decoy contigs and mask bed files to detected reference. Other pipelines may also apply automatic resources. For example variant callers may apply machine learning models and target bed files.
In order for DRAGEN to correctly detect the provided reference, it is important to use the standard naming conventions for each of the four human assemblies that DRAGEN detects:
The size of the DRAGEN hash table is proportionate to the number of seeds populated from the reference genome. The default is to populate a seed starting at every position in the reference genome, ie, roughly 3 billion seeds from a human genome. This default requires at least 32 GB of memory on the DRAGEN PCIe board.
To operate on larger, nonhuman genomes or to reduce hash table congestion, it is possible to populate less than all reference seeds using the --ht-ref-seed-interval
option to specify an average reference interval. The default interval for 100% population is --ht-ref-seed-interval 1
, and 50% population is specified with --ht-ref-seed-interval 2
. The population interval does not need to be an integer. For example, --ht-ref-seed-interval 1.2
indicates 83.3% population, with mostly 1-base and some 2-base intervals to achieve a 1.2 base interval on average.
It is characteristic of hash tables that they are allocated a certain size, but always retain some empty records, so they are less than 100% occupied. A healthy amount of empty space is important for quick access to the DRAGEN hash table. Approximately 90% occupancy is a good upper bound. Empty space is important because records are pseudo-randomly placed in the hash table, resulting in an abnormally high number of records in some places. These congested regions can get quite large as the percentage of empty space approaches zero, and queries by the DRAGEN mapper for some seeds can become increasingly slow.
The hash table is populated with reference seeds of a single common length. This primary seed length is controlled with the --ht-seed-len
option, which defaults to 21.
The longest primary seed supported is 27 bases when the table is 8 GB to 31.5 GB in size. Generally, longer seeds are better for run time performance, and shorter seeds are better for mapping quality (success rate and accuracy). A longer seed is more likely to be unique in the reference genome, facilitating fast mapping without needing to check many alternative locations. But a longer seed is also more likely to overlap a deviation from the reference (variant or sequencing error), which prevents successful mapping by an exact match of that seed (although another seed from the read may still map), and there are fewer long seed positions available in each read.
Longer seeds are more appropriate for longer reads, because there are more seed positions available to avoid deviations.
Seed Length Recommendations
Due to repetitive sequences, some seeds of any given length match many locations in the reference genome. DRAGEN uses a unique mechanism called seed extension to successfully map such high-frequency seeds. When the software determines that a primary seed occurs at many reference locations, it extends the seed by some number of bases at both ends, to some greater length that is more unique in the reference.
For example, a 21-base primary seed may be extended by 7 bases at each end to a 35-base extended seed. A 21-base primary seed may match 100 places in the reference. But 35-base extensions of these 100 seed positions may divide into 40 groups of 1-3 identical 35-base seeds. Iterative seed extensions are also supported, and are automatically generated when a large set of identical primary seeds contains various subsets that are best resolved by different extension lengths.
The maximum extended seed length, by default equal to the primary seed length plus 128, can be controlled with the --ht-max-ext-seed-len
option. For example, for short reads, it is advisable to set the maximum extended seed shorter than the read length, because extensions longer than the whole read can never match.
It is also possible to tune how aggressively seeds are extended using the following options (advanced usage):
--ht-cost-coeff-seed-len
--ht-cost-coeff-seed-freq
--ht-cost-penalty
--ht-cost-penalty-incr
There is a tradeoff between extension length and hit frequency. Faster mapping can be achieved using longer seed extensions to reduce seed hit frequencies, or more accurate mapping can be achieved by avoiding seed extensions or keeping extensions short, while tolerating the higher hit frequencies that result. Shorter extensions can benefit mapping quality both by fitting seeds better between SNPs, and by finding more candidate mapping locations at which to score alignments. The default extension settings along with default seed frequency settings, lean aggressively toward mapping accuracy, with relatively short seed extensions and high hit frequencies.
The defaults for the seed frequency options are as follows:
One primary or extended seed can match multiple places in the reference genome. All such matches are populated into the hash table, and retrieved when the DRAGEN mapper looks up a corresponding seed extracted from a read. The multiple reference positions are then considered and compared to generate aligned mapper output. However, the DRAGEN software enforces a limit on the number of matches, or frequency, of each seed, which is controlled with the --ht-max-seed-freq option
. By default, the frequency limit is 16. In practice, when the software encounters a seed with higher frequency, it extends it to a sufficiently long secondary seed that the frequency of any particular extended seed pattern falls within the limit. However, if a maximum seed extension would still exceed the limit, the seed is rejected, and not populated into the hash table. Instead, a single High Frequency record is populated.
This seed frequency limit does not tend to impact DRAGEN mapping quality notably, for two reasons. First, because seeds are rejected only when extension fails, only extremely high-frequency primary seeds, typically with many thousands of matches are rejected. Such seeds are not very useful for mapping. Second, there are other seed positions to check in a given read. If another seed position is unique enough to return one or more matches, the read can still be properly mapped. However, if all seed positions were rejected as high frequency, often this means that the entire read matches similarly well in many reference positions, so even if the read were mapped it would be an arbitrary choice, with very low or zero MAPQ.
Thus, the default frequency limit of 16 for --ht-max-seed-freq
works well. However, it may be decreased or increased, up to a maximum of 256. A higher frequency limit tends to marginally increase the number of reads mapped (especially for short reads), but commonly the additional mapped reads have very low or zero MAPQ. This also tends to slow down DRAGEN mapping, because correspondingly large numbers of possible mappings are occasionally considered.
In addition to a frequency limit, a target seed frequency can be specified with --ht-target-seed-freq
option. This target frequency is used when extensions are generated for high frequency primary seeds. Extension lengths are chosen with a preference toward extended seed frequencies near the target. The default of 4 for --ht-target-seed-freq
means that the software is biased toward generating shorter seed extensions than necessary to map seeds uniquely.
When building a reference hash table from a fasta with ALT contigs, it may be desired to mask certain regions of high similarity, or to establish a liftover realtionships between primary and alternate contigs. The recommended approach is masking, as described in the Map-Align section. When hg19 or hg38 alt contigs are detected, the hash table builder will require a liftover file or a bed file to mask the alt contigs. If non are provided, a mask bed file from <INSTALL_PATH>/fasta_mask/
will be used automaticaly.
DRAGEN has adopted a masked approach to handle native reference ALT contigs, where strategic regions are masked to increased accuracy. The hash table builder will build the mapper hash table as if the regions that were specified in the argument for ht-mask-bed
were masked with N's. The hash table builder will only allow setting one of ht-mask-bed
or ht-alt-liftover
. Each line in the bed file is expected to contain a contig name, start position (0-based), and end position (1-based), seperated by a single tab or space. Lines that start with # are ignored by the hash table builder to allow commenting. Any line with a contig name that is not found in the input fasta is skipped and logged to the DRAGEN log file. Likewise, lines that describe empty intervals are skipped. If all lines are skipped this way, the hash table builder will issue an error and abort, unless the mask bed file was automatically applied (see Automatic masking). The hash table builder will always issue an error and abort if an interval described in the BED file is outside of the range of the corresponding contig in the fasta. Lines that are not skipped are written to a file called mask.bed that will be present in the hash table output directory, and whose digest will appear in hash_table.cfg. This file is used when a reference is loaded to the FPGA card to dynamically mask reference.bin.
When running from a fasta for which hg38 or hg19 is detected (See Automatic Reference Detection), and no argument for ht-mask-bed
or ht-alt-liftover
was provided, the hash table builder will automatically apply the corresponding bed file for the detected reference from <INSTALL_PATH>/fasta_mask/
. Note that the hash table builder will identify alt contigs by name. So when running from an input fasta that contains alt contig with standard names but modified base content, it is recommended to suppress automatic masking by setting ht-suppress-mask=true
or by passing a custom mask bed file to ht-mask-bed
.
The behavior of DRAGEN with respect to the handling of decoy contigs in the reference has changed since version 2.6.
Starting with DRAGEN 3.x, DRAGEN's hash table builder automatically detects the absence of the decoy contigs from the reference and adds it to the FASTA file, prior to building the hash table. The decoys file is found at <INSTALL_PATH>/liftover/hs\_decoys.fa
. If the reference is missing the decoy contigs, then the reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). This results in an artificially lower mapping rate, however, the accuracy of variant calling is improved thanks to removing false positive caused by decoy reads.
Illumina recommends using this feature by default. However, you can to set the --ht-suppress-decoys
option to true to suppress adding these decoys to the hash table.
The table below describes the difference in behavior between older DRAGEN versions (2.6 and earlier) and DRAGEN 3.x versions with respect to the handling of decoy contigs in the hash table builder:
It is possible to build a custom pangenome reference in order to:
generate a population-specific-pangenome hash table from pangenome msVCF generated from the BSSH app.
generate a human or non-human pangenome hash table from customer-provided msVCF.
To enable the pangenome hash table builder, example command usage is :
dragen --build-hash-table true (required) --ht-graph-msvcf-file <path to a multi-sampple VCF file (required for pangenome reference) --ht-reference <reference.fasta> (required) --ht-graph-extra-kmer-bed < graph.bed> (optional) --ht-mask-bed <mask.bed> (optional) --ht-graph-exclusion-bed <exclusion bed> (optional) --output-directory <DIR> (required) [options]
The custom pangenome hash table builder tool uses a set of population variants provided by the user to generate a pangenome hash table. The variants must be specified in VCF format, in a single multi-sample VCF (msVCF) file containing the variants for a set of individuals. This multi-sample VCF file must have specific formatting described below.
The custom pangenome hash table builder tool only supports msVCF file input respecting the format described below:
msVCF compliant with 4.2 VCF format specification
with variants positionally sorted in the same contig order as the main FASTA reference genome provided in --ht-reference
records shall include diploid or haploid GT calls
supports multi-allelic variants merged in multi-line or separated in multiple lines
with the following FILTER codes, non-PASS records are ignored:
##FILTER=<ID=PASS,Description="All filters passed">
with the following FORMAT field :
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
for better results, we recommend variants to be left-aligned.
maximum number of recommended samples in the msVCF is 256. Higher number may lead to very high memory usage at hash table creation.
Note: INFO/FORMAT subfields must be defined in the header. Events with undefined subfields are ignored.
To build a high-performance custom genome it is highly recommended to use long read sequencing data. We recommend using external tools such as Whatshap (https://github.com/whatshap/whatshap) to generate phased input. DRAGEN analysis leverages the phasing information to reconstruct population haplotypes.
Note: the reference genome provided as input must be the same as the one used to generate the input phased msVCF. If the msVCF contains variants from regions not present in the fasta file, the pangenome reference builder will stop with an error.
A custom exclusion bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.
Note: records of the exclusion bed file provided must be from the same build as the reference genome used to build the pangenome reference.
An Extra-kmer-bed bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.
Note: records of the Extra-kmer-bed file provided must be from the same build as the reference genome used to build the graph reference.
A custom mask bed file can also be provided given the following format: tab delimited with first three columns being: contig name, start position, end position. Any line with a contig name that is not found in the input FASTA is skipped. Any lines that describe empty intervals are skipped.
Note: records of the mask bed file provided must be from the same build as the reference genome used to build the graph reference.
Note: The custom graph reference hash table end to end pipeline will return an error if options --ht-alt-liftover or --ht-allow-mask-and-liftover are specified.
The hash table builder generates the following outputs:
Use the --build-hash-table
option to transform a reference FASTA into the hash table for DRAGEN mapping. It takes as input a FASTA file (multiple reference sequences being concatenated) and a preexisting output directory. Build command usage is as follows:
The --ht-reference
and --output-directory
options are required for building a hash table. The --ht‑reference
option specifies the path to the reference FASTA file, while --output-directory
specifies a preexisting directory where the hash table output files are written. Illumina recommends organizing various hash table builds into different folders. As a best practice, folder names should include any nondefault parameter settings used to generate the contained hash table. The sequence names in the reference FASTA file must be unique.
While masking is the recommended approach to dealing with ALT contigs, DRAGEN also supports a liftover based method. To enable liftover based ALT-aware mapping in DRAGEN, build the hash table with a liftover file by using the --ht-alt-liftover
option. The hash table builder classifies each reference sequence as primary or alternate based on the liftover file, and packs primaries before alternates in reference.bin. SAM liftover files for hg38DH and hg19 are in the <INSTALL_PATH>/liftover
folder.
Custom Liftover Files
Custom liftover files can be used in place of those provided with DRAGEN. Liftover files must be SAM format, but no SAM header is required. SEQ and QUAL fields can be omitted ('*'). Each alignment record should have an alternate haplotype reference sequence name as QNAME, indicating the RNAME and POS of its liftover alignment in a destination (normally primary assembly) reference sequence.
Reverse-complemented alignments are indicated by bit 0x10 in FLAG. Records flagged unmapped (0x4) or secondary (0x100) are ignored. The CIGAR may include hard or soft clipping, leaving parts of the ALT contig unaligned.
A single reference sequence cannot serve as both an ALT contig (appearing in QNAME) and a liftover destination (appearing in RNAME). Multiple ALT contigs can align to the same primary assembly location. Multiple alignments can also be provided for a single ALT contig (extras optionally be flagged 0x800 supplementary), such as to align one portion forward and another portion reverse-complemented. However, each base of the ALT contig only receives one liftover image, according to the first alignment record with an M CIGAR operation covering that base.
SAM records with QNAME missing from the reference genome are ignored, so that the same liftover file may be used for various reference subsets, but an error occurs if any alignment has its QNAME present but its RNAME absent.
The --ht-seed-len
option specifies the initial length in nucleotides of seeds from the reference genome to populate into the hash table. At run time, the mapper extracts seeds of this same length from each read, and looks for exact matches (unless seed editing is enabled) in the hash table.
The maximum primary seed length is a function of hash table size. The limit is k=27 for table sizes from 16 GB to 64 GB, covering typical sizes for whole human genome, or k=26 for sizes from 4 GB to 16 GB.
The minimum primary seed length depends mainly on the reference genome size and complexity. It needs to be long enough to resolve most reference positions uniquely. For whole human genome references, hash table construction typically fails with k < 16. The lower bound may be smaller for shorter genomes, or higher for less complex (more repetitive) genomes. The uniqueness threshold of --ht-seed-len 16
for the 3.1Gbp human genome can be understood intuitively because log4(3.1 G) ≈ 16, so it requires at least 16 choices from 4 nucleotides to distinguish 3.1 G reference positions.
For read mapping to succeed, at least one primary seed must match exactly (or with a single SNP when edited seeds are used). Shorter seeds are more likely to map successfully to the reference, because they are less likely to overlap variants or sequencing errors, and because more of them fit in each read. So for mapping accuracy, shorter seeds are mainly better.
However, very short seeds can sometimes reduce mapping accuracy. Very short seeds often map to multiple reference positions, and lead the mapper to consider more false mapping locations. Due to imperfect modeling of mutations and errors by Smith-Waterman alignment scoring and other heuristics, occasionally these noise matches may be reported. Run time quality filters such as --Aligner.aln_min_score
can control the accuracy issues with very short seeds.
Shorter seeds tend to slow down mapping, because they map to more reference locations, resulting in more work such as Smith-Waterman alignments to determine the best result. This effect is most pronounced when primary seed length approaches the reference genome's uniqueness threshold, eg, K=16 for whole human genome.
Read Length---Generally, shorter seeds are appropriate for shorter reads, and longer seeds for longer reads. Within a short read, a few mismatch positions (variants or sequencing errors) can chop the read into only short segments matching the reference, so that only a short seed can fit between the differences and match the reference exactly. For example, in a 36 bp read, just one SNP in the middle can block seeds longer than 18 bp from matching the reference. By contrast, in a 250 bp read, it takes 15 SNPs to exceed a 0.01% chance of blocking even 27 bp seeds.
Paired Ends---The use of paired end reads can make longer seeds yield good mapping accuracy. DRAGEN uses paired end information to improve mapping accuracy, including with rescue scans that search the expected reference window when only one mate has seeds mapping to a given reference region. Thus, paired end reads have essentially twice the opportunity for an exact matching seed to find their correct alignments.
Variant or Error Rate---When read differences from the reference are more frequent, shorter seeds may be required to fit between the difference positions in a given read and match the reference exactly.
Mapping Percentage Requirement---If the application requires a high percentage of reads to be mapped somewhere (even at low MAPQ), short seeds may be helpful. Some reads that do not match the reference well anywhere are more likely to map using short seeds to find partial matches to the reference.
The --ht-max-ext-seed-len
option limits the length of extended seeds populated into the hash table. Primary seeds (length specified by --ht-seed-len
) that match many reference positions can be extended to achieve more unique matching, which may be required to map seeds within the maximum hit frequency (--ht-max-seed-freq
).
Given a primary seed length k, the maximum seed length can be configured between k and k+128. The default is the upper bound, k+128.
The --ht-max-ext-seed-len
option is recommended for short reads, eg, less than 50 bp. In such cases, it is helpful to limit seed extension to the read length minus a small margin, such as 1-4 bp. For example, with 36 bp reads, setting --ht-max-ext-seed-len
to 35 might be appropriate. This ensures that the hash table builder does not plan a seed extension longer than the read causing seed extension and mapping to fail at run time, for seeds that could have fit within the read with shorter extensions.
While seed extension can be similarly limited for longer reads, eg, setting --ht-max-ext-seed-len
to 99 for 100 bp reads, there is little utility in this because seeds are extended conservatively in any event. Even with the default k+128 limit, individual seeds are only extended to the lengths required to fit under the maximum hit frequency (--ht-max-seed-freq
), and at most a few bases longer to approach the target hit frequency (‑‑ht‑target-seed-freq
), or to avoid taking too many incremental extension steps.
The --ht-max-seed-freq
option sets a firm limit on the number of seed hits (reference genome locations) that can be populated for any primary or extended seed. If a given primary seed maps to more reference positions than this limit, it must be extended long enough that the extended seeds subdivide into smaller groups of identical seeds under the limit. If, even at the maximum extended seed length (--ht-max-ext-seed-len
), a group of identical reference seeds is larger than this limit, their reference positions are not populated into the hash table. Instead, a single High Frequency record is populated.
The maximum hit frequency can be configured from 1 to 256. However, if this value is too low, hash table construction can fail because too many seed extensions are needed. The practical minimum for a whole human genome reference, other options being default, is 8.
Generally, a higher maximum hit frequency leads to more successful mapping. There are two reasons for this. First, a higher limit rejects fewer reference positions that cannot map under it. Second, a higher limit allows seed extensions to be shorter, improving the odds of exact seed matching without overlapping variants or sequencing errors.
However, as with very short seeds, allowing high hit counts can sometimes hurt mapping accuracy. Most of the seed hits in a large group are not to the true mapping location, and occasionally one of these noise hits may be reported due to imperfect scoring models. Also, the mapper limits the total number of reference positions it considers, and allowing very high hit counts can potentially crowd out the actual best match from consideration.
Higher maximum hit frequencies slow down read mapping, because seed mapping finds more reference locations, resulting in more work, such as Smith-Waterman alignments, to determine the best result.
The DRAGEN Software enables the user to build a custom pangenome hash table from a set of population variants. The population variants are specified in a single multi-sample VCF file.
--ht-graph-msvcf-file: Input file containing list of population variants, in multi-sample VCF format.
This replaces the previous options that were previously used to build a graph Reference that are now deprecated.
List of deprecated options :
--ht-pop-alt-contigs: Population based alternate contigs FASTA.
--ht-pop-alt-liftover: Liftover SAM file of population alternate contigs.
--ht-pop-snps: Population based SNPs VCF
The following options control building hash tables from references with ALT-contigs. See References with ALT contigs for more information.
--ht-mask-bed
: Set a custom BED file that defines which regions to mask. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from <INSTALL_PATH>/fasta\_mask
.
--ht-alt-liftover
: Set a liftover file to build a liftover based ALT-aware hash table. SAM liftover files for hg38DH and hg19 are provided in <INSTALL_PATH>/liftover
.
--ht-allow-mask-and-liftover
: Allow the use of both --ht-mask-bed
and --ht-alt-liftover
together.
--ht-suppress-mask
: Suppress automatic detection of the default mask bed files when building the hash table.
--ht-decoys
The DRAGEN software automatically detects the use of hg19 and hg38 references and adds decoys to the hash table when they are not found in the FASTA file. Use the --ht-decoys
option to specify the path to a decoys file. The default is <INSTALL_PATH>/liftover/hs\_decoys.fa
.
--ht-suppress-decoys
: Suppress automatic detection of the default decoys file when building the hash table.
--ht-num-threads
The --ht-num-threads
option determines the maximum number of worker CPU threads that are used to speed up hash table construction. The default for this option is 8, with a maximum of 32 threads allowed. If your server supports execution of more threads, it is recommended that you use the maximum. For example, the DRAGEN servers contain 24 cores that have hyperthreading enabled, so a value of 32 should be used. When using a higher value, adjust --ht-max-table-chunks
needs to be adjusted as well. The servers have 128 GB of memory available.
--ht-max-table-chunks
The --ht-max-table-chunks
option controls the memory footprint during hash table construction by limiting the number of ~1 GB hash table chunks that reside in memory simultaneously. Each additional chunk consumes roughly twice its size (~2 GB) in system memory during construction. The hash table is divided into power-of-two independent chunks, of a fixed chunk size, X, which depends on the hash table size, in the range 0.5 GB < X ≤ 1 GB. For example, a 24 GB hash table contains 32 independent 0.75 GB chunks that can be constructed by parallel threads with enough memory and a 16 GB hash table contains 16 independent 1 GB chunks. The default is --ht-max-table-chunks
equal to --ht-num-threads
, but with a minimum default --ht-max-table-chunks
of 8. It makes sense to have these two options match, because building one hash table chunk requires one chunk space in memory and one thread to work on it. Nevertheless, there are build-speed advantages to raising --ht-max-table-chunks
higher than --ht-num-threads
, or to raising --ht-num-threads
higher than --ht-max-table-chunks
.
--ht-mem-limit
Memory Limit. The --ht-mem-limit
option controls the generated hash table size by specifying the DRAGEN card memory available for both the hash table and the encoded reference genome. The ‑‑ht‑mem-limit
option defaults to 32 GB when the reference genome approaches WHG size, or to a generous size for smaller references. Normally there is little reason to override these defaults.
--ht-size
Hash Table Size. This option specifies the hash table size to generate, rather than calculating an appropriate table size from the reference genome size and the available memory (option --ht-mem-limit
). Using default table sizing is recommended and using --ht-mem-limit
is the next best choice.
--ht-ref-seed-interval
Seed Interval. The --ht-ref-seed-interval
option defines the step size between positions of seeds in the reference genome populated into the hash table. An interval of 1 (default) means that every seed position is populated, 2 means 50% of positions are populated, etc. Noninteger values are supported, eg, 2.5 yields 40% populated. Seeds from a whole human reference are easily 100% populated with 32 GB memory on DRAGEN boards. If a substantially larger reference genome is used, change this option.
--ht-soft-seed-freq-cap
and --ht-max-dec-factor
Soft Frequency Cap and Maximum Decimation Factor for Seed Thinning. Seed thinning is an experimental technique to improve mapping performance in high-frequency regions. When primary seeds have higher frequency than the cap indicated by the --ht-soft-seed-freq-cap option
, only a fraction of seed positions are populated to stay under the cap. The --ht-max-dec-factor
option specifies a maximum factor by which seeds can be thinned. For example, --ht-max-dec-factor 3
retains at least 1/3 of the original seeds. --ht-max-dec-factor 1
disables any thinning. Seeds are decimated in careful patterns to prevent leaving any long gaps unpopulated. The idea is that seed thinning can achieve mapped seed coverage in high frequency reference regions where the maximum hit frequency would otherwise have been exceeded. Seed thinning can also keep seed extensions shorter, which is also good for successful mapping. Based on testing to date, seed thinning has not proven to be superior to other accuracy optimization methods.
--ht-rand-hit-hifreq
and --ht-rand-hit-extend
Random Sample Hit with HIFREQ Record and EXTEND Record. Whenever a HIFREQ or EXTEND record is populated into the hash table, it stands in place of a large set of reference hits for a certain seed. Optionally, the hash table builder can choose a random representative of that set, and populate that HIT record alongside the HIFREQ or EXTEND record. Random sample hits provide alternative alignments that are very useful in estimating MAPQ accurately for the alignments that are reported. They are never used outside of this context for reporting alignment positions, because that would result in biased coverage of locations that happened to be selected during hash table construction. To include a sample hit, set --ht-rand-hit-hifreq
to 1. The --ht-rand-hit-extend
option is a minimum pre-extension hit count to include a sample hit, or zero to disable. Modifying these options is not recommended.
DRAGEN seed extension is dynamic, applied as needed for particular K-mers that map to too many reference locations. Seeds are incrementally extended in steps of 2--14 bases (always even) from a primary seed length to a fully extended length. The bases are appended symmetrically in each extension step, determining the next extension increment if any.
There is a potentially complex seed extension tree associated with each high frequency primary seed. Each full tree is generated during hash table construction and a path from the root is traced by iterative extension steps during seed mapping. The hash table builder employs a dynamic programming algorithm to search the space of all possible seed extension trees for an optimal one, using a cost function that balances mapping accuracy and speed. The following options define that cost function:
--ht-target-seed-freq
Target Hit Frequency. The --ht-target-seed-freq
option defines the ideal number of hits per seed for which seed extension should aim. Higher values lead to fewer and shorter final seed extensions, because shorter seeds tend to match more reference positions.
--ht-cost-coeff-seed-len
Cost Coefficient for Seed Length The --ht-cost-coeff-seed-len
option assigns the cost component for each base by which a seed is extended. Additional bases are considered a cost because longer seeds risk overlapping variants or sequencing errors and losing their correct mappings. Higher values lead to shorter final seed extensions.
--ht-cost-coeff-seed-freq
Cost Coefficient for Hit Frequency. The --ht-cost-coeff-seed-freq
option assigns the cost component for the difference between the target hit frequency and the number of hits populated for a single seed. Higher values result primarily in high-frequency seeds being extended further to bring their frequencies down toward the target.
--ht-cost-penalty
Cost Penalty for Seed Extension. The --ht-cost-penalty
option assigns a flat cost for extending beyond the primary seed length. A higher value results in fewer seeds being extended at all. Current testing shows that zero (0) is appropriate for this parameter.
--ht-cost-penalty-incr
Cost Increment for Extension Step. The --ht-cost-penalty-incr
option assigns a recurring cost for each incremental seed extension step taken from primary to final extended seed length. More steps are considered a higher cost because extending in many small steps requires more hash table space for intermediate EXTEND records, and takes substantially more run time to execute the extensions. A higher value results in seed extension trees with fewer nodes, reaching from the root primary seed length to leaf extended seed lengths in fewer, larger steps.
When building a hash table, DRAGEN configures the options for DNA analysis by default. To run RNA-Seq data, you must build an RNA-Seq hash table by setting --ht-build-rna-hashtable
to true. If running RNA-Seq alignment, use the original --output-directory
instead of the automatically generated subdirectory.
If using the CNV pipeline, set --ht-build-cnv-hashtable
to true. The command generates an additional Kmer hash map that is used in the CNV algorithm. Illumina recommends to always use the --ht-build-cnv-hashtable
option, so you can perform CNV calling with the same hash table used for mapping and aligning.
To run the methylation pipeline, you must build a methylation-specific hash table. DRAGEN can build a single-pass or legacy multi-pass methylation hash table. Methylation runs using a single-pass hash table are completed faster than the legacy multipass hash tables. Single-pass hash tables are recommended for building methylation tables and running analyses.
The following is an example of a single-pass hash table build. The example generates a combined hash table in your reference index folder under the methyl_converted subdirectory.
dragen --build-hash-table true \ --output-directory $REFDIR \ --ht-reference $FASTA \ --ht-num-threads 40 \ --ht-methylated-combined=true \ --ht-seed-len 27
Multi-pass methylation mapping requires building two special hash tables with reference bases converted from C to T in one table and G to A in the other table. The conversions are performed automatically when using the --ht-methylated
command line option. The converted hash tables are generated in two subdirectories under the folder specified using the --output-directory
command line option. The subdirectories are named CT_converted and GA_converted, corresponding with the base conversions. When using the hash tables for methylated alignment runs, make sure to refer to the --output-directory
folder, not the subdirectories.
The base conversions remove a significant amount of information from the hash tables. You might need to use different hash table parameters than in a conventional hash table build. The following options are recommended for building hash tables for mammalian species.
dragen --build-hash-table=true --output-directory $REFDIR --ht-reference $FASTA --ht-max-seed-freq 16 --ht-seed-len 27 --ht-num-threads 40 --ht-methylated=true
To run the HLA caller, an HLA-specific anchored reference hash table must be built. Set --ht-build-hla-hashtable
to true. The command will create a anchored_hla
subdirectory inside the --output-directory
. The HLA-specific reference subdirectory can be built at the same time as the primary reference construction.
The map/align system produces a BAM file sorted by reference sequence and position by default. Creating this BAM file typically eliminates the requirement to run samtools sort or any equivalent postprocessing command. The ‑‑enable-sort option
can be used to enable or disable creation of the BAM file, as follows:
To enable, set to true.
To disable, set to false.
On the reference hardware system, running with sort enabled increases run time for a 30x full genome by about 6--7 minutes.
Marking or removing duplicate aligned reads is a common best practice in whole-genome sequencing. Not doing so can bias variant calling and lead to incorrect results.
The DRAGEN system can mark or remove duplicate reads, and produces a BAM file with duplicates marked in the FLAG field, or with duplicates entirely removed.
In testing, enabling duplicate marking adds minimal run time over and above the time required to produce the sorted BAM file. The additional time is approximately 1--2 minutes for a 30x whole human genome, which is a huge improvement over the long run times of open source tools.
The DRAGEN duplicate-marking algorithm is modeled on the Picard toolkit's MarkDuplicates feature. All the aligned reads are grouped into subsets in which all the members of each subset are potential duplicates.
For two pairs to be duplicates, they must have the following:
Identical alignment coordinates (position adjusted for soft- or hard-clips from the CIGAR) at both ends.
Identical orientations (direction of the two ends, with the left-most coordinate being first).
In addition, an unpaired read may be marked as a duplicate if it has identical coordinate and orientation with either end of any other read, whether paired or not.
Unmapped read pairs are never marked as duplicates.
When DRAGEN has identified a group of duplicates, it picks one as the best of the group, and marks the others with the BAM duplicate flag (0x400, or decimal 1024). For this comparison, duplicates are scored based on the average sequence Phred quality. Pairs receive the sum of the scores of both ends, while unpaired reads get the score of the one mapped end. The idea of this score is to try, all other things being equal, to preserve the reads with the highest-quality base calls.
If two reads (or pairs) have exactly matching quality scores, DRAGEN breaks the tie by choosing the pair with the higher alignment score. If there are multiple pairs that also tie on this attribute, then DRAGEN chooses a winner arbitrarily.
The score for an unpaired read R is the average Phred quality score per base, calculated as follows:
Where R is a BAM record, QUAL is its array of Phred quality scores, and dedup-min-qual is a DRAGEN configuration option with default value of 15. For a pair, the score is the sum of the scores for the two ends.
This score is stored as a one-byte number, with values rounded down to the nearest one-quarter. This rounding may lead to different duplicate marks from those chosen by Picard, but because the reads were very close in quality this has negligible impact on variant calling results.
The limitations to DRAGEN duplicate marking implementation are as follows:
When there are two duplicate reads or pairs with very close Phred sequence quality scores, DRAGEN might choose a different winner from that chosen by Picard. These differences have negligible impact on variant calling results.
If using a single FASTQ file as input, DRAGEN accepts only a single library ID as a command-line argument (RGLB). For this reason, the FASTQ inputs to the system must be already separated by library ID. Library ID cannot be used as a criterion for distinguishing non-duplicates.
DRAGEN does not distinguish between optical and PCR duplicates.
The following options can be used to configure duplicate marking in DRAGEN:
--enable-duplicate-marking
Set to true to enable duplicate marking. When \--enable-duplicate-marking is enabled
, the output is sorted, regardless of the value of the enable-sort option.
--remove-duplicates
Set to true to suppress the output of duplicate records. If set to false, set the 0x400 flag in the FLAG field of duplicate BAM records. When --remove-duplicates is enabled, then enable- duplicate-marking is forced to enabled as well.
--dedup-min-qual
Specifies the Phred quality score below which a base should be excluded from the quality score calculation used for choosing among duplicate reads.
Regions of homozygosity (ROH) are detected as part of the small variant caller. The caller detects and outputs the runs of homozygosity from whole genome calls on autosomal human chromosomes. Sex chromosomes are ignored unless the sample sex karyotype is XX, as specified on the command line or determined by the Ploidy Estimator. ROH output allows downstream tools to screen for and predict consanguinity between the parents of the proband subject.
A region is defined as consecutive variant calls on the chromosome with no large gap in between these variants. In other words, regions are broken by chromosome or by large gaps with no SNV calls. The gap size is set to 3 Mbases.
ROH Algorithm
The ROH algorithm runs on the small variant calls. The algorithm excludes variants with multiallelic sites, indels, complex variants, non-PASS filtered calls, and homozygous reference sites. The variant calls are then filtered further using a block list BED, and finally depth filtering is applied after the block list filter. The default value for the fraction of filtered calls is 0.2, which filters the calls with the highest 10% and lowest 10% in DP values. The algorithm then uses the resulting calls to find regions.
The ROH algorithm first finds seed regions that contain at least 50 consecutive homozygous SNV calls with no heterozygous SNV or gaps of 500,000 bases between the variants. The regions can be extended using a scoring system that functions as follows.
Score increases with every additional homozygous variant (0.025) and decreases with a large penalty (1-0.025) for every heterozygous SNV. This provides some tolerance of presence of heterozygous SNV in the region.
Each region expands on both ends until the regions reach the end of a chromosome, a gap of 500,000 bases between SNVs occurs, or the score becomes too low (0).
Overlapping regions are merged into a single region. Regions can be merged across gaps of 500,000 bases between SNVs if a single region would have been called from the beginning of the first region to the end of the second region without the gap. There is no maximum size for regions, but regions always end at chromosome boundaries.
ROH Options
--vc-enable-roh
Set to true to enable the ROH caller. The ROH caller is enabled by default for human autosomes only. Set to false to disable.
--vc-roh-blacklist-bed
If provided, the ROH caller ignores variants that are contained in any region in the block list BED file. DRAGEN distributes block list files for all popular human genomes and automatically selects a block list to match the genome in use, unless this option is used to select a file.
ROH Output
The ROH caller produces an ROH output file named <output-file-prefix>.roh.bed
in which each row represents one region of homozygosity. The BED file contains the following columns:
Chromosome Start End Score #Homozygous #Heterozygous
Score is a function of the number of homozygous and heterozygous variants, where each homozygous variant increases the score by 0.025, and each heterozygous variant reduces the score by 0.975.
Start and end positions are a 0-based, half-open interval.
#Homozygous is number of homozygous variants in the region.
#Heterozygous is number of heterozygous variants in the region. The caller also produces a metrics file named <output-file-prefix>.roh_metrics.csv
that lists the number of large ROH and percentage of SNPs in large ROH (>3 MB).
The table below demonstrates how the PLINK options can be tuned to behave similarly to the DRAGEN ROH caller default settings (see column DRAGEN default). We observed that PLINK ROH calls (see column PLINK default) in default settings are more conservative compared to DRAGEN default settings. By default, PLINK reports ROH regions of size 1MB or larger (see PLINK option --homozyg-kb ) with at least 100 homozygous SNPs (see PLINK option --homozyg-snp) while DRAGEN ROH caller reports smaller regions with at least 50 homozygous SNPs (see DRAGEN ROH Algorithm section). In addition, PLINK by default allows for only 1 heterozygous SNP per scanning window (specified by PLINK option --homozyg-window-het) while DRAGEN uses a soft score threshold penalty without setting an upper bound on the allowed number of heterozygous SNPs (see DRAGEN ROH Algorithm section). The PLINK ROH calls are largely comparable to the DRAGEN ROH calls after relaxing the default PLINK settings, shown in column PLINK tuned. Prior to PLINK ROH calling, the input DRAGEN hard-filtered VCF files are filtered as per the instructions in DRAGEN ROH Algorithm section.
The DRAGEN Small Variant Caller is a high-speed haplotype caller implemented with a hybrid of hardware and software. The caller performs localized de novo assembly in regions of interest to generate candidate haplotypes, and then performs read likelihood calculations using a hidden Markov model (HMM).
Variant calling is disabled by default. To enable variant calling, set the --enable-variant-caller
option to true. The VCF header is annotated with ##source=DRAGEN_SNV
to indicate the file is generated by the DRAGEN SNV pipeline.
The DRAGEN Small Variant Caller performs the following steps:
Active Region Identification---Identifies areas where multiple reads disagree with the reference are identified, and selects windows around them (active regions) for processing.
Localized Haplotype Assembly--- Assembles all overlapping reads in each active region into a de Bruijn graph (DBG). A DBG is a directed graph based on overlapping K-mers (length K subsequences) in each read or multiple reads. When all reads are identical, the DBG is linear. Where there are differences, the graph forms bubbles of multiple paths that diverge and rejoin. If the local sequence is too repetitive and K is too small, cycles can form, which invalidate the graph. DRAGEN uses K=10 and 25 as the default values. If those values produce an invalid graph, then additional values of K=35, 45, 55, 65 are tried until a cycle-free graph is obtained. From this cycle-free DBG, DRAGEN extracts every possible path to produce a complete list of candidate haplotypes, ie, hypotheses for what the true DNA sequence might be on at least one strand. In addition to graph assembly, haplotypes are also generated via columnwise detection, with candidate variant events identified directly from BAM alignments. Columnwise detection is enabled by default in all small variant calling pipelines and is supplementary to the DBG, but is especially useful in highly repetitive regions where DBG assembly of reads is more likely to fail.
Haplotype Alignment---Uses the Smith-Waterman algorithm to align each extracted haplotype to the reference genome. The alignments determine what variations from the reference are present.
Read Likelihood Calculation---Tests each read against each haplotype to estimate a probability of observing the read assuming the haplotype was the true original DNA sampled. This calculation is performed by evaluating a pair hidden Markov model (HMM), which accounts for the various possible ways the haplotype might have been modified by PCR or sequencing errors into the read observed. The HMM evaluation uses a dynamic programming method to calculate the total probability of any series of Markov state transitions arriving at the observed read.
Genotyping---Forms the possible diploid combinations of variant events from the candidate haplotypes and, for each combination, calculates the conditional probability of observing the entire read pileup. Calculations use the constituent probabilities of observing each read, given each haplotype from the pair HMM evaluation. These calculations feed into the Bayesian formula to calculate the likelihood that each genotype is the genotype of the sample being analyzed, given the entire read pileup observed. Genotypes with maximum likelihood are reported.
In most pipelines, DRAGEN reports two types of depth counts, both of which may differ from the information in the BAM pileup due to various filtering steps that are applied throughout variant calling. Briefly:
Unfiltered depth is the number of reads covering the position, downstream of any read collapsing or deduplication that may have preceded the variant calling step, but upstream of most read filtering and overlapping mate handling. Unfiltered depth is reported as INFO/DP, except in the case of gVCF homref calls, where it is reported as FORMAT/DP.
Informative depth is the number of reads actually used to make the calling decision, where filtered reads and uninformative reads (reads that could not be assigned to a specific allele) have been excluded, and overlapping mate pairs are counted as single reads. When overlapping mate pairs are present, this may cause an apparent discrepancy between the reported depth and the pileup as viewed in a browser such as IGV. To resolve this, use the "View as pairs" option in IGV. Informative depth is reported as FORMAT/DP, except in the case of gvcf homref calls, where it is not reported. The FORMAT/AD and FORMAT/AF fields are based on informative depth.
The following figure summarizes the different filtering steps in more detail.
Filter 1 acts on the reads present in the BAM input (in UMI pipelines, these are the collapsed reads produced by the read collapsing step, not the raw reads) and filters out the following reads:
Duplicate reads.
Soft-clipped bases. DRAGEN filters out soft-clipped bases only when calculating coverage reports.
[Somatic] Reads with MAPQ=0.
[Somatic] Reads with MAPQ < vc-min-tumor-read-qual, where vc-min-tumor-read-qual >1.
Filter 2 trims bases with BQ < 10 and filters out the following reads:
Unmapped reads.
Secondary reads.
Reads with bad cigars.
Filter 3 occurs after downsampling and HMM. Filter 3 filters out the following reads:
Reads that are badly mated. A badly mated read is a read where the pair is mapped to two different reference contigs.
Disqualified reads. Reads are disqualified if their HMM score is below a threshold.
Filter 4 occurs after the genotyper runs. The genotyper adds annotation information to the FORMAT field. Filter 4 filters out reads that are not informative. For example, if the HMM scores of the read against two different haplotypes are almost equal, the read is filtered out because it does not provide enough information to distinguish which of the two haplotypes are more likely.
Since DRAGEN 4.3 the mosaic small variant caller runs downstream of the germline small variant caller. Non-cancer post-zygotic mosaic variants with typical AF lower than 50% detected by the mosaic caller are reported in the output VCF file with a MOSAIC
INFO flag. As default, MOSAIC
tagged variants with AF
smaller than 20% are filtered with the MosaicLowAF
filter.
The following options control the variant caller stage of the DRAGEN host software.
--enable-variant-caller
Set --enable-variant-caller
to true to enable the variant caller stage for the DRAGEN pipeline.
--vc-target-bed
[Optional] Restricts processing of the small variant caller, target BED related coverage, and callability metrics to regions specified in a BED file. The BED file is a text file containing at least three tab-delimited columns. The first three columns are chromosome, start position, and end position. The positions are zero-based. For example:
If the reference span of the variant overlaps with any of the regions in the target BED, then the variant is output. If the reference span does not overlap, the variant is not output. For SNPs and Insertions, the reference span is 1 bp. For deletions, the reference span is the length of the deletion.
--vc-target-bed-padding
[Optional] Pads all target BED regions with the specified value. For example, if a BED region is 1:1000–2000 and the specified padding value is 100, the result is equivalent to using a BED region of 1:900–2100 and a padding value of 0. Any padding added to --vc-target-bed-padding is used by the small variant caller and by the target bed coverage/callability reports. The default padding is 0.
--vc-target-coverage
Specifies the target coverage for downsampling. The default value is 500 for germline mode and 50 for somatic mode.
--vc-remove-all-soft-clips
Set to true to ignore soft-clipped bases during the haploytype assembly step.
--vc-decoy-contigs
Specifies a comma-separated list of contigs to skip during variant calling. This option can be set in the configuration file.
--vc-enable-decoy-contigs
Set to true to enable variant calls on the decoy contigs. The default value is false.
--vc-enable-phasing
Enable variants to be phased when possible. The default value is true.
--vc-combine-phased-variants-distance
Set the maximum distance in base pairs between phased variants to be combined. The default value is 0, which disables the option. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15].
--vc-enable-mosaic-detection
Set to true to enable DRAGEN mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable DRAGEN mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
You can use the following options for downsampling reads in the small variant calling pipeline.
For mitochondrial small variant calling, the downsampling options can be set separately because the mitochondrial contig contains a higher depth than the rest of the contigs in a WGS data set. The following are the downsampling options for the mitochondrial contig.
--vc-target-coverage-mito
--vc-max-reads-per-active-region-mito
--vc-max-reads-per-raw-region-mito
The target coverage and max/min reads in raw/active region options are not directly related and could be triggered independently.
The following are the default downsampling values for each small variant calling mode.
The target coverage downsampling step runs first and is meant to limit the the total coverage at a given position. This step is approximate and the coverage after downsampling at a given position could be a bit higher than the threshold due to the --vc-min-reads-per-start-pos
setting.
If the number of reads at any position with same start position is equal to or lower than the --vc-min-reads-per-start-pos
, that position is skipped for downsampling to make sure that there is always at least a minimum number of reads (set to --vc-min-reads-per-start-pos
, default value is 10) at any start position.
The next downsampling step is to apply the --vc-max-reads-per-raw-region
and --vc-max-reads-per-active-region
limits. These options are used to limit the total number of reads in an entire region using a leveling downsampling method.
This downsampling mechanism scans each start position from the start boundary of the region and discards one read from that position, then moves on to the next position, until the total number of reads falls below the threshold. It can potentially take several passes across the entire region for the total number of reads in the entire region to fall below the threshold. After the threshold is met, the downsampling step is stopped regardless of which position was considered last in the region.
When downsampling occurs, the choice of which reads to keep or remove is random. However, the random number generator is seeded to a default value to make sure that the generator produces the same set of values in each run. This ensures reproducible results, which means there is no run to run variation when using the same input data.
A genomic VCF (gVCF) file contains information on variants and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. The gVCF file includes an artificial <NON_REF>
allele. Reads that do not support the reference or any variants are assigned the <NON_REF>
allele. DRAGEN uses these reads to determine if the position can be called as a homozygous reference, as opposed to remaining uncalled. The resulting score represents the Phred-scaled level of confidence in a homozygous reference call. In germline mode, the score is FORMAT/GQ
and in somatic mode the score is FORMAT/SQ
.
The following options are available to enable and control gVCF output.
--vc-emit-ref-confidence
To enable gVCF output, set to GVCF
. By default, contiguous runs of homozygous reference calls with similar scores are collapsed into blocks (hom-ref blocks). Hom-ref blocks save disk space and processing time of downstream analysis tools. DRAGEN recommends using the default mode.
To produce unbanded output, set --vc-emit-ref-confidence
to BP_RESOLUTION
.
--vc-enable-vcf-output
To enable VCF file output during a gVCF run, set to true. The default value is false.
--vc-gvcf-bands
If using the default --vc-emit-ref-confidence gvcf
(banded mode), DRAGEN collapses gVCF records with a similar GQ or SQ score. By default, the cutoffs are 1 10 20 30 40 60 80
for germline and 1 3 10 20 50 80
for somatic. For example, to define the bands [0, 10), [10, 50), and ≥ 50 use --vc-gvcf-bands 10 50
.
--vc-compact-gvcf
This option, when used for germline in conjunction with --vc-emit-ref-confidence gvcf
, produces a much smaller gVCF output file than the default. It can be used when the gVCF is destined for ingestion into gVCF Genotyper, offering further savings on disk space and gVCF Genotyper runtime compared to the default. This option implies --vc-gvcf-bands 0 1 10 20 30
and additionally omits certain metrics that are not used by gVCF Genotyper. Note that files generated using this option will be rejected by the Pedigree Caller.
Not all entries in the gVCF are contiguous. The file might contain gaps that are not covered by either a variant line or a hom-ref block. The gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
In germline mode, the thresholds for calling are lower for gVCFs than for VCFs. The gVCF output could show a different number of variants than a VCF run for the same sample. There is likely a different number of biallelic and multiallelic calls because gVCF mode includes all possible alleles at a locus, rather than only the two most likely alleles. This means that a biallelic call in the VCF can be output as a multiallelic call in the gVCF. The genotype in the gVCF still points to the two most likely alleles, so the variant call remains the same.
The following are example gVCF records that include a hom-ref block call and a variant call.
In single sample gVCF, FORMAT/DP reported at a HomRef position is the median DP in the band and AD is the corresponding value, so sum of AD will be DP even in a homref band. The minimum is also computed and printed as MIN_DP for the band.
In single sample VCF and gVCF, the QUAL follows the definition of the VCF specification. For more information on the VCF specification, see the most current VCF documentation available on samtools/hts-specs GitHub repository.
QUAL is the Phred-scaled probability that the site has no variant and is computed as:
That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scaled in the VCF file.
GQ for non-homref calls is the Phred-scaled probability that the call is incorrect. GQ=-10*log10(p), where p is the probability that the call is incorrect. GQ=-10*log10(sum(10.^(-GP(i)/10))) where the sum is taken over the GT that did not win. GQ of 3 indicates a 50 percent chance that the call is incorrect, and GQ of 20 indicates a 1 percent chance that the call is incorrect.
In gvcf mode, the evidence in favor of homozygous reference calls is also assessed. However, the posterior probability is not of interest in this case (with zero evidence, e.g. due to zero coverage, the strong prior in favor of homref would yield a strong posterior in favor of homref), so the value of GQ for homref calls reflects the evidence directly, defined using the likelihood ratio between the likelihoods for the homref hypothesis and the strongest competing variant hypothesis: 10*log10[P(D|homref)/P(D|variant)] where D represents the pileup data.
QD is the QUAL normalized by the read depth, DP.
The QUAL scores generated by DRAGEN differ significantly from those of GATK, as DRAGEN's algorithms for small variant detection provide more realistic scores. This improvement stems from two key factors:
Correlated Errors: DRAGEN accounts for real-world correlated errors, unlike GATK, which assumes errors are uncorrelated, leading to inflated QUAL scores in GATK.
Machine Learning (ML): DRAGEN-ML further recalibrates QUAL scores, making them more accurate than DRAGEN without ML. With ML enabled, QUAL scores tend to not exceed 75, compared to GATK, where they can exceed 1000. Consequently, DRAGEN-ML uses a lower QUAL filtering threshold (3) compared to DRAGEN without ML (10).
Our recommendation is to use the default filtering thresholds in DRAGEN: QUAL threshold of 3 with ML enabled.
DRAGEN supports output of phased variant records in both the germline and the somatic VCF and gVCF files. When two or more variants are phased together, the phasing information is encoded in a sample-level annotation, FORMAT/PS. FORMAT/PS identifies which set the phased variant is in. The value in the field in an integer representing the position of the first phased variant in the set. All records in the same contig with matching PS values belong to the same set.
The following is an example of a DRAGEN single sample gVCF, where two SNPs are phased together.
During the genotyping step, all haplotypes and all variants are considered over an active region. For each pair of variants, if both variants occur on all of the same haplotypes or if either is a homozygous variant, then they are phased together. If the variants only occur on different haplotypes, then they are phased opposite to each other. If any heterozygous variants are present on some of the same haplotypes but not others, phasing is aborted and no phasing information is output for the active region.
Phased variant records that belong to the same phasing set can be combined into a single VCF record. For example, assuming reference at position chr2 115035
is A
, the following two phased variants are combined.
The phased variants are combined as follows.
The command-line option --vc-combine-phased-variants-distance
specifies the maximum distance over which phased variants will be combined. The default value 0 disables the feature. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value.
DRAGEN supports phasing of the genotypes listed in the below table. Only the first row in the table is relevant to somatic, since the somatic pipeline only emits 0/1 and 0|1 genotypes. MNV calls can still be phased with other variant calls that fell outside the phased variants distance.
Examples of diploid haplotypes where phasing is supported:
Examples of diploid haplotypes where phasing is not supported:
By default in somatic mode, DRAGEN will output all component SNVs and INDELs that make up an MNV along with the MNV call itself. MNVs and their component calls can be identified and linked to one another by a common value in the INFO.MNVTAG field. Setting --vc-mnv-emit-component-calls=false
can be used to restrict which component calls are reported. When DRAGEN reports an MNV call, it considers the difference between the VAF of the MNV call and the VAF of each component call, and reports any given component call in addition to the MNV call if this difference is greater than --vc-combine-phased-variants-max-vaf-delta
(default: 0.1). The --vc-mnv-emit-component-calls
and --vc-combine-phased-variants-max-vaf-delta
options are only applicable in somatic mode and are not supported in germline mode. In germline mode, functionality to output component calls is not available and MNV calls are emitted only without component calls.
Parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0.
Left aligning a variant means shifting the start position of that variant to the left till it is no longer possible to do so.
A variant is normalized if and only if it is parsimonious and left aligned
Additional notes on variant representation in the DRAGEN VCF:
Reference-trimming of alleles: A single padding reference base is used to represent insertions and deletions (i.e. the reference base preceding the insertion or deletion is included).
Allele decomposition: by default, multi-nucleotide polymorphisms (MNPs) are represented as separate, contiguous individual SNVs records in the VCF. If phasing can be determined, the FORMAT/GT is phased and the FORMAT/PS contains the coordinate position of the first variant in the set of phased variants. This determines which variant have occurred on the same haplotype. Phased variant records that belong to the same phasing set can be combined into a single VCF record by using the --vc-combine-phased-variants-distance
command-line option and set it to a non-zero value. When enabled, the option combines all phased variants in the phasing set that are within the provided distance value (specified in the number of basepairs).
A multiallelic site is a specific locus in a genome that contains three or more observed alleles, counting the reference as one, and therefore allowing for two or more variant alleles. Multi-allelic calls are output in a single variant record in the VCF as follows:
chr1 2656216 . A T,C 107.65 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=12;FS=0.000;MQ=28.95;QD=8.97;SOR=3.056;FractionInformativeReads=0.750 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,5,4:0.556,0.444:9:15:177,144,46,122,0,72:-17.704,-14.420,-4.626,-12.220,0.000,-7.244:1.076e+02,1.096e+02,1.465e+01,8.758e+01,1.520e-01,4.082e+01:0.00,34.77,37.77,34.77,69.54,37.77:0,0,1,8:0,0,4,5
Two indels are considered as multi-allelic if they share the same reference base preceding the indel. chr1 7392258 . C CT,CTTT 234.76 PASS AC=1,1;AF=0.500,0.500;AN=2;DP=44;FS=0.000;MQ=199.22;QD=5.34;SOR=2.226;FractionInformativeReads=0.659 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB 1/2:0,15,14:0.517,0.483:29:50:245,256,55,190,0,55:-24.476,-25.634,-5.492,-18.976,0.000,-5.500:2.348e+02,2.513e+02,5.292e+01,1.848e+02,4.401e-05,5.300e+01:0.00,5.00,8.00,5.00,10.00,8.00:0,0,7,22:0,0,17,12
If a SNP overlaps an INDEL, but the SNP does not align with the reference base preceding the indel, the SNP and INDEL are represented as two different variant records, as shown in the example below. However DRAGEN has the joint detection of overlaping variants feature which is designed to detect overlapping SNP and INDEL and output them in a single VCF variant record, represented as a multi-allelic genotype.
chr1 1029628 . C CGT 49.88 PASS AC=1;AF=0.500;AN=2;DP=37;FS=7.791;MQ=105.32;MQRankSum=-1.315;QD=1.35;ReadPosRankSum=1.423;SOR=1.510;FractionInformativeReads=0.892;R2_5P_bias=-19.742 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:17,16:0.485:33:48:81,0,50:-8.088,0.000,-5.000:4.988e+01,6.653e-05,5.300e+01:0.00,31.00,34.00:10,7,5,11:11,6,9,7:1029628 chr1 1029629 . A G 50.00 PASS AC=1;AF=0.500;AN=2;DP=37;FS=1.289;MQ=105.32;MQRankSum=-0.659;QD=1.35;ReadPosRankSum=-0.199;SOR=0.604;FractionInformativeReads=1.000;R2_5P_bias=-24.923 GT:AD:AF:DP:GQ:PL:GL:GP:PRI:SB:MB:PS 0|1:16,21:0.568:37:48:85,0,49:-8.477,0.000,-4.934:5.000e+01,6.886e-05,5.234e+01:0.00,34.77,37.77:9,7,10,11:10,6,13,8:1029628
The small variant caller currently only supports either ploidy 1 or 2 on all contigs within the reference except for the mitochondrial contig, which uses a continuous allele frequency approach (see Mitochondrial Calling). The selection of ploidy 1 or 2 for all other contigs is determined as follows.
If --sample-sex
is not specified on the command line, the Ploidy Estimator determines the sex. If the Ploidy Estimator cannot determine the sex karyotype or detects sex chromosome aneuploidy, all contigs are processed with ploidy 2.
If --sample-sex
is specified on the command line, contigs are processed as follows.
For female samples, DRAGEN processes all contigs with ploidy 2, and marks variant calls on chrY with a filter PloidyConflict.
For male samples, DRAGEN processes all contigs with ploidy 2, except for the sex chromosomes. DRAGEN processes chrX with ploidy 1, except in the PAR regions, where it is processed with ploidy 2. chrY is processed with ploidy 1 throughout.
For male samples in germline calling mode, DRAGEN calls potential mosaic variants in non-PAR regions of sex chromosomes. A variant is called as mosaic when the allele frequency (FORMAT/AF) is below 85% or if multiple alt alleles are called, suggesting incompatibility with the haploid assumption. The GT field for bi-allelic mosaic variants is "0/1", denoting a mixture of reference and alt alleles, as opposed to the regular GT of "1" for haploid variants. The GT field for multi-allelic mosaic variants is "1/2" in VCF. You can disable the calling of mosaic variants by setting --vc-enable-sex-chr-diploid
to false.
An example germline VCF record of a mosaic variant in a haploid region: chrX 18622368 . C T 48.84 PASS AC=1;AF=0.500;AN=2;DP=22;FS=4.154;MQ=248.02;MQRankSum=3.272;QD=2.27;ReadPosRankSum=2.671;SOR=1.546;FractionInformativeReads=1.000;MOSAIC
GT:AD:AF:DP:F1R2:F2R1:GQ:PL:GP:PRI:SB:MB 0/1:9,13:0.5909
:22:1,8:8,5:48:84,0,51:4.8837e+01,7.4031e-05,5.4007e+01:0.00,34.77,37.77:5,4,4,9:3,6,5,8
DRAGEN detects sex chromosomes by the naming convention, either X/Y or chrX/chrY. No other naming convention is supported.
Instead of treating overlapping mates as independent evidence for a given event, DRAGEN handles overlapping mates in both the germline and somatic pipelines as follows.
When the two overlapping mates agree with each other on the allele with the highest HMM score, the genotyper uses the mate with the greatest difference between the highest and the second highest HMM score. The HMM score of the other mate becomes zero.
When the two overlapping mates disagree, the genotyper sums the HMM score between the two mates, assigns the combined score to the mate that agrees with the combined result, and changes the HMM score of the other mate to zero.
The base qualities of overlapping mates are no longer adjusted.
Typically, there are approximately 100 mitochondria in each mammalian cell. Each mitochondrion harbors 2–10 copies of mitochondrial DNA (mtDNA). For example, if 20% of the chrM copies have a variant, then the allele frequency (AF) is 20%. This is also referred to as continuous allele frequency. The expectation is that the AF of variants on chrM is anywhere between 0% and 100%.
DRAGEN processes chrM through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. In this case, a single ALT allele is considered and the AF is estimated. The estimated AF can be anywhere between 0% and 100%. Default variant AF thresholds are applied to mitochondrial variant calling.
--vc-enable-af-filter-mito
Whether to enable the allele frequency for mitochondrial variant calling. The default is true.
--vc-af-call-threshold-mito
Set the threshold for emitting calls in the VCF. The default is 0.01.
--vc-af-filter-threshold-mito
Set the threshold to mark emitted vcf call as filtered. The default is 0.02.
QUAL and GQ are not output in the chrM variant records. Instead, the confidence score is FORMAT/SQ, which gives the Phred-scaled confidence that a variant is present at a given locus. A call is made if FORMAT/SQ> vc-sq-call-threshold (default = 3.0).
The following filters can be applied to mitochondrial variant calls.
--vc-sq-call-threshold
Set the SQ threshold for emitting calls in the VCF. The default is 0.1.
--vc-sq-filter-threshold
Set the SQ threshold to mark emitted VCF calls as filtered. The default is 3.0
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default value is false.
If FORMAT/SQ < vc-sq-call-threshold, the variant is not emitted in the VCF. If FORMAT/SQ > vc-sq-call-threshold but FORMAT/SQ < vc-sq-filter-threshold, the variant is emitted in the VCF but FILTER=weak_evidence.
If FORMAT/SQ> vc-sq-call-threshold, FORMAT/SQ > vc-sq-filter-threshold, and no other filters are triggered, the variant is output in the VCF and FILTER=PASS.
The following are example VCF records on the chrM. The examples show one call with very high AF and another with low AF. In both cases FORMAT/SQ > vc-sq-call-threshold. FORMAT/SQ is also > vc-sq-filter-threshold, so the FILTER annotation is PASS.
For homref calls (e.g. in NON_REF regions of gVCF output) the FORMAT/GT is hard-coded to 0/0. The FORMAT/AF yields an estimate on the variant allele frequency, which ranges anywhere within [0,1]. For variant calls with FORMAT/AF < 95%, the FORMAT/GT is set to 0/1. For variants with very high allele frequencies (FORMAT/AF ≥ 95%), the FORMAT/GT is set to 1/1.
The following is an example of a variant record on chrM in a trio joint VCF. The variant was detected in the second sample with a confidence score that passed the filter threshold. In the first and third samples GT=0/0, which indicates a tentative hom-ref call (ie, that position for the sample is in a NON_REF region over which no variant was detected with sufficient confidence), but the weak_evidence filter tag indicates that this call is made with low confidence.
We leverage the new pangenome reference and multi-genome mapper output to compute a personalized 2-haplotype reference for the input sample.
The computed 2-haplotype reference is used to impute variants, adjust priors probabilities for genotypes in the variant caller, create a new personalized machine learning model and significantly boosts accuracy of variant calling. False negatives are reduced by adjusting genotype priors based on imputed phased variants in the computed haplotypes. False positives are reduced by limiting the impact of noise from other population haplotypes.
To enable personalized variant calling and machine learning, set --enable-personalization
to true (default: false).
Note that this is a beta feature and available only for the germline small variant caller when run with a V4 pangenome reference.
When variants at multiple loci in a single active region are detected jointly, genotyping can benefit. DRAGEN combines loci into a joint detection region if the following conditions are met:
Loci have alleles that overlap each other.
Loci are in the STR region or less than 10 bases apart of the STR region.
Loci are less than 10 bases apart of each other.
Joint detection generates a haplotype list where all possible combinations of the alleles in the joint detection regions are represented. This calculation leads to a larger number of haplotypes. During genotyping, joint detection calculates the likelihoods that each haplotype pair is the truth, given the observed read pileup. Genotype likelihoods are calculated as the sum of the likelihoods of haplotype pairs that support the alleles in the genotypes. Genotypes with maximum likelihood are reported.
Joint detection is enabled by default. To disable joint detection, set --vc-enable-joint-detection
to false.
DRAGEN has two algorithms that model correlated errors across reads in a given pileup.
Foreign read detection (FRD) detects mismapped reads. FRD modifies the probability calculation to account for the possibility that a subset of the reads were mismapped. Instead of assuming that mapping errors occur independently per read, FRD estimates the probability that a burst of reads is mismapped, by incorporating such evidence as MAPQ and skewed AF.
Mapping errors typically occur in bursts, but treating mapping errors as independent error events per read can result in high confidence scores in spite of low MAPQ and/or skewed AF. One possible strategy to mitigate overestimation of confidence scores is to include a threshold on the minimum MAPQ used in the calculation. However, this strategy can discard evidence and result in false positives.
FRD extends the legacy genotyping algorithm by incorporating an additional hypothesis that reads in the pileup might be foreign reads (ie, their true location is elsewhere in the reference genome). The algorithm exploits multiple properties (skewed allele frequency and low MAPQ) and incorporates this evidence into the probability calculation.
Sensitivity is improved by rescuing FN, correcting genotypes, and enabling lowering of the MAPQ threshold for incoming reads into the variant caller. Specificity is improved by removing FP and correcting genotypes.
The base quality drop off (BQD) algorithm detects systematic and correlated base call errors caused by the sequencing system. BQD exploits certain properties of those errors (strand bias, position of the error in the read, base quality) to estimate the probability that the alleles are the result of a systematic error event rather than a true variant.
Bursts of errors that occur at a specific locus have distinct characteristics differentiating them from true variants. The base quality drop off (BQD) algorithm is a detection mechanism that exploits certain properties of those errors (strand bias, position of the error in the read, low mean base quality over said subset of reads at the locus of interest) and incorporates them into the probability calculation.
DRAGEN FastQC is a tool for calculating common metrics used for quality control of high-throughput sequencing data. The tool is modeled after the metrics generated by Babraham Institute's FastQC tool.
The metrics are generated automatically on all DRAGEN map-align workflows with no additional run time and output in a CSV format file called \<PREFIX\>.fastqc_metrics.csv
. All metrics are calculated and reported separately for each mate-pair.
For users only interested in sample QC or would like to obtain FastQC results only, DRAGEN provides a mode to generate the fastqc_metrics.csv
file directly.
By default DRAGEN FastQC and read-trimming are run as preprocessing steps to standard sequence alignment workflows. If DNA alignment is not needed or if QC results are needed more quickly, the mapping and BAM output portions of the workflow can be disabled. The workflow only outputs key metric files and runs ~70% faster. This option is available on the command-line by entering --fastqc-only=true
after the DRAGEN command.
If FastQC runs stand-alone, then the license will not be consumed. If FastQC runs with map-align enabled, then the license will be consumed.
DRAGEN FastQC is a complete reimplementation of the original FastQC tool developed by the Babraham Institute (henceforth BI-FastQC). The reimplementation of FastQC in DRAGEN, however, has been modified to take advantage of the hardware-acceleration provided by the DRAGEN Field-Programmable Gate Array (FPGA) for a significant speed improvement. As such, there are some differences in how the values are calculated and the resulting metrics will not be exactly identical between the two tools. The most significant differences are described below.
Binning: BI-FastQC uses a customizable binning strategy with a default of 5bp bins, while DRAGEN uses an algorithmic binning strategy based on the Granularity setting described below. In general, this should mean that DRAGEN provides more precise results at default settings.
Outputs: BI-FastQC text output contain the same information as their plots in tabular format, while DRAGEN-FastQC outputs it's raw data. For example, BI-FastQC both plots an outputs the average base quality per-position, while DRAGEN outputs the average base quality by both position and nucleotide. This allows for a more detailed analysis of the data, but requires slightly more work to generate the associated plot.
Rounding: DRAGEN consistently rounds it's calculations to the nearest integer, while the original FastQC uses a mixture of rounding and taking the mathematical floor, leading DRAGEN-FastQC to provide incrementally higher results for some metrics.
Smoothing: Both DRAGEN-FastQC and BI-FastQC utilize smoothing techniques for their distributions of %GC, to account for the fact that 151bp do not divide evenly into 100 percentile bins. However, to take advantage of the speed offered by the FPGA, DRAGEN utilizes a slightly different algorithm than BI-FastQC which results in slightly different results.
It is not possible due to memory constraints to guarantee single-base resolution for all metrics. DRAGEN provides an algorithmic solution for binning via --fastqc-granularity. DRAGEN allocates 256 bins in memory for each size or position-based metric. The granularity value of 4–7 inclusive can be used to determine the bin size. High values use smaller bins for greater resolution. Lower values can be used to create larger bins for larger read-lengths
If a value for --fastqc-granularity is not provided by the user, DRAGEN will attempt to estimate the read length of the input data and set the granularity accordingly.
To include metrics for adapter or other sequence content, DRAGEN FastQC needs to be provided with the desired sequences in FASTA format. DRAGEN provides two options for this purpose, --fastqc-adapter-file
for adapter sequences and --fastqc-kmer-file
for any additional kmers of interest so that users can add sequences of interest without changing the expected adapter results.
DRAGEN FastQC can accept up to a combined total of 16 adapters and kmer sequences. Each sequence can be a maximum of 12 bp in length. By default, DRAGEN uses the adapter file located at <INSTALL_PATH>/config/adapter_sequences.fasta
. The file contains the following same adapter sequences as Babraham's FastQC v 0.11.10 and later.
Illumina Universal Adapter--AGATCGGAAGAG
Illumina Small RNA 3' Adapter--TGGAATTCTCGG
Illumina Small RNA 5' Adapter--GATCGTCGGACT
Nextera Transposase Sequence--CTGTCTCTTATA
The FastQC metrics are output to a CSV file format in the run output directory called
<PREFIX>.fastqc_metrics.csv
The reported metrics are broken down into eight sections by metric type. Each section is broken down further into separate rows by either the length, position, or other relevant categorical variables. The following are the metric sections.
Read Mean Quality---Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.
Positional Base Mean Quality---Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.
Positional Base Content---Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.
Read Lengths---Total number of reads with each observed length. Lengths can be either specific sizes or ranges, depending on settings specified using --fastqc-granularity
.
Read GC Content---Total number of reads with each GC content percentile between 0 % and 100 %.
Read GC Content Quality---Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.
Sequence Positions---Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads. Sequences are listed first in the metric description in quotes. Locations are listed second and can be either specific positions or ranges.
Positional Quality---Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.
The following are examples rows from each section.
polya
—RNA Poly-A tail trimming. See additional description in section
polya
—RNA Poly-A tail trimming. See additional description in section
DRAGEN analysis is capable of mapping on a pangenome hash table. The pangenome hash table introduces alternate graph paths to the linear reference hash table to represent more broadly the allelic diversity of the population over the whole genome or in specific regions defined in a bed file. Gain on accuracy from this methodology has been described in scientific blogs available on the . Mutigenome hash tables for CHM13_v2, hg38, hg19 and hs37d5 assemblies are available on the .
See for information on the multigenome mapping method.
customize the released pangenome hash table with custom bed files or hash table builder options. A set of bed files are available in the resource files on the .
The input files required are a single multi-sample VCF file containing the set of population variants, and optionally bed files restricting graph to some region. The generated files, including hash_table.cmp and associated files in the specified output directory, can then be used as the reference hash table for the DRAGEN mapper. DRAGEN software supports the tool on human reference with files available on the . For non-human, the user provides the required resource files.
A reference genome in FASTA format must be provided. Reference genomes are available to download from the .
This bed file is used to filter out regions of the msVCF file. Variants that fall within intervals defined in the "Graph exclusion bed" file will be ignored and not used in any part of the pangenome reference builder. The result will be the same as if the input msVCF did not contain any variants in the regions defined in the exclusion bed. The file is optional, by default every variants in the msVCF file will be used. Exclusion bed files are available to download from .
This file is used to define regions in the genome where extra seeds will be indexed in the hash table. By default, only seed extracted from the primary reference will be extracted and saved in the reference hash table for mapping. This option will additionally generate seeds from population variants in the defined regions. It is recommended to include the expected difficult regions in this bed file. Extra-kmer-bed files are available to download from for the human hg38, hg19, hs37d5, and chm13 references.
A mask bed file must be provided in order to mask certain regions of high similarity between primary and alternate contigs present in the main genome FASTA. Mask bed files are available to download from the .
An HLA resource file is packaged with DRAGEN and located at the following path after installation: <INSTALL_PATH>/resources/hla/HLA_resource.v1.fasta.gz
. This file is used by default when building the HLA-specific anchored hash table. A custom file can be specified with --ht-hla-reference
. See the HLA section for more information
See for further details on the mosaic small variant caller and the mosaic detection mode and a comparison with DRAGEN 4.2 features.
DRAGEN outputs variants in a VCF file following variant normalization as described here . The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.
In some cases, such as complex variants in repetitive regions, some variants cannot be normalized (i.e. converted into a standard representation) or represented uniquely. To counteract this problem, when comparing two VCFs (e.g. a DRAGEN VCF against a truth set VCF), it is recommended to use the RTG vcfeval tool which performs variant comparisons using a haplotype-aware approach. RTG vcfeval has been adopted as the standard VCF comparison tool by GA4GH and PrecisionFDA .
hg38, hg19, chm13v2
chr1-chr22, chrX, chrY
hs37d5
1-22, X, Y
Value for --ht-seed-len
Read Length
21
100 bp to 150 bp
17 to 19
shorter reads (36 bp)
27
250+ bp
--ht-cost-coeff-seed-len
1
--ht-cost-coeff-seed-freq
0.5
--ht-cost-penalty
0
--ht-cost-penalty-incr
0.7
--ht-max-seed-freq
16
--ht-target-seed-freq
4
Reference does not include the decoy contigs (eg, hg19)
Decoy reads mismap elsewhere in the genome due to the lack of contigs in the reference. Artificially higher mapping rate. False positive calls in noisy regions to which the decoy contigs are mismapped.
DRAGEN automatically detects the absence of the decoy contig from the reference and adds it to the FASTA file. Artificially lower mapping rate because decoy reads which map to the decoy contigs are artificially marked as unmapped in the output BAM (because the original reference does not have the decoy contig). False positive calls are avoided thanks to adding the decoy contigs under the hood. Therefore this helps variant calling.
Reference includes the decoy contigs (eg, hs37d5)
Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place
Decoy reads map to the decoy contigs. High mapping rate. No false positive calls caused by decoy reads because decoy reads map to the right place
--build-hash-table
Yes
Set to true
--ht-graph-msvcf-file
Yes
Path to the multi-sample VCF file containing population variants
--ht-reference
Yes
Path to the reference genome FASTA file.
--ht-graph-extra-kmer-bed
No
Path to the extra kmer bed file
--ht-mask-bed
No (but recommended)
Path to the mask bed file
--ht-graph-exclusion-bed
No
Path to the exclusion bed file
--output-directory
Yes
Specify the directory where all related hash table files will be written
reference.bin
The reference sequences, encoded in 4 bits per base. Four-bit codes are used, so the size in bytes is roughly half the reference genome size. In between reference sequences, N are trimmed and padding is automatically inserted. For example, hg19 has 3,137,161,264 bases in 93 sequences. This is encoded in 1,526,285,312 bytes = 1.46 GB, where 1 GB means 1 GiB or 2^30^ bytes.
hash_table.cmp
Compressed hash table. The hash table is decompressed and used by the DRAGEN mapper to look up primary seeds with length specified by the --ht-seed-len
option and extended seeds of various lengths.
hash_table.cfg
A list of parameters and attributes for the generated hash table, in a text format. This file provides key information about the reference genome and hash table.
hash_table.cfg.bin
A binary version of hash_table.cfg used to configure the DRAGEN hardware.
hash_table_stats.txt
A text file listing extensive internal statistics on the constructed hash including the hash table occupancy percentages. This table is for information purposes. It is not used by other tools.
mask.bed
Present only for masked hash tables. A tab delimeted bed file that describes the masked regions. Contains all lines from the input bed file that are not comment lines, lines that describe empty intervals, or lines with contig names that were not found in the input fasta.
--build-hash-table
Yes
Set to true
--ht-reference
Yes
Path to the reference genome FASTA file.
--ht-mask-bed
No (but recommended)
Path to the mask bed file. If not provided, the DRAGEN software automatically applies BED files for hg38 and hg19 from /opt/edico/fasta_mask.
--output-directory
Yes
Specify the directory where all related hash table files will be written
single-pass
--ht-methylated-combined=true
--ht-seed-len 27
multi-pass
--ht-methylated=true
--ht-seed-len 27
--ht-max-seed-freq 16
--homozyg-density
50
50
Minimum required density to call a ROH (1 SNP in 50 kb), can be increased to relax the per SNP density.
--homozyg-gap
1000
1000
3000
Maximal interval between two homozygous SNPs in a ROH (in kb)
--homozyg-kb
1000
500
All sizes reported
Minimal length of reported ROH (in kb)
--homozyg-snp
100
50
50
Minimal number of homozygous SNPs in the reported ROH
--homozyg-window-het
1
2
Soft score threshold (1-0.025) penalty for a het SNP and 0.025 gain for a hom SNP
Maximum number of heterozygous SNPs allowed in a scanning window
--homozyg-window-missing
5
5
Number of missing calls allowed in a scanning window
--homozyg-window-snp
50
50
Variants in a scanning window
--homozyg-window-threshold
0.05
0.05
For a SNP to be eligible for inclusion in a ROH, the hit rate/overlap of all scanning windows containing the SNP must be at least 0.05
--vc-target-coverage
Specifies the maximum number of reads covering any given position.
--vc-max-reads-per-active-region
Specifies the maximum number of reads covering a given active region.
--vc-max-reads-per-raw-region
Specifies the maximum number of reads covering a given raw region.
--vc-min-reads-per-start-pos
Specifies the minimum number of reads with a start position overlapping any given position.
--high-coverage-support-mode
Applies the high coverage mode down-sample options if set to true. Enabling this option is recommended for targeted panels with coverage over 1000x, but will slow down run time.
Germline
--vc-target-coverage
500
Germline
--vc-max-reads-per-active-region
10000
Germline
--vc-max-reads-per-raw-region
30000
Somatic
--vc-target-coverage
1000
Somatic
--vc-max-reads-per-active-region
10000
Somatic
--vc-max-reads-per-raw-region
30000
High Coverage
--vc-target-coverage
100000
High Coverage
--vc-max-reads-per-active-region
200000
High Coverage
--vc-max-reads-per-raw-region
200000
Mitochondrial
--vc-target-coverage-mito
40000
Mitochondrial
--vc-max-reads-per-active-region-mito
200000
Mitochondrial
--vc-max-reads-per-raw-region-mito
200000
Description
Probability that the site has no variant
Probability that the call is incorrect
Evidence supporting homref call
Qual normalized by depth
Formulation
QUAL = GP(GT=0/0)
GQ =-10*log10(p)
GQ = 10*log10[P(D|homref)/P(D|variant)]
QUAL/DP
Scale
Unsigned Phred
Unsigned Phred
Signed Phred
Unsigned Phred
Numerical example
QUAL=20: 1 % chance that there is no variant at the site. Qual=50: 1 in 1e5 chance that there is no variant at the site.
GQ=3, 50% that the call is incorrect. GQ=20, 1% change that the call is incorrect.
GQ=0: no evidence. GQ>0: evidence favors homref.
0|1
0|1
0/1
Germline and Somatic
Yes in 4.0
0/1
1/1
1/2
Germline
No
0/1
1/2
1/2
Germline
No
1/1
1/1
1/1
Germline
Yes in 4.2
male
Not relevant
Male
female
Not relevant
Female
none
Not relevant
None
auto (default)
XY
Male
auto (default)
XX
Female
auto (default)
Everything else
None
7
1-255
1
<256
6
1-128
2
>=256 and <507
5
1-64
4
>=507 and <4031
4
1-32
8
>=4031
READ MEAN QUALITY
Read1
Q38 Reads
965377
...
POSITIONAL BASE MEAN QUALITY
Read1
ReadPos 145-152 T Average Quality
34.49
POSITIONAL BASE MEAN QUALITY
Read1
ReadPos 150 T Average Quality
34.44
POSITIONAL BASE MEAN QUALITY
Read1
ReadPos 256+ T Average Quality
36.99
...
POSITIONAL BASE CONTENT
Read1
ReadPos 145-152 A Bases
113362306
POSITIONAL BASE CONTENT
Read1
ReadPos 150 A Bases
14300589
POSITIONAL BASE CONTENT
Read1
ReadPos 256+ A Bases
13249068
...
READ LENGTHS
Read1
150bp Length Reads
77304421
READ LENGTHS
Read1
144-151bp Length Reads
77304421
READ LENGTHS
Read1
>=255bp Length Reads
1000000
...
READ GC CONTENT
Read1
50% GC Reads
140878674373
...
READ GC CONTENT QUALITY
Read1
50% GC Reads Average Quality
36.20
...
SEQUENCE POSITIONS
Read1
'AGATCGGAAGAG' 137bp Starts
20
SEQUENCE POSITIONS
Read1
'AGATCGGAAGAG' 137-144bp Starts
23
...
POSITIONAL QUALITY
Read1
ReadPos 150 50% Quantile QV
37
POSITIONAL QUALITY
Read1
ReadPos 145-152 50% Quantile QV
37
...
An MD5SUM file is generated automatically for VCF output files. This file is in the same output directory and has the same name as the VCF output file, but with an .md5sum extension appended. For example, whole_genome_run_123.vcf.md5sum. The MD5SUM files is a single-line text file that contains the md5sum of the VCF output file. This md5sum exactly matches the output of the Linux md5sum command.
DRAGEN secondary analysis employs machine learning based variant recalibration (DRAGEN-ML) for germline SNV VC. Variant calling accuracy is improved using powerful yet efficient machine learning techniques that augment the variant caller, by exploiting more of the available read and context information that does not easily integrate into the Bayesian processing used by the haplotype variant caller. A supervised machine learning method was developed using truth from the PrecisionFDA v4.2.1 sets to build a model that processes read and other contextual evidence to remove false positives, recover false negatives and reduce zygosity errors, for both SNVs and INDELs.
No additional setup is required. ML model files for the hg38 and hg19 human references are packaged with the DRAGEN installer. After installation, the files are present at <INSTALL_PATH>/resources/ml_model/<ref>
DRAGEN-ML is enabled by default as needed, when running the germline SNV VC. DRAGEN will automatically detect the reference used for analysis, and use the correct model files. It either hg38 or hg19 reference type is not detected, ML recalibration will automatically be disabled and SNV VC falls back to legacy operation.
DRAGEN-ML requires a run with BAM or FASTQ input, since the machine learning model extracts information from the read pile-up. DRAGEN-ML runs concurrently with DRAGEN SNV VC. DRAGEN-ML can be applied to WGS or WES samples. Re-calibration of existing VCF files is not supported.
DRAGEN-ML recalibrates all quality scores, changing the values of the QUAL and GQ fields in the output VCF/GVCF.
DRAGEN-ML also updates PL and GP in the output VCF/GVCF.
The genotypes (GT field) of some variants may be changed by ML e.g., 0/1 to 1/1 or vice versa.
DRAGEN-ML PHRED scores (e.g. QUAL) are better calibrated than and differ significantly from those with ML disabled and, as a consequence, QUAL scores tend to not exceed 75. By comparison, QUAL scores with ML disabled can exceed 1000. For this reason, the QUAL filtering threshold is set to 3 when DRAGEN-ML is enabled, compared to 10 for DRAGEN-VC when DRAGEN-ML is disabled.
The following variants types are recalibrated:
Biallelic and multiallelic variants
Autosomes and sex chromosomes, including haploid positions
Force GT calls
Non primary contigs
DRAGEN-ML typically removes 30-50% of SNP FPs, with smaller gains on INDELS. FN counts are reduced by 10% or more. The output QUAL/GQ of DRAGEN-ML is empirically more accurately calibrated than DRAGEN SNV VC without ML. There are significant gains in accuracy statistics across the entire genome with ML enabled. Note that a small number of variant calls may have degraded accuracy with ML enabled compared to VC without ML.
DRAGEN-ML adds about 10% to the run time compared to runs without ML.
DRAGEN supports force genotyping (ForceGT) for small variant calling. To use ForceGT, use the --vc-forcegt-vcf
option with a list of small variants to force genotype. The input list of small variants can be a *.vcf or *.vcf.gz file.
The current limitations of ForceGT are as follows:
ForceGT is supported for germline small variant calling in the V3 mode. The V1, V2, and V2+ modes are not supported.
ForceGT is also supported for somatic small variant calling.
ForceGT variants do not propagate through joint genotyping.
DRAGEN supports only a single ForceGT VCF input file, which must meet the following requirements:
The input has to be a valid VCF file according to version 4.2 of the VCF standard. For instance, it has to have at least eight tab-delimited columns and records need to be sorted by reference contig and position.
The header has to list the same contigs as the reference used for variant calling. All variants must refer to one of these contig names.
Variants have to be normalized (parsimonious and left-aligned, see below).
It must not contain any multinucleotide or complex variants (AT -> C). These are variants that require more than one substitution / insertion / deletion to go from REF allele to ALT allele and are ignored.
Any deletions longer than 50bp are filtered out.
Any variant will only be called once. Duplicate entries will be ignored.
The following nonnormalized variant will cause undefined behavior in DRAGEN:
Instead of…
use…
Force genotyping requires an input VCF and can be used with DRAGEN software in VCF, GVCF or VCF+GVCF mode. In all cases the output file(s) contains all regular calls and the forceGT variants, as follows:
For a ForceGT call that was not called by the variant caller (not common), the call is tagged with FGT in the INFO field.
For a germline ForceGT call that was also called by the variant caller and filter field is PASS, the call is tagged with NML;FGT in the INFO field (NML denotes normal). In somatic mode, the call is tagged with FGT;SOM.
For a normal call (and PASS) by the variant caller, with no ForceGT call (normal), no extra tags are added (no NML tag, no FGT tag).
This scheme distinguishes among calls that are present due to FGT only, common in both ForceGT input and normal calling, and normal calls.
All the variants in the input ForceGT VCF are genotyped and present in the output file. The following table lists the reported GTs for the variants.
At a position with no coverage
./. or .
At a position with coverage but no reads supporting ALT allele
0/0 or 0
At a position with coverage and reads supporting ALT allele
dependent on pipeline (germline/somatic)
If DRAGEN calls a variant that is different from the one specified in the input ForceGT VCF, the output contains the following multiple entries at the same position:
One entry for the default DRAGEN variant call
One entry each for every variant call present in the input ForceGT-VCF at that position
If a target BED file is provided along with the input ForceGT VCF, then the output file only contains ForceGT variants that overlap the BED file positions.
The filtering step identifies de novo variants calls of the joint calling workflow in regions with ploidy changes. Since de novo calling can have reduced specificity in regions where at least one of the pedigree members shows non-diploid genotypes, the de novo variant filtering marks relevant variants and thus can improve specificity of the call set.
Based on the structural and copy number variant calls of the pedigree, the FORMAT/DN field in the proband is changed from the original DeNovo value to DeNovoSV or DeNovoCNV if the de novo variant overlaps with a ploidy-changing SV or CNV, respectively. All other variant details remain unchanged, and all variants of the input VCF will also be present in the filtered output VCF. Structural or copy number variants which result in no change of ploidy, such as inversions, are not considered in the filtering. As an example, a de novo SNV calls in the input VCF
Overlapping with an SV duplication in the proband, mother or father would be represented in the filtered output VCF as follows:
The following is an example command line for running the de novo filtering, based on the files returned by the joint calling workflows:
The following options are used for de novo variant filtering:
--dn-input-vcf
---Joint small variant VCF from the de novo calling step to be filtered.
--dn-output-vcf
---File location to which the filtered VCF should be written. If not specified, the input VCF is overwritten.
--dn-sv-vcf
---Joint structural variant VCF from the SV calling step. If omitted, checks with overlapping structural variants are skipped.
--dn-cnv-vcf
--- Joint structural variant VCF from the CNV calling step. If omitted, checks with overlapping copy number variants are skipped.
DRAGEN provides post-VCF variant filtering based on annotations present in the VCF records. Default and non-default variant hard filtering are described below. However, due to the nature of DRAGEN's algorithms, which incorporate the hypothesis of correlated errors from within the core of variant caller, the pipeline has improved capabilities in distinguishing the true variants from noise, and therefore the dependency on post-VCF filtering is substantially reduced. For this reason, the default post-VCF filtering in DRAGEN is very simple.
The default filters in the germline pipeline are as follows:
##FILTER=<ID=DRAGENSnpHardQUAL,Description="Set if true:QUAL < 10.41 (3 when ML recalibration is enabled)">
##FILTER=<ID=DRAGENIndelHardQUAL,Description="Set if true:QUAL < 7.83 (3 when ML recalibration is enabled)">
##FILTER=<ID=LowDepth,Description="Set if true:DP <= 1">
##FILTER=<ID=PloidyConflict,Description="Genotype call from variant caller not consistent with chromosome ploidy">
DRAGENSnpHardQUAL and DRAGENIndelHardQUAL: For all contigs other than the mitochondrial contig, the default hard filtering consists of thresholding the QUAL value only. A different default QUAL threshold value is applied to SNP and INDEL
LowDepth: This filter is applied to all variants calls with INFO/DP <= 1
PloidyConflict: This filter is applied to all variant calls on chrY of a female subject, if female is specified on the DRAGEN command line, of if female is detected by the ploidy estimator.
For the mitochondrial contig, DRAGEN processes it through a continuous AF pipeline, which is similar to the somatic variant calling pipeline. Please refer to Mitochondrial Calling for the filtering details.
DRAGEN supports basic filtering of variant calls as described in the VCF standard. You can apply any number of filters with the --vc-hard-filter
option, which takes a semicolon-delimited list of expressions, as follows:
where the list of criteria is itself a list of expressions, delimited by the || (OR) operator in this format:
The meaning of these expression elements is as follows:
filterID---The name of the filter, which is entered in the FILTER column of the VCF file for calls that are filtered by that expression.
snp/indel/all---The subset of variant calls to which the expression should be applied.
annotation ID---The variant call record annotation for which values should be checked for the filter. Supported annotations include FS, MQ, MQRankSum, QD, and ReadPosRankSum.
comparison operator---The numeric comparison operator to use for comparing to the specified filter value. Supported operators include <, ≤, =, ≠, ≥, and >. For example, the following expression would mark with the label "SNP filter" any SNPs with FS < 2.1 or with MQ < 100, and would mark with "indel filter" any records with FS < 2.2 or with MQ < 110:
This example is for illustration purposes only and is NOT recommended for use with DRAGEN V3 output. Illumina recommends using the default hard filters. The only supported operation for combining value comparisons is OR, and there is no support for arithmetic combinations of multiple annotations. More complex expressions may be supported in the future.
The orientation bias filter is designed to reduce noise typically associated with the following:
Pre-adapter artifacts introduced during genomic library preparation (eg, a combination of heat, shearing, and metal contaminates can result in the 8-oxoguanine base pairing with either cytosine or adenine, ultimately leading to G→T transversion mutations during PCR amplification), or
FFPE (formalin-fixed paraffin-embedded) artifact. FFPE artifacts stem from formaldehyde deamination of cytosines, which results in C to T transition mutations. The orientation bias filter can only be used on somatic pipelines. To enable the filter, set the --vc-enable-orientation-bias-filter
option to true. The default is false.
The artifact type to be filtered can be specified with the --vc-orientation-bias-filter-artifacts
option. The default is C/T,G/T, which correspond to OxoG and FFPE artifacts. Valid values include C/T, or G/T, or C/T,G/T,C/A.
An artifact (or an artifact and its reverse compliment) cannot be listed twice. For example, C/T,G/A is not valid, because C→G and T→A are reverse compliments.
The orientation bias filter adds the following information:
##FORMAT=<ID=F1R2,Number=R,Type=Integer,Description="Count of reads in F1R2 pair orientation supporting each allele">
##FORMAT=<ID=F2R1,Number=R,Type=Integer,Description="Count of reads in F2R1 pair orientation supporting each allele">
##FORMAT=<ID=OBC,Number=1,Type=String,Description="Orientation Bias Filter base context">
##FORMAT=<ID=OBPa,Number=1,Type=String,Description="Orientation Bias prior for artifact">
##FORMAT=<ID=OBParc,Number=1,Type=String,Description="Orientation Bias prior for reverse compliment artifact">
##FORMAT=<ID=OBPsnp,Number=1,Type=String,Description="Orientation Bias prior for real variant">
Please note that the OBF filter runs as a standalone process after DRAGEN is complete. The VC metrics that are computed as part of DRAGEN SNV caller will not be updated and will not reflect the additional variants that are filtered in this stage.
DRAGEN supports pedigree-based and population-based germline variant joint analysis for multiple samples. A pedigree-based analysis deals with samples from the same species which are related to each other. A population-based analysis compares samples of the same species which are unrelated to each other.
Joint analysis requires a gVCF file for each sample. To create a gVCF file, run the germline small variant caller with the --vc-emit-ref-confidence gVCF
option. There is also the option to write a germline gVCF with reduced size using the option --vc-compact-gvcf
. This results in a significant speed up for a downstream analysis using gVCF Genotyper. Please note that this compact format is not compatible with a pedigree analysis.
The gVCF file contains information on the variant positions and positions determined to be homozygous to the reference genome. For homozygous regions, the gVCF file includes statistics that indicate how well reads support the absence of variants or alternative alleles. Contiguous homozygous runs of bases with similar levels of confidence are grouped into blocks, referred to as hom-ref blocks. Not all entries in the gVCF are contiguous. A reference might contain gaps that are not covered by either variant line or a hom-ref block. Gaps correspond to regions that are not callable. A region is not callable if there is not at least one read mapped to the region with a MAPQ score above zero.
The DRAGEN germline variant caller has an option --vc-combine-phased-variants-distance
to combine phased variants in the gVCF output. Input gVCF files created with this option cannot be processed in a population-based analysis using gVCF Genotyper.
The option to combine phased variants is switched off by default, for details please refer to the section on germline small variant calling in this user guide.
If force genotyping was enabled for any input file, any ForceGT calls that are not also called by the variant caller will be ignored.
Similarly, targeted variant calls (option --targeted-merge-vc
) in any gVCF file that are not also called by the variant caller will be ignored as well.
Both pedigree- and population-based joint analysis can process gVCF files written by the GATK v4.1 variant caller.
There are two available joint analysis output files:
Multisample VCF--A VCF file containing a column with genotype information for each of the input samples according to the input variants.
Multisample gVCF--A gVCF file augmenting the content of a multisample VCF file, similar to how a gVCF file augments a VCF file for a single sample. In between variant sites, the multisample gVCF contains statistics that describe the level of confidence that each sample is homozygous to the reference genome. Multisample gVCF is a convenient format for combining the results from a pedigree or small cohort into a single file. If using a large number of samples, fluctuation in coverage or variation in any of the input samples creates a new hom-ref block, which causes a highly fragmented block structure and a large output file that can be slow to create.
The multisample gVCF output is only available in the pedigree-based analysis.
The following example shows a single line from a multi-sample VCF where one sample has a variant, and the other two samples are in a gVCF gap. Gaps are represented by "./.:.:".
In hom-ref blocks, the following FORMAT fields are calculated uniquely.
FORMAT/DP--In a single sample gVCF, the FORMAT/DP reported at a hom-ref position is the median DP in that band. In a multisample gVCF, the FORMAT/DP reported at a hom-ref position is the MIN_DP from hom-ref calls.
FORMAT/AD--In single sample gVCF, values represent the position in the band where DP=median DP. In the multisample gVCF, AD values at hom-ref positions are copied from the single sample gVCF.
FORMAT/AF--Values are based on FORMAT/AD.
FORMAT/PL--Values represent the Phred likelihoods per genotype hypothesis. For hom-ref blocks, each value in FORMAT/PL represents the minimum value across all positions within the band.
FORMAT/SPL and FORMAT/ICNT--Parameters reported in the gVCF records, including both hom-ref blocks and variant records. The parameters are used to compute the confidence score of a variant being de novo in the proband of a trio. For SNP, FORMAT/PL and FORMAT/SPL are both used as input to the DeNovo Caller. FORMAT/PL represents Phred likelihoods obtained from the genotyper, if the genotyper is called. FORMAT/SPL represents Phred likelihoods obtained from column-wise estimation, pregraph. Each value in FORMAT/SPL represents the minimum across all positions within the band. For INDEL, the PL value is computed in the joint pedigree calling step based on the FORMAT/ICNT reported in the gVCF file. FORMAT/ICNT consist of two values. The first value is the number of reads with no indels at the position, and the second value is the number of reads with indels at the position. Each value in FORMAT/ICNT represents the maximum of the value across all positions within the band.
In the following example hom-ref block, ICNT provides information on whether each sample contains an Indel at the position of interest. If the proband contains an indel at the position and the ICNT of the parents does not indicate any read supporting an indel, then the confidence score is high for the proband to have an indel de novo call at the position.
SPL and ICNT values are specific to DRAGEN. The GATK variant caller does not output SPL and ICNT values.
In a single sample gVCF, FORMAT/DP reported at a hom-ref position is the median DP in the band. The minimum is also computed and printed as MIN_DP for the band.
In the multisample gVCF, MIN_DP from hom-ref calls is printed as FORMAT/DP, and AD is just copied from the gVCF. Therefore, at a hom-ref position in the multi-sample gVCF output, the DP is not necessarily going be the sum of AD.
Use pedigree mode to jointly analyze samples from related individuals and to perform de novo calling.
To invoke pedigree mode, set the --enable-joint-genotyping
option to true. Use the --pedigree-file
option to specify the path to a pedigree file that describes the relationship between panels.
The pedigree file must be a tab-delimited text file with the file name ending in the .ped extension. The following information is required.
Family_ID
The pedigree identifier.
Individual_ID
The ID of the individual.
Paternal_ID
The ID of the individual's father. If the founder, the value is 0.
Maternal_ID
The ID of the individual's mother. If the founder, the value is 0.
Sex
The sex of the sample. If male, the value is 1. If female, the value is 2.
Phenotype
The genetic data of the sample. If unknown, the value is 0. If unaffected, the value is 1. If affected, the value is 2.
The following is an example of an input pedigree file.
The De Novo Caller identifies all the trios within the pedigree and generate a de novo score for each child. The De Novo Caller supports multiple trios within a single pedigree. Pedigree Mode supports de novo calling for small, structural, and copy number variants.
Pedigree Mode is run in multiple steps. The following is an example workflow for a trio using FASTQ input.
Run single sample alignment and variant calling to generate per sample output using the following inputs for Pedigree Mode.
gVCF files for the Small Variant Caller.
*.tn.tsv files for the Copy Number Caller.
BAM files for the Structural Variant Caller.
Run Pedigree Mode for Small Variant Caller. For more information, see Small Variant DeNovo Calling.
Run Pedigree Mode for Copy Number Caller. For more information, see Multisample CNV Calling.
Run Pedigree Mode for Structural Variant Caller. For more information, see Structural Variant De Novo Quality Scoring.
Run DeNovo Variant Small Variant Filtering. For more information, see De Novo Small Variant Filtering.
The Small Variant De Novo Caller considers a trio of samples at a time. The samples are related via a pedigree file. The Small Variant De Novo Caller determines all positions that have a Mendelian conflict based on the genotype from the individual sample gVCFs. Sex chromosomes in males are treated as haploid apart from the PAR regions, which are treated as diploid.
Each of those positions is then processed through the Pedigree Caller to compute a joint posterior probability matrix for the possible genotypes. The probabilities are used to determine whether the proband has a de novo variant with a DQ confidence score. All three subjects are assumed to have an independent error probability.
At positions where the original genotype from the gVCFs shows a double Mendelian conflict (eg, 0/0+0/0->1/1 or 1/1+1/1->0/0), the genotypes of the trio samples can be adjusted to the highest joint posterior probability that has at least one Mendelian conflict.
The DQ formula is DQ = -10log10(1 - Pdenovo).
Pdenovo is the sum of all indexes in the joint posterior probability matrix with one of more Mendelian conflicts.
In the GT overwrite step, it is possible for the GT of the parents to be overwritten. In the case of multiple trios, the GT of the parents is based on the last trio processed. The trios are processed in the order they are listed in the pedigree file. DRAGEN currently does not add an annotation in the VCF in cases where the GT was overwritten.
The multisample VCF file is annotated with FORMAT/DQ and FORMAT/DN fields to the output a VCF file that represents a de novo quality score and an associated de novo call. The DN field in the VCF is used to indicate the de novo status for each segment.
The following are the possible values:
Inherited--The called trio genotype is consistent with Mendelian inheritance.
LowDQ--The called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold.
DeNovo--The called trio genotype is inconsistent with Mendelian inheritance and DQ is greater than or equal to the de novo quality threshold.
The following is an example VCF line for a trio:
1 16355525 . G A 34.46 PASS AC=1;AF=0.167;AN=6;DP=45;FS=6.69;MQ=108.04;MQRankSum=-0.156;QD=2.46;ReadPosRankSum=0;SOR=0.016 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP:PP:DPL:DN:DQ 0/1:11,3:0.214:14:39:PASS:8,2:3,1:74,0,47:39.454,0.00053613,49.99:0,1,104:74,0,47:DeNovo:0.67375 0/0:18,0:0:16:48:PASS:.:.:0,48,605:.:0,12,224:0,48,255:.:. 0/0:14,0:0:14:42:PASS:.:.:0,42,490:.:0,5,223:0,42,255:.:.
The following command line options are available for de novo small variant calling.
--enable-joint-genotyping
--Run the joint genotyping caller.
--pedigree-file
--Specify the path to a pedigree file that describes the relationship between samples. It is possible to run JointGenotyper without a pedigree file on unrelated samples, but we do not recommend this anymore for gVCF variant calls from DRAGEN 3.10 or newer.
--variant
or --variant-list
--Specify the gVCF input to the workflow. The pedigree caller can read input gVCF files from an AWS S3 bucket, Azure storage BLOB, or pre-signed URL.
--qc-snp-denovo-quality-threshold
--Specify the minimum DQ value for a SNP to be considered de novo. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
--qc-indel-denovo-quality-threshold
--Specify the minimum DQ value for an indel to be considered de novo. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
--output-directory
--The output directory. This is required.
--output-file-prefix
--The prefix used to label all output files. This is required.
-r
The directory where the hash table resides.
The output of the joint genotyper depends on the order of input gVCF files passed on the command line using --variant
or --variant-list
. It is recommended to use the same input order when re-analyzing gVCFs to ensure the output is the same as an earlier run.
DRAGEN provides a population-based analysis option to jointly analyze samples from unrelated individuals.
The tool for population-based analysis is the iterative gVCF Genotyper. Its input is a set of single or multisample gVCFs. The output is a multisample VCF that contains one entry for any variant seen in any of the input gVCFs. The variants are genotyped across all input samples using information from the hom-ref blocks as necessary. The iterative gVCF Genotyper does not adjust genotypes based on population information but it provides means to filter variant sites based on information leveraged from the population. See Iterative gVCF Genotyper analysis for information on the available command line options.
To compare multiple pedigrees, you can run gVCF Genotyper on the output of a pedigree analysis and merge multiple joint-called pedigrees into a single multisample VCF. To enable, run the pedigree analysis using the --enable-multi-sample-gvcf=true
option to write a multisample gVCF.
gVCF Genotyper offers an iterative workflow to aggregate new samples into an existing cohort. The iterative workflow allows users to incrementally aggregate new batches of samples with existing batches, without having to redo the analysis from scratch across all samples, every time when new samples are available. The workflow takes single sample gVCF files as input, and can be performed in a "step-by-step mode" if multiple batches of samples are available, or "end-to-end mode", if only a single batch of samples is available. Multi-sample gVCF files output from the Pedigree Caller (described above) are also accepted as input. gVCF Genotyper can accept input gVCF files generated using DRAGEN version 3.2.6 or later.
Step 1 (gVCF aggregation): the user can use iterative gVCF Genotyper to aggregate a batch of gVCF files into a cohort file and a census file. The cohort file is a condensed data format to store gVCF data in multiple samples, similar to a multi-sample gVCF. The census file stores summary statistics of all the variants and hom-ref blocks among samples in the cohort. As part of this step, adjacent hom-ref blocks with matching FILTER columns are further merged to reduce the disk footprint of the intermediate files, FORMAT field values being base-pair weight averaged in the process.
When a large number of samples are available, the user can divide samples into multiple batches each with similar sample size (e.g. 1000 samples), and repeat Step 1 for every batch.
Step 2 (census aggregation): after all per batch census files are generated, the user can aggregate them into a single global census file. This step scales to aggregate thousands of batches, in a much more efficient way than aggregating gVCFs from all batches. When a new batch of samples becomes available, the user only needs to perform Step 1 on that batch, then aggregate the census file from the new batch with the global census file from all previous batches in order to generate an updated global census file.
Step 3 (msVCF generation): every time a global census file is updated, with new variant sites discovered and/or variant statistics updated at existing variant sites, the user can take a per-batch cohort file, per-batch census file and the global census file as input, and generate a multi-sample VCF for one batch of samples. The output multi-sample VCF contains the variants and alleles discovered in all samples from all batches, and also includes global statistics such as allele frequencies, the number of samples with or without genotypes, and the number of samples without coverage. Similar statistics among samples in the batch are also included. This step can be repeated for every batch of samples, and the number of records in each output multi-sample VCF is the same across all batches.
To facilitate parallel processing on distributed compute nodes, for every step above, the user can choose to split the genome into shards of equal size, and process each shard using one instance of iterative gVCF Genotyper on each compute node. See option --shard
below.
There is a special treatment of alternative or unaligned contigs when the --shard
option is enabled: all contigs that are not autosomes, X, Y or chrM are included in the last shard. No other contigs will be assigned to the last shard. The mitochondrial contig will always be on its own in the second to last shard.
If a combined msVCF of all batches is required, an additional step can be separately run to merge all of the batch msVCF files into a single msVCF containing all samples.
--enable-gvcf-genotyper-iterative
: set to true to run the iterative gVCF Genotyper (always required).
--ht-reference
: The file containing the reference sequence in FASTA format (always required).
--output-directory
: The output directory (always required).
--output-file-prefix
: The prefix used to label all output files (optional, default value dragen
).
--shard
: Use this option to process only a portion ('shard') of the genome, when distributing the work across multiple compute nodes in a production workflow. Provide the index (1-based) of the shard to process and the total number of shards, in the format of n/N
(e.g. 1/50 means shard 1 of total 50 shards). To facilitate concurrent processing within each job, the shard will by default be split into 10x the number of available threads. This option assumes a Human reference genome and might not work for non-Human reference genomes.
--gg-regions
: Use this option to test iterative gVCF Genotyper only for a subset of regions in the genome. The value is a list of regions (chr:start-end) delimited by comma. Contig names must match those in the reference and no region may overlap another. If a single region larger than 1Mb is selected, multiple threads are enabled. Otherwise, one thread is launched per region. This assumes that the --shard
option is not given. It is important that the same regions are chosen for each step 1,2 and 3.
--gg-regions-bed
: If a path to a BED file is provided as value, this option, like the one above, will limit the iterative gVCF Genotyper processing to the genome regions specified therein, which must be non-overlapping. This option is intended for exome input data. It results in faster processing times and is compatible with sharding. This option will only take effect in step 1 or end-to-end mode. It differs from the option above in that, if the number of regions exceeds 10 times the number of available threads, they will not necessarily be processed by independent threads.
--gg-discard-ac-zero
If set to true, the gVCF Genotyper does not print variant alleles that are not called (hom-ref genotype) in any sample. The default value is true.
--gg-remove-nonref
If set to true, the <NON_REF> symbolic allele is removed in the process of reading in input gVCF files. The default value is true.
--gg-vc-filter
Discard input variants that failed filters in the upstream caller. The default is false. Affected records will have their genotype set to hom-ref and the filter string "ggf" added to FORMAT/FT.
--gg-hard-filter
Specifies a filtering expression to be applied to the output msVCF records. See msVCF hard filtering below. The default is to apply no filters.
--gg-skip-filtered-sites
Omits msVCF records that fail the given hard filter. The default is false.
--gg-msvcf-format-fields
Can be used to override the default set of sample genotype fields in the output msVCF. See msVCF metric customization below.
--gg-msvcf-info-fields
Can be used to override the default set of site-wise INFO fields in the output msVCF. See msVCF metric customization below.
--gg-squeeze-msvcf
Set to omit genotype fields other than GT from the output msVCF for confidently called hom-ref sample records.
--gg-gq-squeezing-threshold
Use in conjunction with the previous option to adjust the threshold on GQ (default 30) that signifies a confident hom-ref call.
--gg-output-type
Set to spvcf
to write the output in spVCF format rather then the default msVCF. See File size optimizations below for details.
--gg-diploidify
In the output msVCF file, convert haploid calls to diploid. The diploidified genotype is homozygous in the haploid call e.g. 1
becomes 1/1
. The LPL field is also diploidified for these samples. Site metrics, such as allele counts, are calculated before diploidification. Diploidifying genotypes may ease the ingestion of msVCF files into downstream analysis tool, such as Hail and Plink. When this option is enabled, it is possible to include the DF
FORMAT field (included by default) that signifies whether or not a genotype has been diploidified, see msVCF metric customization below.
--gvcfs-to-cohort-census
: set to true to aggregate gVCF files from one batch of samples into a cohort file and a census file.
--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant
: if --variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
--aggregate-censuses
: set to true to aggregate a list of per batch census files into a global census file.
--input-census-list
: the path to a file containing a list of input per batch census files (from Step1), with the absolute path to each file on a separate line.
--generate-msvcf
: set to true to generate a multi-sample VCF for one batch of samples.
--input-cohort-file
: the path to the per batch cohort file (from Step1).
--input-census-file
: the path to the per batch census file (from Step1).
--input-global-census-file
: the path to the global census file (from Step2).
--gvcfs-to-msvcf
: set to true to enable the end-to-end mode. This is the default is none of the steps 1,2 or 3 above is selected.
--variant-list
: the path to a file containing a list of input gVCF files, with the absolute path to each file on a separate line.
--variant
: if --variant-list
is not given, use this option for each input gVCF file. Absolute file paths must be provided.
--merge-batches
: set to true to merge msVCF files for a set of batches.
--input-batch-list
: the path to a file containing a list of msVCF files to be merged, with the path to each file on a separate line. All the files listed must have been generated from the same global census file, with the same set of options, and by default all batches pertaining to that global census must be included in the merge.
--gg-enable-indexing
: set to true (the default) to generate a tabix index for the merged msVCF.
--gg-merge-subset
: set to override the restriction that all batches must be included in the merge.
Mimalloc is a custom memory allocation library that can yield a significant speed-up in the iterative gVCF Genotyper workflow. In some deployments, e.g. cloud, it is automatically and seamlessly used, but in other contexts it requires special user intervention to be activated, as at present it cannot be included in standard DRAGEN by default.
For this purpose, the convenience script mi_dragen.sh
is provided, which loads the bundled library and can be transparently used in the same way as the DRAGEN executable. Please note that its use is only intended and supported for use with the iterative gVCF Genotyper component, although it can in principle be applied for any other DRAGEN workflow too. Its use for other purposes is known to possibly lead to undesirable memory overuse and thus should be undertaken at the user's own risk.
The output of gVCF Genotyper is a multi-sample VCF (msVCF) that contains metrics computed for all samples in the cohort.
The msVCF can become a very large file with increasing cohort size. In some cases, the file might need more storage than can be allocated by VCF parsers. This is caused by VCF entries such as FORMAT/PL which store a value for each combination of alleles. We therefore decided to replace FORMAT/PL with a tag FORMAT/LPL which stores a value only for the alleles that actually occur in the sample. Similarly, the msVCF also contains FORMAT/LAD which stores the allelic depth only for the alleles occurring in the sample.
We also added a new FORMAT/LAA field which lists 1-based indices of the alternate alleles that occur in the current sample. The allele order of other local fields is the same as that of LAA.
This approach is also referred to as local alleles and is also used by open source software such as bcftools and Hail.
When processing mitochondrial variant calls, which may contain separate records for each allele, iterative gVCF Genotyper processing differs in the following ways:
Only the record with the highest FORMAT/AF sum is kept.
The FORMAT/AF field will be additionally collected, and used to generate the FORMAT/LAF field in the output msVCF
The value displayed in the QUAL column of the msVCF is the maximum of the input QUAL values for the site across the global cohort. The QUAL value will be missing if any of the batch census files used to create the global census were generated with a version of DRAGEN earlier than v4.2.
The Hardy-Weinberg Equilibrium (HWE) states that, given certain conditions, genotype and allele frequencies should remain constant between generations. Deviations from HWE can results from violations of the underlying HWE assumptions in the population, non-random sampling or may be artifacts of variant calling. Adherance to HWE can be assessed by comparing the observed frequencies of genotypes to those expected under HWE given the observed allele counts.
Iterative gVCF Genotyper offers several metrics for assessing adherence to HWE. It calculates both allele-wise and site-wise HWE P-values, an allele-wise excess heterozygosity (ExcHet) P-value and the site-wise inbreeding coefficient (IC). These metrics are calculated only for diploid sites and missing values are excluded from the calculations. These values are included as fields in the INFO column of the output msVCF file. Both batch-wise and global values are included, where the field names for the global values are prefixed with G
.
HWE
Hardy-Weinberg Equilibrium P-value
Allele-wise
One for each alt allele
ExcHet
Excess Heterozygosity P-value
Allele-wise
One for each alt allele
HWEc2
Hardy-Weinberg Equilibrium P-value
Site-wise
1
IC
Inbreeding Coefficient
Site-wise
1
Care should be taken when interpreting these metrics for small cohorts and/or low frequency alleles, as small changes in inputs can lead to large changes in their values. Further, violations of the underlying HWE assumptions (such as inbreeding), and non-random sampling (such as the presence of consanguineous samples), can adversely affect results, making identification of poorly called variants more difficult.
Where it is not possible to calculate the metric, they are represented as missing (i.e., ".") in the msVCF file. This can vary between the metrics, but may occur if non-diploid genotypes are encountered, if there is only one allele present at a site, or if no samples are genotyped at a site.
Iterative gVCF Genotyper offers both allele-wise and site-wise HWE P-values. The allele-wise P-values are based on the exact-conditional method Am J Hum Genet. 2005 May; 76(5): 887–893 the site-wise P-values are based on Pearson's chi-squared method. For bi-allelic sites, although both are measuring the same property, their values may differ. The differences between the methods are explored in Am J Hum Genet. 2005 May; 76(5): 887–893. Care should be taken when deciding which to use.
Iterative gVCF Genotyper calculates allele-wise HWE and the ExcHet P-values. The values are calculated using the exact-conditional method described in Am J Hum Genet. 2005 May; 76(5): 887–893. The implementation does not use a mid P-value correction.
For HWE a P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it. For ExcHet a P-value of ≈ 0.5 suggests that the number of heterozygotes is close to the number expected under HWE, while a value ≈ 1 suggests that there are more heterozygotes than expected and a value ≈ 0 suggests that there are fewer heterozygotes than expected.
For a bi-allelic site the HWE P-values is based on the numbers of homozygotes and heterozygotes comparing the observed to expected. For a multi-allelic site, P-values are calculated per ALT allele as if it were bi-allelic. Genotypes composed of only the ALT allele being considered are counted as alternative homozygous, any other genotype containing a copy of the ALT allele being considered are counted as a heterozygous, and any genotype with no copies of the ALT allele being considered are counted as reference homozygous (this may include genotypes containing other ALT alleles).
Iterative gVCF Genotyper calculates a site-wise HWE P-value. The value is calculated using the Pearson's chi-squared method, comparing the genotype counts expected under HWE to those observed. The chi-squared test statistic is calculated as
𝜒2 = ∑gt (Egt - Ogt)2 / Egt
where the summation is over gt
is over all genotypes possible at the site given the alleles present, and Egt and Ogt are the expected and observed counts for genotype gt
, respectively. From the chi-squared test statistic the P-value is then calculated from a chi-squared distribution where the number of degrees of freedom is the number of possible genotypes minus the number of alleles, which is
where n
is the number of alleles.
The batch-wise value uses only the alleles present in the batch. Alleles with AC=0 are not included in the calculation.
A P-value of ≈ 1 suggests that the distribution of heterozygotes and homozygotes is close to that expected under HWE, while a P-value of ≈ 0 suggests a deviation from it.
Iterative gVCF Genotyper calculates the inbreeding coefficient (IC) (sometimes called the Fixation index and denoted by F
). It is defined as the proportion of the population that is inbred. The value of IC can be estimated by looking at the observed number of heterozygotes in comparison to the number expected under HWE:
where O(het)
and E(het)
are the observed and expected number of heterozygotes in the cohort, respectively. Although initially conceived for studying inbreeding and defined as a non-negative value, it is also commonly used to look for deviations from HWE and can take values in the range [-1, 1].
Values of IC ≈ 0 suggest that the cohort is in HWE. Negative values suggest an excess of heterozygosity and a deviation from HWE, which can be symptomatic of poor variant calling. Positive values suggest a deficit of heterozygotes and the possible presence of inbreeding.
Using the above definition, IC should be a property of the population, and so would be expected to be drawn from the same distribution for all sites and for all variants at a site. Deviations from this distribution can suggest issues in calling a site correctly. Violations of HWE assumptions and/or non-random sampling may adversely affect the distribution of IC, causing it to be shifted. However, outliers can still be identified, although thresholds may need to be adjusted accordingly.
Allelic balance (AB) describes the proportion of reads that support each allele within a called genotype and can be calculated from the allelic depth (FORMAT/AD or FORMAT/LAD). For homozygotes this is taken as
AB = ADi / ∑j ADj
where i
is the index of the called allele and j
runs over all alleles. For heterozygotes this is taken as
ABi = ADi / (ADa + ADb)
where a
and b
are the indices of the called alleles and i
can have values a
or b
. For homozygous genotypes AB is expected to be ≈ 1 and for heterozygous genotypes it is expected to be ≈ 0.5 for each allele. Deviations from the expected values can be indicative of an error.
DRAGEN's iterative gVCF Genotyper calculates site-wise AB values for each allele based on the read depths among all samples. Only diploid genotypes are included in the calculations. Values are calculated separately among homozygous (ABHom) and heterozygous (ABHet) genotypes. ABHet is calculated using the counts among all heterozygous calls that contains the allele under consideration. P-values for ABHet are also calculated (ABHetP) based on a binomial test with an expected probability of 0.5. A P-value of ≈ 1 signifies that results are in line with expectation while ≈ 0 signifies a deviation from expectation. Values are written to the INFO fields ABHom, ABHet and ABHetP, with one value for each allele (including the reference allele). Values should be in the range [0, 1]. Missing values are coded by -1, for example where there are no homozygous calls for an allele. If AD is not present in any input gVCF file, the values are not calculated and the fields will be omitted from the output msVCF file.
It is also possible to filter based on the maximum ABHetP value, see msVCF hard filtering.
Sites in the output msVCF can be filtered on the following global metrics:
QUAL
Number of samples with called genotypes (GNS_GT)
Inbreeding coefficient (GIC)
𝜒2 Hardy-Weinberg Equilibrium P-value (GHWEc2)
The maximum P-value for heterozygous allelic balance (GABHetP)
The syntax of a filtering expression is the same as that used by the small variant caller (see Germline Small Variant Hard Filtering). Filters are always applied to the globally-computed metrics, not the values for the current batch. Records failing filter will have the specified filter ID(s) written to the FILTER column of the msVCF, or will be omitted entirely if the --gg-skip-filtered-sites option is specified. Since filtering is on a per-site basis, filters cannot be applied separately to SNPs or indels as they can in the variant caller.
The per-sample genotype metrics in the output msVCF can be customized by providing a colon-separated list of metrics, analogous to that of the VCF FORMAT column, to the --gg-msvcf-format-fields
option, e.g. --gg-msvcf-format-fields=GT:LAD:LPL:LAA:QL
. Supported metrics are GT, GQ, AD, LAD, FT, LPL, LAA, LA, LGT, QL, MQR, LAF and DF (N.B. LAF will only appear on the MT contig and DF will only appear if the --gg-diploidify
option is enabled). Sample genotype (GT) is always present and always shown first, regardless of whether it is included in the option string or not. Alternatively, an msVCF containing only site statistics and no per-sample genotype fields can be generated using the option --gg-msvcf-format-fields=None
.
GT
Genotype
1
String
GQ
Genotype quality
1
Integer
AD
Allelic depths
R
Integer
LAD
Localized allelic depths
.
Integer
FT
Sample filter
1
String
LPL
Local normalized, Phred-scaled likelihoods for genotypes as in original gVCF
.
Integer
LAA
Mapping of local alt allele index from original gVCF to msVCF excluding the reference allele
.
Integer
LA
Mapping of local allele indices from original gVCF to msVCF including the reference allele
.
Integer
LGT
Local GT value as in original gVCF
1
String
QL
Phred-scaled probability that the site has no variant in this sample (original gVCF QUAL)
1
Float
MQR
Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities (original gVCF INFO/MQRankSum)
1
Float
LAF
Allele fractions for the local alt alleles
.
Float
DF
Diploidified, 1 represents a genotype that was originally haploid, 0 represents originally diploid
1
Integer
The per-site INFO metrics in the output msVCF can be customized by providing a semicolon-separated list of metrics, analogous to that of the VCF INFO column, to the --gg-msvcf-info-fields
option, e.g. --gg-msvcf-info-fields=AC;AN;NS;NS_GT;NS_NOGT;NS_NODATA;AF
. Supported metrics are AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet, HWEc2, AF. The default set of metrics is AC, AN, NS, NS_GT, NS_NOGT, NS_NODATA, IC, HWE, ExcHet and HWEc2. All INFO fields can be included using the option --gg-msvcf-info-fields=All
. All INFO fields can be dropped using the option --gg-msvcf-info-fields=None
, in which case the INFO field will contain the missing symbol, .
. For each specified metric, the value for the current batch and the global value are written. For global values, the metric names are prepended by G
.
INFO fields that have a missing value, .
, at a site are omitted from the msVCF for that site, so sites may contain different sets of fields.
AC
Allele count in genotypes
A
Integer
AN
Total number of alleles in called genotypes
1
Integer
NS
Total number of samples
1
Integer
NS_GT
Total number of samples with called genotypes
1
Integer
NS_NOGT
Total number of samples with unknown genotypes ./.
1
Integer
NS_NODATA
Total number of samples with no coverage
1
Integer
IC
The inbreeding coefficient
1
Float
HWE
The exact conditional Hardy-Weinberg Equilibrium P-value
A
Float
ExcHet
he exact conditional Excess Heterozygosity P-value
A
Float
HWEc2
The chi-squared Hardy-Weinberg Equilibrium P-value
1
Float
AF
The ALT allele frequencies (AC/AN)
A
Float
ABHom
The allelic balance among homozygotes
R
Float
ABHet
The allelic balance among heterozygotes
R
Float
ABHetP
The P-value for allelic balance among heterozygotes
R
Float
For sizable cohorts, the file outputs from gVCF Genotyper can become extremely large. However, there are a number of options within the component which can mitigate this. As well as reduced footprint on disk, these options can lead to faster runtimes owing to the diminished I/O demands.
The following options have applicability to this:
The small variant caller's --vc-compact-gvcf
, described previously. This doesn't reduce output file sizes, but the smaller input gVCFs reduce gVCF Genotyper runtime and could reduce data storage costs.
The removal of the NON_REF symbolic allele when ingesting the input gVCF files, which is the default behaviour. Doing this reduces the size not only of the final msVCF output, but also the intermediate cohort and census files.
Several options exist that reduce the volume of data written to the final msVCF file:
Outputting local allele values, as described above.
Use of the msVCF metric customization options to output only those metrics required for the downstream analysis.
Omitting records that fail filters (--gg-skip-filtered-sites
option).
Dropping trailing genotype fields for hom-ref records (--gg-squeeze-msvcf
option). This behaviour is explicitly permitted by the VCF specification.
The option that can have the biggest impact on the final output file size is that to write it directly in spVCF format. This is a lossless encoding and the space saving can be dramatic: file size reductions of multiple tens of times have been observed for large cohorts with sparsely distributed variants. Files output as spVCF at step 3 (--generate-msvcf
) can be directly merged via the --merge-batches
subcommand to produce a single spVCF file. spVCF-encoded files are likely to require decoding back to full msVCF for use with downstream tools, and a binary for this is available for download. The decoding will take time, but this is offset by the reduced time required within gVCF Genotyper to initially write the smaller spVCF files. Users are recommended to, if possible, directly pipe the decoded data into the downstream tool rather than first writing the full msVCF file to disk.
1: The number of values is coded as per the VCF specification, with A
denoting one value per alt allele, R
one value per possible allele (including the reference allele), G
one value per possible genotype and .
an unspecified number of values that may vary between site and sample. The number of elements in localised array FORMAT fields that depend on the number of local alleles will vary between samples and so are specified as .
.
The DRAGEN Somatic Pipeline allows ultrarapid analysis of Next-Generation Sequencing (NGS) data to identify cancer-associated mutations in somatic chromosomes. DRAGEN calls SNVs and indels from both matched tumor-normal pairs and tumor-only samples using a probability model that considers the possibility of somatic variants, germline variants, and various systematic noise artifacts. The model is informed by sample-specific nucleotide and indel noise patterns that are estimated from the data at runtime. When considering somatic variants, DRAGEN does not make any ploidy assumptions, which enables detection of low-frequency alleles. For loci with coverage up to 100x in the tumor sample, DRAGEN can detect variant allele frequencies down to approximately 5%. This limit scales with increasing depth on a per-locus basis. It is recommended to provide DRAGEN with a systematic noise file that contains position- and allele-specific noise frequencies as estimated from a panel of normal samples (see below); DRAGEN uses this noise file to filter calls that can be explained as resulting from position- and allele-specific noise. After multiple filtering steps, the output is generated as a VCF file. Variants that fail the filtering steps are kept in the output VCF. The variants include a FILTER annotation that indicates which filtering steps have failed.
For the tumor-normal pipeline, both samples are analyzed jointly. DRAGEN assumes that germline variants and systematic noise artifacts are shared by both samples, whereas somatic variants are present only in the tumor sample. Only somatic variants are reported. To detect systematic noise artifacts, DRAGEN recommends that the coverage in the normal sample be at least half of the coverage in the tumor sample.
The tumor-only pipeline produces output that contains both germline and somatic variants and can be further analyzed to identify tumor mutations. The caller does not attempt to distinguish between them: filtering out common germline variants as reported in databases is currently the most reliable way to remove germline variants. The tumor-only pipeline provides a germline tagging feature and requires this feature to be explicitly enabled or disabled. When germline tagging is enabled, variant annotation must also be enabled; DRAGEN then tags variants that are common in the gnomAD database as germline so that they can be filtered out if desired. The tumor-only pipeline also requires the presence of a systematic noise file by default. To run without germline tagging and/or systematic noise files, these options need to be disabled explicitly.
DRAGEN uses a Bayesian approach to compute the posterior probability that a somatic variant is present and reports this as a phred-scale quantity, "somatic quality" (SQ):
##FORMAT=<ID=SQ,Number=1,Type=Float,Description="Somatic quality">
DRAGEN scores variants by computing likelihoods for several hypotheses and noise processes, taking into account many factors such as: the numbers of alt-supporting and ref-supporting reads in the tumor and normal samples (and hence the alt allele frequencies in both samples); mapping qualities and how these are distributed across the reads in the tumor and normal pileups; basecall qualities; forward vs reverse strand support; sample-wide estimates of insertion and deletion error probabilities as functions of repeat period, repeat length, and indel length; sample-wide estimates of nucleotide error biases; whether there are nearby co-phased events; and whether the positions and alleles in question are known somatic hotspots or associated with sequence-specific error patterns. You can use SQ as the primary metric to describe the confidence with which the caller made a somatic call. SQ is reported as a format field for the tumor sample (exception: for homozygous reference calls in gvcf mode it is instead a likelihood ratio, analogous to homref GQ as described in the germline section). Variants with SQ score below the SQ filter threshold are filtered out using the weak_evidence
tag. To trade off sensitivity against specificity, adjust the SQ filter threshold. Lower thresholds produce a more sensitive caller and higher thresholds produce a more conservative caller. If performing tumor-normal analysis, the SQ field for the normal sample contains the Phred-scaled posterior probability that a putative call is a germline variant. The somatic caller does not test for diploid genotype candidates and does not output GQ or QUAL values.
If tumor SQ > vc-sq-call-threshold
(default is 3 for tumor-normal and 0.1 for tumor-only), then the FORMAT/GT for the tumor sample is hard-coded to 0/1, and the FORMAT/AF yields an estimate on the somatic variant allele frequency, which ranges anywhere within [0,1].
If the value for vc-sq-filter-threshold
is lower than vc-sq-call-threshold
, the filter threshold value is used instead of the call threshold value.
If tumor SQ < vc-sq-call-threshold
, the variant is not emitted in the VCF.
If tumor SQ > vc-sq-call-threshold
but tumor SQ < vc-sq-filter-threshold
, the variant is emitted in the VCF, but FILTER=weak_evidence.
If tumor SQ > vc-sq-call-threshold
and tumor SQ > vc-sq-filter-threshold
, the variant is emitted in the VCF and FILTER=PASS (unless the variant is filtered by a different filter).
The default vc-sq-filter-threshold is 17.5 for tumor-normal and 3.0 for tumor-only analysis. The following is an example somatic T/N VCF record. Tumor SQ > vc-sq-call-threshold
but tumor SQ < vc-sq-filter-threshold
, so the FILTER is marked as weak_evidence.
The clustered-events penalty is an exception to the above rule for emitting variants. By default, the clustered-events penalty replaces the (obsolete) clustered-events filter. Instead of applying a hard filter when too many events are clustered together, DRAGEN applies a penalty to the SQ scores of cophased clustered events. Clustered events with weak evidence are no longer called, but clustered events with strong evidence can still be called. This is equivalent to lowering the prior probability of observing clustered cophased variants. The penalty is applied after the decision to emit variants, so that penalized variants still appear in the VCF if their unpenalized score is high enough. Variants that are combined into an MNV via the --combine-phased-variants-distance
option are treated as a single variant for the purposes of the penalty. The penalty will not be applied to somatic hotspot variants. To disable the clustered-events penalty, set --vc-clustered-event-penalty=0
.
Please see the DRAGEN Recipe sections for recommended command lines in typical workflows. The following command line options are typically used for somatic small-variant calling:
--tumor-fastq1 and --tumor-fastq2
Inputs a pair of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with OTHER FASTQ options to run in tumor-normal mode. For example:
--tumor-fastq-list
Inputs a list of FASTQ files into the mapper aligner and somatic variant caller. You can use these options with other FASTQ options to run in tumor-normal mode. For example:
--tumor-bam-input
and --tumor-cram-input
Inputs a mapped BAM or CRAM file into the somatic variant caller. You can use these options with other BAM/CRAM options to run in tumor-normal mode.
--vc-sq-call-threshold
and --vc-sq-filter-threshold
These options control the thresholds for emitting calls in the VCF and applying the weak_evidence
filter tag (see above).
--vc-target-vaf
This option allows the user to adjust the allele frequencies of haplotypes that will be considered by the caller as potentially appearing in the sample. It is not a hard threshold, but the variant caller will aim to detect variants with allele frequencies larger than this setting. In the case of tumor-normal runs, the frequency is measured with respect to the full set of reads (tumor and normal combined). The default threshold of 0.03 was selected to be as low as possible without incurring an excessive false positive cost; a lower setting may increase sensitivity for low-frequency variants, but may increase false positives and runtime; a higher setting may reduce false positives. Setting the vc-target-vaf to 0 will result in all haplotypes with at least two supporting reads being taken into consideration.
--vc-somatic-hotspots
, --vc-use-somatic-hotspots
, and --vc-hotspot-log10-prior-boost
DRAGEN uses a hotspot VCF to indicate somatic mutations that are expected with increased frequency. The default hotspot file (automatically selected from <INSTALL_PATH>/resources/hotspots/somatic_hotspots_*
based on the reference) is mostly based on the Memorial Sloan Kettering Cancer Center (MSKCC) published hotspots and positions in COSMIC with population allele counts (AC) >= 50. It is somewhat conservative and boosts only a few thousand positions. You can specify a custom hotspot file via the --vc-somatic-hotspots
option (note: input VCF records must be sorted in the same order as contigs in the selected reference) or disable the hotspots feature with vc-use-somatic-hotspots=false
. The effect of the hotspot file is that the prior probability for hotspot variants is boosted by a factor, up to a maximum prior of 0.5. An SNV is considered to match a hotspot variant only if the allele in question is identical, whereas insertions or deletions are considered to match any insertion/deletion allele respectively. You can use vc-hotspotlog10-prior-boost
to control the size of the adjustment. The default value is 4 (log10 scale) corresponding to an increase of 40 phred, and reducing this value will result in a smaller adjustment.
vc-systematic-noise
This option allows the user to specify the systematic noise file. To run without a systematic noise file (not recommended), specify vc-systematic-noise=NONE
.
--vc-combine-phased-variants-distance
This option is the same as in the germline variant caller (see "Combine Phased Variants" in the germline small-variant caller section).
vc-skip-germline-tagging=true
This option disables the germline tagging feature in the tumor-only pipeline (not recommended).
--vc-callability-tumor-thresh
Specifies the callability threshold for tumor samples. The somatic callable regions report includes all regions with tumor coverage above the tumor threshold. The default value is 50. For more information on the somatic callable regions report, see Somatic Callable Regions Report.
--vc-callability-normal-thresh
Specifies the callability threshold for normal samples, if present. If applicable, the somatic callable regions report includes all regions with normal coverage above the normal threshold. The default value is 5. For more information on the somatic callable regions report, see Somatic Callable Regions Report.
In a tumor-normal analysis, DRAGEN accounts for tumor-in-normal (TiN) contamination by running liquid tumor mode. Liquid tumor mode is disabled by default, but we recommend enabling it with --vc-enable-liquid-tumor-mode=true
if TiN contamination is expected. When liquid tumor mode is enabled, DRAGEN is able to call variants in the presence of TiN contamination up to a specified maximum tolerance level (default: 0.15). If using the default maximum contamination TiN tolerance, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample. vc-tin-contam-tolerance
enables liquid tumor mode and allows you to set the maximum contamination TiN tolerance.
Liquid tumor mode is not equivalent to liquid biopsy. Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. For liquid tumors, it is not feasible to use blood as a normal control because the tumor is present in the blood. Skin or saliva is typically used as the normal sample. However, skin and saliva samples can still contain blood cells, so that the matched normal control sample contains some traces of the tumor sample and somatic variants are observed at low frequencies in the normal sample. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Liquid tumor mode typically uses a library that is WGS or WES with medium depth for example (100x T/ 40xN), and the lowest VAF detected for these types of depths is ~5%. Liquid biopsy typically uses a targeted gene panel (eg 500 genes), with very high raw depth, and uses UMI indexing (collapsing down to a depth of >2000x) to enable sensitivity at VAF down to 0.1 % in some cases (the limit of detection will vary depending on coverage and data quality).
If using different sequencing systems or different library preparation methods for tumor and normal samples, we recommend setting --vc-override-tumor-pcr-params-with-normal=false
. In tumor-normal mode, DRAGEN estimates a set of PCR error parameters separately for each of the tumor and normal samples. By default, DRAGEN ignores the tumor-sample parameters and uses normal-sample parameters for analysis of both samples. This default prevents overestimation of tumor-sample error rates that can occur if the somatic variant rate is high.
Allele frequency and related settings
There is no hard limit on the allele frequencies at which DRAGEN can report calls, but there are a number of points in the pipeline where low allele frequency can affect calling. The vc-target-vaf
setting affects the threshold used to detect candidate haplotypes during localized haplotype assembly, but does not affect variant scoring. Once a candidate haplotype is detected, all putative variants appearing in the haplotype are scored and calls scoring above the SQ call threshold are emitted regardless of the allele frequency or the number of supporting reads.
The probability calculation in the somatic caller assesses variant and noise hypotheses at fixed allele frequencies defined by a discrete grid (by default at coverages <200: 0, 0.05, 0.1, ... 1.0). This means that the calculation will assess variants with allele frequencies below 0.05 as if the true frequency is equal to 0.05; this strategy does not preclude such variants from being called but may result in lower scores compared to if the true frequency had been considered. At positions with higher coverage, DRAGEN adds extra grid points as in the table below in order to consider hypotheses involving lower allele frequencies and effectively achieve a lower limit of detection (LOD), with the lowest VAF halving every time the coverage doubles:
0-199
0.05
200-399
0.025
400-799
0.0125
...
...
If calls below a certain VAF are not of interest, you can use --vc-enable-af-filter
(see Post Somatic Calling Filtering below) to apply a hard filter on VAF.
DRAGEN can compensate for oxidation and deamination artifacts that might exist upstream of the sequencing system, and are common in FFPE samples. DRAGEN does this by estimating nucleotide mutation biases on a per sample basis, taking account of read orientation. During variant calling, DRAGEN then corrects for nucleotide substitution biases by combining the estimated parameters with the basecall quality scores, thus modifying the nucleotide error rates used by the hidden Markov model.
Nucleotide (NTD) Error Bias Estimation is on by default and recommended as a replacement for the orientation bias filter. Both methods take account of strand-specific biases (systematic differences between F1R2 and F2R1 reads). In addition, NTD error estimation accounts for non-strand-specific biases such as sample-wide elevation of a certain snv type, e.g. C->T or any other transition or transversion. NTD error estimation can also capture these biases in a trinucleotide context, e.g. in the case of C->T it will break down the counts as ACA->ATA, CCA->CTA, GCA->CTA, TCA->TTA, etc.
This feature can be disabled by specifying --vc-enable-unequal-ntd-errors=false
or set to auto-detect by specifying --vc-enable-unequal-ntd-errors=auto
. In auto-detect mode, DRAGEN will run the estimation but then disable the use of the estimated parameters if it determines that the sample does not exhibit nucleotide error bias. When the feature is enabled, DRAGEN will by default estimate a smaller set of parameters in a monomer context. To estimate a larger set of parameters in a trimer context (recommended on sufficiently large panels when coverage is above 1000X), specify --vc-enable-trimer-context=true
.
To specify the regions from which to estimate nucleotide substitution biases, use --vc-snp-error-cal-bed
. Alternatively, if --vc-target-bed
is used to specify the target regions for variant calling, and the total bed regions are sufficiently small (maximum 4 megabases), --vc-snp-error-cal-bed
can be omitted and DRAGEN will use the target bed file for bias estimation. Otherwise, DRAGEN will use a default bed file selected to match the reference, and covering a mixture of coding and non-coding regions.
DRAGEN requires a panel size of at least 150kbp to correctly estimate nucleotide mutation biases when using trimer context, or at least 10kbp when using monomer context. If this requirement is not met for trimer context, DRAGEN falls back on the monomer model, and if it is not met for monomer context, DRAGEN turns the bias estimation feature off.
DRAGEN provides two specialized UMI-aware variant calling pipelines for running from UMI-collapsed reads. These pipelines are optimized to take account of the increased read and basecall qualities that are typical in simplex- and duplex-collapsed reads. Both pipelines are disabled by default; when running with UMI collapsing enabled (--enable-umi true
) or when running from UMI-collapsed bams, enable UMI-aware variant calling by setting one of the following options to true:
--vc-enable-umi-solid
The VC UMI solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher.
--vc-enable-umi-liquid
The liquid biopsy pipeline is not equivalent to liquid tumor mode (see above). The liquid biopsy pipeline starts from a regular blood sample and looks for low VAF somatic variants from tumor cell free DNA floating in the blood. This type of test enables tumor profiling (diagnosis/biomarker identification) from plasma rather than from tissue, which requires an invasive biopsy. The VC UMI liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of >2000X and target allele frequencies of 0.1% and higher.
If a third-party tool is used to produce the collapsed reads, then configure the tool so that the base call quality scores quantify the error produced by the sequencing system only. DRAGEN uses Sample-specific NTD Error Bias Estimation (see above) to account for errors upstream of the sequencing system, so such errors should not be included in base call quality scores.
You can output a gVCF file for tumor-only data sets. A gVCF file reports information on every position of the input genome, including homozygous reference (homref) positions, i.e. positions where no alt allele (either germline or somatic) is present. DRAGEN creates a new <NON_REF> allele, to which reads that do not support the reference allele or any reported variant allele are assigned. In tumors, variants could exist at arbitrarily low allele frequencies and be undetectable. Thus, a somatic homref call cannot guarantee that no somatic variant at any allele frequency exists at the position. Instead, DRAGEN considers a position to be a homozygous reference if there are no somatic variants with an allele frequency at or above the limit of detection (LOD). Whereas the SQ score for an ordinary alt allele is a phred-scale posterior probability, the SQ score for the <NON_REF> allele is a phred-scale ratio between the likelihood of a homref call and the likelihood of a variant call with allele frequency at the LOD (if an alt allele is also reported, the <NON_REF> SQ score is capped at the complement of the posterior probability for the alt allele). If the LOD value is lowered, fewer homref calls are made. If the LOD value is increased, more homref calls are made.
By default the LOD is set to 5%, but you can enter a different value using the --vc-gvcf-homref-lod
option.
DRAGEN can add a number of filters by populating the FILTER column in the vcf. The output is provided in the *.hard-filtered.vcf.gz
output file (note: the *.vcf.gz
output file without "hard-filtered" in the filename differs only in that the filter column is unpopulated; the file is produced for historical reasons but is to be deprecated).
Options
The following options are available for post somatic calling filtering:
--vc-sq-call-threshold
Emits calls in the VCF. The default is 3.0 for tumor-normal and 0.1 for tumor-only. If the value for vc-sq-filter-threshold
is lower than vc-sq-call-threshold
, the filter threshold value is used instead of the call threshold value.
--vc-sq-filter-threshold
Marks emitted VCF calls as filtered. The default is 17.5 for tumor-normal and 3.0 for tumor-only.
--vc-enable-triallelic-filter
Enables the multiallelic filter. The default is true. This filter will not be applied to somatic hotspot variants.
--vc-enable-non-primary-allelic-filter
Similar to the triallelic filter, but filters less aggressively. Keep the allele per multiallelic position with highest alt AD, and only filter the rest (Default=false). This filter will not be applied to somatic hotspot variants. Cannot be enabled when the triallelic filter is also on.
--vc-enable-af-filter
Enables the allele frequency filter for nuclear chromosomes. The default value is false. When set to true, the VCF excludes variants with allele frequencies below the AF call threshold or variants with an allele frequency below the AF filter threshold and tagged with low AF filter tag. The default AF call threshold is 1% and the default AF filter threshold is 5%. To change the threshold values, use the vc-af-call-threshold
and vc-af-filter-threshold
command-line options. Please use vc-enable-af-filter-mito
and corresponding threshold options for mitochondrial allele frequency filtering.
--vc-enable-non-homref-normal-filter
Enables the non-homref normal filter. The default value is true. When set to true, the VCF filters out variants if the normal sample genotype is not a homozygous reference.
--vc-enable-vaf-ratio-filter
Adds one condition to be filtered out by the alt_allele_in_normal filter. The default value is false. When set to true, the VCF filters out variants if the normal sample AF is greater than 20% of tumor sample AF.
--vc-depth-filter-threshold
Filters all somatic variants (alt or homref) with a depth below this threshold. The default value is 0 (no filtering).
vc-homref-depth-filter-threshold
In gvcf mode, filters all somatic homref variants with a depth below this threshold. The default value is 3.
vc-depth-annotation-threshold
Filters all non-PASS somatic alt variants with a depth below this threshold. The default value is 0 (no filtering).
Filters
Tumor-Only & Tumor-Normal
weak_evidence
Variant does not meet likelihood threshold. The likelihood ratio for SQ tumor-normal is < 17.5 or < 3.0 for SQ tumor-only.
Tumor-Only & Tumor-Normal
multiallelic
Site filtered if there are two or more ALT alleles at this location in the tumor. Not applied to somatic hotspot variants.
Tumor-Only & Tumor-Normal
base_quality
Median base quality of ALT reads at this locus is < 20.
Tumor-Only & Tumor-Normal
mapping_quality
Median mapping quality of ALT reads at this locus is < 20 (tumor-normal) or < 30 (tumor-only).
Tumor-Only & Tumor-Normal
fragment_length
Absolute difference between the median fragment length of alt reads and median fragment length of ref reads at a given locus > 10000.
Tumor-Only & Tumor-Normal
read_position
Median of distances between the start and end of read and a given locus < 5 (the variant is too close to edge of all the reads). To output variant read position to the INFO field, use --vc-output-variant-read-position=true
.
Tumor-Only & Tumor-Normal
low_af
Allele frequency is below the threshold specified with --vc-af-filter-threshold
(default is 5%). Enabled only when using --vc-enable-af-filter=true
.
Tumor-Only & Tumor-Normal
systematic_noise
If AQ score is < 10 (default) for tumor-normal or < 60 (default) for tumor-only, the site is filtered.
Tumor-Only & Tumor-Normal
low_frac_info_reads
The fraction of informative reads (denominator excludes filtered_out reads) is below the threshold. The default threshold value is 0.5.
Tumor-Only & Tumor-Normal
filtered_reads
More than 50% of reads have been filtered out.
Tumor-Only & Tumor-Normal
long_indel
Indel length is more than 100bp.
Tumor-Only & Tumor-Normal
low_depth
The site was filtered because the number of reads is too low. The filter is off by default.
Tumor-Only & Tumor-Normal
low_tlen
The site was filtered because the fraction of low TLEN ALT supporting reads is above a threshold. The default threshold is 0.4. Reads with TLEN smaller than -2.25 (default) standard deviations from the mean are considered to be low TLEN. This filter is not applied for reads sampled from tight insert distributions i.e., stddev / mean < 0.1 (default).
Tumor-Only and Tumor-Normal
no_reliable_supporting_read
No reliable supporting read was found in the tumor sample. A reliable supporting read is a read supporting the alt allele with mapping quality ≥ 40, fragment length ≤ 10,000, base call quality ≥ 25, and distance from start/end of read ≥ 5.
Tumor-Only & Tumor-Normal
too_few_supporting_reads
Variant is supported by < 3 reads in the tumor sample. This filter is not applied in UMI-aware pipelines.
Tumor-Normal
noisy_normal
More than three alleles are observed in the normal sample at allele frequency above 9.9%.
Tumor-Normal
alt_allele_in_normal
ALT allele frequency in the normal sample is above 0.2 plus the maximum contamination tolerance. For solid tumor mode, the value is 0. For liquid tumor mode, the default value is 0.15. See vc-enable-vaf-ratio-filter
for optional conditions.
Tumor-Normal
non_homref_normal
Normal sample genotype is not a homozygous reference.
The DRAGEN systematic noise filter significantly improves somatic variant calling precision, especially in tumor-only mode. DRAGEN enforces its use in the tumor-only pipeline by refusing to start a run without a noise file (this option can explicitly be disabled). This filter tackles noise that consistently appears at specific locations in the reference genome. This noise can arise from:
Mis-mapping in low-complexity regions: Repetitive sequences with low information content can lead to reads mapping to incorrect locations.
PCR noise in homopolymer regions: Regions with long stretches of the same nucleotide (e.g., AAAAA) can introduce errors during PCR amplification.
The systematic noise filter offers a significant improvement over the older "panel of normals" method. While the panel of normals simply excluded specific positions, the new filter employs a statistical model. This model compares the variant and its allele frequency (AF) to the noise level associated with that specific position and allele in the reference genome. This allows for a more nuanced filtering approach, reducing false positives without discarding potentially valid variants.
Note that the systematic noise filter specifically aims to remove noise, while the option --vc-enable-germline-tagging
is used for identifying germline variants. The systematic noise filter is not recommended for germline admixture datasets, where tumor-normal pairs are simulated by combining germline samples from two different individuals. This is because such datasets contain (simulated) somatic variants at germline variant positions, and those positions may be present in the noise files with the result that desired variants are filtered out.
Newer versions of the systematic noise will include two columns, one for the "mean" noise and one for the "max" noise. The noise file header will specify a "##NoiseMethod". This is the column that will be used by default during variant calling. For UMI/PANELs/WES is is recommended to use the "mean" noise, and for WGS it is recommended to use the "max" noise.
Prebuilt systematic noise files are available for download (see below), but when possible, it is recommended to build custom noise files from a panel of normal samples sequenced locally. This will ensure that the noise file is specific to the library preparation, sequencing system, and panel in use. Building your own noise file is especially helpful for clean UMI samples that tend to have less noise than WGS/WES samples. To generate a noise file it is recommended to use approximately 20-50 normal samples, although fewer normal samples (1-10) can still be used to generate useful noise files.
The systematic noise filter is used in the DRAGEN tumor-only or tumor-normal pipeline by adding the following commands:
Prebuilt systematic noise files can be downloaded here: DRAGEN Software Support Site page
Somatic Systematic Noise Baseline Collection v2.0.0 noise files were generated with V4.3 and for the first time include allele specific information. Each v2.0.0 noise file includes both "mean" and "max" noise in separate columns. A header line "##NoiseMethod=mean/max" specifies which noise column will be used by default.
Noise files generated with V4.3 contain extra columns and are not compatible with earlier versions. Older noise files are still supported in the current version of DRAGEN as per the table below.
Somatic Systematic Noise Baseline Collection v2.0.0
V4.3
hg19, hg38, hs37d5, WES, WGS
~50 per cohort, 80-100X coverage
The default WES and WGS noise files were generated using a combination of Nextera and TruSeq samples (with and without PCR). There are also hg38 WGS HEME and FFPE specific noise files.
The BaseSpace Sequence Hub DRAGEN CNV Baseline Builder App can be used to build SNV and CNV noise files in the cloud. Alternatively the following DRAGEN CMD lines can be used to generate the noise files locally:
First run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples using the following command:
Once the normal samples have completed, collect the normal VCFs in the VCF_LIST file (one vcf per line) and use DRAGEN to generate the systematic noise file:
Running the filter during somatic variant calling:
--vc-systematic-noise
Specifies a systematic noise BED file. If a somatic variant does not pass the AQ threshold, the variant is marked as 'systematic_noise' in the FILTER column of the output VCF.
--vc-systematic-noise-method
Specifies which column in the systematic noise file will be used: "max" is more aggressive and recommended for WGS, while "mean" preserves better sensitivity and is recommended for WES/PANELs.
--vc-systematic-noise-filter-threshold
Set the AQ threshold. Higher values filter more aggressively. By default the threshold value is 10 for tumor-normal and 60 for tumor-only. The valid range spans 0-100. For tumor-normal runs the threshold may be set higher (e.g. to 60) to improve specificity at the possible cost of some sensitivity.
--vc-systematic-noise-filter-threshold-in-hotspot
Set the AQ threshold to use in hotspot regions, where one may want to filter less aggressively than in the rest of the genome. By default, the threshold value is 10 for tumor-normal and 20 for tumor-only.
--vc-allele-specific-systematic-noise
Apply systematic noise in an allele-specific manner when allele information is available. This setting is ignored for v1 noise files (Default=true))
Running the tumor-only pipeline on the normals:
--vc-detect-systematic-noise
Run the tumor-only pipeline in an ultra sensitive mode and intentionally include noise in the output VCF. WARNING: this option should only be used with normal samples to characterize noise, it is NOT intended for analyzing tumor samples.
--vc-detect-systematic-noise-mode
Specify the library type when generating the systematic noise. Only required for UMI samples. This mode will generate GVCFs which are especially useful for capturing very low levels of noise. The default mode will work well for WGS/WES and non-UMI panels. Valid options include [UMI, DEFAULT]
Building the noise file:
--build-sys-noise-method
Specifies the default value for vc-systematic-noise-method by adding it as part of the header in the systematic noise file. It is recommended to select 'mean' for UMI/PANELS/WES data and 'max' for WGS data (default is 'max')."
--build-sys-noise-vcfs-list
Text file containing the paths of normal VCFs. Specify the full VCF file paths. List one file per line.
--build-sys-noise-germline-vaf-threshold
Variant calls with VAF higher than this threshold will be considered germline and will not contribute to the noise estimate. This option is disabled by default by setting the threshold to 1. (Default 1)
--build-sys-noise-use-germline-tag
This option will ensure that variants tagged by vc-enable-germline-tagging=true
will not be counted as noise. (Default true)
--build-sys-noise-min-sample-cov
Min coverage at a site for a sample to be used towards noise estimation. At low coverages estimated allele frequencies become less reliable. Accurate AF estimation is imporant for germline variant detection, and also for noise detection when using MAX noise. (Default 5)
--build-sys-noise-min-supporting-samples
Min number of samples with noise at a position in order for a position to be considered systematic-noise (Default 1).
When enabling DRAGEN for tumor-only somatic calling, potential germline variants can be tagged in the INFO field with 'GermlineStatus' using population databases. Current databases include 1KG, both exome and genome sequencing data from gnomAD. The following options are available for this feature:
--vc-enable-germline-tagging
Enable germline tagging. The default is 'false'. Once this is set to 'true', it will require user to set annotation related parameters as follows:
--enable-variant-annotation=true
--variant-annotation-data
Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly
The genome build, GRCh37 or GRCh38
Additional options to control how to define germline variants.
--germline-tagging-db-threshold
The minimum alternative allele count across population databases for a variant to be defined as germline (default=50).
--germline-tagging-pop-af-threshold
The minimum population allele frequency for a variant to be defined as germline. Once specified, this will override the input from --germline-tagging-db-threshold.
When enabling DRAGEN for tumor-only or tumor-normal pipelines with Nirvana Annotation, the Nirvana JSON output can be converted into a Mutation Annotation Format (MAF) file. The MAF file is a tab-separated values file containing aggregated mutation information and will be saved to the output directory that you specify. You can enable MAF conversion directly as part of the somatic small variant calling workflow (integrated mode) or separately by providing a path to a VCF file or annotated JSON file (standalone mode).
When running MAF conversion as part of the somatic small variant calling workflow, the following options are required for this feature:
Annotation options:
--enable-variant-annotation=true
Enable variant annotation
--variant--annotation-data
Nirvana annotation database (Downloadable at https://support.illumina.com/content/dam/illumina-support/help/Illumina_DRAGEN_Bio_IT_Platform_v3_7_1000000141465/Content/SW/Informatics/Dragen/Nirvana_DownloadData_fDG.htm)
--variant-annotation-assembly
Genome build, GRCh37 or GRCh38
MAF conversion options:
--enable-maf-output=true
Enable MAF output
--maf-transcript-source
Desired transcript source, RefSeq or Ensembl
Additional standalone options (when running without the variant caller):
--maf-input-vcf
Input VCF with the following form: <path>/<file_name>.hard-filtered.vcf.gz
--maf-input-json
Input JSON with the following form: <path>/<file_name>.hard-filtered.annotated.json.gz
Please note that when specifying standalone mode with VCF input, you must also enable annotation options to generate the JSON file. Conversely, annotation options should not be specified when running standalone mode with an input annotated JSON file.
Optional options:
--maf-include-non-pass-variants
Enabling this option will output all variants, including non-PASS variants, in the MAF output file.
Please note that this is an optional option that gives you all variant output. By default, MAF output will only contain variants that have the PASS filter from the hard-filtered VCF file.
Example command lines:
MAF output from BAM input and variant caller:
MAF output from output directory and output file prefix, where the output directory contains a VCF file prefixed by the output file prefix:
MAF output from source VCF file:
Note: This command line will output the MAF file in the same location as the input VCF file. To specify a directory for output, add --output-dir
and --output-file-prefix
options.
MAF output from source annotated VCF file:
Note: This command line will output the MAF file in the same location as the input annotated JSON file. To specify a directory for output, add the --output-dir
and --output-file-prefix
options.
Multisample CNV calling is possible starting from tangent normalized counts files (*.tn.tsv.gz) specified with the --cnv-input
option (one per sample). Multisample CNV analysis benefits from using joint segmentation to increase the sensitivity of detection of copy number variable segments. For each copy number variable segment identified, the copy number genotype of each sample is emitted in a single VCF entry to facilitate annotation and interpretation.
Multisample CNV analysis is supported for WGS and WES workflows.
The following is an example command line for running a trio analysis:
Make sure all input samples have gone through the same single sample workflow and have identical intervals. If the samples are WES inputs, then you must generate the samples using the same panel of normals, and the autosomal intervals for all samples must match.
The following options are used in DeNovo CNV calling:
--cnv-input
For DeNovo CNV calling, this specifies the input tangent-normalized signal files (*.tn.tsv.gz) from the single sample runs. This option can be specified multiple times, once for each input sample.
--cnv-filter-de-novo-qual
Phred-scaled threshold at which a putative event in the proband sample if marked as DeNovo. Default value is 0.125.
--pedigree-file
Pedigree file specifying the relationship between the input samples.
First, CNV calling is performed on each sample independently. Joint segmentation then uses the copy number variable segments from each single sample analysis to derive a set of joint copy number variable segments. This set of joint segments is determined simply by taking the union of all breakpoints from the copy number variable segments of all samples. This results in the splitting of any partially overlapping segments across different samples. For example:
Following joint segmentation, copy number calling is again performed independently on each sample using the joint segments. Segments can be merged as with the single sample analysis, but each joint segment is emitted in the mutlisample VCF as a single entry. The quality score (QS
in the VCF) from the sample's merged segment, if applicable, is used for filtering the call. Sample calls are filtered using the sample's FT field in the multisample VCF. The QUAL
column of the multisample VCF is always missing (ie, "."). The FILTER
column of the mutlisample VCF is SampleFT
if none of the sample's FT
fields are PASS
, and PASS
if any of the sample's FT
fields are PASS
.
Note, however, that when a single segment in one sample overlaps multiple segments in another sample, the larger segment annotation is replicated across multiple records, e.g. (only relevant VCF fields are printed below):
A de novo event is defined as the existence of a genotype at a particular locus in a proband's genome that did not result from standard Mendelian inheritance from the parents. The de novo calling stage identifies putative de novo events in the proband of each trio of a multisample analysis. In some cases, these putative de novo events may be real, but they can also arise from sequencing or analysis artifacts. Consequently, a de novo quality score is assigned to each putative de novo event and used to filter out low-quality de novo events. Trios are specified by specifying a .ped file with the --pedigree-file
option. Multiple trios can be specified (eg, quad analysis), and all valid trios will be processed.
For each joint segment in a trio, the de novo caller determines if there is a Mendelian inheritance conflict for the called copy number genotypes. The CNV caller does not identify the copy number for each allele of a given diploid segment, which means assumptions are made about the possible allelic composition of the parent genotypes.
The assumption is that the copy number 0 allele is not present for diploid regions of a parent's genome (sex dependent) when the assigned copy number is 2 or greater. This results in simplifications, as follows:
The following are examples of consistent and inconsistent copy number genotypes for diploid regions using these assumptions:
If a joint segment has a Mendelian inheritance conflict, a Phred-scaled de novo quality score (DQ
field in the VCF) is calculated using the likelihoods for each copy number state (see Quality Scoring section) of each sample in the trio, combined with a prior for the trio genotypes:
Where
The DN
field in the VCF is used to indicate the de novo status for each segment. Possible values are:
Inherited
- the called trio genotype is consistent with Mendelian inheritance
LowDQ
- the called trio genotype is inconsistent with Mendelian inheritance and DQ is less than the de novo quality threshold (default 0.125)
DeNovo
- the called trio genotype is inconsistent with Mendelian inheritance and DQ
is greater than or equal to the de novo quality threshold (default 0.125)
The records in a multisample CNV VCF differ slightly from the single sample case. The major differences are as follows:
The per-record entries are broken down into the segments among the union of all the input samples breakpoints, which means there are more entries in the overall VCF.
The QUAL
column is not used and its value is ".". The per-sample quality is carried over into the SAMPLE
columns with the QS
tag.
The FILTER
column indicates PASS
if any of the individual SAMPLE
columns PASS
. Otherwise, it indicates SampleFT
.
The per-sample annotations are carried over from their originating calls. The single sample filters are applied at the sample level and are emitted in the FT
annotation.
Additionally, if a valid pedigree is used, then de novo calling is performed, which adds the following two annotations to the proband sample.
While the VCF contains many entries, due to the joint segmentation stage, the number of de novo events can be found by extracting entries that have a DN
and DQ
annotation. These records are also extracted and are converted to GFF3 in the de novo calling case.
DRAGEN emits the calls in the standard VCF format. By default for analyses other than somatic WGS, the VCF file includes only copy number gain and loss events (and LOH events, where allele-specific copy number is available). To include copy neutral (REF) calls in the output VCF, set --cnv-enable-ref-calls
to true.
File extension: *.cnv.vcf.gz
The CNV VCF file follows the standard VCF format. Due to the nature of how CNV events are represented versus how structural variants are represented, not all fields are applicable. In general, if more information is available about an event, then the information is annotated. Some fields in the DRAGEN CNV VCF are unique to CNVs. The VCF header is annotated with ##source=DRAGEN_CNV
to indicate the file is generated by the DRAGEN CNV pipeline
The following is an example of some of the header lines that are specific to CNV:
The following header lines are specific to somatic WGS CNV calling:
ModelSource
The primary basis on which the final tumor model was chosen. The following values can be included:
DEPTH+BAF
: Depth+BAF signal is used to determine tumor model.
DEPTH+BAF_DOUBLED
: The initial depth+BAF model is duplicated based on VAF signal or excess segments at half the expected depth change.
DEPTH+BAF_DEDUPLICATED
: The depth+BAF model is deduplicated based on VAF signal or insufficient segments supporting a duplication.
DEPTH+BAF_WEAK
: Depth+BAF signal is used to determine lower-confidence tumor model.
VAF
: VAF signal is used to determine tumor model due to insufficient depth+BAF signal.
DEGENERATE_DIPLOID
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. The diploid coverage is set to lowest value observed in a substantial number of bases in segments with BAF=50%.
SAMPLE_MEDIAN
: Sample is treated as high-purity diploid in absence of adequate signal from depth+BAF and VAF. Diploid coverage set to sample median.
EstimatedTumorPurity
Estimated fraction of cells in the sample due to tumor. The range of this field is [0, 1] or NA
if a confident model could not be determined.
DiploidCoverage
Expected read count for a target bin in a diploid region. The numeric value is unlimited.
OverallPloidy
Length weighted average of tumor copy number for PASS events. The numeric value is unlimited.
AlternativeModelDedup
An alternative to the best model corresponding to one less whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation if the best model might involve a spurious genome duplication.
AlternativeModelDup
An alternative to the best model corresponding to one more whole-genome duplication. The alternative is given as a pair of values (tumor purity, diploid coverage). This can be useful for manual investigation where the best model might have missed a true genome duplication.
OutlierBafFraction
A QC metric that measures the fraction of b-allele frequencies that are incompatible with the segment the BAFs belong to. High values might indicate a mismatched normal, substantial cross-sample contamination, or a different source of a mosaic genome, such as bone marrow transplantation. The range of this field is [0, 1].
All coordinates in the VCF are 1-based.
The CHROM column specifies the chromosome (or contig) on which the copy number variant being described occurs.
The POS column is the start position of the variant. According to the VCF specification, if any of the ALT alleles is a symbolic allele, such as <DEL>, then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism.
The ID column is used to represent the event. The ID field encodes the event type and coordinates of the event (1-based, inclusive). In addition to representing GAIN
, LOSS
and REF
events, in Somatic WGS CNV, the ID could include the Copy Neutral Loss of Heterozygosity (CNLOH) or Copy Number Gain with LOH (GAINLOH) events.
The REF column contains an N for all CNV events.
The ALT column specifies the type of CNV event. Because the VCF contains only CNV events, only the <DEL> or <DUP> entries are used. If REF calls are emitted, their ALT will always be .
. In Somatic WGS CNV, the ALT
field can contain two alleles, such as <DEL>,<DUP>, which allows representation of allele-specific copy numbers if they differ in copy number states.
The QUAL column contains an estimated quality score for the CNV event, which is used in hard filtering. Each CNV workflow has different defaults and the value used can be found in the VCF header.
The FILTER column contains PASS
if the CNV event passes all filters, otherwise the column contains the name of the failed filter. Default values are defined in the header line for each available FILTER.
Available FILTERs:
cnvLength
which indicates that the length of the CNV is lower than a threshold.
cnvQual
which indicates that the QUAL of the CNV is lower than a threshold.
Germline CNV has the following additional FILTERs:
cnvCopyRatio
which indicates that the segment mean of the CNV is not far enough from copy neutral.
Both Germline CNV workflows have the following additional FILTERs:
dinucQual
which indicates a CNV call where some of its dinucleotide percentages are outside typical ranges, and thus the call is likely to be a false positive.
Germline WGS CNV has the following additional FILTERs:
cnvBinSupportRatio
which indicates, for CNVs greater than 80kb, the percent span of supporting target intervals is lower than a threshold.
highCN
which indicates a CNV call with implausible copy number (>6).
Germline WES CNV has the following additional FILTERs:
cnvLikelihoodRatio
indicates a log10 likelihood ratio of ALT to REF is less than a threshold.
Both Somatic CNV workflows have the following additional FILTERs:
binCount
- Filters CNV events with a bin count lower than a threshold.
Somatic WGS CNV has the following additional FILTERs:
lengthDegenerate
- Marks records as non-PASS
ing based on each record's length (REFLEN
) when the caller returns the default model. Segments having less than 1 Mb are assigned this filter when returning the default model.
segmentMean
- Marks records as non-PASS
ing based on each record's segment mean (SM
) when the caller returns the default model. Segments having insufficient SM
in DEL
s or DUP
s are assigned this filter when returning the default model.
Somatic WES CNV has the following additional FILTERs: -SqQual
- Marks records as non-PASS
ing based on each record's somatic quality (SQ) when the caller returns the default model. Segments having insufficient SQ are assigned this FILTER when returning the default model. SQ is the somatic quality value which is a Phred scale score of p-value from 2-sample t-test comparing normalized counts of CASE vs PON.
The INFO column contains information representing the event.
REFLEN
indicates the length of the event.
SVLEN
is a signed representation of REFLEN
(e.g., a negative value indicates a deletion), and it is only present for non-REF records.
SVTYPE
is always CNV and only present for non-REF records.
END
indicates the end position of the event (1-based, inclusive).
If using a segment BED file, then the segment identifier is carried over from the input to SEGID
field.
The common FORMAT fields are described in the header:
Germline WGS CNV includes the following FORMAT fields:
Germline WES CNV includes the following FORMAT fields:
Somatic WGS CNV and Somatic WES CNV with ASCN (allele-specific copy number) support include the following FORMAT fields:
Somatic WES CNV without ASCN support provides only the common FORMAT fileds and does not include the CN
entry, since it does not estimate the tumor purity fraction and cannot make an estimate of the copy number.
Because germline copy number calling determines overall copy number rather than the copy number on each haplotype, the genotype type field contains missing values for diploid regions when CN is greater than or equal to 2. The following are examples of the GT field for various VCF entries:
The DRAGEN CNV pipeline provides a measure of the quality of the data for a sample. If using the WGS self-normalization method, the additional CoverageUniformity
metric is present in the VCF header. The metric is only available for germline samples. The CNV pipeline assumes that post-normalization target counts are independently and identically distributed (IID). Coverage in most high-quality WGS samples is uniform enough for the CNV caller to produce accurate calls, but some samples violate the IID assumption. Issues during library preparation or sample contamination can lead to several extreme outliers and/or waviness of target counts, which can result in a large number of false positive CNV calls. The CoverageUniformity
metric quantifies the degree of local coverage correlation in the sample to help identify poor-quality samples.
A larger value for this metric means the coverage in a sample is less uniform, which indicates that the sample has more nonrandom noise, and could be considered poor quality. The CoverageUniformity metric depends on factors other than sample quality, such as the cnv-interval-width
setting and sample mean coverage. DRAGEN recommends using this score to compare the quality of samples from similar mean coverage and the same command line options. Because of this, DRAGEN CNV only provides the metric and does not take any action based on it.
DRAGEN CNV outputs metrics in CSV format. The output follows the general convention for QC metrics reporting in DRAGEN. The CNV metrics are output to a file with a *.cnv_metrics.csv
file extension. The following list summarizes the metrics that are output from a CNV run.
Sex Genotyper:
Estimated sex of the case sample as well as that of all panel of normals samples are reported. For WGS workflows, the estimated sex karyotype will be reported and for non-WGS workflows the gender will be reported.
Confidence score (ranging from 0.0 to 1.0). If the sample sex is specified, this metric is 0.0.
CNV Summary:
Bases in reference genome in use
Average alignment coverage over genome
Number of alignment records processed
Number of filtered records (total)
Number of filtered records (due to duplicates)
Number of filtered records (due to MAPQ)
Number of filtered records (due to being unmapped)
Coverage MAD
Median Bin Count
Number of target intervals
Number of normal samples
Number of segments
Number of amplifications
Number of deletions
Number of PASS amplifications
Number of PASS deletions
Coverage MAD and Median Bin Count are only printed for WES germline/somatic CNV. Coverage MAD is the median absolute deviation of normalized case counts. Higher values indicate noiser sample data (poor quality). Median Bin Count is the median of raw counts normalized by interval size.
Intermediate stages of the pipeline stages produce various intermediate output files. These files can be useful for visualization of the evidence or results from each stage, and may aid in fine-tuning options.
All files have a structure similar to a BED file with optional header line(s).
The file *.target.counts.gz
is a compressed tab-delimited text file that contains the number of read counts per target interval. This is the raw signal as extracted from the alignments of the BAM or CRAM file. The format is identical for both the case sample and any panel of normals samples. There is also a bigWig representation of a target.counts.diploid
file, which is normalized to the normal ploidy level of 2 instead of raw counts.
It has the following columns:
Contig identifier
Start position
End position
Target interval name
Count of alignments in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gz
file is shown below.
B-allele counts are calculated at sites in the tumor sample where the normal sample is likely to be heterozygous. When analyzed in conjunction with a matched normal sample, the sites are those that are called as heterozygous SNVs in the normal sample. When analyzed in tumor-only mode, they are taken from a collection of sites that have high-frequency SNVs in the population. Each B-allele site consists of a reference allele and a variant allele, and the number of reads in the tumor sample supporting each of these alleles is counted.
B-allele counts are written both to gzipped tsv file *.ballele.counts.gz
and gzipped bedgraph file *.baf.bedgraph.gz
.
The tsv file format is the following:
Contig identifier
Start, BED-style (zero-based inclusive) start position of the reference allele
Stop, BED-style (zero-based inclusive) stop position of the reference allele
Base sequence for the reference allele
Base sequence for the the first allele being counted
Base sequence for the second allele being counted
The number of qualified reads containing a sequence matching the first allele
The number of qualified reads containing a sequence matching the second allele
Additionally, in the case of B-allele sites from a population VCF, the following two additional columns are added after the columns listed above:
Population frequency for the first allele
Population frequency for the second allele
An example of B-allele counts file is provided below:
The bedgraph file format is similar to the BED format and it has the following columns:
Contig identifier
Start
Stop
Ratio of allele counts
The numerator and denominator of thw ratio is determined by sorting the allele counts according to the priority of the corresponding bases. The order of the bases in descending priority is {A, T, G, C}.
When the priority of allele1 is higher than the priority of allele2, the output frequency is calculated by:
When the priority of allele2 is higher than the priority of allele1, the output frequency is calculated by:
By prioritizing the bases in this way, the output frequencies will be deterministically distributed in a roughly equal proportion above and below 0.5. When plotting these B-allele frequencies (e.g., in IGV), this gives an easy way to visually determine significant changes in b-allele frequency between neighboring segments of the genome. It also provides a similar visualization to that typically used for array data.
An example of the bedgraph file is shown below:
The file *.target.counts.gc-corrected.gz
contains the number of GC-corrected read counts per target interval. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
GC-corrected read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
. They contain the DRAGEN version and command line that was used to generate the file, as well as other meta information.
An example of a *.target.counts.gc-corrected.gz
file is shown below.
The file *.combined.counts.txt.gz
is a column-wise concatenation of individual *.target.counts.gz
and *.target.counts.gc-corrected.gz
used to form the panel of normals.
The file *.tn.tsv.gz
contains the normalized signal of the case sample, per target interval, i.e., the log2-normalized copy ratio signal. A strong signal deviation from 0.0 indicates a potential for a CNV event. The format is equivalent to the *target.counts.gz
file:
Contig identifier
Start position
End position
Target interval name
Log2-normalized read counts in this interval
Count of improperly paired alignments in this interval
Header lines are also included that start with #
.
An example of a *.tn.tsv.gz
file is shown below.
File extension: *.seg
, *.seg.called
, *.seg.called.merged
Files containing the segments produced by the segmentation algorithm. The Segment_Mean
value of a segment is the ratio of the mean of that segment to the whole-sample median, without log transformation (linear copy-ratio). A strong signal deviation from 1.0 indicates a potential for a CNV event.
The *.seg
file has the following columns:
Sample name
Contig identified
Start position
End position
Number of intervals in the segment
Linear copy-ratio of the segment
An example of a *.seg
file is shown below.
The *.seg.called
file is identical to the *.seg
file, with an additional column indicating the initial call for whether the segment is a duplication +
ir a deletion -
.
The *.seg.called.merged
file is identical to the *.seg.called
file but with segments potentially merged when they meet internal merging criteria. In addition to the columns described above, this file has also the following columns:
QUAL
FILTER
Copy number assignment
Ploidy
Improper_Pairs count
In addition to segmentation of target counts, some workflows perform segmentation of B-allele loci. The output file has suffix *.baf.seg
and it has the same format of the *.seg
file with two modifications. Firstly, the Segment_Mean
value is the mean over B-allele loci of the smaller observed allele fraction. Secondly, there is an additional column:
BAF_SLM_STATE
: Integer between 0 and 10, indicating bins of minor-allele fraction (low to high), or .
when the BAF data are too variable to estimate a minor-allele fraction"
An example of segmentation output file is shown below:
The file *.cnv.purity.coverage.models.tsv
describes the different tested models and their log-likelihood. It has columns:
Model purity (Cellularity)
Model diploid coverage
Model log-likelihood
An example is shown below:
To generate additional equivalent bigWig and gff files, set the --cnv-enable-tracks
option to true. These files can be loaded into IGV along with other tracks that are available, such as RefSeq genes. Using these tracks alongside publicly available tracks allows for easier interpretation of calls. DRAGEN autogenerates IGV session XML file if tracks are generated by DRAGEN CNV. The *.cnv.igv_session.xml
can be loaded directly into IGV for analysis.
The following IGV tracks are automatically populated in the output IGV session file:
*.target.counts.bw
--- Bigwig representation of the target counts bins. Setting the track view in IGV to barchart or points is recommended. Values are gc-corrected if gc-correction was performed.
*.improper_pairs.bw
--- BigWig representation of the improper pairs counts. Setting the track view in IGV to barchart is recommended.
*.tn.bw
--- BigWig representation of the tangent normalized signal. Setting the track view in IGV to points is recommended.
*.seg.bw
--- BigWig representation of the segments. Setting the track view in IGV to points is recommended.
*.baf.bedgraph.gz
--- BED graph representation of B-allele frequency (if available). Setting the track view in IGV to points is recommended.
*.cnv.gff3
--- GFF3 representation of the CNV events. DEL events show as blue and DUP events show as red. Filtered events are a light gray. If REF events are enabled, then they will show up as green. An example of DRAGEN CNV gff3 is shown below (different CNV workflows might output different attributes on the 9th column):
For somatic WGS analyses, the following additional files are included in the IGV session xml:
*.baf.seq.bw
--- BigWig representation of the BAF segments. Setting the track view in IGV to points is recommended.
*.tumor.baf.bedgraph.gz
--- Bedgraph represengation of the B-allele frequencies. Setting the track view in IGV to points and windowing function to none is recommended.
File extension: *.igv_session.xml
The IGV session XML file is prepopulated with track files generated by DRAGEN. The session file loads the reference genome that best matches the standard reference genomes in an IGV installation, by comparing the name of the --ref-dir
specified on the command-line. Standard UCSC human reference genomes are autodetected, but any variations from the standard reference genomes might not be autodetected. To edit the genome detection, alter the genome
attribute in the Session
element to the reference genome you would like for analysis before loading into IGV. The reference identifier used by IGV might differ from the actual name of the genome. The following is an example edited session file.
Note that depending on the IGV version installed, it may come prepackaged with different flavors of GRCh37. The reference naming conventions have changed so a user may have to edit the genome
field in the XML file directly. For example, IGV has traditionally packaged a b37
reference genome, but may also include a 1kg_v37
or a 1kg_b37+decoy
, which will appear on the IGV user interface as "1kg, b37" or "1kg, b37+decoy" respectively.
You can determine what the correct encoding of a reference genome by going to File > Save Session...
and then inspecting the generated igv_session.xml file.
DRAGEN CNV outputs can be ingested using third-party libraries on most commonly used languages such as Python/R. The typically used files are:
*.target.counts.gz
or *.target.counts.gc-corrected.gz
, containing the number of alignments, or corrected alignments, per interval. Used to display the coverage profile across all intervals.
*.tn.tsv.gz
, containing the log2-normalized copy ratio per interval.
*.baf.bedgraph.gz
, if BAF is available, containing the BAF for each considered site. Used to display the BAF profile across all sites.
In all previously specified files, the format is similar to BED, allowing them to be loaded as any other tab-separated files.
A similar workflow can be used to plot copy number calls (and minor copy number calls, if available) by using the *.cnv.gff3
output file. Some examples of DRAGEN output GFF3 are shown below:
Germline WGS
Somatic WGS
From the output GFF3, the typical steps to follow are to parse each segment coordinates and the CopyNumber
annotation (or any other annotation the user might want to plot), and to plot them using the libraries listed previously for coverage/BAF profiles (or any other library and language of user's choice).
To improve accuracy, the DRAGEN CNV Pipeline excludes genomic intervals if one or more of the target intervals failed at least one quality requirement. The excluded intervals are reported to *.excluded_intervals.bed.gz
file. The file has a bed format, identifies the regions of the genome that are not callable for CNV analysis and describes the reason intervals were excluded in the fourth column. The following are the possible reasons for exclusion.
An example of a *.excluded_intervals.bed.gz
file is shown below:
The DRAGEN CNV Pipeline generates the PON Metrics File (.cnv.pon_metrics.tsv.gz
) if a Panel of Normals is provided and --cnv-generate-pon-metric-file
is set to true
. If PON size is less than 2, then an empty file will be generated.
The PON Metric File includes basic statistics of the coverage profile for each interval. To remove sample coverage bias, DRAGEN applies sample median normalization, and then computes the following metrics:
Example:
The DRAGEN CNV Pipeline generates the PON Correlation File (.cnv.pon_correlation.txt.gz
) if a Panel of Normals is provided. The PON Correlation File includes correlation between CASE sample and each PON sample.
Example:
The SegDups extension provides intermediate and final outputs. All intervals follow the bed format (0-based, start inclusive, end exclusive) and they are in tab-delimited text files (gzip compressed).
The final output has extension .cnv.segdups.rescued_intervals.tsv.gz
, and contains the rescued target intervals which can then be injected before segmentation. It has columns:
Chromosome name
Start - 0-based inclusive
Stop - 0-based exclusive
Target interval name prefixed with "target-wgs-"
Sample Counts (in header, identifier taken from RGSM) - log2-scale normalized counts for each interval
Improper pairs - Kept for compatibility with CNV workflow, set to 0 for rescued intervals
Target region ID - ID of the target region (aka pair of rescued target intervals)
The joint normalized coverage profile (log2-scale) for each region is provided in output to file .cnv.segdups.joint_coverage.tsv.gz
with columns:
Target region ID
Joint normalized coverage (log2-scale) of the two intervals in the target region
Copy Number Float - estimate of joint copy number for the target region (e.g., CNF ~ 4)
The differentiating sites' data is provided in output to file .cnv.segdups.site_ratios.tsv.gz
with columns:
Differentiating site name
Target (gene A) counts at site
Non-target (gene B) counts at site
Target ratio: gene A counts over total (i.e., gene A + gene B) counts at site
The DRAGEN CNV caller leverages depth as its primary signal for calling copy number variants. Depth alone poses challenges for calling events that are less than 10kbp. The sensitivity of CNVs at lengths less than 10kbp can be improved by leveraging junction signals from the DRAGEN structural variant caller.
When both the DRAGEN CNV and SV caller are executed in a single invocation, then an additional integration step is done at the end of a DRAGEN run to improve the CNV calls. This feature is enabled automatically when DRAGEN detects a germline WGS analysis.
The SV/CNV Integration module takes in DEL and DUP calls from the output data structures of the germline CNV and SV callers, identifies putative matches, updates annotations, filters, scores, and outputs the refined records in a new output VCF. By leveraging junction signals from the SV caller and depth signals from the CNV caller, this approach allows for sensitive CNV detection down to 1kbp while also improving recall and precision across length scales. This is achieved by rescuing previously low quality calls if evidence is found from both callers, and also by adjusting CNV breakends to the more accurate SV breakends. The matching algorithm takes into account the proximity of the events as well as the transition states at the breakends, among other things.
The following is an example command line for running a germline WGS analysis for both CNV and SV.
Other optional CNV or SV parameters can also be added.
The original CNV and SV VCF output files, prior to integration, are available for users in the DRAGEN output directory, as described elsewhere. Additionally, there is an enhanced CNV VCF available with the *.cnv_sv.vcf.gz
extension. The VCF header lines in the *.cnv_sv.vcf.gz
mostly correspond to a concatenation of the individual header lines from the CNV and SV VCFs, with a few lines deduplicated and some new ones added. For details on the legacy header lines, please refer to the individual CNV and SV user guide sections.
Newly added header lines are described in the following table.
Records that can be matched or rescued will have annotations indicating the breakpoint linkage between a CNV and SV record. If a complete match is found, then the MatchSv
annotation will be present in the record, indicating the SV record's ID
field for this CNV record. Furthermore, the use of the SVCLAIM
field will indicate if the record has evidence arising from depth signal D
, or junction signals J
, or both DJ
.
Because of the mixing of standalone SV records and CNV records, the FORMAT field may have different annotations. For details on the CNV or SV specific annotations, please refer to the individual CNV and SV user guide sections.
Records that can be matched or rescued will have FILTER set to PASS. The original FILTERs are retained for records that were not matched or rescued. For example, the cnvLength
FILTER will still be applied to standalone CNV records (those with SVCLAIM=D
).
Example records are shown below.
The DRAGEN small variant caller is a haplotype-based caller which performs local assembly of all reads in an active region into a de Bruijn graph (DBG). The assembly process uses all the read bases including the soft-clip bases of reads. The soft-clip bases provide evidence for the presence of variants, specifically longer insertions and deletions which are not present in the read cigar and hence cannot be directly viewed in IGV.
The assembly and realignment step (using pair-HMM) performed by variant caller aims to correct mapping errors made by the original aligner and improves the overall variant caller accuracy. Using the evidence BAM, we can view how the variant caller sees the read evidence and how the reads have been realigned making it a very useful debugging tool.
By default, the evidence BAM contains only a subset of regions processed by the small variant caller. Only regions which have candidate indel variants and some percentage of soft-clip reads in the pileup are realgned and output in the evidence BAM. This is done to reduce the run-time overhead needed to generate the evidence BAM.
The output of the VC Evidence BAM feature will match the output format that the customer has selected using --output-format option. The default format is bam.
A bam/cram/sam file with the suffix _evidence.bam/cram/sam
and the corresponding index file. The evidence BAM can be enabled along with the regular BAM output from the Map-Align step. When multiple BAM are passed as inputs to the variant caller, for e.g., in Tumor-Normal calling, then they will be combined in the evidence BAM output and tagged with appropriate read groups.
A bed file with regions that were realigned and output in VC Evidence BAM with suffix ".realigned-regions.bed".
The evidence BAM consists of realigned reads, badly mated reads and reads that are disqualified by the variant caller based on the read likelihood scores.
Disqualified and Badly Mated reads
Reads that are badly-mated (when the read and its mate are mapped to different chromosmes) are tagged with a BM tag (integer) and reads that are disqualified (based on read likelihoods) are tagged with the DQ tag (integer). These reads are filtered out by the genotyper in the variant caller. The alignment score tag AS is forced to 0 for such reads in the evidence BAM and hence, they can be filtered from the IGV pileup by setting the minimum AS score to be 1 instead of 0.
Graph Haplotypes
When enabling graph haplotypes output using --vc-evidence-bam-output-haplotypes
, all the haplotypes constructed by the de Bruijn graph are output in the evidence BAM as single reads covering the entire active region. The reads and haplotypes are tagged with different read groups which makes it easily distinguishable in IGV. In IGV, we can use “Color Alignments By” or “Group Alignments By” > read group to separate out the reads from the haplotypes. The haplotypes are tagged with read group EvidenceHaplotype
and the reads are part of the EvidenceRead_Normal/Tumor
read group.
The haplotypes are named as Haplotype 1, Haplotype 2 and so on and have an additional ‘HC’ tag (integer). The realigned reads also have an HC tag which encodes which haplotype best matches the read based on the likelihood calculation. Only reads which are supported by a single unique haplotype have the HC tag, reads which match more than one haplotype well do not have an HC tag. The use of this tag is primarily intended to enable highlighting of reads in IGV. Go to "Color Alignments By > Tag" and enter "HC" to view which reads are uniquely supported by a certain graph haplotypes.
The default mode of the small variant caller has been optimized to detect germline variants with typical AFs of 0%, 50% or 100%. On the other hand, non-cancer post-zygotic mosaic variants have typical allele frequencies (AFs) lower than 50% and therefore more challenging to find with the default small variant caller. To improve sensitivity of low AF calls, a new machine learning (ML) model trained using read and context evidence from low AF calls is used. This allows the model to identify variants down to approximately 5% AF on 35x WGS and 3% AF on 300x WGS. The mosaic ML model is applied to all calls that are rejected by the germline model and variants detected with the mosaic detection are ideintified by a MOSAIC
flag in the VCF INFO field.
When the mosaic detection is enabled, the hard filter QUAL
threshold for both SNPs and INDELs is lowered to 0.4
in this mode to allow low AF calls to be set as PASS
in the FILTER field. MOSAIC
tagged variants with QUAL
smaller than 3
are filtered with the MosaicHardQUAL
filter.
We provide an optional MosaicLowAF
filtering option to filter MOSAIC
tagged variants with AF
smaller than the AF
threshold. The threshold for this filter can be set with the --vc-mosaic-af-filter-threshold
option.
Furthermore, the output of MOSAIC
tagged calls can be restricted using an optional target BED provided with the --vc-mosaic-target-bed
option.
--vc-enable-mosaic-detection
Set to true to enable mosaic detection with mosaic AF filter threshold set to 0.0
. Set to false to disable mosaic detection. The default is true with mosaic AF filter threshold set to 0.2
.
--vc-mosaic-af-filter-threshold
Set the allele frequency threshold for the application of the MosaicLowAF
filter to mosaic calls. All MOSAIC
tagged variants with AF
smaller than the AF
threshold are filtered with the MosaicLowAF
filter. The default mosaic AF
filter threshold is set to 0.2
when the germline variant caller is enabled. The AF default threshold is set to 0.0
when the mosaic detection mode is enabled with --vc-enable-mosaic-detection=true
.
--vc-mosaic-qual-filter-threshold
Set the QUAL
threshold for the application of the MosaicHardQUAL
filter to mosaic calls. All MOSAIC
tagged variants with QUAL
smaller than the threshold QUAL
are filtered with the MosaicHardQUAL
filter. The default mosaic QUAL
filter threshold is set to 3.0
.
--vc-mosaic-target-bed
Optional target BED file to restrict the output of MOSAIC
tagged variant calls only in the specified regions.
Small variant calling features comparison between default germline small variant caller and mosaic detection mode in DRAGEN 4.2 and DRAGEN 4.3
DRAGEN Multi-region Joint Detection (MRJD) is a de novo germline small variant caller for paralogous regions. In DRAGEN v4.3, MRJD covers regions that include six clinically relevant genes: NEB, TTN, SMN1/2, PMS2, STRC, and IKBKG. MRJD is compatible with hg38, hg19 and GRCh37 reference genome. The table below includes hg38 region coordinates covered by MRJD.
MRJD is a variant calling method that is designed to detect de novo germline small variants in paralogous regions of the genome. A conventional variant caller relies on the read aligner to determine which reads likely originated from a given location. This method works well when the region of interest does not resemble any other region of the genome over the span of a single read (or a pair of reads for paired-end sequencing). However, a significant fraction of the human genome does not meet this criterion. At least 5% of the human genome consists of segmental duplications. Many regions of the genome have near-identical copies elsewhere, and as a result, the true source location of a read might be subject to considerable uncertainty. If a group of reads is mapped with low confidence, a conventional variant caller might ignore the reads, even though they contain useful information. If a read is mismapped (i.e., the primary alignment is not the true source of the read), it can result in variant detection errors.
MRJD is designed in attempt to tackle the complexities raised by segmental duplication regions. Basically, instead of considering each region in isolation, MRJD considers all locations from which a group of reads may have originated and attempts to detect the underlying sequences jointly across all paralogous regions in the sample of interest.
Below is a diagram showing the general workflow of MRJD in PMS2 and PMS2CL regions. MRJD takes primary alignments in all paralogous regions, regardless of mapping quality, builds and places haplotypes based on reads and prior knowledge, and computes joint genotypes to call small variants.
Figure 1. MRJD Caller workflow.
As shown in the diagram above, there are two modes of the DRAGEN MRJD Caller, default mode and high sensitivity mode. Here are details on the differences between the two modes.
With --enable-mrjd=true
, the MRJD Caller will report the following two types of variants:
Uniquely placed variants, which means the variant is found and placed in one of the paralogous regions without ambiguity. See variants labeled with “type 1” in Figure 2.
Region-ambiguous variants. In this case, the aggregated genotype contains a variant allele with high confidence, but MRJD Caller is unable to place the variant allele in one of the paralogous regions with high confidence. The MRJD Caller will report the variant allele in all paralogous regions. See variants labeled with “type 2” in Figure 2.
With both --enable-mrjd=true
and --mrjd-enable-high-sensitivity-mode=true
, the MRJD Caller reports the same variants as from the default mode, plus two other types of variants.
Positions where the reference alleles in all paralogous regions are not the same. It is well established that gene conversion, including reciprocal crossover, is a common event between paralogous regions (such as PMS2 and PMS2CL). When reciprocal crossover event occurs, the prior model, without nearby information on phasing, might end up placing the converted haplotype in the source region instead of the destination region, resulting in no variant. The high sensitivity mode compensates for this event by reporting the variant in corresponding positions in all paralogous regions. See variants labeled with “type 3” in Figure 2.
Variants that have been placed uniquely in one of the paralogous regions and no variant in the corresponding position in the other region. The high sensitivity mode reports the variant in the rest of the paralogous regions. This is to compensate the fact that sometimes the prior knowledge that is used to help place the variant is not sufficient or is estimated incorrectly. In those cases, the variant allele still exists but is placed in the wrong paralog region. Therefore, reporting the variant in the other paralogous regions can help maximize sensitivity even with the lack of prior. See variants labeled with “type 4” in Figure 2.
Figure 2. Different variant types reported by MRJD Caller default mode and high sensitivity mode.
The MRJD Caller is disabled by default and requires WGS data aligned to a human reference genome build 38, 19, or GRCh37.
Here is the list of options related to MRJD.
--enable-mrjd
If set to true, MRJD is enabled for the DRAGEN pipeline. Note that MRJD cannot run together with SNV caller in the current version of DRAGEN (default = ‘false’).
--mrjd-enable-high-sensitivity-mode
If set to true, MRJD high sensitivity mode is enabled for the DRAGEN pipeline. See previous section on what variant types are reported in MRJD default mode and high sensitivity mode (default = ‘false’).
The following command-line example uses FASTQ input and runs MRJD Caller with high sensitivity mode:
The following command-line example uses BAM input that has already been aligned and runs MRJD Caller with high sensitivity mode:
Here are the example command lines to first run DNA Mapping and Small Variant Calling workflow using FASTQ files as input, and then run MRJD using BAM file generated by the DNA Mapping workflow as input.
The MRJD Caller generates a .mrjd.hard-filtered.vcf.gz file in the output directory. The output file is a compressed VCFv4.2 formatted file that contains the VCF representation of the small variants from the identified genotype.
The following are example output format for uniquely placed variant. The DRAGENHardQual filter is applied to the records if the variant has a QUAL < 3.00.
Figure 3. VCF output format example for uniquely placed call.
For variant that are not uniquely placed (type 2-4 variant in Figure 2), the MRJD Caller will also report variants under diploid genotype format, which can be interpreted the same way as uniquely placed variant (the genotype is region-specific instead of an aggregate across all regions). Under this format, The QUAL presents phred-scaled quality score for the assertion made in ALT (i.e. −10log10 prob(GT==0/0)). Note that the QUAL score will be equal to or less than 3 (if the QUAL > 3, then the call should be uniquely placed).
The QUAL, GT, GQ and PL will be reported similar to the DRAGEN germline VC. To avoid losing information about the aggregated genotype across paralogous regions, the MRJD Caller reports genotype, phred-scaled quality score, and the phred-scaled genotype likelihoods for aggregated genotype using JGT, JQL, and JPL in the FORMAT column.
Figure 4. VCF output format example for non-uniquely-placed call.
The VCF imputation tool can infer multi-allelic SNP and INDEL variants from low-coverage sequencing samples by packaging the GLIMPSE software (2020, Olivier Delaneau & Simone Rubinacci). The DRAGEN implementation of the GLIMPSE software allows for scalability of variant imputation:
with an end-to-end pipeline where the 3 phases of the GLIMPSE software (Chunk, Phase and Ligate) get executed in a single command, on one chromosome or on multiple chromosomes
with accceleration supported with Advanced Vector Extensions (AVX)
The DRAGEN VCF imputation tool infers variants on autosomes and chromosome X of haploid and diploid species.
For data other than human data (reference build hg38) the user needs to provide its own reference panel and genetic map. A custom reference panel can be built with the DRAGEN Population Haplotyping tool.
Notes:
The output is in biallelic format, one line per ALT.
The VCF imputation tool only supports input sample data generated with the DRAGEN secondary analysis software.
The following is an example of commands to impute vcf on a single chromosome:
The following is an example of commands to impute vcf on chromosome X:
The imputation tool infers multi-allelic SNP and INDEL variants from low-coverage sequencing samples that are provided by the user. To maximize the accuracy of the imputed variant per sample, the tool leverages the information from all provided samples.
The sample(s) to be imputed must have the following format:
VCF, multi-sample VCF, BCF or multi-sample BCF (zip or unzipped). gVCF is not supported
Must contain GL (Genotype Likelihoods) or PL (phred-scaled genotype likelihoods) information
To impute INDELs and get the best accuracy on INDELs, it is recommended:
and to set the command --imputation-phase-impute-reference-only-variants
to true.
Notes: IRPv1.x does nor support chrX, IRPv2.x supports chrX, chrY and chrM are not supported
A custom reference panel can be built with the DRAGEN Population Haplotyping tool. When providing a custom reference panel ensure the chromosome of mixed ploidy chromosome is divided into the PAR and non-PAR regions that exist, and the basename matches the subregions names defined in the JSON config file. The format should be <PREFIX>.basename
. Examples: IRPv2.0.chrX_par1, IRPv2.0.chrX_par2, and IRPv2.0.chrX_nonpar.
<chromosome name>.gmap.gz
3 columns: position, chromosome number, distance (cM)
compliant with the reference genome used to generate the sample input
In the IRP reference panel folder available on DRAGEN support page, the JSON config file corresponds to human data. The user can edit this file if imputation is done on another species.
For imputing VCF on human data with typename “M” for Male and “F” for Female (“M” and “F” are the values used in the sample type file):
The JSON config file is made of two fields as defined in the table below
Note: ensure the subregion names match the genetic map name. Example: if "chrX_nonpar" is defined in the "region" field of the JSON config file, then the genetic map corresponding to chromosome X non PAR region in the Reference Panel folder must be named "chrX_nonpar".gmap.gz.
The sample type file is required when haplotyping is performed on non-PAR regions of mixed ploidy chromosomes to define the typename of each sample.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: typename value for each sample. This typename value should match the typenames used in the JSON config file.
The VCF imputation tool generates several outputs:
The imputed variant file with concatenated imputed variants: one single VCF or msVCF file for all specified regions/chromosomes with name <prefix>.impute.vcf.gz
The intermediate files:
chunk regions to be passed along to the internal Phase step with name <prefix>.impute.chunk.out.txt
imputed variants per chunks identified: VCF or msVCF depending on the input sample format with name <prefix>_chr_start-end.impute.phase.vcf.gz
text file with path to all the <prefix>_chr_start-end.impute.phase.vcf.gz
generated with name <prefix>.impute.phase.out.txt
Note: while the imputation tool can impute multi-allelic positions, the output is in biallelic format, one line per ALT. The bcftools tool can be used to post-collapse all ALT in one line with the command: bcftools norm -m +snps
Note: with this end-to-end implementation of the GLIMPSE software, the parameters window_size and buffer_size are respectively set to 2 Mb and 200 kb.
The DRAGEN Copy Number Variant (CNV) Pipeline can call CNV events using next-generation sequencing (NGS) data. This pipeline supports multiple applications in a single interface via the DRAGEN Host Software, including processing of whole-genome sequencing (WGS) data and whole-exome sequencing (WES) data for germline analysis.
The DRAGEN CNV pipeline supports two normalization modes of operation. The two modes apply different normalization techniques to handle biases that differ based on the application, for example, WGS versus WES. While the default option settings attempt to provide the best trade-off in terms of speed and accuracy, a specific workflow may require more finely tuned option settings.
The DRAGEN CNV pipeline follows the workflow shown in the following figure.
DRAGEN CNV Pipeline Workflow
The DRAGEN CNV Pipeline uses many aspects of the DRAGEN secondary analysis available in other pipelines, such as hardware acceleration and efficient I/O processing. To enable CNV processing in the DRAGEN Host Software, set the --enable-cnv
command line option to true.
The CNV pipeline has the following processing modules:
Target Counts --- Binning of the read counts and other signals from alignments.
Bias Correction --- Correction of intrinsic system biases.
Normalization --- Detection of normal ploidy levels and normalization of the case sample.
Segmentation --- Breakpoint detection via segmentation of the normalized signal.
Calling / Genotyping --- Thresholding, scoring, qualifying, and filtering of putative events as copy number variants.
The normalization module can optionally take in a panel of normals (PoN), which is used when a cohort or population samples are readily available. Note that PoN normalization is not available for somatic WGS analysis. All other modules are shared between the different CNV algorithms.
The following figures show a high-level overview of the steps in the DRAGEN CNV Pipeline as the signal traverses through the various stages. These figures are examples and are not identical to the plots that are generated from the DRAGEN CNV Pipeline.
The first step in the DRAGEN CNV Pipeline is the target counts stage. The target counts stage extracts signals such as read count and improper pairs and puts them into target intervals.
Read Count Signal
Improper Pairs Signal
Next, the case sample is normalized against the panel of normals or against the estimated normal ploidy level. Any other biases are subtracted out of the signal to amplify any event level signals.
Normalization
The normalized signal is then segmented using one of the available segmentation algorithms. Events are then called from the segments.
Segments
Called Events
The events are then scored and emitted in the output VCF.
The following are the top-level options that are shared with the DRAGEN Host Software to control the CNV pipeline. You can input a BAM or CRAM file into the CNV pipeline. If you are using the DRAGEN mapper and aligner, you can use FASTQ files.
--bam-input
--- The BAM file to be processed.
--cram-input
--- The CRAM file to be processed.
--enable-cnv
--- Enable or disable CNV processing. Set to true to enable CNV processing.
--enable-map-align
--- Enables the mapper and aligner module. The default is true, so all input reads are remapped and aligned unless this option is set to false.
--fastq-file1
, --fastq-file2
--- FASTQ file or files to be processed.
--output-directory
--- Output directory where all results are stored.
--output-file-prefix
--- Output file prefix that will be prepended to all result file names.
--ref-dir
--- The DRAGEN reference genome hashtable directory.
The output and filtering options control the CNV output files.
--cnv-exclude-bed
--- Specifies a BED file that indicates the intervals to exclude from the CNV analysis. If a target interval overlaps regions specified from exclude BED file more than cnv-exclude-bed-min-overlap
, the target interval is suppressed.
--cnv-exclude-bed-min-overlap
--- Specifies a fraction for filtering threshold of overlap amount between a target interval and the excluded region (0.5).
--cnv-enable-ref-calls
--- Emit copy neutral (REF) calls in the output VCF file. The default is true for single WGS CNV analysis.
--cnv-enable-tracks
--- Generate track files that can be imported into IGV for viewing. When this option is enabled, a \*.gff
file for the output variant calls is generated, as well as \*.bw
files for the tangent normalized signal. The default is true.
--cnv-filter-bin-support-ratio
--- Filters out a candidate event if the span of supporting bins is less than the specified ratio with respect to the overall event length. This filter only applies to records with length greater than cnv-filter-bin-support-ratio-min-len
. The default ratio is 0.2 (20% support). As an example, if an event is called and has a length of 100,000 bp, but the target interval bins that support the call only spans a total of 15,000 bp (15,000/100,000 = 0.15), then the interval is filtered out. If applied, the record will have cnvBinSupportRatio
as a filter.
--cnv-filter-bin-support-ratio-min-len
--- Minimum length of candidate event at which to apply cnv-filter-bin-support-ratio
. Currently only applied to germline WGS workflows, with default value of 80,000 bp.
--cnv-filter-copy-ratio
--- Specifies the minimum copy ratio (CR) threshold value centered about 1.0 at which a reported event is marked as PASS in the output VCF file. The default value is 0.2, which leads to calls with CR between 0.8 and 1.2 being filtered. If applied, the record will have cnvCopyRatio
as a filter.
--cnv-filter-length
--- Specifies the minimum event length in bases at which a reported event is marked as PASS in the output VCF file. The default is 10000. If applied, the record will have cnvLength
as a filter.
--cnv-filter-qual
--- Specifies the QUAL value at which a reported event is marked as PASS in the output VCF file. You should adjust the parameter value according to your own application data. If applied, the record will have cnvQual
as a filter.
--cnv-min-qual
--- Specifies the minimum reported QUAL. The default is 3.
--cnv-max-qual
--- Specifies the maximum reported QUAL. The default is 200.
--cnv-qual-length-scale
--- Specifies the bias weighting factor to adjust QUAL estimates for segments with longer lengths. This is an advanced option and should not need to be modified. The default is 0.9303 (2-0.1).
--cnv-qual-noise-scale
--- Specifies the bias weighting factor to adjust QUAL estimates based on sample variance. This is an advanced option and should not need to be modified. The default is 1.0.
For the DRAGEN CNV pipeline, the hashtable must be generated with the --enable-cnv option
set to true, in addition to any other options required by other pipelines. When --enable-cnv
is true, DRAGEN generates an additional k-mer uniqueness map that the CNV algorithm uses to counteract mappability biases. You only need to generate the k-mer uniqueness map file one time per reference hashtable. The generation takes about 1.5 hours per whole human genome.
The following example command generates a hashtable.
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the DRAGEN CNV Pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned, including any samples to be used as references for normalization. Each sample must have a unique sample identifier. Use the --RGSM
option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM
option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
DRAGEN can map and align FASTQ samples, and then directly stream them to downstream callers, such as the CNV Caller and the Haplotype Variant Caller. You can use this process to skip generation of a BAM or CRAM file, which bypasses the need to store additional files.
To stream alignments directly to the DRAGEN CNV pipeline, run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable CNV. The following example command line maps and aligns a FASTQ file, and then sends the file to the Germline CNV WGS pipeline.
The target counts stage is the first processing stage for the DRAGEN CNV pipeline. This stage bins the alignments into intervals. The primary analysis format for CNV processing is the target counts file, which contains the feature signals that are extracted from the alignments to be used in downstream processing. The binning strategy, interval sizes, and their boundaries are controlled by the target counts generation options, and the normalization technique used.
When working with whole genome sequence data, the intervals are autogenerated from the reference hashtable. Only the primary contigs from the reference hashtable are considered for binning. You can specify additional contigs to bypass with the --cnv-skip-contig-list
option.
With whole exome sequence data, DRAGEN uses the target BED file supplied with the --cnv-target-bed
option to determine the intervals for analysis.
The target counts stage generates a .target.counts.gz
file. You can use the file later in place of any BAM or CRAM by specifying the file with the --cnv-input
option for the normalization stage. The .target.counts.gz
file is an intermediate file for the DRAGEN CNV pipeline and should not be modified.
If the samples are whole genome, then the effective target intervals width is specified with the --cnv-interval-width
option. The higher the coverage of a sample, the higher the resolution that can be detected. This option is important when running with a panel of normals because all samples must have matching intervals. For self-normalization, the actual width of a given target interval might be larger than the specified value.
The default value for WGS is 1000 bp with a sample coverage of ≥ 30x.
Using a cnv-interval-width
of less than 250 bp for WGS analysis can drastically increase runtime.
The intervals are autogenerated for every primary contig in the reference. Only references that have the UCSC or GRC convention are supported. For example, chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
. You can specify a list of contigs to skip by using the --cnv-skip-contig-list
option. This option takes a comma-separated list of contig identifiers. The contig identifiers must match the reference hashtable that you are using. By default, only the mitochondrial chromosomes are skipped. Non-primary contigs are never processed.
For example, to skip chromosome M, X, and Y, use the following option:
If the samples are whole exome samples, supply a target BED file with the --cnv-target-bed $TARGET_BED
option. The intervals in the target BED file indicate regions where alignments are expected based on the target capture kit. The BED file intervals are further split into intervals of smaller size, depending on the value of cnv-interval-width
.
To use a standard BED file, make sure that there is no header present in the file. In this case, all columns after the third column are ignored, similar to the operation of DRAGEN Variant Caller.
The following options control the generation of target counts.
--cnv-counts-method
--- Specifies the counting method for an alignment to be counted in a target bin. Values are midpoint, start, or overlap. The default value is overlap when using the panel of normals approach, which means if an alignment overlaps any part of the target bin, the alignment is counted for that bin. In the self-normalization mode, the default counting method is start.
--cnv-min-mapq
--- Specifies the minimum MAPQ for an alignment to be counted during target counts generation. The default value is 3 for self-normalization and 20 otherwise. When generating counts for panel of normals, all MAPQ0 alignments are counted.
--cnv-target-bed
--- Specifies a properly formatted BED file that indicates the target intervals to sample coverage over. For use in WES analysis.
--cnv-interval-width
--- Specifies the width of the sampling interval for CNV processing. This option controls the effective window size. The default is 1000 for WGS analysis and 500 for WES analysis.
--cnv-skip-contig-list
--- Specifies a comma-separated list of contig identifiers to skip when generating intervals for WGS analysis. The default contigs that are skipped, if not specified, are chrM,MT,m,chrm
.
--cnv-filter-duplicate-alignments
--- Filter duplicate marked alignments during target counts if option is set to true
. The deafult setting is false
.
Target counts options are recorded in the header of each counts file, to facilitate review and validation of panel of normals. If PON counts are generated with different count options than CASE sample, then DRAGEN will return an option validation error.
PCR duplicates are often considered as noise in coverage depth information. DRAGEN CNV has an option to include/exclude duplicate marked alignments: --cnv-filter-duplicate-alignments
when counting alignments. This relies on the alignments having the duplicate-marked bit (0x400) in the SAM flag set correctly.
If --enable-map-align=false
, then duplicate marking should be present in the input file (pre-aligned BAM/CRAM). If --enable-map-align=true
, then --enable-duplicate-marking=true
should be set.
Note that CNV will wait for duplicate marking from the Map/Aligner which may increase overall run time.
In the WGS case where a BED file is not specified for a given reference, the same intervals should be generated each time. The intervals created take into account the mappability of the reference genome using a k-mer uniqueness map created during hashtable generation.
Due to ambiguity that may arise from non-unique genomic loci, only regions corresponding to unique k-mers are considered. A position in the reference genome is marked as a unique k-mer if the k-mer starting at that position does not show up anywhere else in the reference genome (or non-unique, otherwise). Furthermore, if the k-mer contains any bases other than A, C, T or G, it is marked as non-unique.
For WGS samples and in absence of a cnv-target-bed
file, the target intervals are auto generated based on the pre-computed k-mer-uniqueness map for a given input reference hashtable, and the cnv-interval-width
option, which defaults to 1000bp. The cnv-interval-width
option determines the minimum number of unique k-mer positions required in the interval. There is an upper bound to the length of the interval: when the length of the interval is greater than double the size of cnv-interval-width
, without reaching the required count of unique k-mer positions, the interval is discarded and the process starts again at the next genomic position. Regions that are discarded are denoted as "dropout" regions, and denoted with exclusion reason NON_KMER_UNIQUE
in the *.cnv.excluded_intervals.bed.gz
file.
A dropout region is a complex region that does not count alignments and results in an interval missing from the analysis. Dropout regions include centromeres, telomeres, and low complexity regions. If there is sufficient signal in the flanking regions, an event can still span these dropout regions, even if alignment counting does not occur in the regions. The event is handled by the segmentation stage.
GC Biases measure the relationship between GC content and read coverage across a genome. Biases can occur in library prep, capture kits, sequencing system differences, and mapping. Biases can result in difficulties calling CNV events. The DRAGEN GC bias correction module attempts to correct these biases.
Typical whole-exome capture kits have over 200,000 targets spanning the regions of interest. If your BED file has fewer than 200,000 targets, or if the target regions are localized to a specific region in the genome (such that GC bias statistics might be skewed), then GC bias correction should be disabled.
The following options control the GC bias correction module.
--cnv-enable-gcbias-correction
--- Enable or disable GC bias correction when generating target counts. The default is true.
--cnv-enable-gcbias-smoothing
--- Enable or disable smoothing the GC bias correction across adjacent GC bins with an exponential kernel. The default is true.
--cnv-num-gc-bins
--- Specifies the number of bins for GC bias correction. Each bin represents the GC content percentage. Allowed values are 10, 20, 25, 50, or 100. The default is 25.
The DRAGEN CNV pipeline supports two normalization algorithms:
Self-Normalization --- Estimates the autosomal diploid level from the sample under analysis to determine the baseline level to normalize by. Sex chromosomes and PAR regions are handled based on the sample sex.
Panel of Normals --- A reference-based normalization algorithm that uses additional matched normal samples to determine a baseline level from which to call CNV events. The matched normal samples here means it has undergone the same library prep and sequencing workflow as the case sample.
Which algorithm to use depends on the available data and the application. Use the following guidelines to select the mode of normalization.
Whole genome sequencing
Single sample analysis
Additional matched samples are not readily available
Simpler workflow via a single invocation
Only references with chr1, chr2, chr3, ..., chrX, chrY
or 1, 2, 3, ..., X, Y
naming conventions are supported.
Whole genome sequencing (non-somatic)
Whole exome sequencing
Targeted panels, including somatic panels
Additional matched samples are available
Nonhuman samples
The DRAGEN CNV pipeline provides the self-normalization mode that does not require a reference sample or a panel of normals. To enable this mode, set --cnv-enable-self-normalization
to true. Self-normalization mode bypasses the need to run two stages and can save time. It uses the statistics within the case sample to determine the baseline from which to make a call.
Because self normalization uses the statistics within the case sample, this mode is not recommended for WES or targeted sequencing analysis due to the potential for insufficient data.
The self-normalization mode is the recommended approach for whole-genome sequencing single sample processing. The pipeline continues through to the segmentation and calling stage to produce the final called events.
If you are running from a FASTQ sample, then the default mode of operation is self-normalization.
When operating in self-normalization mode, the --cnv-interval-width
option used during the target counts stage becomes the effective interval width based on the number of unique k-mer positions. You typically do not have to modify this option.
Self-normalization autogenerates the target intervals to use during the analysis based on the reference genome and is only compatible with standard human references.
If the user wishes to attempt self-normalization mode on non-standard human references, an override can be set via --cnv-bypass-contig-check=true
. Under this setting, the CNV caller will do a naive median normalization across all of the contigs within the reference genome. This feature is purely for experimental and for research use only, and no claims or validation is made for the use of this feature.
The panel of normals mode uses a set of matched normal samples to determine the baseline level from which to call CNV events. These matched normal samples should be derived from the same library prep and sequencing workflow that was used for the case sample. This allows the algorithm to subtract system level biases that are not sample specific. The generation of the target counts for these normal samples should also have identical command line options with the case sample under analysis.
In this mode, the DRAGEN CNV Pipeline is broken down into two distinct stages. The target counts stage is performed on each sample, case, and normals, to bin the alignments. The normalization and call detection stage is then performed with the case sample against the panel of normals to determine the events.
Target counts should be generated for all samples, whether the samples are to be used as references or are the case samples under analysis. The case samples and all samples to be used as a panel of normals sample must have identical intervals and therefore should be generated with identical settings. The target counts stage also performs GC Bias correction. GC Bias correction is enabled by default.
The following examples are for WES processing, which is the case in where a panel of normals is required.
The following is an example command for processing a BAM file.
The following is an example command for processing a CRAM file.
When running an analysis with a panel of normals (set of target counts), then a column wise concatenated version of the panel is output as a *.combined.counts.txt.gz file. If the user wishes to generate this file without running the actual calling step, then this can be done by adding the --cnv-generate-combined-counts=true
option to the command line. The individual panel of normals target counts file must be specified either via --cnv-normals-file
(one per file) or --cnv-normals-list
(single text file with paths to each sample).
The following is an example command line using a normals list:
The next step in the CNV pipeline when using a panel of normals is to perform the normalization and to make the calls. This involves a separate execution of DRAGEN during which the normalization is performed and calls are made. This step requires the specification of a set of target counts files to be used for reference-based median normalization.
Ideally the panel of normals samples follow library prep and sequencing workflows that are identical to the workflows of the case sample under analysis. In order to be applicable to both male and female case samples, the panel of normals should include a balanced set of both male and female samples. DRAGEN automatically handles calling on sex chromosomes based on the predicted sex of each sample in the panel.
For optimal bias correction, a minimum of 50 samples is recommended as a panel. DRAGEN can run with smaller numbers of samples in the panel, down to even just a single sample, but smaller panels can result in artifactual calls in the test sample where at least some of the panel samples have copy number changes. Larger panels do not entirely prevent such issues, but they limit it to regions where non-reference copy numbers are common.
The following is an example of PON files, which uses a subset of the GC corrected files from the target counts stage.
DRAGEN accepts 3 different file formats for a Panel of Normals (PON).
The CNV caller can also be started from the *.target.counts.gz
(raw counts) or *.target.counts.gc-corrected.gz
(GC corrected counts) files of the case sample, by specifying the selected file with the --cnv-input
option and the PON options described above. When selecting GC corrected counts the option --cnv-enable-gcbias-correction
should be set to false to disable the GC-correction stage.
For example, the following command normalizes the case sample against the panel of normals.
These options control the preconditioning of the panel of normals and the normalization of the case sample.
--cnv-enable-self-normalization
--- Enable/disable self normalization mode, which does not require a panel of normals.
--cnv-extreme-percentile
--- Specifies the extreme median percentile value at which to filter out samples. The default is 2.5.
--cnv-input
--- Specifies a target counts file for the case sample under analysis when using a panel of normals.
--cnv-normals-file
--- Specifies a target.counts.gz file to be used in the panel of normals. You can use this option multiple times, one time for each file.
--cnv-normals-list
--- Specifies a text file that contains paths to the list of reference target counts files to be used as a panel of normals. Absolute paths are recommended in case the workflow is used later or shared with other users. Relative paths are supported if the paths are relative to the current working directory.
--cnv-max-percent-zero-samples
--- Specifies the number of zero coverage samples allowed for a target. If the target exceeds the specified threshold, then the target is filtered out. The default value is 5%. The option is sensitive to the number of normal samples being used. Make sure you adjust the threshold accordingly. If your panel of normals size is small and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-max-percent-zero-targets
--- Specifies the number of zero coverage targets allowed for a sample. If sample exceeds the specified threshold, then the sample is filtered out. The default value is 2.5%. The option is sensitive to the total number of target intervals. Make sure you adjust the threshold accordingly. If the capture kit has a small number of probes and the threshold not adjusted, the option could filter out targets that were not intended to be.
--cnv-target-factor-threshold
--- Specifies the bottom percentile of panel of normals medians to filter out useable targets. The default is 1% for whole genome processing and 5% for targeted sequencing processing.
--cnv-truncate-threshold
--- Specifies a percentage threshold for truncating extreme outliers. The default is 0.1%.
--cnv-enable-gender-matched-pon
--- Enable/disable gender matched PON normalization. If enabled, DRAGEN uses matched gender PON for sex chromosome normalization. Sex chromosome intervals are filtered if PON has no matched gender sample. The default value is true.
After a case sample has been normalized, the sample goes through a segmentation stage. DRAGEN implements multiple segmentation algorithms, including the following algorithms:
Circular Binary Segmentation (CBS)
Shifting Level Models (SLM)
The SLM algorithm has three variants, SLM, Heterogeneous SLM (HSLM), and Adaptive SLM (ASLM). HSLM is for use in exome analysis and handles target capture kits that are not equally spaced. ASLM includes additional sample-specific estimation of technical variability of depth of coverage, as opposed to changes in copy number. The estimations are based on the median variance within fixed windows or a preliminary set of segments based on b-allele ratios. The ASLM algorithm mitigates over segmentation due to noisy or wavy samples; this is the default mode for somatic GWGS analysis.
By default, SLM is the segmentation algorithm for germline whole genome processing, ASLM is the algorithm for somatic whole genome processing, and HSLM is the algorithm for whole exome processing
For the targeted sequencing workflows, you can also run with a --cnv-segmentation-bed
. The option pre-defines the segments to estimate copy numbers for and skips the segmentation step of the workflow. See Targeted Segmentation (Segment BED) for more information.
--cnv-segmentation-mode
--- Specifies the segmentation algorithm to perform. The following values are available.
bed
cbs
slm
--- The default for germline WGS analysis.
aslm
--- The default for somatic WGS analysis.
hslm
--- The default for targeted/WES analysis.
--cnv-merge-distance
--- Specifies the maximum number of base pairs between two segments that would allow them to be merged. The default value is 0 for germline WGS, which means the segments must be directly adjacent. For WES analysis, this parameter is disabled by default due to the spacing of targeted intervals.
--cnv-merge-threshold
--- Specifies the maximum segment mean difference at which two adjacent segments should be merged. The segment mean is represented as a linear copy ratio value. The default is 0.2 for WGS and 0.4 for WES. To disable merging, set the value to 0.
Circular Binary Segmentation is implemented directly in DRAGEN and is based on A faster circular binary segmentation for the analysis of array CGH data¹ with enhancements to improve sensitivity for NGS data. The following options control Circular Binary Segmentation.
--cnv-cbs-alpha
--- Specifies the significance level for the test to accept change points. The default is 0.01.
--cnv-cbs-eta
--- Specifies the Type I error rate of the sequential boundary for early stopping when using the permutation method. The default is 0.05.
--cnv-cbs-kmax
--- Specifies maximum width of smaller segment for permutation. The default is 25.
--cnv-cbs-min-width
--- Specifies the minimum number of markers for a changed segment. The default is 2.
--cnv-cbs-nmin
--- Specifies the minimum length of data for maximum statistic approximation. The default is 200.
--cnv-cbs-nperm
--- Specifies the number of permutations used for p-value computation. The default is 10000.
--cnv-cbs-trim
--- Specifies the proportion of data to be trimmed for variance calculations. The default is 0.025.
¹Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657-663. doi:10.1093/bioinformatics/btl646
The Shifting Level Models (SLM) segmentation mode follows from the R implementation of SLMSuite: a suite of algorithms for segmenting genomic profiles².
--cnv-slm-eta
--- Baseline probability that the mean process changes its value. The default is 4e-5.
--cnv-slm-fw
--- Minimum number of data points for a CNV to be emitted. The default is 0, which means segments with one design probe could in effect be emitted.
--cnv-slm-omega
--- Scaling parameter that modulates relative weight between experimental or biological variance. The default is 0.3.
--cnv-slm-stepeta
--- Distance normalization parameter. The default is 10000. This option is only valid for HSLM.
Regardless of segmentation method, initial segments are split across large gaps where depth data is unavailable, such as across centromeres.
²Orlandini V, Provenzano A, Giglio S, Magi A. SLMSuite: a suite of algorithms for segmenting genomic profiles. BMC Bioinformatics. 2017;18(1). doi:10.1186/s12859-017-1734-5
In applications for targeted panels, you can limit the segmentation and calling performed on intervals by specifying a --cnv-segmentation-bed
. For example, the specified intervals might correspond to gene boundaries matched to the targeted assay. This segmentation mode is only supported with the panel of normals and requires an accompanying --cnv-target-bed
. Also specify the --cnv-segmentation-bed
during the panel of normals generation step, so that all interval boundaries during analysis are matched. For more information on panel of normals generation, see Panel of Normals
The recommended format for the BED file includes four columns and a header. The four columns are contig
, start
, stop
, and name
. The name column represents the name of the gene and must be unique within the BED file. The name is used in the output VCF and annotated as a segment identifier in the INFO/SEGID
field. The following example file is in the recommended format:
If using a three-column BED file, then do not include a header or the name field values. Three-column BED files should only include the contig
, start
, and stop
values. In this case, the segment identifier is autogenerated from the coordinate fields.
Quality scores are computed using a probabilistic model that uses a mixture of heavy tailed probability distributions (one per integer copy number) with a weighting for event length. Noise variance is estimated. The output VCF contains a Phred-scaled metric that measures confidence in called amplification (CN > 2 for diploid locus), deletion (CN < 2 for diploid locus), or copy neutral (CN=2 for diploid locus) events.
The scoring algorithm also calculates exact copy-number quality scores that are inputs to the DeNovo CNV detection pipeline.
You can input an exclude BED to the CNV caller to filter out regions from analysis. Inputting an exclude bed is useful if there are certain regions in the genome that are known to be problematic due to library prep, sequencing, or mapping issues. You can also exclude intervals that specify common CNVs to aid in downstream analysis. You can specify an exclude BED file using cnv-exclude-bed
. DRAGEN does not provide an exclude BED. The intervals to exclude should be formatted in standard three-column BED format.
The intervals in the exclude BED are compared with the original target counts intervals. If the overlap is greater than cnv-exclude-bed-min-overlap, the target counts interval are excluded from analysis. The *.target.counts.gz
file still includes the interval, so you can inspect the original read counts. The normalization stage removes intervals. The *.tn.tsv.gz
file excludes the intervals removed during normalization.
DRAGEN can perform mapping and aligning of FASTQ samples, and then directly stream the data to downstream callers. If the input is a FASTQ sample, a single sample can run through both the CNV and the small VC. This triggers self-normalization by default.
Run the FASTQ sample through a regular DRAGEN map/align workflow, and then provide additional arguments to enable the CNV, VC, or both. The options that apply to CNV in the standalone workflows are also applicable here.
The following examples show different commands.
When running the target counts stage or the normalization stage, the DRAGEN CNV pipeline also provides the following information about the samples in the run.
A correlation metric of the read count profile between the case sample and any panel of normals samples. A correlation metric greater than 0.90 is recommended for confident analysis, but there is no hard restriction enforced by the software.
The predicted sex of each sample in the run. The sex is predicted based on the read count information in the sex chromosomes and the autosomal chromosomes. The median value for the counts is printed to the screen for the autosomal chromosomes, the X chromosome, and the Y chromosome. This estimation requires a minimum of 300 target intervals on the sex chromosomes to proceed.
The results are printed to the screen when running the pipeline. For example:
The predicted sexes for samples in use are also printed to the *.cnv_metrics.csv output file. For a panel of normals, the predicted sexes are used to determine which panel samples are leveraged for normalization on sex chromosomes. If the estimated sex of the sample is UNDETERMINED, the sex of the sample is set to FEMALE.
You can override the predicted sex of the sample with the --sample-sex
option.
The germline CNV workflow can be extended to call copy number alterations in a curated subset of segmentally duplicated regions. Segmental duplications are large blocks of DNA ≥ 1kb, characterized by a high degree of sequence identity at nucleotide level (> 90%). This poses a challenge for traditional approaches, and such regions are usually excluded.
This extension complements the original germline CNV workflow by using a tailored algorithm to compute the normalized coverage in such regions, which is then injected before the segmentation step and becomes part of the main CNV workflow in downstream steps. We currently recommend WGS data aligned to a supported human reference genome (currently only hg38
) with at least 30x coverage. See below for additional requirements.
The following pairs of genes defining Segmental Duplications are included:
This extension is enabled by default in the germline CNV workflow. However, it requires:
Normalization set to self-normalization (--cnv-enable-self-normalization=true
).
GC bias correction enabled (--cnv-enable-gcbias-correction=true
).
Counts method set to start
(--cnv-counts-method=start
).
Interval width not greater than 10kb. However, we recommend using the cnv-interval-width
default (1kb) for best performance.
A supported reference genome builds in input (currently supported based on: hg38
).
If necessary, the extension can be disabled through setting --cnv-enable-segdups-extension
to false.
For each duplicated region, the extension collects all reads falling on top of the two homologous intervals of the pair, and it computes the normalized joint coverage (output to *.cnv.segdups.joint_coverage.tsv.gz
).
Through differentiating sites between the two homologous intervals, the extension computes the proportion of coverage to associate to the first and to the second interval (output to *.cnv.segdups.site_ratios.tsv.gz
).
Such proportion is used to redistribute the joint normalized coverage between the two homologous intervals.
The rescued intervals are output to the *.cnv.segdups.rescued_intervals.tsv.gz
file for inspection and they are automatically injected before the segmentation step.
During integration with the original intervals from the CNV caller, the rescued intervals are considered higher priority, thus replacing all original intervals that they overlap with.j
To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs from matched normal sample or population SNV VCF. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.
The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample.
Panel of normals are used for the reference baseline to provide insight into copy number variants. The ASCN somatic WES CNV model is similar to the somatic WGS CNV model (with different internal parameters tuned for WES), but it uses a panel of normals to remove coverage bias in each target region.
The pipeline accepts various input types for matched normal sample or population SNV VCF for B-allele loci. If the normal sample was already processed using the germline small variant caller, the user can provide its output VCF file.
If the normal sample was not processed, the user can provide raw reads or aligned reads and enable the concurrent execution of the small variant caller. In such case the DRAGEN CNV will receive the small variant caller's output, and use it to estimate B-allele frequencies from the germline SNVs.
If there is no matched normal sample, the user can provide a population SNV VCF. DRAGEN will intersect the population SNV VCF with the target region provided by the cnv-target-bed
and use the resulting SNVs to estimate B-allele frequencies.
You can use following somatic WES CNV calling command-line options:
1 tumor input
1 normal input (either option 1, 2, or 3)
Panel of normals (either option 1, 2, 3 or 4)
Target region
When the normal sample input is not in VCF format (e.g., FASTQ/BAM/CRAM), then the normal sample shall be capable of being used as PON. However, if the normal sample is already included in the PON, then it will not be added.
The following is an example command line for running ASCN tumor-normal somatic WES CNV calling with matched normal SNV VCF.
The following example command line runs ASCN tumor-normal somatic WES CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true
.
If a matched normal is not available, DRAGEN CNV requires population SNV VCF to run in tumor-only mode. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.
For somatic whole-exome sequencing (WES) and somatic targeted panels, you can use a panel of normals as the reference baseline to provide insight into copy number variants. The reported events are based solely on the normalized copy ratio values and the deviation from the expected reference baseline levels. This workflow can be useful for applications that require only the detection of gains and losses in targeted genes. The somatic WES CNV model is similar to the germline WES CNV model, but utilizes a different quality scoring and calling model.
Use one of the following input options.
--tumor-fastq1
and --tumor-fastq2
--Specify a FASTQ file
--tumor-bam-input
--Specify an existing BAM file
--tumor-cram-input
--Specify an existing CRAM file
The Somatic WES CNV Caller requires a panel of normals. The panel of normals samples help measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see Panel of Normals. The panel of normals sample should be well matched to the case sample under analysis.
If a matched normal sample is available, the sample can be included in the panel of normals. The workflow does not change if a matched normal is or is not available.
The following example command line runs somatic analysis on WES data.
If you are using somatic targeted panels with a set of genes supplied with the capture kit, then you can bypass segmentation by specifying a cnv-segmentation-bed
and using cnv-segmentation-mode=bed
. If using this option, all events in the segmentation BED are reported in the output VCF. For more information on the segmentation BED file, see [Targeted Segmentation (Segment BED)].
The following example command line runs somatic analysis on a targeted panel.
The Somatic WES CNV Caller computes quality scores using a 2 sample t-test between the normalized copy ratio of the case sample and the panel of normals samples. The caller computes a p-value per segment. The p-values are then converted to Phred-scaled scores. For copy neutral events, the caller computes quality scores as 1-p
.
DUP/DEL events calls are made based on the limit of detection threshold (LoD) which is set using cnv-filter-limit-of-detection
(default 0.2). For each segment, the caller compute a p-value for hypothetical counts by Case Counts X (1 +/- LoD)
against PON. If p-value of Case Counts X (1+LoD)
is highest, then segment is called as DUP. If p-value of Case Counts X (1-LoD)
is highest, then segment is called DEL. Otherwise segment is called REF.
The output VCF contains the quality score in the QUAL
field.
The non-ASCN Somatic WES CNV Caller only reports copy ratio, also known as fold change. Fold change is encoded in the FORMAT/SM
field as a linear copy ratio of the segment mean. In such case, if tumor purity is known, you can infer the ploidy of a gene or segment in the sample from the reported fold change using the following calculation.
For example, if the tumor purity is 30% for MET with a fold change of 2.2x, then there are 10 copies of MET DNA in the sample.
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
DRAGEN includes a repeat expansion detection method called ExpansionHunter. ExpansionHunter performs sequence-graph based realignment of reads that originate inside and around each target repeat. ExpansionHunter then genotypes the length of the repeat in each allele based on these graph alignments.
The ExpansionHunter is designed for PCR-free whole genome samples. Repeats are only genotyped if the coverage at the locus is at least 10x, but a minimum of 30x is recommended. Sequencing reads must be paired-end with a minimum read length of 100 (2x100bp). The ExpansionHunter cannot be run on multiple FASTQ files that are assigned to different library IDs in the fastq_list.csv
file.
ExpansionHunter does not support somatic analysis.
More information and analysis is available in the following ExpansionHunter papers:
Dolzhenko et al., Detection of long repeat expansions from PCR-free whole-genome sequence data 2017
Dolzhenko et al., ExpansionHunter: A sequence-graph based tool to analyze variation in short tandem repeat regions 2019
To enable DRAGEN repeat expansion detection, the following command-line options are required.
--repeat-genotype-enable=true
--repeat-genotype-specs=<path to specification file>
You can use the --sample-sex
option to specify the sex of the sample. The following options are optional.
--repeat-genotype-region-extension-length=<length of region around repeat to examine>
(default 1000 bp)
--repeat-genotype-min-baseq=<Minimum base quality for high confidence bases>
(default 20)
The main output of repeat expansion detection is a VCF file that contains the variants found via this analysis.
The repeat-specification (also called variant catalog) JSON file defines the repeat regions for ExpansionHunter to analyze. Default repeat-specification for some pathogenic and polymorphic repeats are in the <INSTALL_PATH>/resources/repeat-specs/
directory, based on the reference genome used with DRAGEN.
You can create specification files for new repeat regions by using one of the provided specification files as a template. See the ExpansionHunter documentation for details on the format.
--repeat-genotype-specs
is required for ExpansionHunter. If the option is not provided, DRAGEN attempts to autodetect the applicable catalog file from <INSTALL_PATH>/resources/repeat-specs/
based on the reference provided.
The ExpansionHunter can detect pathogenic expansions of FXN, ATXN3, ATN1, AR, DMPK, HTT, FMR1, ATXN1, C9ORF72 repeats with high accuracy (see the ExpansionHunter papers above). The pathogenicity status of some repeats might depend on the presence of sequence interruptions or motif changes that ExpansionHunter does not call. If you would like to visually inspect the relevant read alignments, you can use a Repeat Expansion Viewer third-party tool.
Included below are the repeat unit expansion thresholds (normal, pre-mutation and expansion) for some common repeats.
The results of repeat genotyping are output as a separate VCF file, which provides the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf
(*.gz). The VCF output file lists with the following fields first.
Table 2 Core VCF Fields
Table 3 Additional INFO Fields
Table 4 GENOTYPE (Per Sample) Fields
For example, the following VCF entry describes the ATXN1 repeat in a sample NA13537.
In this example, the first allele spans 33 repeat units while the second allele spans 58 repeat units. The repeat unit is TGC (RU INFO field), so the sequence of the first allele is TGC x 33 and the sequence of the second allele is TGC x 58. The repeat spans 30 repeat units in the reference (REF INFO field).
The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (52,71). There are 4 spanning and 69 flanking reads consistent with the repeat allele of size 33 that is 4 reads fully contain the repeat of size 33 and 69 flanking reads overlap at most 33 repeat units. There are 83 flanking and 4 in-repeat reads consistent with the repeat allele of size 58. The average coverage of this locus is 37.46x.
The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool available on GitHub to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMs.
The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>
.
LocusName---A locus identifier that matches the corresponding entry in the repeat expansion specification file.
StartPosition---The starting alignment position of a read on the first graph node.
GraphCIGAR---The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node.
Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
ExpansionHunter de novo allows the discovery of expanded STR regions from paired-end reads across a cohort of samples. It is designed to work with PCR-free samples of 100-200bp paired-end reads at >30X coverage.
Note:
STRs shorter than the read length are ignored; the program is appropriate only for detecting expansions that exceed the read length.
The location of each reported STR is approximate (up to about 500bp-1Kbp)
STRs are not genotyped; the program reports a depth-normalized count of reads originating inside each STR; this count can be used as a very approximate measure of the repeat length
To achieve best results all samples must be sequenced on the same instrument to similar coverage, have the same read and fragment lengths, and be subjected to the same computational pre-processing (e.g. reads must be aligned by the same aligner)
For more information refer to:
Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020)
Briefly, the workflow can be separated in two distinct steps: profiling and analysis. In the profiling step, repetitive reads are found and used to infer the location of potential STR regions. The regions and the respective read counts are then saved in a "profile" on disk. The profiling step is run for each sample and the resulting profiles are merged into a single dataset for the analysis. In the analysis step the user needs to provide a table describing the experimental design to run either an outlier analysis which tests one sample against the rest or a case-control analysis where the samples are split in two groups.
In DRAGEN, the analysis is more streamlined than the standalone EHdn tool and has considerable performance improvements, while retaining the same output.
Note: The output in the case of outlier analysis might not be exactly identical because it involves bootstrapping. In DRAGEN, the random sampling function necessary for bootstrapping is different than what is implemented by Numpy in the standalone EHdn.
Note: The DRAGEN implementation is based on EHdn version v0.9.1
The two steps of the workflow, profiling and analysis, are performed by two separate DRAGEN commands.
In the first step we compute the profiles which are going to be saved as ProtoBuf messages (<out_prefix>.data
). The profile can be saved in a specific directory with the --str-profiler-output-directory
flag. The sample name will be saved in the profile and can be specified at the profiling stage with the flag --str-profiler-sample-name
. If not specified, the sample name in the RGSM field will be used instead.
DRAGEN has to be called once for each sample, for example with the command:
After all the profiles are computed, they have to be divided in 'cases' and 'controls' directories. This can be achieved while computing the profiles by passing the directory with the --str-profiler-output-directory
flag. The input can be a list of samples with the --fastq-list
option. DRAGEN can take as input a list of FASTQ files and save each profile in the directory specified directory with --str-profiler-output-directory
. A list of cases and a list of controls can be run in this manner.
Example command:
The analysis is performed with a separate DRAGEN command, which takes as input the path to the two directories.
Two analysis types can be specified:
outlier
= bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for the cases samples
casecontrol
= cases and controls counts are compared with a one-sided Wilcoxon rank-sum test and a Bonferroni correction is applied to the resulting p-values
Providing the --str-profile-analysis flag
will trigger the analysis workflow. Example command:
The standalone version of EHdn performs 100 rounds of resampling during bootstrapping due to computational constraints. In DRAGEN the resampling has been increased to 1000 by default thanks to the much faster computation. This number can be adjusted with the flag --str-profiler-resampling-rounds
. Increasing the number of resampling cycles will improve the precision of the estimate but also linearly increase the compute times.
DRAGEN will spread the computation across 48 threads by default, but the number can be adjusted on the command line with the flag --str-profiler-threads
.
The output (as in the standalone EHdn implementation) is composed of two tables, one for the "motif" level analysis and one for the "locus" level analysis which will be saved as <output-prefix>.str_profiler_locus.tsv
and <output-prefix>.str_profiler_motif.tsv
respectively. Below is a description of the locus analysis output. The motif table is the same as the locus table but without the contig, start and end columns.
The previous can be visualized as:
is the set of all genotypes
is the set of conflicting genotypes
is the Mother copy number
is the Father copy number
is the Proband copy number
is the the prior for the trio genotype
For more information on how to use the output files to aid in debug and analysis, see .
In Somatic WGS CNV, the INFO column can also contain the HET
tag, when the call is considered sub-clonal. See .
When matching CNV with SV output, additional INFO annotations are added. See .
Using R, a good starting point is the package. The main workflow involves reading the *.target.counts.gz
file as an R dataframe, convert this to a GRanges object then plot the target intervals as points with the karyoploteR
package. The same workflow can be used to plot the GC-corrected counts, the log2 normalized copy ratios and the BAF profiles.
Using Python, the workflow is similar to R's but using Python's libraries such as , to convert DRAGEN output files to dataframe, and , to plot coverage and BAF profiles across the genome.
It is important to note that MRJD cannot run together with the DRAGEN Small Variant Caller in this DRAGEN version. We recommend users to run DNA Mapping and Small Variant Calling workflow first, and then run MRJD using the aligned BAM file generated from DNA Mapping workflow as input. Using this workflow, two VCF files will be created (.hard-filtered.vcf.gz by DRAGEN Small Variant Caller and .mrjd.hard-filtered.vcf.gz by DRAGEN MRJD). To help user get a single VCF file for downstream anlaysis, we prepared a utility tool that replaces the DRAGEN Small Variant Caller output in the homology region of the six medically relevant and challenging genes with MRJD caller output. The tool also annotates the calls made by MRJD (with "MRJD" tag in the INFO column). Please refer to the to download the utility tool.
Upon completion, the tool generates imputed variants based on a reference panel, a genetic map, and input samples provided. The DRAGEN secondary analysis software supports VCF imputation on human data and provides a reference panel and a genetic map for the hg38 reference build accessible on the .
To achieve more accurate results, it is recommended to use input VCF generated with the force genotyping capability of the DRAGEN secondary analysis software so that it contains all the positions that are present in the reference panel. A file to be used as input of the force genotyping run of the DRAGEN variant caller, with all sites present in the IRP reference panel (built from human reference genome hg38) is provided in the Imputation files accessible in the . When running the force genotype option (of DRAGEN variant caller) for imputation, it is recommended to disable the machine learning tool (--vc-ml-enable-recalibration=false).
to force genotype the input VCF with a SNPs-only sites.vcf file using DRAGEN argument --vc-forcegt-vcf. This SNPs-only sites.vcf file contains only the SNPs sites present in the reference panel. A SNPs-only VCF file is also available in the IRP reference panel (built from human reference genome hg38) in the Imputation files accessible in the .
A per-chromosome reference panel in BCF format that lists all the imputation positions in the targeted regions along with the corresponding haplotypes must be provided. A reference panel (with prefix IRPv{x}) is available in the Imputation files accessible in the . IRPv2.0 is a multi-allelic SNP, INDELs reference panel containing 3202 samples from the 1000 Genomes Project, which have been variant called using DRAGEN 4.0 against hg38.
A genetic map per chromosome is required to obtain the imputed variants. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use a prebuilt genetic map corresponding to the human hg38 reference genome. A prebuilt map is available as part of the Imputation files, accessible at the . DRAGEN does not generate custom genetic map files. The genetic map should follow the format:
This config file allows the proper handling of haploid/diploid chromosomes. This file is present in the same directory of the input reference panel with PREFIX and is available in the . It must follow the naming convention: {$DIR}/{$PREFIX}.config.json. When the config file is not present in the directory, the tool assumes that the imputation is done on all diploid chromosomes.
The DRAGEN CNV pipeline supports multiple input formats. To run the DRAGEN CNV pipeline directly with FASTQ input without generating a BAM or CRAM file, see for instructions on streaming alignment records directly from the DRAGEN map/align stage.
DRAGEN CNV also supports running from an already mapped and aligned BAM or CRAM file. If you have data that has not yet been mapped and aligned, see .
The reference hashtable is a pregenerated binary representation of the reference genome. For information on generating a hashtable, see .
For information on running CNV concurrently with the Haplotype Variant Caller, see .
Further details are available in the section.
Some of the excluded intervals can be rescued through the segmental duplication extension to the germline CNV workflow. See below on Section for more details.
The GC bias correction module immediately follows the target counts stage and operates on the *.target.counts.gz
file. GC bias correction generates a GC bias corrected version of the file, which has a *.target.counts.gc-corrected.gz
extension in the file name. The GC bias corrected versions are recommended for any downstream processing when working with WGS data. For WES, if there are enough target regions, then the GC bias corrected counts can also be used. See for further details on GC-corrected target counts files.
See for a description of the target counts files.
An excluded interval does not guarantee that a CNV call does not span the interval. If there is sufficient data flanking the region, the segmentation stage along with any merging might still generate a call spanning the excluded interval. However, the call would not take read counts from excluded intervals into account. You can view explanations for excluded intervals in the *.excluded_intervals.bed.gz file. See for further details.
See for a description of the extension output files.
ASCN somatic WES CNV pipeline utilize same methods and workflow of DRAGEN Somatic WGS CNV pipeline. Please see for more details.
Tumor purity can be estimated automatically through the workflow.
For more information on the specification file specified by --repeat-genotype-specs
option, see .
Users can choose between any of the three default repeat-specification files packaged with DRAGEN using the command line option: --repeat-genotype-use-catalog=<default|default_plus_smn|expanded>
. The default
option includes ~60 repeats. The default_plus_smn
option includes the SMN repeat in addition to all the repeats in the default
catalog. The expanded catalog includes ~174K repeats, see . If --repeat-genotype-use-catalog
is not specified on the command line, then the default
catalog is used.
The repeat genotyping results will be incorrect if the selected reference genome is not compatible with the repeat specification file. When this occurs, many repeats may be marked as "LowDepth" in the VCF output file or estimated to have zero length. This can be further confirmed by visualizing read alignments with the .
The default
variant catalog contains specifications on disease-causing repeats located in AFF2, AR, ARX_1, ARX_2, ATN1, ATXN1, ATXN10, ATXN2, ATXN3, ATXN7, ATXN8OS, BEAN1, C9ORF72, CACNA1A, CBL, CNBP, COMP, CSTB, DAB1, DIP2B, DMD, DMPK, EIF4A3, FMR1, FOXL2, FXN, GIPC1, GLS, HOXA13_1, HOXA13_2, HOXA13_3, HOXD13, HTT, JPH3, LRP12, MARCHF6, NIPA1, NOP56, NOTCH2NLC, NUTM2B-AS1, PABPN1, PHOX2B, PPP2R2B, PRDM12, PRNP, RAPGEF2, RFC1, RUNX2, SAMD12, SOX3, STARD7, TBP, TBX1, TCF4, TNRC6A, VWA1, XYLT1, YEATS2, ZIC2 and ZIC3 genes. More information about disease-causing repeats can also be found .
For the expanded
variant catalog, apart from the aforementioned disease-causing repeats, there are ~174K additional polymorphic repeats. They are initially detected using STR-Finder from the 1000 Genomes Project. After that, the candidate repeats are filtered out based on a customized quality control pipeline, see details .
2
0/2, 1/1
1/1
3
0/3, 1/2
1/2
4
0/4, 1/3, 2/2
1/3, 2/2
N
x/(N-x) for x <= N/2
x/(N-x) for 1 <= x <= N/2
2
2
2
Yes
2
2
1
No
3
2
4
No
3
2
2
Yes
2
0
2
No
GT
Genotype
SM
Linear copy ratio of the segment mean
CN
Estimated copy number
BC
Number of bins in the region
PE
Number of improperly paired end reads at start and stop breakpoints
GC
GC dinucleotide percentage
CT
CT dinucleotide percentage
AC
AC dinucleotide percentage
LR
Log10 likelihood ratio of ALT to REF
AS
Number of allelic read count sites
BC
Number of read count bins
CN
Estimated total copy number in tumor fraction of sample. This field is not present if the model cannot be estimated with high confidence.
CNF
Floating point estimate of tumor copy number. This field is not present if the model cannot be estimated with high confidence.
CNQ
Exact total copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
MAF
Maximum estimate of the minor allele frequency
MCN
Estimated minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence.
MCNF
Floating point estimate of tumor minor-haplotype copy number. This field is not present if the model cannot be estimated with high confidence.
MCNQ
Minor copy number Q-score. This field is not present if the model cannot be estimated with high confidence.
NCN
Normal-sample copy number. The field is only present in germline-aware mode.
SCND
Difference between CN and GCN. The field is only present in germline-aware mode.
SD
Best estimate of segment's bias-corrected read count
Diploid
.
2
./.
Diploid
<DUP>
>2
./1
Diploid
<DEL>
1
0/1
Diploid
<DEL>
0
1/1
Haploid
.
1
0
Haploid
<DUP>
>1
1
Haploid
<DEL>
0
1
NON_KMER_UNIQUE
Non-unique Kmer bases are larger than 50% of interval.
Not applicable. This reason only applies to self-normalization mode.
EXCLUDE_BED
Interval overlaps with exclude BED larger than threshold.
--cnv-exclude-bed-min-overlap
PON_MAX_PERCENT_ZERO_SAMPLES
Number of PON samples with 0 coverage is larger than threshold.
--cnv-max-percent-zero-samples
PON_TARGET_FACTOR_THRESHOLD
Median coverage of interval is lower than threshold of overall median coverage.
--cnv-target-factor-threshold
PON_MISSING_INTERVAL
Target interval not found in PON.
Not applicable
1
contig
chromosome name
2
start
genomic locus of interval start
3
stop
genomic locus of interval stop
4
name
interval name
5
mean
average coverage depth
6
std
standard deviation
7
normalizedStd
normalized standard deviation (std/mean)
8
min
minimum
9
25%
25 percentile
10
50%
median
11
75%
75 percentile
12
max
maximum
13
intervalSize
interval size (stop-start)
14
gcContents
percent GC
END_LEFT_BND_OF
1
String
ID of CNV whose left end is matched to the end of SV
END_RIGHT_BND_OF
1
String
ID of CNV whose right end is matched to the end of SV
LEFT_BND
1
String
ID of SV that matches the left end of CNV record
LEFT_BND_OF
1
String
ID of CNV whose left end is matched to SV
MatchSv
1
Integer
ID of original SV that was merged with CNV record
OrigCnvEnd
1
Integer
Coordinate of original CNV end
OrigCnvPos
1
Integer
Coordinate of original CNV pos
RIGHT_BND
1
String
ID of SV that matches the right end of CNV record
RIGHT_BND_OF
1
String
ID of CNV whose right end is matched to SV
SVCLAIM
A
String
Claim made by the structural variant call. Valid values are D, J, DJ for abundance, adjacency and both respectively
vc-output-evidence-bam
Enable evidence BAM output
False
vc-evidence-bam-output-haplotypes
Output graph haplotypes in evidence BAM
False
vc-evidence-bam-clipped-read-threshold
Percentage of clipped reads in active region to enable evidence BAM output for that region
10%
vc-evidence-bam-force-output
Force evidence BAM output for all active regions
False
4.2
DRAGEN 4.2 default Small Variant Caller
--enable-variant-caller=true
3
No
No
N/A
4.2 HSM
DRAGEN 4.2 High Sensitivity Mode
--enable-variant-caller=true --vc-enable-high-sensitivity-mode=true
0.4
Yes
Yes (Alpha)
N/A
4.3
DRAGEN 4.3 default Small Variant Caller
--enable-variant-caller=true
3
Yes
Yes (Full)
20%
4.3 Mosaic
DRAGEN 4.3 Mosaic Detection Mode
--enable-variant-caller=true --vc-enable-mosaic-detection=true
0.4
Yes
Yes (Full)
0%
chr2
151578759
151588523
NEB exon 98-105
chr2
151589318
151599076
NEB exon 90-97
chr2
151599871
151609628
NEB exon 82-89
chr2
178653238
178654995
TTN exon 172-180
chr2
178657498
178659255
TTN exon 181-189
chr2
178661759
178663516
TTN exon 190-198
chr5
70049522
70077596
SMN2
chr5
70924940
70953013
SMN1
chr7
5970924
5980896
PMS2 exon 13-15
chr7
5980968
5987689
PMS2 exon 11-12
chr7
6737007
6743712
PMS2CL exon 2-3
chr7
6743880
6753867
PMS2CL exon 4-6
chr15
43599563
43602630
STRC exon 24-29
chr15
43602982
43611000
STRC exon 14-23
chr15
43611040
43618800
STRC exon 1-13
chr15
43699379
43702452
STRCP1 exon 23-28
chr15
43702488
43710472
STRCP1 exon 13-22
chr15
43710502
43718262
STRCP1 exon 1-12
chrX
154555884
154565047
IKBKG exon 3-10
chrX
154639390
154648553
IKBKGP1
regions
Required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define contig name and subregion name of mixed ploidy chromosome
Dictionary in the form: contigname_of_mixed_ploidy :[contigname_of_mixed_ploidy"_par1", contigname_of_mixed_ploidy"_par2", contigname_of_mixed_ploidy"_nonpar1", contig_name_of_mixed_ploidy"_nonpar2"...]
ploidy
“default” is a required name
contigname_of_mixed_ploidy_"nonpar" is required only when a chromosome of mixed ploidy is present in the Reference Panel folder
Define:
ploidy behavior when different from “default”
default ploidy behavior
Dictionary in the form: contigname_of_mixed_ploidy_"nonpar": { typename1 : 1, typename2 : 2} "default" : { "typename1": 2, "typename 2": 2} typename is used in the Sample Type file input
--enable-imputation
NA
Yes
Set to true
to enable vcf imputation pipeline
--imputation-ref-panel-dir
STRING
Yes
Directory containing per-chromosome reference panel VCF and optionally the JSON config file
--imputation-ref-panel-prefix
STRING
Yes
Prefix for reference panel files and the JSON config file
--imputation-genome-map-dir
STRING
Yes
Directory containing per-chromosome genome map files
--imputation-chunk-input-region
STRING
Yes for single region
Target region, usually a full chromosome (e.g. chr20:1000000-2000000 or chr20).
--imputation-chunk-input-region-list
STRING
Yes for list of regions
Text file listing chromosomes or regions to be processed, one chromosome/region per line.
--imputation-phase-input
STRING
Yes for single VCF file
Sample input file with VCF/BCF format. Single VCF or multi-sample VCF
--imputation-phase-input-list
STRING
Yes for multiple VCF files
Text file listing sample input in VCF/BCF format, one input file per line
--imputation-phase-sample-type
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a single VCF file
Define typename of the VCF file imputed. The typename must match one of the two typenames defined in the JSON config file
--imputation-phase-sample-type-list
STRING
Yes when imputing on a non PAR region of mixed ploidy chromosome AND a list of VCF files
Path to the Sample Type file
--output-directory
STRING
Yes
Output directory
--output-file-prefix
STRING
Yes
Output files prefix
--imputation-phase-threads
INT
No
Specify the number of threads to use. Default is the number of system threads
--imputation-phase-filter-input-sample-in-ref
NA
No
Default is true
: if sample ID matches between reference panel and sample input, then the corresponding samples are ignored from the reference panel to avoid imputation against itself. To be turned to false
if all samples from the reference panel should be kept regardless of their presence in the sample input.
--imputation-phase-impute-reference-only-variants
STRING
No
Default is false
. If set to true
, allows imputation at variants only present in the reference panel. The use of this option is intended only to allow imputation at sporadic missing variants. If the number of missing variants is non-sporadic, please re-run the genotype likelihood computation at all reference variants and avoid using this option, since data from the reads should be used.
When the input sample variant calling was performed using --vc-forcegt-vcf
with SNPs-only sites.vcf file, it is recommended to set this option to true to also impute INDELs positions from the reference panel.
--imputation-phase-input-independently
STRING
No
Default is false
. If set to true
, allows to treat each sample input independently without using them in the reference panel calculation
5
10000
10
5000
>= 30
1000
Fastq
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
TRUE
--enable-map-align=true
, --enable-duplicate-marking=true
BAM
FALSE
--enable-map-align=false
--cnv-normals-file
Individual normal file. This option uses a single file name and can be specified multiple times.
--cnv-normals-list
List of normal files. A plain text file in which each line in the file contains a path pointing to a *.target.counts.gz
or *.target.counts.gc-corrected.gz
file generated from the target counts stage. Relative paths are supported if the paths are relative to the current working directory. Absolute paths are recommended in case the workflow is used later or shared with other users.
--cnv-combined-counts
PON file which combines all normal files in a single file. Combined counts file can be found from output folder of prior DRAGEN run with same panel of normals (*.combined.counts.txt.gz
file). Some pre-packaged PON file directly downloaded from Illumina support site need to use this option.
CYP2A6
CYP2A7
FCGR3A
FCGR3B
RHD
RHCE
STRC
STRCP1
ACSM2A
ACSM2B
ACTR3B
ACTR3C
AQP12A
AQP12B
ASAH2
ASAH2B
CCDC74A
CCDC74B
CD177
CD177p1
CD8B
CD8B2
CFH1
CFHR1
CYP4A11
CYP4A22
DHX40
DHX40P1
EIF5AL1
EIF5AP4
FCGR2A
FCGR2C
FFAR3
GPR42
FOLH1
FOLH1B
FRMPD2
FRMPD2B
GPAT2
GPAT2P1
GSTT2B
GSTT2
DDT
DDTL
HCAR2
HCAR3
HSPA1A
HSPA1B
KRT81
KRT86
LGALS7
LGALS7B
MRPL45
MRPL45P2
MSTO1
MSTO2p
MUC20
MUC20P1
MZT2A
MZT2B
OTOA
OTOAp1
PDPR
PDPR2P
PIEZ02
ENST00000591853.1
ZP3
POMZP3
PRAMEF7
PRAMEF8
PROS1
PROS2P
RMND5A
ANAPC1P2
ROCK1
ROCK1p1
SERPINB3
SERPINB4
SYT3
ZNF473CR
TBC1D26
TBC1D28
TOP3B
TOP3BP1
TUBA3D
TUBA3E
ZNF443
ZNF799
Tumor input
--tumor-fastq1
,--tumor-fastq2
,--tumor-bam-input
, --tumor-cram-input
file
Specify a tumor input file.
Normal input Option 1
--fastq1
,--fastq2
,--bam-input
, --cram-input
file
Specify a normal input file (if normal VCF is not ready).
Normal input Option 1
--cnv-use-somatic-vc-baf
true
/false
If running in tumor-normal mode with the SNV caller enabled, use this option to specify the germline heterozygous sites. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
Normal input Option 2
--cnv-normal-b-allele-vcf
vcf file
Specify a matched normal SNV VCF. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
Normal input Option 3
--cnv-population-b-allele-vcf
vcf file
Specify a population SNP catalog. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
PON option 1
--cnv-normals-file
normal count file
Specify individual normal counts file (target.counts.gz or target.counts.gc-corrected.gz) for PON. You can use this option multiple times, one time for each file.
PON option 2
--cnv-normals-list
text file indicating normal count files per line
Specify text file that contains paths to the list of reference target counts files to be used as a panel of normals (new line separated).
PON option 3
--cnv-combined-counts
file
Specify combined PON file (.combined.counts.txt.gz).
PON option 4
NA
If no PON sample is specified, then DRAGEN utilizes matched normal sample as single sample PON. Available for Normal input Option 1
Target region
--cnv-target-bed
bed file
Specify target region bed file
Sample sex
--sample-sex
male
/female
/auto
/none
If known, specify the sex of the sample. If the sample sex is not specified, the caller attempts to estimate the sample sex from tumor alignments.
DMPK
< 37
37-50
> 50
FXN
< 33
33-66
> 66
HTT
< 35
35-40
> 40
ATN1
< 35
35-49
> 49
ATXN1
< 40
40-41
> 41
AR
< 35
35-36
> 36
FMR1
< 55
55 - 200
> 200
ATXN10
< 32
32-33
> 33
ATXN2
< 31
31-32
> 32
ATXN7
< 27
27-33
> 33
CACNA1A
< 19
19-20
> 20
CBL
< 80
80-81
> 81
CSTB
< 29
29-30
> 30
JPH3
< 28
28-40
> 40
PPP2R2B
< 32
32-65
> 65
C9ORF72
< 30
30-31
> 31
ATXN3
< 41
41-52
> 52
CHROM
Chromosome identifier
POS
Position of the first base before the repeat region in the reference
ID
Always .
REF
The reference base at position POS
ALT
List of repeat alleles in format <STRn>
. N is the number of repeat units. If REF, then .
.
QUAL
Always .
FILTER
LowDepth filter is applied when the overall locus depth is below 10x or number of reads that span one or both breakends is below 5.
END
Position of the last base of the repeat region in the reference
REF
Number of repeat units spanned by the repeat in the reference
RL
Reference length in bp
VARID
Variant ID from the variant catalog
RU
Repeat unit in the reference orientation
REPID
Variant ID from the variant catalog
GT
Genotype
SO
Type of reads that support the allele. Values can be SPANNING, FLANKING, or INREPEAT. These values indicate if the reads span, flank, or are fully contained in the repeat.
REPCN
Number of repeat units spanned by the allele
REPCI
Confidence interval for REPCN
ADSP
Number of spanning reads consistent with the allele
ADFL
Number of flanking reads consistent with the allele
ADIR
Number of in-repeat reads consistent with the allele
LC
Locus Coverage
contig
Contig of the repeat region
start
Approximate start of the repeat
end
Approximate end of the repeat
motif
Inferred repeat motif
top_case_zscore
Top z-score of a case sample
high_case_counts
Counts of case samples corresponding to z-score greater than 1.0
counts
Nonzero counts for all samples
contig
Contig of the repeat region
start
Approximate start of the repeat
end
Approximate end of the repeat
motif
Inferred repeat motif
pvalue
P-value from Wilcoxon rank-sum test
bonf_pvalue
P-value after Bonferroni correction
counts
Depth-normalized counts of anchored in-repeat reads for each sample (omitting samples with zero count)
The CYP2D6 Caller is capable of genotyping the CYP2D6 gene from whole-genome sequencing (WGS) data and is derived from the method implemented in Cyrius¹. Due to high sequence similarity with its pseudogene paralog CYP2D7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2D6 Caller performs the following steps:
Determines total CYP2D6 and CYP2D7 copy number from read depth.
Determines CYP2D6-derived copy number at CYP2D6/CYP2D7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2D6-derived copy number along the CYP2D6 gene.
Calls small variants in CYP2D6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.
The first step of CYP2D6 calling is to determine the combined copy number of CYP2D6 and CYP2D7. Reads aligned to regions in either CYP2D6 or CYP2D7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2D6 and CYP2D7 copy number is then calculated from the average sequencing depth across the CYP2D6 and CYP2D7 regions.
The CYP2D6-derived copy number is calculated at 117 predefined differentiating sites across the CYP2D6 gene. The differentiating sites are selected at positions with sequence differences in CYP2D6 and CYP2D7 where calling the CYP2D6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.
For each differentiating site, CYP2D6-specific and CYP2D7-specific alleles are counted in reads mapping to either CYP2D6 or the homologous region in CYP2D7. The CYP2D6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2D6 and CYP2D7 copy number calculated from the previous step.
The CYP2D6-derived copy number along the CYP2D6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
exon 9
2D6-2D7
*4.013
,
*36
,
*57
,
*83
exon 9
2D7-2D6
*13
intron 4
2D7-2D6
*13
intron 1
2D7-2D6
*13
intron 1
2D6-2D7
*68
In addition to the exon 9 fusion breakpoints, exon 9 can participate in CYP2D7 gene conversion resulting in an embedded CYP2D7 sequence instead of a true hybrid. The structural variant caller also detects exon 9 gene conversions. Because only changes in CYP2D6-derived copy number yield structural variant calls, there might be rare cases where two hybrid copies result in no structural variant calls. For example, when both *36
and *13
with fusion breakpoint in exon 9 are present. However, the structural variant caller is capable of detecting multiple copies of the same fusion type (eg, *36x2
) or cases where both an exon 9 gene conversion copy and an exon 9 2D6-2D7 hybrid are present.
118 small variants that define various star alleles are detected from the read alignments. 96 of these variants are in unique (nonhomologous) regions of CYP2D6 with high mapping quality. Only reads mapping to CYP2D6 are used for calling variants in nonhomologous regions. The other 22 variants occur in homologous regions of CYP2D6 where reads mapping to either CYP2D6 or CYP2D7 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant). A strand bias filter is applied to a small subset of variants that would otherwise tend to have false positive calls in the population.
Samples with poor sequencing quality or greater than five copies of CYP2D6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.
The called SVs and small variant genotypes are matched against the definitions of 128 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1
, *46
and *43
, *45
where both sets of star alleles contain the same 4 small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2D6-derived gene copies determined from previous steps. When there are fewer than two CYP2D6-derived gene copies, then one or more *5
deletion haplotypes are included in the output set of star alleles. If all variant genotypes cannot be matched to a set of star alleles, the CYP2D6 Caller returns a no call during the genotyping step with filter value No_call
.
Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The deletion haplotype (*5
) is considered as a possible haplotype during this process. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype
.
The CYP2D6 Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File). An example of the CYP2D6 caller content in the output is as follows:
genotype
called star allele genotype
string (semi-colon delimited list of possible genotypes with haplotypes separated by /
)
genotypeFilter
The filter status for the genotype call
string (The value can include: PASS, No_call, or More_than_one_possible_genotype)
phenotypeDatabaseAnnotation
The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources
string
Each CYP2D6 genotype contains two haplotypes separated by a slash (eg *1/*2
). Each haplotype consists of one or more star alleles separated by a plus sign (eg *10+*36
). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg *1x2
for two copies of *1
).
¹Chen X, Shen F, Gonzaludo N, et al. Cyrius: accurate CYP2D6 genotyping using whole-genome sequencing data. The Pharmacogenomics Journal. 2021;21(2):251-261. doi:10.1038/s41397-020-00205-5
The CYP2B6 Caller is capable of genotyping the CYP2B6 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP2B7 and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.
The CYP2B6 Caller performs the following steps:
Determines total CYP2B6 and CYP2B7 copy number from read depth.
Determines CYP2B6-derived copy number at CYP2B6/CYP2B7 differentiating sites.
Detects SV breakpoints by calculating the changes in CYP2B6-derived copy number along the CYP2B6 gene.
Calls small variants in CYP2B6 copies.
Identifies star alleles from the detected SV breakpoints and small variants.
Identifies the most likely genotype for the called star alleles.
The first step of CYP2B6 calling is to determine the combined copy number of CYP2B6 and CYP2B7. Reads aligned to regions in either CYP2B6 or CYP2B7 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP2B6 and CYP2B7 copy number is then calculated from the average sequencing depth across the CYP2B6 and CYP2B7 regions.
The CYP2B6-derived copy number is calculated at 99 predefined differentiating sites across the CYP2B6 gene. The differentiating sites are selected at positions with sequence differences in CYP2B6 and CYP2B7 where calling the CYP2B6-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.
For each differentiating site, CYP2B6-specific and CYP2B7-specific alleles are counted in reads mapping to either CYP2B6 or the homologous region in CYP2B7. The CYP2B6-derived copy number is then calculated from the two gene-specific allele counts using the total CYP2B6 and CYP2B7 copy number calculated from the previous step.
The CYP2B6-derived copy number along the CYP2B6 gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:
intron 4-exon 5
2B7-2B6
*29
intron 4-exon 5
2B6-2B7
*30
35 small variants that define various star alleles are detected from the read alignments. All of these variants are in unique (nonhomologous) regions of CYP2B6 with high mapping quality. Only reads mapping to CYP2B6 are used for calling variants in nonhomologous regions.
For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant).
Samples with poor sequencing quality or greater than five copies of CYP2B6 will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.
The recombinant (gene conversion) variant 18053A>G is detected by phasing the variant site with five flanking differentiating sites. When the haplotypes formed from phasing these sites supports the gene conversion in CYP2B6, a read depth analysis at the gene conversion breakpoints (transitions from either CYP2B6->CYP2B7 or CYP2B7->CYP2B6) is performed. When the posterior probability that there is at least one gene conversion variant is above 0.7 then DRAGEN uses the variant for star allele identification.
The called SVs and small variant genotypes are matched against the definitions of 39 different star alleles. This might result in different sets of star alleles matching the called variant genotypes, such as with *1
, *6
and *4
, *49
where both sets of star alleles contain the same two small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of CYP2B6-derived gene copies determined from previous steps. If no variant genotypes can be matched to a set of star alleles, the CYP2B6 Caller returns a no call during the genotyping step with filter value No_call
.
Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value More_than_one_possible_genotype
.
The caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File).
An example of the CYP2B6 caller content in the output is as follows:
For CYP2B6 caller, the fields are defined as follows.
genotype
star allele genotype identified for sample
string
genotypeFilter
The filter status for the genotype call
string (The value can include: PASS, No_call, or More_than_one_possible_genotype)
phenotypeDatabaseAnnotation
The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources
string
The CYP21A2 Caller is capable of genotyping the CYP21A2 gene from whole-genome sequencing (WGS) data. Due to high sequence similarity with its pseudogene paralog CYP21A1P and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants.
The CYP21A2 calling workflow is broken up into the following major stages:
Loading input configuration
Processing read data
Analyzing read data
Read data analysis is further split into the following steps:
Determine total CYP21A2 and CYP21A1P copy number from read depth.
Call small variants in CYP21A2 copies.
Phase reads to detect common variants and recombination events.
Identify most likely haplotypes.
The CYP21A2 Caller requires WGS data aligned to a human reference genome with at least 30x coverage.
The first step of CYP21A2 calling is to determine the combined copy number of CYP21A2 and CYP21A1P. Reads aligned to regions in either CYP21A2 or CYP21A1P are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined CYP21A2 and CYP21A1P copy number is then calculated from the average sequencing depth across the CYP21A2 and CYP21A1P regions.
Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of CYP21A2 with high mapping quality. Only reads mapping to CYP21A2 are used for calling variants in nonhomologous regions. The other variants occur in homologous regions of CYP21A2/CYP21A1P where reads mapping to either are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele is counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant copy number (0 for nonvariant).
For a list of the supported nonrecombinant-like variants, refer to the targeted/cyp21a2/target_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
To analyze the homologous region even further, DRAGEN phases reads covering differentiating sites and known variant sites. Whenever a detected haplotype has a CYP21A2->CYP21A1P or CYP21A1P->CYP21A2 transition that is consistent with one of the known recombinant-like variants, the transition is considered as a candidate breakpoint for calling those variants. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint, the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.
For a list of recombinant variant sites, refer to the targeted/cyp21a2/recombinant_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
Note that NM_000500.9:c.710_719delinsACGAGGAGAA will be reported as the following three variants on the same haplotype: NM_000500.9:c.710T>A NM_000500.9:c.713T>A NM_000500.9:c.719T>A
The CYP21A2 Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File).
totalCopyNumber
Total copy number of CYP21A2 and CYP21A1P genes including hybrids
nonnegative integer
deletionBreakpointInGene
null (i.e. unknown) if totalCopyNumber > 3
true, false, null
true if CN <= 3 and a deletion-like recombinant variant haplotype is detected
false if CN <=3 and no deletion-like recombinant variant is detected
recombinantHaplotypes
List of detected haplotypes arising from nonallelic homologous recombination variant calling
Array of two strings. Each string consists of all associated allele IDs (if any) within the haplotype. Consecutive IDs in the same haplotype are separated by a '+'.
variants
List of single site, nonrecombinant-like variants (i.e. not arising from nonallelic homologous recombination). An empty list if no variants are detected.
Array of nonrecombinant-like variants.
Note: A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a CYP21A1P allele to a CYP21A2 allele) after excluding some sites with common gene conversions in CYP21A1P.
Each nonrecombinant-like variant reported in the variants
array will have the fields below.
alleleId
HGVS identifier of the variant allele
string
alleleCopyNumber
Copy number of the allele in the called genotype
nonnegative integer
genotypeQuality
Phred-scaled quality for the called genotype
nonnegative integer
filter
Filter for the called genotype
string. "PASS" when not filtered
Recombinant-like and nonrecombinant-like variants are reported in VCF format. See Targeted VCF File for details about how these variants are reported in VCF.
An example of the CYP21A2 caller content in the <output-file-prefix>.targeted.json
output file is shown below.
The HBA Caller is capable of genotyping the HBA1 and HBA2 genes from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the possible genotypes of the pair of genes. We consider regions surrounding the HBA1 and HBA2 sites to resolve the possible HBA1 and HBA2 genotypes.
The HBA Caller performs the following steps:
Determines total copy number from read depth of the regions surrounding the HBA1 and HBA2 sites.
Determines HBA genotypes based on the copy number of the regions surrounding the HBA1 and HBA2 sites.
Calls small variants in the HBA1 and HBA2 regions based on the region copy number derived from the genotype along with allele counts from read information.
The HBA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
For a comprehensive evaluation of the HBA caller, see HBA targeted caller blog post.
The first step of HBA calling is to determine the copy number of the regions sorrounding the HBA1 and HBA2 sites. Reads aligned to the regions are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. Finally, a Gaussian Mixture Model (GMM) is used to obtain the integer region copy number from the region normalized counts.
The genotyping step attempts to identify the two likely haplotypes described in the following table, where "a" stands for a functional copy of either HBA1 or HBA2, "-" stands for a nonfunctional/missing copy of either HBA1 or HBA2, while "3.7" and "4.2" describe the recombinant event that likely caused the deletion/duplication of the functional HBA copy. The second column of the following table reports the interpretation of the genotype.
aaa3.7/aa
alpha-globin triplication
aaa4.2/aa
alpha-globin triplication
aa/aa
Normal
-a3.7/aa
Silent Carrier
-a4.2/aa
Silent Carrier
--/aaa3.7
Carrier
--/aaa4.2
Carrier
-a3.7/-a3.7
Carrier
-a4.2/-a4.2
Carrier
-a3.7/-a4.2
Carrier
--/aa
Carrier
--/-a3.7
HbH
--/-a4.2
HbH
--/--
Hb Bart's
If none of the previous genotype is identified, then no call is made and the caller reports a "None" genotype.
18 small variants are detected from the read alignments. These variants occur in homologous regions of HBA1 and HBA2 where reads mapping to either HBA1 or HBA2 are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the HBA1/HBA2 genotyping.
The HBA Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File).
genotype
The HBA genotype.
string
genotypeFilter
The HBA genotype filter.
string, [PASS, HBALowGQ, HBALowPValue, No_call]
genotypeQual
The HBA Phred genotype quality.
double
minPValue
The minimum copy number p-value of regions used to determine copy number genotype of the HBA locus.
double
variants
List of detected homology region variants in HBA1/HBA2.
Array of variants
Each variant reported in the variants
array will have the fields below.
alleleId
HGVS identifier of the variant allele
string
alleleCopyNumber
Copy number of the allele in the called genotype
nonnegative integer
genotypeQuality
Phred-scaled quality for the called genotype
nonnegative integer
filter
Filter for the called genotype
string. "PASS" when not filtered
Structural variant and homology region variants are reported in VCF format. See Targeted VCF File for details about how these variants are reported in VCF.
An example of the HBA caller content in the <output-file-prefix>.targeted.json
output file is shown below.
The LPA Caller is capable of identifying the LPA Kringle-IV-2 (KIV-2) VNTR unit copy number from whole-genome sequencing (WGS) data. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the VNTR unit copy number.
The LPA Caller performs the following steps:
Determines total LPA KIV-2 VNTR unit copy number.
Determines the heterozygous LPA KIV-2 VNTR unit copy number if heterozygous KIV-2 markers are present.
Calls small variants in the LPA KIV-2 VNTR region based on the unit copy number along with allele counts from read information.
The LPA Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
The first step of LPA calling is to determine the unit copy number of LPA KIV-2. Reads aligned to the LPA KIV-2 are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.
The second step of LPA calling is to determine the heterozygous unit copy numbert of LPA KIV-2. Heterozygous unit copy number is determined using two specific linked SNV sites that have been identified as a combined marker allele that is always present in every copy of the repeat unit concordantly. That is, if any copy of the repeat unit in an LPA haplotype contains the ALT alleles at those two SNV sites, then every copy of the repeat unit in that LPA haplotype contains the ALT alleles at those two sites. The relative read coverage for the ALT and REF cases at these sites can therefore be used to determine the proportions of overall copy numbers across the KIV repeat array that belong to each haplotype.
2 small variants are detected from the read alignments. These variants occur in the LPA KIV-2 VNTR region where reads mapping to either of the 6 units in the reference are used for variant calling.
For each variant, reads containing either the variant allele or the nonvariant allele are counted and a binomial model is used to determine the likelihood for each possible variant allele copy number up to the maximum possible as determined from the LPA KIV-2 VNTR unit copy number.
The LPA Caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
sample
The sample name.
string
dragenVersion
The version of DRAGEN.
string
lpa
The LPA targeted caller specific fields.
dictionary
The lpa
fields are defined as below.
kiv2CopyNumber
Total KIV-2 unit copy number
float
refMarkerAlleleCopyNumber
Null if Homozygous REF/ALT markers call
float, null
Float if Heterozygous markers call and stores the KIV-2 unit copy number of the allele having REF markers
altMarkerAlleleCopyNumber
Null if Homozygous REF/ALT markers call
float, null
Float if Heterozygous markers call and stores the KIV-2 unit copy number of the allele having ALT markers
type
"Heterozygous markers call" if we observe both REF and ALT markers
string, "Heterozygous markers call", "Homozygous REF markers call", "Homozygous ALT markers call"
"Homozygous REF markers call" if we observe only REF markers
"Homozygous ALT markers call" if we observe only ALT markers
variants
List of known variants that were detected in the KIV-2 region.
list of variants
For the variants
the fields are defined as below.
hgvs
HGVS identifier of the variant
string
qual
Phred QUAL score of the variant
double
altCopyNumber
Copy number of the ALT variant
double
altCopyNumberQuality
Phred QUAL copy number of the ALT variant
double
The LPA Caller also generates a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file possibly compressed.
Examples of the LPA Caller content in the output json file are shown below.
The following are example output files:
The GBA Caller is capable of detecting both recombinant-like and nonrecombinant-like variants in the GBA gene from whole-genome sequencing (WGS) data. Disruption of all copies of the GBA gene in an individual causes the autosomal recessive disorder Gaucher disease, and carriers are at increased risk of Parkinson's disease and Lewy body dementia. Due to high sequence similarity with its pseudogene paralog GBAP1, calling recombinant-like variants in GBA requires a specialized caller.
To enable the GBA Caller, use --enable-gba=true
as part of a germline-only WGS analysis workflow. The GBA Caller is disabled by default and requires WGS data aligned to a human reference genome with at least 30x coverage.
The GBA Caller performs the following steps:
Determine the total combined GBA and GBAP1 copy number
Detect nonrecombinant-like variants from a set of 111 known variants
Assemble phased haplotypes in the exon 9-11 region where recombinant variants occur
Detect any GBAP1 -> GBA breakpoints that are consistent with one of the 7 known recombinant-like variants
A 10 kb region of unique sequence in between GBA and GBAP1 is used to compute the copy number change due to reciprocal recombination events. Reads that align to this 10 kb region are counted and the count is normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The total combined GBA and GBAP1 copy number is then calculated as two more than the copy number of this 10 kb region.
Of the known nonrecombinant-like variants, some are in unique (nonhomologous) regions of GBA with high mapping quality. Only reads mapping to GBA are used for calling variant in nonhomologous regions. The other variants occur in homologous regions of GBA/GBAP1 where reads mapping to either GBA or GBAP1 are used for variant calling.
For each variant, reads containing the variant allele and the nonvariant alleles are counted. A binomial model that incorporates the sequencing error rate is then used to determine the most likely variant allele copy number (0 for nonvariant).
For a list of the supported nonrecombinant-like variants, refer to the targeted/gba/target_variants_*.tsv
files located in the resources
directory of the DRAGEN install location.
A collection of 10 differentiating sites in the exon 9-11 region of GBA are used to detect the GBA and GBAP1 haplotypes present in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with seed sites which are then iteratively extended to neighboring sites. At each iteration, reads that can be unambiguously assigned to one of the detected partial haplotypes are used to extend the next neighboring site for each partial haplotype. Iteration continues until all sites have been extended. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in GBA -> GBAP1 breakpoint detection.
If any of the 10 differentiating sites in exon 9-11 indicate that there is no wild type GBA allele copies, then the sample is called as homozygous variant and the recombinant-like variant that best matches the depth calls at the 10 sites is reported.
When the sample is not homozygous variant, the phased haplotypes are used to detect heterozygous variants. The detected haplotypes are compared against a set of 7 known recombinant-like variants: A495P, L483P, D448H, c.1263del, RecNciI, RecTL, c.1263del+RecTL). Whenever a detected haplotype has a GBA->GBAP1 or GBAP1->GBA transition that is consistent with one of these 7 known recombinant-like variants, the transition is considered as a candidate breakpoint for calling that recombinant-like variant. Reads containing phasing information for the two sites flanking each candidate breakpoint are used for variant calling. When the read data supports the hypothesis that the sample contains at least one copy of a candidate breakpoint , the associated haplotype is a recombinant haplotype candidate. Recombinant haplotype candidates are sorted by likelihood and the number of variant sites. If no wild type haplotype was detected, DRAGEN reports any detected homozygous recombinant haplotype, or up to two different recombinant haplotypes (i.e. compound het) if detected. If any wild type haplotype was found, DRAGEN reports a maximum of one recombinant haplotype. When no recombinant haplotypes are detected two wild type haplotypes are reported.
The caller can detect the following recombinant variant haplotypes: A495P, L483P, D448H, 1263del, RecNciI, RecTL, and c.1263del+RecTL. Note: RecNciI, RecTL, and c.1263del+RecTL maye be deletion-like recombinant variants. A deletion-like recombinant variant haplotype (as opposed to a gene conversion-like recombinant variant haplotype) is defined as a haplotype with one or fewer switch sites (transitions from a GBAP1 allele to a GBA allele).
The table below shows the HGVS identifiers associated with each recombinant variant haplotype.
A495P
NM_000157.4:c.1483G>C
L483P
NM_000157.4:c.1448T>C
D448H
NM_000157.4:c.1342G>C
c.1263del
NM_000157.4:c.1265_1319del
RecNciI
NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C
RecTL
NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C, NM_000157.4:c.1342G>C
c.1263del+RecTL
NM_000157.4:c.1483G>C, NM_000157.4:c.1448T>C, NM_000157.4:c.1342G>C, NM_000157.4:c.1265_1319del
The GBA Caller generates its output in the targeted caller output file <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File).
totalCopyNumber
Total copy number of all GBA and GBAP1 genes including hybrids
nonnegative integer
deletionBreakpointInGene
null (i.e. unknown) if totalCopyNumber > 3
true, false, null
true if CN <= 3 and a deletion-like recombinant variant haplotype is detected
false if CN <=3 and no deletion-like recombinant variant is detected
recombinantHaplotypes
List of detected haplotypes arising from nonallelic homologous recombination variant calling
Array of two strings. Each string consists of all associated allele IDs (if any) within the haplotype. Consecutive IDs in the same haplotype are separated by a '+'.
variants
List of single site, nonrecombinant-like variants (i.e. not arising from nonallelic homologous recombination). An empty list if no variants are detected.
Array of nonrecombinant-like variants.
Each nonrecombinant-like variant reported in the variants
array will have the fields below.
alleleId
HGVS identifier of the variant allele
string
alleleCopyNumber
Copy number of the allele in the called genotype
nonnegative integer
genotypeQuality
Phred-scaled quality for the called genotype
nonnegative integer
filter
Filter for the called genotype
string. "PASS" when not filtered
Recombinant-like and nonrecombinant-like variants are reported in VCF format. See Targeted VCF File for details about how these variants are reported in VCF.
An example of the GBA caller content in the <output-file-prefix>.targeted.json
output file is shown below.
You can enable de novo structural variant quality scoring in DRAGEN.
To enable de novo scoring for structural variant joint diploid calling, set --sv-denovo-scoring
to true. To adjust the threshold value for which variants are classified as de novo, use the --sv-denovo-threshold
command line option. See DN Field for more information.
De novo scoring requires the following two files:
A pedigree file that specifies the relationship of all samples in the pedigree.
The VCF output from germline structural variant calling analysis run jointly over all samples in the pedigree.
The pedigree file is required for de novo scoring. Use the same file format as required for joint small variant calling analysis and de novo scoring. For information on the file format, see Small Variant De Novo Calling. The file specifies which sample in the trio is the proband, mother, or father. If there are multiple trios specified in the pedigree file (eg, multigeneration pedigree or siblings), DRAGEN automatically detects the trios and provides the de novo scores on the proband sample of each detected trio.
DRAGEN applies de novo scoring to the VCF output from germline structural variant analysis for all samples specified in the pedigree file. You can supply the VCF file directly using the command line or produce the file as part of the DRAGEN run where de novo scoring is enabled.
De novo scoring adds the de novo quality score (DQ
) and de novo call (DN
) fields for each sample in the output VCF file.
The DQ
field is defined as follows.
The DQ
field represents the Phred-scaled posterior probability of the variant being de novo in the proband. For example, DQ scores of 13 and 20 correspond to a posterior probability of a de novo variant of 0.95 and 0.99. If DRAGEN can calculate the DQ score, the score is added to the proband samples. If the DQ score cannot be calculated, the field is set to ".".
The DN
Field is defined as follows.
DRAGEN compares valid (> 0) DQ scores against a threshold value. You can set the threshold value using the --sv-denovo-threshold
command line option. For example, to set the threshold value to 10, add --sv-denovo-threshold 10
to the command line. The default threshold value is 20.
If a DQ score is greater than or equal to the threshold value, the DN
field is set to DeNovo
. If the DQ score is below the threshold value, the DN
field is set to LowDQ
. If the DQ is 0 or ".", the DQ score is invalid and the DN
field is set to ".".
You can use de novo structural variant scoring in the following workflows.
Perform de novo scoring in two DRAGEN runs. In the first, run germline structural variant analysis jointly over all samples in the pedigree file. In the second, apply de novo structural variant scoring to the joint germline VCF output. See Two-Run Workflow.
Perform de novo scoring in one DRAGEN run. Run germline structural variant analysis jointly over all samples in the pedigree file, and then apply de novo scoring to the joint germline structural variant calls. See One-Run Workflow.
In the two-run workflow, first run a standard DRAGEN joint germline analysis over multiple samples as shown in the following example.
In the second run, use the VCF output (<OUT_DIR1>/<PREFIX1>.sv.vcf.gz
) as input for de novo scoring. You can provide the VCF input using the --variant
option. The following command line provides an example of the second run.
The resulting output VCF file (<OUT_DIR2>/<PREFIX2>.sv.vcf.gz
) includes all de novo scoring annotations.
Run a standard DRAGEN joint germline analysis over multiple samples with all required de novo scoring options. The following example shows the one-run workflow.
The resulting output VCF file (<OUT_DIR>/<PREFIX>.sv.vcf.gz
) includes all de novo scoring annotations
Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C-> T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.
DRAGEN offers the following two independent components that can call the SMN1 copy number using WGS data from a germline sample.
ExpansionHunter
SMN Caller
SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.
In addition to the standard diploid genotype call, SMA Calling with ExpansionHunter uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.
SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.
To enable SMA calling along with repeat expansion detection, set the --repeat-genotype-enable
option to true
. For information on graph-alignment options, see Repeat Expansion Detection with ExpansionHunter.
To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The <INSTALL_PATH>/resources/repeat-specs/experimental
folder contains example files.
The <output-file-prefix>.repeat.vcf
file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.
VARID
SMN marks the SMN call.
GT
Genotype call at this position using a normal (diploid) genotype model.
DST
SMA status call: + indicates detected - indicates undetected ? indicates undetermined.
AD
Total read counts supporting the C and T allele.
RPL
Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely.
The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, NM_000344.4:c.*3+80T>G
that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data.²
To enable the SMN Caller, use --enable-smn=true
as part of a germline-only WGS analysis workflow. Additionally, it can also be enabled along with other targets from the targeted caller by using the option --enable-targeted=true
. The SMN Caller is disabled by default.
The SMN Caller performs the following steps:
Determines total and intact SMN copy numbers
Calls SMN1 copy number at eight differentiating sites
Determines copy number for NM_000344.4:c.*3+80T>G
The SMN Caller requires WGS data aligned to a human reference genome with at least 30x coverage
Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.
To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with ExpansionHunter (see SMA Calling With ExpansionHunter). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.
For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.
NM_000344.4:c.*3+80T>G
The SNP NM_000344.4:c.*3+80T>G (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.
For this high-homology region SNP, reads mapping to either SMN1 or SMN2 are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).
The SMN Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json
that also contains calls from other targets (see Targeted JSON File). An example of the SMN caller content in this file is shown below.
For SMN caller, the fields are defined as follows.
smn1CopyNumber
Copy number of intact SMN1
nonnegative integer or null
smn2CopyNumber
Copy number of intact SMN2
nonnegative integer or null
smn2Delta78CopyNumber
Copy number of SMN2Δ7–8 (deletion of exon 7 and 8)
nonnegative integer
totalCopyNumber
Raw normalized depth of total SMN (exons 1 to 6)
nonnegative floating point number
fullLengthCopyNumber
Raw normalized depth of intact SMN (exons 7 & 8)
nonnegative floating point number
variants
a json array containing info about specific SMN variants
json-array
Each variant reported in the variants
array will have the fields below.
hgvs
HGVS id of the variant being reported
string
qual
Phred quality that at least one copy of the variant allele is found
nonnegative floating point number
altCopyNumber
detected copy number of the variant allele
nonnegative floating point number
altCopyNumberQuality
Phred quality of the detected copy number
nonnegative floating point number
The variant NM_000344.4:c.*3+80T>G
is also reported in a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file and possibly compressed. The variant is reported with the VARIANT_IN_HOMOLOGY_REGION
flag in the INFO
field and also filtered with the TargetedRepeatConflict
filter. This variant lies in a region of homology between SMN1 and SMN2 and hence this variant is reported twice - once for each SMN1 and SMN2 regions - and is connected by the same EVENT
in the INFO
field. The ploidy of the variant is reported in concordance with the identified genotype.
An example of the vcf entry for the variant NM_000344.4:c.*3+80T>G is as follows.
The variant NM_000344.4:c.*3+80T>G in the <output-file-prefix>.targeted.vcf[.gz]
file can also be included in the <output-file-prefix>.hard-filtered.vcf[.gz]
by including smn
in the --targeted-merge-vc
list, i.e. --targeted-merge-vc smn
. The output file <output-file-prefix>.targeted.vcf[.gz]
is compressed by default. This option can be disabled using --enable-vcf-compression=false
.
¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). Human Mutation. 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3> 3.0.CO;2-9
²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genetics in Medicine. 2020;22(5):945-953. doi: 10.1038/s41436-020-0754-0
The Rh Caller is capable of identifying a common gene conversion between RHD and RHCE genes from whole-genome sequencing (WGS) data, that is referred to as RHCE Exon2 gene conversion. Due to high sequence similarity between the genes, a specialized caller is necessary to resolve the gene conversion between the pair of genes. We consider 798 loci, called differentiating sites, that represents differences between the RHD and RHCE genes, that are well preserved in the population.
The Rh Caller performs the following steps:
Determines total copy number from read depth of the RHD and RHCE regions.
Detect RHD -> RHCE breakpoints that are consistent with the RHCE Exon2 gene conversion.
The Rh Caller requires WGS data aligned to a human reference genome with at least 30x coverage. Reference genome builds must be based on hg19
, GRCh37
, or hg38
.
The Rh Caller is run by default when the small variant caller is enabled, the sample is a not a tumor sample, and the sample is detected as WGS by the Ploidy Estimator.
The first step of Rh calling is to determine the copy number of RHD and RHCE regions. Reads aligned to the RHD and RHCE regions are counted according to their support of the differentiating sits. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples.
A collection of 4 differentiating sites in the exon 2 region of RHD and RHCE are used to detect the presence of the RHCE Exon2 gene conversion in the sample. An iterative phasing algorithm is used to build up haplotypes that are supported by the read data. The phasing algorithm starts with candidate haplotypes formed from all possible bases at the first differentiating site. The haplotypes are then extended at the next differentiating site by considering all reads that can be uniquely assigned to a single candidate haplotype. If these reads support only a single base at the next differentiating site for a given candidate haplotype, then the haplotype is extended with that base. When a candidate haplotype can be extended by both bases at the next differentiating site then both possible extended haplotypes are included in the set of candidate haplotypes, growing the set by 1. Subsequent extension steps are performed at neighboring differentiating sites until all sites have been processed. Some haplotypes may have sites that are unresolved (i.e. ambiguous), but these haplotypes can still participate in RHD -> RHCE breakpoint detection.
When the phased haplotypes support the RHCE Exon2 gene conversion. We visit all the differentiating sites ad report them as variants in the output VCF file with ploidy identified using the copy number estimated from the read depth of the differentiating site.
The Rh Caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
sample
The sample name.
string
dragenVersion
The version of DRAGEN.
string
rh
The RH targeted caller specific fields.
dictionary
The rh
fields are defined as below.
totalCopyNumber
Total RHD/RHCE copy number
integer
rhdCopyNumber
RHD gene copy number
integer
rhceCopyNumber
RHCE copy number
integer
variants
List of known variants from recombination that were detected in RHD/RHCE.
list of variants
For the variants
the fields are defined as below.
hgvs
HGVS identifier of the variant
string, "NC_000001.11g.25405596_25409676con25283766_25287797"
qual
Phred QUAL score of the variant
double
altCopyNumber
Copy number of the ALT variant
double
altCopyNumberQuality
Phred QUAL copy number of the ALT variant
double
Examples of the Rh Caller content in the output json file are shown below.
The Rh Caller also generates a <output-file-prefix>.targeted.vcf[.gz]
file in the output directory. The output file is a VCFv4.2
formatted file, possibly compressed.
The following are example output files:
To detect somatic copy number aberrations and regions with loss of heterozygosity, run the DRAGEN CNV Caller on a tumor sample with a VCF that contains germline SNVs. The output file is a VCF file. Components of the germline CNV caller are reused in the somatic algorithm with the addition of a somatic modeling component, which estimates tumor purity and ploidy.
The germline SNVs are used to compute B-allele ratios in the tumor, which allows for allele-specific copy number calling on the tumor sample. Where possible, use of the small-variant VCF from a matched normal sample is preferred (tumor-normal mode) for best results, but a catalog of population SNPs can be used when a matched normal sample is not available (tumor-only mode).
When a matched normal sample is available, the sample should first be processed using the germline small variant caller. In this case, only germline-heterozygous SNV sites are used for determining B-allele ratios. If no matched normal is available, population SNP B-allele ratios are computed as for matched normal heterozygous loci, but are treated as variants of unknown germline genotype; possible genotype assignments are statistically integrated to determine allele-specific copy number.
In matched normal mode, a VCF containing germline copy number changes for the individual may optionally be input. This makes sure that germline CNVs are output as separate segments in the somatic whole-genome sequencing (WGS) CNV VCF, and annotated with the germline copy number so that it is clear whether there are specifically-somatic copy number changes in the region.
You can use the following somatic WGS CNV calling command-line options:
--tumor-fastq1
,--tumor-fastq2
,--tumor-bam-input
, --tumor-cram-input
Specify a tumor input file.
--cnv-normal-b-allele-vcf
--cnv-population-b-allele-vcf
--cnv-use-somatic-vc-baf
--sample-sex
If known, specify the sex of the sample. If the sample sex is not specified, the caller attempts to estimate the sample sex from tumor alignments.
--cnv-normal-cnv-vcf
--cnv-use-somatic-vc-vaf
--cnv-somatic-enable-het-calling
The following is an example command line for running tumor-normal somatic WGS CNV calling with a matched normal SNV VCF.
If a matched normal is not available, you must disable CNV calling or run in tumor-only mode. Running with a mismatched normal in tumor-normal mode yields unexpected results. The following example command line runs tumor-only somatic WGS CNV calling with a population SNV VCF.
The following example command line runs tumor normal somatic WGS CNV calling concurrently with the Somatic SNV Caller, which allows you to use the matched normal germline heterozygous sites directly from the SNV Caller with the command cnv-use-somatic-vc-baf true
.
You can enable additional features when a matched normal sample and the outputs from DRAGEN Germline analysis are also available. If a matched normal sample is available, enable germline-aware mode and VAF-aware mode using the following example command line. For more information on germline-aware mode and VAF-aware mode, see Germline-aware Mode and VAF-aware Mode.
The target counting stage and its output are the same as for the germline CNV calling case. The target intervals with the read counts are output in a *.target.counts.gz
file. If there is insufficient read depth coverage detected, processing will halt. For low depth tumor samples, the value of --cnv-interval-width
can be increased from to capture more alignments. The B-allele counting occurs in parallel with the read counting phase, and the values are output in a *.baf.bedgraph.gz
file. This file can be loaded into IGV along with other bigwig files generated by DRAGEN for visualization. See Output Files for more details on output files.
The Somatic WGS CNV Caller requires a source of heterozygous SNP sites to measure B-allele counts of the tumor sample. The following are the available modes.
cnv-normal-b-allele-vcf
Specify a matched normal SNV VCF. Use when a matched normal sample and the matched normal SNV VCF are available. To use this option, you must run the matched normal sample through the DRAGEN Germline workflow.
cnv-population-b-allele-vcf
Specify a population SNP VCF. Use when a matched normal sample is not available and analysis must be performed in tumor-only mode.
cnv-use-somatic-vc-baf
Set to true
to enable DRAGEN to identify germline variants during a tumor/matched-normal run, rather than requiring a separate run on the normal sample. Use if and only if tumor and matched normal input are available. Also enable the Somatic SNV Caller via enable-variant-caller
to use this option.
To specify a matched normal sample SNV VCF, use the --cnv-normal-b-allele-vcf
option. The VCF file should come from processing the matched normal sample through the DRAGEN germline small variant caller with filters applied. Typically, this file name has a *.hard-filtered.vcf.gz
extension. All records marked as PASS that are determined to be heterozygous in the normal sample are used to measure the b-allele counts of the tumor sample. You can also use equivalent gVCF file (*.hard-filtered.gvcf.gz
), but the processing time is significantly longer due to the number of records, most of which are not heterozygous sites.
To specify a population SNP VCF, use --cnv-population-b-allele-vcf
option. To obtain a population SNP VCF, process an appropriate catalog of population variation, such as from dbSNP, the 1000 genome project, or other large cohort discovery efforts. A suitable example file for this parameter is "1000G_phase1.snps.high_confidence.vcf.gz" from the GATK resource bundle. Only high-frequency SNPs should be included. For example, include SNPs with minor allele population frequency ≥ 10% to limit run time impact and reduce artifacts. Specify the ALT allele frequency by adding AF=<alt frequency>
to the INFO
section of each record. Additional INFO
fields might be present, but DRAGEN only parses and uses the AF field. Sites specified with --cnv-population-b-allele-vcf
can be either heterozygous or homozygous in the germline genome from which the tumor genome derives
The following is an example valid population SNP record:
DRAGEN considers the following requirements when parsing records from the b-allele VCF:
Only simple SNV sites.
Records must be marked PASS
in the FILTER
field.
If there are records with the same CHROM
and POS
values in the VCF
, then DRAGEN uses the first record that occurs.
If a tumor sample and matched normal input are available, use --cnv-use-somatic-vc-baf true
. You must enable the Somatic SNV Caller. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
If a tumor sample and matched normal input are available, you can avoid having to separately process the matched normal with the DRAGEN germline pipeline by specifying --cnv-use-somatic-vc-baf true
. If using this option, DRAGEN determines the germline heterozygous sites from the matched normal input and measures the b-allele counts of the tumor sample. The information is passed to the Somatic WGS CNV Caller to simplify the overall somatic workflow.
To enable --cnv-use-somatic-vc-baf
, enter the following command line options.
--tumor-bam-input <TUMOR_BAM>
—Specify the tumor input
--bam-input <NORMAL_BAM>
—Specify the matched normal input
--enable-variant-caller true
—Enable the somatic SNV variant caller
--cnv-use-somatic-vc-baf true
—Enable somatic VC BAF
To specify germline CNVs from a matched normal sample, use --cnv-normal-cnv-vcf
. When specified, CNV records marked as PASS
in the normal sample are used during tumor-sample segmentation to make sure that confident germline CNV boundaries are also boundaries in the somatic output. Segments with germline copy number changes that are relative to reference ploidy are excluded from somatic model selection. During somatic copy number calling and scoring, the germline copy number is used to modify the expected depth contribution from the normal contamination fraction of the tumor sample. The process leads to more accurate assignment of somatic copy number in regions of germline CNV. DRAGEN then annotates the somatic WGS CNV VCF entries with germline copy number (NCN) and the somatic copy number difference relative to germline (SCND) for the segments that have germline CNVs.
If both the small variant caller and the CNV caller are enabled in a tumor-matched normal run, the somatic SNV results can affect the estimated purity and ploidy of the tumor sample. The somatic SNV variant allele frequencies (VAFs) that are captured by the allele depth values from passing somatic SNVs reflect the combination of tumor purity, total tumor copy number at a somatic SNV locus, and the number of tumor copies bearing the somatic allele. Clusters of somatic SNVs with similar allele depths inform the tumor model.
When a tumor has limited copy number variation and/or CNVs are mostly subclonal, such as in many liquid tumors, VAFs can help prevent incorrect or low-confidence estimated tumor models. Incorrect or low-confidence estimated tumor models can lead to wrong or filtered calls. VAF information can also help determine the presence or absence of a genome duplication even in samples from clonal tumors with clear CNVs.
To utilize VAF information, run somatic WGS CNV calling with small variant calling on tumor and matched-normal read alignment inputs. For example, you could use the following command line:
--enable-variant-caller=true --enable-cnv=true --tumor-bam-input <TUMOR_BAM> --bam-input <NORMAL_BAM>
For tumor/matched-normal runs with --enable-variant-caller true
, VAF-based modeling is enabled by default. To disable VAF-based modeling, set --cnv-use-somatic-vc-vaf
to false
.
DRAGEN uses HET-calling mode for segments with a copy number that is estimated to be heterogeneous (HET) among different subclones. Based on a statistical model, a segment is considered to be heterogeneous when the depths or BAF values in a segment are too far away from what is expected for the closest integer-copy number.
To turn on HET calling, specify --cnv-somatic-enable-het-calling=true
on the command line. N.B., this setting will only be honored when DRAGEN is able to identify a confident purity/ploidy model. When a confident model cannot be identified, the caller will return a default model and HET-calling will always be disabled (see Somatic WGS CNV Model section for more details and nuances of this approach).
When a segment is considered as heterogeneous, the output for the segment is changed as follows.
The HET tag is added to the INFO field for the segment.
At least one of the CN and MCN values is given as a non-REF value. Specifically, the values are given as the integer values closest to CNF and MCNF. If the integer values would result in a REF call, then at least one of the CN and MCN values is adjusted to the closest non-REF value.
The ID, ALT, and GT fields are set appropriately for the chosen CN and MCN.
The QUAL score reflects confidence that the segment has nonreference copy number in at least a fraction of the sample.
The CNQ and MCNQ values reflect confidence that the assigned CN and MCN values are true in all of the tumor cells, so at least one of the CNQ and MCNQ values is typically less than five.
Selecting a tumor purity and diploid coverage level (ploidy) is a key component of the somatic WGS CNV caller. The somatic WGS CNV caller uses a grid-search approach that evaluates many candidate models to attempt to fit the observed read counts and b-allele counts across all segments in the tumor sample. A log likelihood score is emitted for each candidate. The log likelihood scores are output in the *.cnv.purity.coverage.models.tsv
file. The somatic WGS CNV caller chooses the purity, coverage pair with the highest log likelihood, and then computes several measures of model confidence based on the relative likelihood of the chosen model compared to alternative models.
If the confidence in the chosen model is low, the caller returns the default model with estimated tumor purity set to NA
. The default model provides an alternative methodology to identify large somatic alterations (length of at least 1 Mb): records are filtered by this model based on their segment mean value (SM
). The threshold values used by the caller are estimated automatically considering the variance of the sample, with larger SM
thresholds for DUPs when the variance is higher. The user can use alternative threshold values through the --cnv-filter-del-mean
and --cnv-filter-dup-mean
parameters. Finally, when the caller returns the default model, the fields regarding copy number states based on model estimation (i.e., CN
, CNF
, CNQ
, MCN
, MCNF
, MCNQ
) are omitted from the final VCF output.
In order to improve accuracy on the tumor ploidy model estimation, the somatic WGS CNV caller estimates whether the chosen model calls homozygous deletions on regions that are likely to reduce the overall fitness of cells, which are therefore deemed to be "essential" and under negative selection. In the current literature, recent efforts tried to map such cell-essential genes¹.
The check on essential regions is controlled with --cnv-somatic-enable-lower-ploidy-limit
(default true). Default bedfiles describing the essential regions are provided for hg19, GRCh37, hs37d5, GRCh38, but a custom bedfile can also be provided in input through the --cnv-somatic-essential-genes-bed=<BEDFILE_PATH>
parameter. In such case, the feature is automatically enabled. A custom essential regions bedfile needs to have the following format: 4-column, tab-separated, where the first 3 columns identify the coordinates of the essential region (chromosome, 0-based start, excluded end). The fourth column is the region id (string type). For the purpose of the algorithm, currently only the first 3 columns are used. However, the fourth might be helpful to investigate manually which regions drove the decisions on model plausibility made by the caller.
If the somatic WGS CNV caller does not find any overlap between any of the homozygous deletions and any of the essential regions, the model is considered plausible and the model optimization ends. Otherwise, when at least an overlap is found, the model is declared invalid and the model search is repeated on the subset of models that support at least one copy (CN = 1) for the essential region with the lowest coverage among the regions overlapping homozygous deletions.
¹E.g., in 2015 - https://www.science.org/doi/10.1126/science.aac7041
The segmentation stage might produce adjacent or nearby segments that are assigned the same copy number and have similar depth and BAF data. This segmentation can result in a region with consistent true copy number being fragmented into several pieces. The fragmentation might be undesirable for downstream use of copy number estimates. Also, for some uses, it can be preferable to smooth short segments that would be assigned different copy numbers whether due to a true copy number change or an artifact. To reduce undesirable fragmentation, initial segments can be merged during a postcalling segment smoothing step.
After initial calling, segments shorter than the specified value of --cnv-filter-length
are deemed negligible. Among the remaining nonnegligible segments, successive pairs are evaluated for merging. On a trial basis, the Somatic WGS CNV Caller combines two successive segments that are within --cnv-merge-distance
(default value of 10000 for WGS Somatic CNV) of one another and have the same CN and MCN assignments, along with any intervening negligible segments into a single segment that is recalled and rescored. If the merged segment receives the same CN and MCN as its constituent nonneglible pieces with a sufficiently high-quality score, the original segments are replaced with the merged segment. The merged segment might be further merged with other initial or merged segments to either side. Merging proceeds until all segment pairs that meet the criteria are merged. NB. When the germline CN information is available, and two segments have different germline CN, they will not be merged.
The Somatic WGS CNV Caller can report the total tumor copy number by estimating tumor purity. The BAF estimations from matched normal SNVs or population SNPs allow for allele specific copy number calling. The following table provides examples for a DUP in a reference-diploid region:
4
2
2+2
4
1
3+1
*[LOH]4
0
4+0
*The entry represents a Loss of Heterozygosity (LOH) case. The total copy number is still considered a DUP, so the entry is annotated as GAINLOH
to distinguish the value from Copy Neutral LOH (CNLOH
), which would be annotated as 2+0.
Repetitive regions in the human genome pose a challenge for general variant calling approaches which typically cannot make use of potentially misplaced MAPQ0 reads. Furthermore, high sequence homology of some genes with a pseudogene paralog can lead to a wide variety of common structural variants (SVs) in the population, requiring specialized targeted calling approaches. DRAGEN supports targeted calling for a number of genes/targets as described in subsequent target-specific sections.
The targeted caller can be enabled using the command line option --enable-targeted=true
or a subset of targets can be enabled by providing a space-separated list of target names. The supported target names are: cyp2b6
, cyp2d6
, cyp21a2
, gba
, hba
, lpa
, rh
, and smn
. For a list of all supported targeted caller options along with their default values, see Targeted Caller Options. The targeted caller produces a <output-file-prefix>.targeted.json
file containing a summary of the variant caller results for each target. Additional detail of individual variant calls are reported in VCF format in the <output-file-prefix>.targeted.vcf.gz
output file.
The targeted caller requires WGS data aligned to a human reference genome with at least 30x coverage. The caller may be less reliable at lower coverage. Human reference genome builds based on hg19
, hs37d5
(including GRCh37
), or hg38
are supported. The targeted caller should not be enabled with low-coverage, exome or enrichment sequencing data.
The targeted caller generates a <output-file-prefix>.targeted.json
file in the output directory. The output file is a JSON formatted file containing the fields below.
sampleId
The sample name.
string
always
softwareVersion
The version of DRAGEN.
string
always
phenotypeDatabaseSources
Resources used for calling metabolism status (phenotype).
json array of strings
CYP2B6 or CYP2D6 is enabled
cyp2b6
The CYP2B6 caller fields.
dictionary
CYP2B6 caller is enabled
cyp2d6
The CYP2D6 caller fields.
dictionary
CYP2D6 caller is enabled
cyp21a2
The CYP21A2 caller fields.
dictionary
CYP21A2 caller is enabled
gba
The GBA caller fields.
dictionary
GBA caller is enabled
hba
The HBA caller fields.
dictionary
HBA caller is enabled
lpa
The LPA caller fields.
dictionary
LPA caller is enabled
rh
The RH caller fields.
dictionary
RH caller is enabled
smn
The SMN caller fields.
dictionary
SMN caller is enabled
The targeted caller generates a <output-file-prefix>.targeted.vcf.gz
file in the output directory. The output file is a VCFv4.2
formatted file. The targets that have VCF output are: cyp21a2, gba, hba, lpa, rh, and smn.
Small variants, structural variants, and copy number variants are reported in the same VCF file.
The <output-file-prefix>.targeted.vcf.gz
file includes the following source
header line:
For lpa, rh and smn targets, the EVENT
and EVENTTYPE
INFO fields are used to identify the called variants.
The EVENT
and EVENTTYPE
INFO fields are formally introduced in VCFv4.4
to enable the representation of complex rearrangements. This is achieved using the EVENT
field to group all the related VCF records together, and the EVENTTYPE
to classify the event. The corresponding header lines are the following.
However, the use of EVENT
is not limited to complex rearrangements and can be used to associate nonsymbolic alleles, for example in cases of variant position ambiguity in high homology regions.
Since the EVENTTYPE
values are implementation-defined, custom EVENTTYPE
header lines are included to describe each EVENTTYPE
.
For cyp21a2, gba, and hba targets, the ALLELE_ID
INFO field is used to identify the called variant alleles.
The missing value .
is used when no identifier is available (e.g. a wild type allele) or applicable (e.g. allele index 0 for a structural variant record).
In the case of target variants in a high homology region, each variant is reported ambiguously at all corresponding homologous positions (i.e. in both the pseudogene and in the target gene). Additional analysis for these variants can be performed if absolute certainty that these variants are located in the target gene (e.g. in gba or cyp21a2) is required.
For lpa and smn the ploidy of the called genotype (FORMAT/GT
field) corresponds to the combined copy number from all the homologous positions. For cyp21a2, gba and hba, this "joint" genotype from all the homologous positions is instead reported in a separate FORMAT/JGT
field which is then collapsed into a diploid genotype and reported in the FORMAT/GT
field. The following fields are reported for "joint" calls:
Note that the FORMAT/GQ
and FORMAT/JGQ
fields contain the unconditional genotype quality, unlike the VCF spec where FORMAT/GQ
is defined as the genotype quality conditioned on the site being variant.
In the depicted example there are two genes A and B that include a high homology region. The usual process to call variants in this regions is to make a joint pileup of the reads aligning in both genes A and B and call the variants using a model with a ploidy proportional to the total copy number of the regions. This generates divergent possible genotypes that are equally likely since the variant cannot be confidently placed in either gene A or gene B. For lpa and smn the variant would be reported as follows:
Given the unconventional ploidy of the FORMAT/GT
field used in this representation, a TargetedRepeatConflict
filter is applied to these records. The header line for the filter is the following.
For cyp21a2, gba and hba, a conventional diploid FORMAT/GT
is reported and so no TargetedRepeatConflict
filter is applied. Due to the ambiguity in placing target variants in high homology regions, the corresponding QUAL
and FORMAT/GQ
fields can be much lower than conventional small variant calls (i.e. Phred 3 for a single variant allele copy across two homologous diploid positions). Therefore, instead of filtering on QUAL
and FORMAT/GQ
for these records, the records are filtered based on the FORMAT/JVQL
and FORMAT/JGQ
fields:
Since the wild type alleles at homologous positions may be different from each other or different from the reference alleles, an additional filter is applied when only wild type alleles are detected across the homologous positions. This avoids making ambiguous variant calls when no target variant of interest is detected.
In the case of an identified gene conversion even in rh, a small variant is reported at each differentiating site in the acceptor region.
In the depicted example there are two genes A and B and gene A is the acceptor of a gene conversion from gene B (green box in the figure). Gene conversion are identified by observing variations in copy number at differentiating sites (blue and pink bars in the figure) in consecutive regions. Copy number variations between regions define the breakends of the gene conversion. An equivalent VCF representation for gene conversion would be using CNV and SV entries with breakends corresponding to the donor/acceptor regions, however, only the small variant representation is currently supported.
In the case of a detected gene conversion event, there may be differentiating sites with a genotype that is inconsistent with that gene conversion event. In these cases the RecombinantConflict
filter is applied. The RecombinantConflict
is defined by the following header line.
In the example, the resulting representation is as follows.
For cyp21a2 and gba, nonallelic homologous recombination can result in gene deletion or duplication in the case of reciprocal recombination or gene conversion in the case of nonreciprocal recombination. Both gene deletion and gene conversion can introduce loss-of-function variants and in both cases the targeted caller will report these variants in the target gene. In the case of gene deletion, the differentiating sites at the nontarget (i.e. pseudogene) positions will contain the overlapping deletion allele *
while the differentiating sites in the target will contain any variant alleles. Although an equivalent VCF representation would be to simply report the deletion with a single structural variant VCF record, reporting small variant VCF records in the target gene allows for identification of the specific mutations that may occur in a gene transcript and matches well with annotation using HGVS nomenclature. Similarly, for gene conversions, variants are reported at differentiating sites in the target gene, rather than as pairs of structural variant breakends.
Calls at differentiating sites within the recombinant variant calling region will contain the same "joint" fields as are reported for nonrecombinant-like variants in high homology regions ( see Nonrecombinant-like Variants In High Homology Regions). However, the collapsed diploid FORMAT/GT
will be based on any detected recombination events. Because detected recombinant variants are placed in the target gene, these records are filtered differently than the ambiguously placed, nonrecombinant-like variants in high homology regions. The INFO/Recombinant
flag is added to calls derived from recombinant variant calling to distinguish them from nonrecombinant-like variant calls in high homology regions. The FORMAT/VQL
field is used to apply the RecombinantLowVQL
filter for low quality recombinant variants and the RecombinantREF
filter is applied when the collapsed diploid FORMAT/GT
contains only reference alleles.
The use of GT=0
for symbolic structural variant alleles is formally disambiguated in VCFv4.4
, specifying that "GT=0 indicates the absence of any of the ALT symbolic structural variants defined in the record". With this convention we can report compound overlapping heterozygous structural variants.
In the hba genotype depicted above, two overlapping SVs can be represented as follows:
The relevant header lines for the VCF records above are:
In the depicted example there is a Variable Number Tandem Repeat (VNTR) region composed of three repeat units in the reference. The CN
INFO field is used to report the allele copy number, the CN
FORMAT field to is used report the region total copy number given by the sum of the allele copy numbers, and the REPCN
FORMAT field is used to report the repeat unit copy number equal to the allele copy number multiplied by the number of repeat units in the reference.
This VNTR can be represented as follows:
The REPCN
and CN
header lines are:
For lpa, rh and smn, the TargetedLowQual
filter is applied if the QUAL
of a target variant is less than 3.00
.
Similarly, for cyp21a2 and gba the TargetedLowVQL
filter is applied if the VQL
of a target variant in low-homology region is less than 3.00
.
The TargetedLowGQ
filter is applied if the targeted variant has GQ
smaller than 3
.
hard-filtered
FilesWhen the small variant caller is enabled, the targeted small variant VCF calls can be merged into the <output-file-prefix>.hard-filtered.vcf.gz
and <output-file-prefix>.hard-filtered.gvcf.gz
files, briefly hard-filtered
files. The --targeted-merge-vc
command line option can be used to control which targets will have their small variant VCF records merged into the hard-filtered
files. For example, --targeted-merge-vc rh
will enable merging of the calls from the rh
caller into the hard-filtered
files and --targeted-merge-vc rh hba
will enable merging of the calls from the rh
and hba
targets into the hard-filtered
files. The true
value will merge all calls from all supported targets into the hard-filtered
files, while the false
value will merge no calls into the hard-filtered
files.
The targeted calls merged into the hard-filtered
files are marked with a TARGETED
INFO flag.
When enabled, targeted small variants are merged into the hard-filtered
files regardless of any regions that may be provided using the --vc-target-bed
option.
The merging strategy for targeted small variant calls is to prioritize the targeted calls over small variant calls from the germline small variant caller. When a germline small variant call overlaps a targeted caller call, then the small variant call is filtered with a TargetedConflict
filter if any of the following holds:
The targeted caller call is PASS
.
The small variant call and targeted caller call have incompatible genotypes and the targeted caller call is not filtered with the TargetedLowGQ
filter.
The strategy is summarized in the following examples.
The TARGETED
call is PASS
.
The TARGETED
call and the small variant call are not overlapping
The TARGETED
call is filtered with TargetedLowQual
and has a discordant variant representation with the overlapping small variant call.
The TARGETED
call is filtered with TargetedLowQual
and has a discordant genotype with the overlapping small variant call.
The TARGETED
call is filtered with TargetedLowGQ
and has a discordant genotype with the overlapping small variant call.
The targeted caller can be enabled in parallel with other components as part of a human WGS germline analysis workflow (see DRAGEN Recipe - Germline WGS).
The following command-line example runs the targeted caller from FASTQ input:
The following command-line example runs cyp21a2 only using BAM input without realignment:
The DRAGEN Variable Number Tandem Repeat (VNTR) Caller detects expansions and contractions in tandem repeat (TR) regions. For specified TR regions in the genome, the DRAGEN VNTR Caller estimates the size of the haplotypes in each region and provides variant calls, including the number of copies of the repeat for the sample in question. The DRAGEN VNTR Caller only considers TR regions included in a pre-specified VNTR catalog file.
For each region in the VNTR catalog, the VNTR Caller performs the following steps:
Read fragment collection, including wrap-around alignment and read classification;
Genotyping, including the scoring of candidate haplotype lengths using a Bayesian likelihood model.
The output of the VNTR Caller is the total length of sequence present in each TR region for the sample in question, resolved for each haplotype if possible; the copy number for each region is calculated from the length. Calls are reported in a VNTR output VCF file following the VCFv4.4 spec.
The DRAGEN VNTR Caller can be enabled by setting the --enable-vntr
option to true
. The VNTR Caller requires whole-genome sequencing (WGS) data aligned to a human reference genome with at least 30x coverage.
This diagram illustrates the overall workflow of the VNTR Caller. The VNTR Caller takes as input a set of aligned reads from the sample in question (either from the DRAGEN mapper or from an input BAM/CRAM) and a VNTR catalog file.
The VNTR catalog is a bed file specifying the TR regions for the VNTR Caller to act upon. Each region in the bed file is expected to be the start and the end of a tandem repeat sequence with no additional buffer sequence. The catalog also includes a unique TR ID for each region and the sequence of the repeat unit/pattern (see below for more details on the VNTR catalog file format).
The VNTR Caller processes each TR region in parallel, starting with read fragment collection. The VNTR Caller considers read fragments (i.e. paired-end reads as a single unit) rather than individual reads. To obtain all of the relevant read fragments for each region, all of the reads that overlap the region are found, and then all of their mates are collected as well.
Due to the repetitive nature of TR regions, existing read-alignments may be unreliable. Therefore spanning reads, unmapped reads, and reads with soft-clips undergo a specialized wrap-around alignment algorithm, which allows for a read to align to the same pattern sequence multiple times without penalty (mirroring the structure of the tandem repeat). This algorithm produces more reliable alignments of the read fragments to the TR region. Additionally, another rule to virtually extend the boundaries of the repetitive region into the flanks is applied to resolve some alignment ambiguities arising from the wrap-around alignment.
Once reliable alignments of the read fragments have been obtained, the next step is to classify each read. Reads are classified as non-overlapping, flanking, spanning, and contained relative to the TR region based on the following figure.
The output of fragment collection is the set of all read fragments in each TR region, re-aligned as necessary, with each read given a classification. This collection of read fragments is referred to as a pileup and acts as the input to the genotyper.
The genotyper determines the top-scoring haplotypes based on the read fragment evidence for each TR region. Given a pileup, the genotyper further classifies each read fragment into fragment classes.
The number of fragments in each class acts as evidence for the haplotype lengths of the TR region. A Bayesian likelihood model is used to evaluate what pair of haplotypes have the highest likelihood of generating the observation of these fragment class counts. A set of candidate haplotype lengths is generated based on fractional increments of the repeat pattern length, and each pair of haplotype lengths is evaluated as a candidate diploid genotype. If the caller detects that individual haplotype lengths cannot be resolved, the total length is considered as a candidate genotype instead (referred to as a total call). In subsequent steps, these total call candidates are assessed as if they were homozygous diploid genotypes.
For a Bayesian model, the posterior probability of each candidate diploid genotype must be considered. The posterior probability is made up of two parts: the genotype prior and the pileup likelihood. Three types of priors are currently supported:
No prior (all alleles weighted equally, referred to as model 0)
Het/hom priors (four classes with different weights: homozygous reference, ref/alt, homozygous alt, and alt1/alt2, referred to as model 1)
Population haplotype frequencies (region-specific haplotype frequencies over high-quality population data sets, referred to as model 3 and used by default)
The priors model can be chosen by setting the option --vntr-priors-model
to 0
, 1
, or 3
(the default being 3
).
The pileup likelihood is calculated as the likelihood of observing the fragment class counts given the candidate diploid genotype (based on an underlying model for how fragments are generated from a TR region haplotype of a given length). With the prior and the pileup likelihood, the posterior probability of each candidate diploid genotype can be computed. The diploid genotype with the highest posterior probability is chosen as the resulting call for each region.
The VNTR Caller is disabled by default. To enable the VNTR Caller, set --enable-vntr
to true
. The VNTR Caller can run directly from FASTQ input with the mapper or from prealigned BAM/CRAM input. You can also enable the VNTR Caller in parallel with any other germline variant callers as part of a WGS germline analysis workflow. For more information on other variant callers, see the DRAGEN DNA Pipeline.
FASTQ input example:
BAM input example:
Additional Options:
The number of threads used for the DRAGEN VNTR caller can be adjusted using the --vntr-num-threads <number_of_threads>
option.
The VNTR catalog is a bed file with the following required fields:
chromosome (or contig)
start position (0-based inclusive)
end position (0-based exclusive = 1-based inclusive)
TR ID (unique ID for TR region)
repeat unit sequence (sequence of repeat pattern/motif)
The reference haplotype length is calculated by subtracting the start position from the end position, and the number of repeat units in the reference can be found by dividing the reference haplotype length by the length of the repeat unit.
When using a standard reference (hg38
, hg19
, or GRCh37
), DRAGEN will automatically use a matching pre-packaged catalog by default. A custom catalog can be provided by adding in the option, --vntr-catalog-bed <custom_catalog_bed_file>
.
For references other than the ones mentioned above, a catalog must be provided. Furthermore the caller requires a set of normalization regions (--vntr-normalization-regions-bed <bed file>
. These regions should be well-behaved and free of any VNTRs or other large variants. We recommend using a few thousand regions of at least 2kb each. These two files are enough to run the genotyper without priors or the aforementioned flat priors model (--vntr-priors-model 0
or 1
). To enable population priors 3
, one additional file has to be provided: --vntr-priors-file <json file>
. The json file contains data obtained from a population analysis, structured like the following example with one entry per catalog region:
The output of the DRAGEN VNTR Caller includes a VNTR VCF file, a table output TSV file, and a VNTR metrics file.
The VNTR VCF file follows version 4.4 of the VCF spec. The VCF includes a call for every TR region provided in the VNTR catalog unless it was hard-filtered in the fragment collection or the genotyping (the filter annotation can be found in the table output file).
Each call is an estimate of the lengths of the haplotypes present in that region for the sample in question. If the individual haplotypes lengths can be distinguished, then a diploid call is reported including the lengths and copy number of each haplotype (in the INFO RB and RUC fields, respectively). Otherwise, a total call is made, only reporting the total length and total copy number for the region (in the FORMAT TOTALRB and TOTALRUC fields). For total calls, GT = ./.
.
If the length of a haplotype is within a certain threshold of the reference array length for the region, then it is reported as a reference allele (the default reference threshold is 10%). If both haplotypes are reference alleles or if the total length of a total call is within the total reference threshold, then a reference call is reported in the VCF, with HomRef
in the FILTER
field and GT = 0/0
. A symbolic <CNV:TR>
is reported in the ALT
field for each non-reference allele in the call.
The following fields are included for each VCF entry:
INFO:SVTYPE
: set to "CNV" for all VNTR calls
INFO:SVLEN
: set to the reference array length
INFO:EVENTTYPE
: set to "VNTR"
INFO:RUS
: the sequence of the repeat unit (i.e. the repeat pattern or motif)
INFO:RUL
: the length of the repeat unit
INFO:REFRUC
: the number of copies of the repeat in the reference haplotype
INFO:RB
: the length of each ALT haplotype being reported
INFO:CN
: the copy number per ALT haplotype relative to the reference (equal to RB / SVLEN
)
INFO:CNVTRLEN
: the change in length of each ALT haplotype compared to the reference (equal to RB - SVLEN
)
INFO:RUC
: the number of repeat unit copies for each ALT haplotype being reported (equal to RB / RUL
)
FORMAT:SVFT
: any filters that will be applied only in the merged SV + VNTR VCF
FORMAT:GQ
: Genotype quality score
FORMAT:CN
: the total copy number relative to the reference (equal to TOTALRB / SVLEN
)
FORMAT:TOTALRB
: the total length of all haplotypes (including reference haplotypes if present)
FORMAT:TOTALCNVTRLEN
: the total change in length of all haplotypes relative to the reference (equal to TOTALRB - 2*SVLEN
)
FORMAT:TOTALRUC
: the total number of repeat unit copies of all haplotypes (including reference haplotypes if present; equal to TOTALRB / RUL
)
Additional fields:
INFO:RUCCHANGE
: the change in the RUC compared to the reference for each ALT haplotype (equal to RUC - REFRUC
)
INFO:LOGPROB
: the log probability of the called alleles from the genotyper
INFO:VNTRCLASSFIT
: the score of how well the fragment classes fit the expected distribution
INFO:TOTALFRAGCOUNT
: the number of fragments used to make the call
The table output file provides a simple summary of the VNTR Caller output. Every single region included in the VNTR catalog bed will also be included in the table output file; regions where a hard-filter was applied in the fragment collection or the genotyping will still be included with the reason for the filter annotated (these regions will not appear in the VCF file).
For each region, the following information is provided:
trId
: the unique ID of the TR region
patternSize
: the length of the repeat unit (i.e. the repeat pattern/motif length)
refArraySize
: the length of the reference haplotype for this region
Hap1Size
: the length of the first haplotype of the call (Hap1Size <= Hap2Size
; NA
for total calls)
Hap2Size
: the length of the second haplotype of the call (NA
for total calls)
TotalSize
: the total length of all haplotypes in the call (equal to Hap1Size + Hap2Size
for diploid calls)
Likelihood
: the log likelihood of the called alleles from the genotyper (equal to INFO:LOGPROB
in the VCF)
QUAL
: QUAL score (matches QUAL field in the VCF)
GQScore
: Genotype quality score (equal to FORMAT:GQ
in the VCF)
ClassDistributionFit
: the score of how well the fragment classes fit the expected distribution (equal to INFO:VNTRCLASSFIT
in the VCF)
FragmentCount
: the number of fragments used to make the call (equal to INFO:TOTALFRAGCOUNT
in the VCF)
Flags
: the flags and filters applied to the call
The VNTR metrics file reports summary statistics for the VNTR caller including region counts, read class counts, and call counts.
Region counts include the number of normalization regions, the number of prior regions, the number of TR regions with nonzero coverage, and the total number of TR regions.
Read class counts include the total number of reads in each class (strictly left, left-flanking, spanning, contained, right-flanking, strictly right, and unmapped).
Call counts include the total number of uncalled TR regions, as well as the total number of deletion, insertion, and reference calls for diploid and total calls (note that for a region where a diploid call is made, two calls are reported, but for a total call region, only one call is reported).
DRAGEN supports the automatic merger of the VNTR VCF calls with the DRAGEN Structural Variant (SV) Caller output VCF. By default, if both the DRAGEN VNTR Caller and the DRAGEN SV Caller are enabled (with the options --enable-vntr true
and --enable-sv true
, respectively), then calls made by the VNTR Caller will also be included in the DRAGEN SV VCF (<output_prefix>.sv.vcf.gz
). This behavior can be disabled by adding the option --sv-vntr-merge false
. The VNTR VCF does not change even if DRAGEN SV is enabled.
When VNTR calls are added to the SV VCF, the following changes are applied:
VNTR diploid calls with GT = 1/2
are split into two separate VCF entries, each of which is reported as GT = 0/1
.
A lt50bp
filter is applied to all VNTR calls with INFO:CNVTRLEN < 50
(the min SV length parameter is set to 50 bp by default). For total calls with no INFO:CNVTRLEN
, FORMAT:TOTALCNVTRLEN
is used instead.
A TotalCall
filter is applied to all VNTR total calls (calls with GT = ./.
). This behavior can be disabled by adding the option --sv-vntr-filter-total-calls false
.
A LowPopulationVariance
filter is applied to all VNTR calls with the FORMAT:SVFT
field equal to LowPopulationVariance
. The filter indicates that there are few population samples with an SV variant for this region.
An OverlapsVNTR
filter is applied to any SV call that overlaps with a VNTR call (even with a HomRef filter) UNLESS the VNTR call has a TotalCall
or a LowPopulationVariance
filter.
DRAGEN can find and remove variants that are common to separate VCF files. DRAGEN supports the following modes:
Small indel deduplication—If using a structural variant VCF and a small variant VCF, DRAGEN filters all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF (without changing SV and SNV VCF files) that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup.vcf.gz
as suffix. The diagram below describes the small indel deduplication pipeline. You must provide a reference genome to generate the VCF files to normalize the variants. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
SMN deduplication—If using a small variant VCF and an ExpansionHunter VCF, DRAGEN filters any lines in the small variant VCF that have the same chromosome and position as lines in the ExpansionHunter VCF with the INFO tag VARID=SMN
. A reference genome is not required.
Use the following command line options to input VCF or gVCF files. The input files are not altered.
vd-sv-vcf
—Specify a structural variant VCF or gVCF.
vd-small-variant-vcf
—Specify a small variant VCF or gVCF.
vd-eh-vcf
—Specify an ExpansionHunter VCF or gVCF.
DRAGEN determines the name and type of the output file as follows.
Output prefix
If a value is specified for output-file-prefix
, the prefix is used as usual. If the value is not valid, the name of the filtered input is used as the prefix.
Deduplication mode
The prefix is followed by .small_indel_dedup
or .smn_dedup
depending on the deduplication mode used.
File type
The output file type matches the input file type (VCF or gVCF). If enable-vcf-compression
is set to true
, the output file is gzip compressed, regardless of if the input file was compressed. The name of the match log is either match_log.smn_dedup.txt
or match_log.small_indel_dedup.txt
depending on which deduplication mode you use.
You can use the following command line options for variant deduplication.
enable-variant-deduplication
To enable variant deduplication, set to true
. The default is false
.
enable-vcf-indexing
To generate tabix index files, set to 'true'. The default is 'true'.
vd-output-match-log
To log matching lines to a text file, set to true. The default is false. For each match, the two matching lines follow each other, then by a new line.
The following is an example command for an SMN deduplication standalone run:
You can also run small indel deduplication automatically on outputs from the DRAGEN joint caller where both structural variant and small variant callers are enabled. To run small indel deduplication automatically, set enable-variant-deduplication
to true
, and make sure the vd-sv-vcf
, vd-small-indel-vcf
, and vd-eh-vcf
input options are not set. Only small indel deduplication can be run automatically.
The following is an example command for an automatic small indel deduplication run.
The Ploidy Caller uses the per contig median coverage values from the Ploidy Estimator to detect aneuploidy and chromosomal mosaicism in mammalian germline samples from whole genome sequencing data.
The Ploidy Caller runs by default except in the following circumstances:
The Ploidy Estimator cannot determine if the input data is from whole genome sequencing. For example, data from exome or targeted sequencing.
The reference genome does not contain any autosomes following the expected naming convention (e.g. chr1
or 1
).
There is no germline sample. For example, tumor-only analysis.
Chromosomal mosaicism is detected when there is a significant shift in median coverage of a chromosome compared to the overall autosomal median coverage.
The following table displays some examples of expected shifts in coverage for a give aneuploidy and mosaic fraction.
The Ploidy Caller models coverage as a normal distribution for both the null (neutral) and the alternative (mosaic) hypotheses. The two normal distributions have equal mean at the median autosomal coverage for the sample, but the variance of the alternative normal distribution is greater than that of the null normal distribution. The baseline variance of the two models at 30x coverage was determined empirically from a cohort of ~2500 WGS samples. The actual variance used for the two models is calculated from the baseline variance at 30x coverage, adjusting for the median autosomal coverage of the sample. Below are the likelihood distributions for the null and alternative hypotheses for a sample with 35x median autosomal coverage.
After applying an empirically estimated prior for chromosomal mosaicism the Ploidy Caller generates ploidy calls according to the posterior probability of the null and alternative hypotheses as shown below for a sample with 35x median autosomal sequencing coverage.
At 35x median autosomal coverage, the threshold for deciding between a neutral (REF) and an alternative (DEL or DUP) call is roughly at +/- 5% shift in coverage for an autosome. At 100x median autosomal coverage, the threshold is at roughly +/- 3% shift in coverage for an autosome. A Q20 threshold is used to filter low quality calls.
In addition to detecting aneuploidy and chromosomal mosaicism in autosomes where the expected reference ploidy is 2, the Ploidy Caller can also detect these variants in allosomes. The reference sex karyotype used for making calls on the allosomes is determined from the sex karyotype of the sample either provided on the command line using the --sample-sex
option or from the Ploidy Estimator. If the sex karyotype of the sample is not provided on the command line and not determined by the Ploidy Estimator, then the sex karyotype is assumed to be XX. Whenever the sex karyotype contains at least one Y chromosome, the reference sex karyotpye is XY. If the sex karyotype does not contain at least one Y chromosome, then the reference sex karyotype is XX. The following table displays each of the possible sex karyotypes for a sample. If the Y chromosome reference ploidy is zero, then ploidy calling is not performed on the Y chromosome.
The Ploidy Caller generates a <output-file-prefix>.ploidy.vcf.gz
file in the output directory. The output file follows the VCF 4.2 Specification. A single record is reported for each reference autosome and allosome, except for the Y chromosome if the reference sex karotype is XX. Calls are not made for other sequences in the reference genome, such as mitochondrial DNA, unlocalized or unplaced sequences, alternate contigs, decoy contigs, or the Epstein-Barr virus sequence. The VCF header is annotated with ##source=DRAGEN_PLOIDY
to indicate the file is generated by the DRAGEN PLOIDY pipeline.
The following information is provided in the VCF file.
Meta-information--The VCF output file contains common meta-information such as DRAGENVersion
and DRAGEN CommandLine
, as well as Ploidy Caller specific information. The VCF header contains the meta-information for median autosome depth of coverage, the provided sex karyotype if available, the estimated sex karyotype from the Ploidy Estimator if available, and the reference sex karyotype. The following is an example of the header lines:
FILTER Fields--The VCF output file includes the LowQual filter, which filters results with quality score below 20.
INFO Fields--The VCF output INFO fields include the following:
END—End position of the variant described in this record.
SVTYPE—Type of structural variant.
FORMAT Fields--The VCF output file includes the following format fields. There is no GT
FORMAT field. A variant call in the VCF displays either <DUP>
or <DEL>
in the ALT column. A non-variant call displays .
in the ALT column. If using the output file for downstream use, a GT field can be added for variant calls using ./1
for a diploid contig and 1 for a haploid contig. For non-variant calls, use 0/0
for diploid and 0
for haploid.
DC—Depth of coverage.
NDC—Normalized depth of coverage.
The following is an example output file for a sample with mosaic loss of the Y chromosome.
The following is an example output file for a sample with trisomy 21.
Samples derived from cell lines frequently have coverage artifacts that might result in variant ploidy calls on some chromosomes. Chromosomes 17, 19, and 22 are the most common for the cell line coverage artifacts. When performing accuracy assessments of ploidy calls on cell line samples, filter out chromosomes with known cell line artifacts.
The DRAGEN offering encompasses a multitude of bioinformatics tools and allows for rapid end-to-end analysis of NGS data. The most common workflow is running FASTQ data through the DRAGEN map/align component and streaming directly to the small variant caller. This eliminates the need for a user to construct a workflow from off-the-shelf tools, dealing with interfaces, unfortunate incompability issues, and external library dependencies. In this section, we expand on the capabilities of DRAGEN to ease the workflow needs of common bioinformatics analyses.
Most components in DRAGEN can be enabled or disabled independently. These are controlled by enable-<component>
flags on the command line. Based on which components are enabled, DRAGEN will resolve any inconsistencies (if applicable) and construct the desired workflow. Where possible, DRAGEN will run components in parallel to save time and compute costs. Some examples of the top level options are listed here:
enable-map-align
enable-sort
enable-duplicate-marking
enable-variant-caller
enable-cnv
enable-sv
Each component has its own set of options which are used to configure the behavior of the component. These options typically control specific input settings, internal algorithm parameters, or output files and filtering criteria. Refer to the individual component sections for more details. As an example, a different BED file may be provided separately for each caller:
cnv-target-bed
sv-call-regions-bed
vc-target-bed
Additionally, some options are shared amongst callers, such as output-directory
and sample-sex
. Each variant caller will also produce its own set of VCFs and metric output files.
DRAGEN accepts the following common standard NGS input formats:
FASTQ (fastq-file1
and fastq-file2
)
FASTQ List (fastq-list
)
BAM (bam-input
)
CRAM (cram-input
)
Somatic workflows can use tumor equivalent input files (eg, tumor-bam-input
).
When running from unaligned reads, the reads first go through the map/align component to produce alignments which continue downstream to the variant callers. When running from prealigned reads, the user has the choice to re-align with the DRAGEN map/align component or to use the existing alignments from the source input. It is common to run with enable-map-align false
if you already have DRAGEN alignments available in BAM or CRAM format.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work. In this section we outline some best practices for doing so.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Build up the necessary options for each component separately, so that it can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
The following table summarizes the support for different input formats and variant callers.
For brevity, other features and callers are not listed in the table even though they may be supported. Examples include repeat genotyping, SMA, CYP2D6, and ploidy calling. DRAGEN can run all germline callers for WGS analysis in a single command line (CNV + SNV + SV + ...). Similar support also exists for WES analysis, if the component is supported in single caller mode and there is no conflict with the input configurations.
The somatic workflows can be constructed in a similar manner by specifying tumor and normal inputs. The need for potentially two input files (tumor and matched normal) as well as the need for a matched normal SNV VCF for the Somatic CNV caller means extra care has to be taken.
One recommended tumor/normal workflow first starts with running the matched normal through the Germline Workflow.
Run matched normal through Germline workflow (CNV + SNV + SV + ...). This is required to first generate the matched normal SNV VCF. See the Somatic CNV section for more details.
Run tumor and matched normal through Somatic workflow (CNV + SNV + SV + ...)
Optionally, a full tumor/normal analysis can be done in a single execution if both the SNV and CNV modules are enabled, by leveraging the BAF information directly from the small variant caller. See the Somatic CNV section for more details. In brief, this requires the use of --enable-variant-caller true
and --cnv-use-somatic-vc-baf true
.
The following table lists the various combinations that are supported under the tumor/normal mode of operation.
Running in tumor only mode just requires removing the matched normal input from the INPUT
options and configuring each individual caller to run in tumor only mode (for example, CNV uses a population B-allele VCF instead of the matched normal SNP VCF).
The following table lists the combinations that are supported under the tumor only mode of operation.
These modes are for WGS analysis. Similar support also exists for WES analysis, if the mode is supported in single caller mode and there is no conflict in the input configurations. For WES analysis, note that CNV requires a panel of normals regardless of whether it is Tumor Normal or Tumor Only analysis.
The Ploidy Estimator runs by default. The Ploidy Estimator uses reads from the mapper/aligner to calculate the sequencing depth of coverage for each autosome and allosome in the human genome. The sex karyotype of the sample is then estimated using the ratios of the median sex chromosome coverages to the median autosomal coverage. The sex karyotype is estimated based on the range the ratios fall in. If the ratios are outside all expected ranges, then the Ploidy Estimator does not determine a sex karyotype.
Ploidy estimation can fail if the type of input sequencing data cannot be determined to be either WGS or WES. When ploidy estimation fails the estimated median coverage values will be zero. The type of input sequencing data is determined using coverage skewness.
skewness = std::abs(autosomeMean - autosomeMedian) / autosomeMean
When skewness is <= 0.2 the data is determined to be WGS. Note that a minimum of 2x coverage is required for WGS. WGS with coverage lower than 2x may not be detected properly or may be detected as WES. When skewness is >=0.6 the data is determined to be WES. Skewness between 0.2 and 0.6 will have undefined input sequencing data type and the reported estimated median coverage values will be zero.
For WES data, the median exome coverage is estimated using the 99th percentile of coverage bins across each contig. This estimated median exome coverage is then reported by the Ploidy Estimator and used for sex estimation.
If there is not sufficient sequencing coverage in the autosomes (at least 2x for either WGS or WES) then the Ploidy Estimator does not determine a sex karyotype.
When both tumor and matched normal reads are provided as input, the Ploidy Estimator only estimates sequencing coverage and sex karyotype for the matched normal sample and ignores the tumor reads. If only tumor reads are provided as input, the Ploidy Estimator estimates sequencing coverage and sex karyotype for the tumor sample.
The Ploidy Estimator results, including each normalized per-contig median coverage, is reported in the <output-file-prefix>.ploidy_estimation_metrics.csv
file and in standard output.
The following is an example of the results.
DRAGEN supports Tumor Mutational Burden (TMB) in Tumor-Only or Tumor-Normal Mode.
It is important to note that in T/O mode germline variants must be identified and filtered using database information and optionally also allele frequency information. These germline filtering techniques are generally not as accurate as tumor normal subtraction. When using databases only to subtract germline variants, the TMB may be slightly higher than the more accurate T/N estimate. When using database and allele frequency information to remove germline variants, the TMB may be slightly underestimated for high purity tumor samples.
DRAGEN TMB comprise the following steps:
Please refer to "Somatic mode" for detailed variant calling options.
TMB is computed over protein coding regions with sufficient coverage. If DRAGEN detects a reference hg19/38, GRCh37/38 or hs37d5 it will automatically select the appropriate coding region based on the bed files available in "<INSTALL_PATH>/resources/tmb/". By default the coverage threshold for eligible regions is 50.
The protein coding region bed file and the coverage settings can explicitly be specified using the qc-coverage
options listed below in [QC coverage settings to override the default eligible region]. If DRAGEN does not automatically detect the reference it is required to specify these settings.
The following variants are excluded from the TMB calculation:
Non-PASS variants
Mitochondrial variants
MNVs
Variants that do not meet the minimum depth (DP) threshold. Use the --vc-callability-tumor-thresh
command line option to specify the threshold value.
Variants that do not meet the minimum variant allele threshold. Use the --tmb-vaf-threshold
command line option to specify the threshold value.
Variants that fall outside the eligible regions.
Tumor driver mutations. Variants with a population allele count ≥ 50 are treated as tumor driver mutations. You can specify the cosmic driver threshold using the tmb-cosmic-count-threshold
command line option. The tumor driver mutations filter relies on Nirvana annotations and will additionally require settings for --enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
.
By default, germline variants are not counted towards TMB. Variants are determined as germline based on a database or a proxi filter. The database germline filter can be disabled with tmb-skip-db-filter
. Disabling the database germline filter will effectively also disable the germline proxi filter.
Database filter
Variants with a population allele count ≥ 50 that are observed in either the 1000 Genome or gnomAD database will be marked as germline. Use germline-tagging-db-threshold
to change the population allele counts. The database germline filter relies on Nirvana annotations and requires settings for --enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
.
Proxi filter
Proxi filter can be enabled with tmb-enable-proxi-filter
. The proxi filter will flag any variants with VAF > 0.9 as germline. The proxi filter scans the variants surround a specific variant and identifies those variants with similar VAFs. The proxi window size that determines the number of surrounding variants can be specified with tmb-proxi-window-size
. If 95% (default value for tmb-proxi-fraction-threshold
) and no less than 5 (tmb-proxi-count-threshold
) of the surrounding variants of similar VAF are germline, then mark the current variant also as germline.
Proxi filter can also be done via a probabilistic approach, which can be enabled with tmb-enable-prob-proxi
. It estimates the expected germline allele frequency using the surrounding germline variants and then tests whether the allele frequency of the target variant is similar to the expected germline allele frequency or not. P value threshold can be set by tmb-prob-proxi-p-value
(the default value 1e-15 is set for ultra-deep sequenced samples, e.g. cfDNA)
Note that proxi filters can be too aggressive for 100% pure cell lines. Probabilistic proxi filter can be problematic for mixing or contaminated samples, as these samples do not have clear germline variant allele frequency distributions.
CH filter
When processing ctDNA samples it may be beneficial to also remove CH (clonal hematopoiesis) variants. Circulating tumor DNA generally has shorter fragment size. CH variants can be identified based on the insert size of the reads supporting the call. To capture the insert-size distribution for each variant call, it is required to specify vc-log-insert-size
during variant calling (step1). Once specified, potential CH variants based on insert size distribution will be labeled in the output. Additional, CH variants can be also labeled via a bed file supplied to tmb-ch-bed
. Variants other than germlines overlapping the region will be labeled as CH.
Nonsynonymous consequences are detected based on the Nirvana annotations. Nirvana variants that are annotated with the following consequences are labaled as nonsynonymous:
feature_elongation, feature_truncation, frameshift_variant, incomplete_terminal_codon_variant, inframe_deletion, inframe_insertion, missense_variant, protein_altering_variant, splice_acceptor_variant, splice_donor_variant, start_lost, stop_gained, stop_lost, transcript_truncation
TMB outputs a tmb.trace.csv file with detailed information on each variant used the TMB score. The trace file contains a column "Nonsynonymous" that indicates the appropriate status for each variant.
The subset of filtered variants that are nonsynonymous are used as numerator in the "Filtered Nonsyn Variant Count" metric.
TMB = Filtered Variants / Eligible Region (Mbp)
Nonsynonymous TMB = Filtered Nonsynonymous Variants / Eligible Region (Mbp)
The maximum somatic allele frequency (MSAF) outputs the estimated maximum somatic allele frequency of the sample. This is done via finding the confident somatic variants with highest allele frequency. MSAF is a rough approximate to the tumor fraction of cfDNA in peripheral blood samples. The MSAF mode can be enabled with tmb-enable-msaf
.
[Required]
--enable-tmb true
Enables TMB. If set, the small variant caller, Illumina Annotation Engine, and the related callability report are enabled.
[Recommended]
[QC coverage settings to override the default eligible region]
The protein coding region and the coverage settings can explicitly be specified using the qc-coverage
options listed below. All four settings must be specified to override the defaults. If DRAGEN does not automatically detect the reference it is required to specify these settings.
--qc-coverage-region-1
Specify the coding regions bed file to use.
--qc-coverage-tag-1=tmb
Required to associate these coverage settings with TMB. If this setting is not specified then DRAGEN will revert to default coding regions.
--vc-callability-tumor-thresh
Specify the somatic_callable bed minimum threshold, this will limit the regions over which TMB will be computed (default is 50).
--qc-coverage-reports-1=callability
. The callability report is required whenever it is desired to override the default TMB coverage settings.
[Optional settings]
--tmb-vaf-threshold
Specify the minimum VAF threshold for a variant. Variants that do not meet the threshold are filtered out (default=0.05)
--tmb-cosmic-count-threshold
The minimum number of observations in cosmic for variant to be considered a driver mutation. Driver mutations are not counted in TMB. This setting has very little impact on WES/WGS, but can help avoid bias in small panels (default=50)
--tmb-skip-db-filter
Skip database germline filtering. The database germline filter is required for tumor-only samples, but can be skipped for tumor-normal (default=false)
--germline-tagging-db-threshold
Specify the minimum allele count (total number of observations) for an allele in gnomAD or 1000 Genome to be considered a germline variant. Variant calls that have the same positions and allele are ignored from the TMB calculation (default=50)
--tmb-germline-max-cosmic-count
Restrict the db-filter. Variants with cosmic allele count higher than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1000]).
--tmb-germline-min-vaf
Restrict the db-filter. Variants with a variant allele frequency lower than this threshold will never be marked as germline. Set to 0 to disable. (default=0, range=[0;1])
--tmb-enable-proxi-filter
Enable proxi filter functionality in germline filtering. This is an optional feature that may be appropriate for T/O runs. In T/O mode the DB germline filter may not able to detect all germline variants, especially for ethnicity groups that are not well represented in germline databases. The proxi filter uses allele frequency information to help remove germline variants missed by the DB, and can help to obtain more accurate (lower) TMB values on samples with low tumor purity. In samples with high tumor purity this filter may be too aggressive and mark some somatic variants as germline resulting in too low TMB scores. (default=false)
--tmb-proxi-count-threshold
Proxi filter surrounding variant count threshold in germline filtering (default=5)
--tmb-proxi-fraction-threshold
Proxi filter surrounding variant db filter fraction threshold in germline filtering (default=0.95)
--tmb-proxi-window-size
, Number of surrounding variants before and after the target variant for proxi filter (default=500)
--tmb-ch-bed
Variants in the region will be labeled as clonal hematopoiesis (CH) variants.
--tmb-ch-insert-p-value
Minimum P value to classify a variant as CH using insert size (default=0.1)
--tmb-ch-insert-min-len
Minimum fragment size to test for CH using insert size (default=100)
--tmb-ch-insert-max-len
Maximum fragment size to test for CH using insert size (default=200)
--tmb-ch-insert-min-num
Minimum number of fragment size record to test for CH using insert size (default=50)
--tmb-enable-msaf
Enable MSAF output (default=false)
--tmb-msaf-p-value
Maximum P value (from insert size) to call confident somatic variant (default=1e-5)
--tmb-msaf-rank-num
If no confident somatic variant found, it will use the specified ranked variant (default=4)
The TMB values are output to <output prefix>.tmb.metrics.csv
. The file format uses the following CSV column convention, similar to other metric CSV files.
The TMB module also outputs a tmb.trace.csv file that provides detailed information on each variant that was included in the TMB calculation.
When enabling MSAF, the information is output to <output prefix>.tmb.msaf.csv
.
DRAGEN includes a dedicated human leukocyte antigen (HLA) genotyper for calling HLA class I and class II alleles with two-field resolution (a.k.a. four-digit resolution). At this resolution, DRAGEN HLA genotyper is able to discern and report HLA alleles based on their protein sequences. For more information on HLA nomenclature, see Nomenclature for factors of the HLA system¹.
Class I HLA typing is enabled by setting the --enable-hla
flag to true
. Additionally, class II HLA typing is enabled by setting the --hla-enable-class-2
flag to true
. For TSO500-solid or TSO500-liquid runs, HLA typing should be enabled instead through the following batch options: --tso500-solid-hla=true
and --tso500-liquid-hla=true
respectively. NOTE: class II HLA typing is not supported for TSO500 runs.
The HLA Caller primarily executes the following four steps:
Extract reads mapped to the HLA genes. These are HLA-A, -B and -C loci for class I, and HLA-DQA1, -DQB1, -DRB1 for class II loci. The human reference version is auto-detected during this step. The human reference builds hg19, hs37d5, and GRCh38 are fully supported, CHM13 build is enabled but not supported.
Align the extracted HLA reads to a reference set of 9,086 HLA alleles using the DRAGEN map-align processor. Only full-sequence alleles from the IMGT/HLA database (v3.45) that have also been reported on the Allele Frequency Net database were selected in building the default HLA reference resource.
Filter out HLA-specific alignments with sub-maximal alignment scores, and optimize the read distribution using Expectation-Maximization.
Select the most likely genotype for each HLA locus from a short list of candidate alleles using a homozygosity threshold set at 20%.
The reference directory that is supplied at command-line with --ref-dir
must contain anchored_hla
, a specific subdirectory with HLA-specific reference files. The DRAGEN default reference directories have been updated to contain the anchored_hla
subdirectory.
An HLA-specific reference subdirectory can be built by executing
This command will create anchored_hla
as a subdirectory of the target {REF-DIR}
supplied as an argument to --output-directory
as above.
The HLA-specific reference subdirectory can be built at the same time as the primary reference construction. An example command-line for this mode is
An HLA resource file, HLA_resource.v2.fasta.gz
, is packaged with DRAGEN. It is located at <INSTALL_PATH>/resources/hla/HLA_resource.v2.fasta.gz
An HLA allele reference FASTA file can be used as input to the hash-table building option --ht-hla-reference
.
Note: Using custom HLA reference files to generate the HLA-specific reference subdirectory anchored_hla
is not recommended, as accuracy cannot be guaranteed.
Custom input FASTA files (which can be zipped or unzipped) must contain only HLA allele sequences, and all allele names must adhere to the HLA star-allele nomenclature¹, where the first character of each allele name indicates the HLA locus, e.g. A*02:01:01:01. Allele names extracted from such a custom input file start at the first character of the allele name (to be preceded by character '>') and end at the last character of the name or until the first delimiter character '-' is reached.
The following is an illustration of a valid HLA reference input file to option --ht-hla-reference
:
Custom HLA reference files might require customized memory allocation, which can be specified with an argument to the command-line option --ht-hla-ext-table-alloc
.
The HLA component has no additional user-settable command-line options.
Note: this HLA component replaces prior workflows. See the appropriate guide for the DRAGEN software version being used in order to determine valid parameters.
The HLA Caller requires the DRAGEN mapper-aligner to be enabled (enabled via option --enable-map-align=true
, or through TSO500-batch options).
The HLA Caller generates a tab-delimited output file for class I and, if enabled, class II alleles. Class I results contain six class I alleles, with two alleles per class I HLA gene (HLA-A, -B and -C), and class II results contain six class II alleles, with two alleles per class II HLA gene (HLA-DQA1, -DQB1, and -DRB1). Homozygous calls show identical alleles at the respective loci.
The genotype output file is <prefix>.hla.tsv
, and it is located in the user-specified output directory. In tumor-only mode the output is stored to <prefix>.hla.tumor.tsv
file. In tumor-normal mode, two output genotype files are generated from tumor and normal samples: <prefix>.hla.tumor.tsv
and <prefix>.hla.tsv
.
In all cases, the genotype file contains a header row with one column for each of the class I and/or class II alleles and a body row with the HLA type of each allele at two-field resolution.
The following is an example output file produced by DRAGEN class I and II HLA typing:
The HLA Caller generates two additional HLA files.
<prefix>.hla_metrics.csv
—Contains the number of reads supporting each allele result (individual reads may support multiple alleles), and the total number of HLA reads analyzed.
<prefix>.hla_2field_EM.tsv
—Contains the maximal likelihood output from the Expectation-Maximization step: a list of candidate alleles at two-field resolution and corresponding intermediate posterior probability.
Internal checks for sufficient coverage at each HLA locus will trigger a warning message when fewer than 50 reads support any given allele call, or when fewer than 300 HLA reads are detected overall. In both settings, an allele call will still be attempted, but the results may be unreliable.
An empty genotype call at a given HLA locus is returned when there are no reads supporting that locus. In this scenario, a warning message will indicate missing coverage.
No HLA genotype will be returned with single-end DNA read inputs.
By default, DRAGEN only genotypes HLA alleles that have full-nucleotide sequence data in IMGT/HLA v3.45 and that have also been reported on the Allele Frequency Net database. As such, no partial alleles are currently called using the supplied resource reference FASTA file HLA_resource.v2.fasta
.
The HLA Caller accepts standard input files in FASTQ or BAM format.
The following example command line uses FASTQ file inputs.
The following example command line uses BAM file inputs (with map-align enabled). NOTE: the --hla-enable-class-2
enables class II HLA typing.
The following example command line uses tumor-normal paired file inputs from FASTQ.
The following example command line activates HLA typing in a TSO500-solid run from FASTQ input. A TSO500-compatible reference_directory is one which uses the same reference genome as in TSO i.e. hg19.
The following example command line activates HLA typing in a TSO500-liquid run from FASTQ input. A TSO500-compatible reference_directory is one which uses the same reference genome as in TSO i.e. hg19.
¹Marsh SG, et al. Nomenclature for factors of the HLA system, 2010. Tissue Antigens. 2010 75:291-455.
The DRAGEN Structural Variant (SV) Caller integrates and extends Manta structural variant calling methods to provide SV and indel calls 50 bases or larger. SVs and indels are called from mapped paired-end sequencing reads. The SV caller is optimized for analysis of diploid germline variation in small sets of individuals and somatic variation in tumor-normal sample pairs.
The SV caller performs the following actions:
Discovers, assembles, and scores large-scale SVs, medium-sized indels, and large insertions within a single efficient workflow.
Combines paired and split-read evidence during SV discovery and scoring to improve accuracy, but does not require split-reads or successful breakpoint assemblies to report a variant in cases where there is strong evidence otherwise.
Scores known SV deletions and insertions from an input VCF file against one or more input samples, either as a standalone procedure or together with standard SV discovery.
Provides scoring models for 1) germline variants in small sets of diploid samples, 2) somatic variants in matched tumor-normal sample pairs, and 3) somatic and germline variants in tumor-only samples.
All SV and indel inferences are output in VCF 4.2 format.
The DRAGEN SV Caller divides the SV and indel discovery process into the following steps.
Scans the genome or a subset of the genome (specified by the call regions) to build various genome-wide data structures, including a breakend association graph of all SV associated regions. The graph contains edges that connect all regions of the genome that have a possible breakend association. Edges can connect two different regions of the genome to represent evidence of a long-range association, or an edge can connect to a region to capture a local indel/small SV association. These associations are more general than a specific SV hypothesis and multiple breakend candidates might be found on one edge. Typically only one or two candidates are found per edge. Instead of passing an inclusion region BED file, an exclusion region BED file can be passed to DRAGEN so that any SV breakend that overlaps with these regions gets removed from downstream analyses. The excluded regions are removed from the graph building process, but active regions can get extended and present in the excluded regions in the refinement step. This can happen for the active regions that are close to the boundaries of the excluded regions. Hence, the final SV calls may still get extended to these regions.
Analyzes the breakend association graph to discover candidate SVs, then scores discovered candidate SVs and any known SVs from the input. Analysis and scoring are performed as follows.
Infers SV candidates that are associated with the given graph edge.
Assembles the SV breakends.
Merges discovered SV candidates with any known SV candidates included in the input data.
Scores/genotypes and filters each SV candidate under various biological models (currently germline, tumor-normal, and tumor-only).
Outputs scored SVs to VCF.
For each structural variant and indel, the SV Caller attempts to assemble the breakends by gathering nearby evidential reads, and to call SV events to base pair resolution by aligning assemblies against the reference genome. Then SV caller reports the left-shifted breakend coordinate (per the VCF 4.2 SV reporting guidelines), together with any breakend homology sequence and/or inserted sequence between the breakends. As a result, SV events' reported coordinates may not be directly reflected by read alignments' IGV view. Often the assembly will fail to provide a confident explanation of the data, especially in repeat regions. As a result, the SV caller will skip providing single-base resolution breakpoints or the associated split read support. In such cases, the SV caller will approximate the event breakpoints and score the events under the unified likelihood model as in other regular cases but report the variant as IMPRECISE instead.
You can provide known SVs as input for forced genotyping. This known SV input can be scored either standalone or together with the standard SV discovery workflow, in which case the known and discovered SVs are merged.
The sequencing reads provided as input to the SV Caller are expected to be from a paired-end sequencing assay that results in an "innie" orientation between the two reads of each sequence fragment, each presenting a read from the outer edge of the fragment insert inward.
The SV Caller is primarily tested for whole-genome and whole-exome (or other targeted enrichment) sequencing assays on DNA. For these assays the following applications are supported:
Joint analysis of 5 or fewer diploid individuals
Subtractive analysis of a matched tumor-normal sample pair
Analysis of an individual tumor sample
For joint analysis, there is no specific restriction against larger cohorts, but there might be stability or call quality issues.
Tumor samples can be analyzed without a matched normal sample. In this case, both germline and somatic variants are scored and reported in the output.
The SV Caller can discover all variation classes that can be explained as novel DNA adjacencies in the genome. Novel DNA adjacencies are classified into the following categories based on the breakend pattern:
Deletions
Insertions
Insertions in the result can be divided into the following two subclasses depending on if the inserted sequence can be fully assembled. 1) Fully-assembled insertions; 2) Partially-assembled (inferred) insertions
Mobile Element Insertions that are not called by the general purpose SV routine will be rescued by the MEI specific routine based on similarity between assembled contigs and known sequences in the MEI catalog described in the file <INSTALL_PATH>/config/sv_mobile_element_sequences.fa
.
Tandem Duplications
Inversions
Unclassified breakend pairs corresponding to intra- and inter-chromosomal translocations, or complex structural variants.
The SV Caller cannot directly discover the following variant types:
Dispersed duplications.
Dispersed duplications may be indirectly called as insertions or unclassified breakends.
Most expansion/contraction variants of a reference tandem repeat.
Breakends corresponding to small inversions.
The limiting size is not tested, but in theory, detection falls off below ~200 bases. Micro-inversions might be detected indirectly as combined insertion/deletion variants.
Fully-assembled large insertions.
The maximum fully-assembled insertion size should correspond to approximately twice the read-pair fragment size, but power to fully assemble the insertion should fall off to impractical levels before this size.
The SV Caller does detect and report very large insertions when the breakend signature of such an event is found, even though the inserted sequence cannot be fully assembled.
More general repeat-based limitations exist for all variant types:
Power to assemble variants to breakend resolution falls to zero as breakend repeat length approaches the read size.
Power to detect any breakend falls to (nearly) zero as the breakend repeat length approaches the fragment size.
While the SV Caller classifies certain novel DNA-adjacencies into variant classes, it has a limited ability to infer high-level events resulting from complex rearrangements, so certain calls summarized as deletions, duplications, and insertions might be better described by looking at the full system of breakends and copy number changes associated with a given event.
The DRAGEN SV caller is capable of forced genotyping a set of SVs input from a VCF file. Forced genotyping means that the input SVs are scored and emitted in the output of the SV Caller even if the variant is not supported in the sample data. For example, given a germline analysis, the input variants are processed and written to the output VCF, even if the variant quality falls below the threshold normally required for an SV to be emitted.
Forced genotyping typically enables known SVs to be detected at higher recall than standard SV discovery (particularly for SV discovery on a lower-depth sample). Forced genotyping can also be useful to assert against the presence of an SV allele. For example, you can use forced genotyping to distinguish a confident homozygous reference genotype from a lack of sequencing coverage over the SV locus.
Forced genotyping SVs are processed according to the current SV analysis being run. For example, if a germline analysis is configured by providing one or more normal samples as input, then the input SVs are scored under a germline model.
Forced genotyping alleles are always emitted in the output and might have modified scoring and filtering rules applied compared to SVs only discovered from the sample data.
Forced Genotyping can be run in two modes.
Standalone --- Only the SVs described in an input VCF are scored and emitted.
Integrated --- The standard SV discovery analysis is run and the results are merged with SVs scored from the forced genotyping input. The workflow outputs the union of SVs discovered from the sample data and any additional forced genotyping alleles. The workflow is run whenever the --sv-discovery
option is true.
You can specify forced genotyping input using the --sv-forcegt-vcf
option. The input must be a VCF of SV alleles. The SV allele types are restricted to insertions, deletions, tandem duplications, and breakends, which are not labeled with the INFO/IMPRECISE
flag. The following are the filtering criteria required for the VCF record to be processed as an input SV allele. If any of the criteria are not met, the VCF record is removed from the set of input SVs for forced genotyping. When a forced genotyping VCF is specified on the command line, the SV caller reports the total number of SV records used as input SVs and the total number of records filtered (if any) due to the following criteria.
Describes an insertion, deletion, tandem duplication, or breakend record.
Cannot contain the INFO/IMPRECISE
flag.
Cannot contain multiple ALT alleles.
Has a FILTER
value of PASS
or unknown (.
).
All indels are at least the minimum scored variant size (default is 50).
Cannot repeat an SV allele previously described in the same file.
The REF
field cannot be empty or unknown (.
).
You must describe insertions using the VCF small indel format, including an ALT
entry that describes the complete insertion sequence. Using <INS>
as a symbolic alt allele is not accepted. You can describe deletions using either the VCF small indel format or the <DEL>
symbolic alt allele. For any variant described using a symbolic alt allele, you must also provide a value for INFO/END
. Inversions represented in a single VCF record using the <INV>
alt allele are not accepted, but the inversion can be genotyped if converted to a set of breakend records. Each breakpoint is described by a pair of breakend VCF records. If the forced genotyping input contains just one record of the pair and the input conditions above are met, the input is still accepted for forced genotyping, and the distal breakend is inferred from the local record.
You can describe breakpoint insertions for non-insertion SV alleles using one of the following two methods. Both methods correspond to the format used to describe breakpoint insertions in the SV VCF output.
For SVs described using the symbolic ALT
format, such as <DEL>
, the INFO/SVINSSEQ
field is parsed to read the breakpoint insertion sequence.
For smaller indels described directly in the REF
and ALT
fields, the contents of the ALT
field describe the breakend sequence.
Forced genotyping SVs are always output to the standard VCF output of the SV Caller, regardless of whether the forced genotyping is standalone or integrated with SV calling. When the same SV allele is independently discovered from the sample data, only the discovered SV appears in the final output. The discovered SV allele is annotated to indicate the match to a forced genotyping input SV, and the scoring and filtration rules are changed to match.
VCF output records influenced by forced genotyping have the following associated fields.
The flag INFO/NotDiscovered
is set for any VCF record that was not independently discovered from the sample data. When forced genotyping is run standalone, all output records contain the flag. When integrated with SV calling, the flag can distinguish the SV alleles that would not have been discovered in a standard SV analysis.
For these variants only, the usual SV caller ID field generated from the SV Locus graph is not available, instead, the ID is taken from the corresponding user input VCF. The suffix UserInput${InputVCFRecordNumber}
is appended to the ID, separated by an underscore. If your input VCF contains only one of the two VCF records that comprise a breakend variant, then the ID is taken from the mate breakend record and the _Mate
suffix is added.
Any output VCF record that corresponds to a forced genotyping input VCF record has the value INFO/UserInputId=${ID}
set to reflect the VCF ID value of the input VCF record. The corresponding record might have also been discovered independently from the sample data and might not have the INFO/NotDiscovered
flag set.
Any output VCF record that corresponds to a forced genotyping input VCF record containing forced genotyping alleles that match exactly to an input SV has the flag INFO/KnownSVScoring
. VCF records with this flag are always emitted in the output of the SV Caller. Several filters, such as MaxDepth, are not applied.
When DRAGEN-SV is used in the somatic mode (tumor-only or tumor-normal), a BEDPE file with a set of paired-end regions in the BEDPE file format can be specified to filter out sequencing / systematic noise and also recurrent germline calls. Any variant that overlaps with one of the systematic noise paired regions (with a population count of at least 2) and has the same orientation will be marked as SystematicNoise
in the final VCF file. This BEDPE file can be passed via the command line option --sv-systematic-noise
.
The systematic noise BEDPE file is built using VCFs that were generated by the DRAGEN-SV tumor-only pipeline when run on normal samples that do not necessarily match to the subject the tumor sample was taken from. The file might contain several dozen samples.
You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To generate a BEDPE file, do as follows.
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
The systematic noise BEDPE should follow a particular format
The SV caller applies a diploid scoring model for one or more diploid samples(treated as unrelated), as well as a somatic scoring model when a tumor and matched normal sample pair are given.
The germline scoring model produces diploid genotype probabilities for each candidate structural variant. Most candidates are approximated as independent for scoring purposes and modeled under a bi-allelic and diploid genotype likelihood setting. DRAGEN solves for the posterior probability over posible genotypes given the sequencing data by approximating it proportionally to the product between the prior probability of a genotype and the conditional probability of observing a set of read fragments(post-filtering) given the underlying genotype. DRAGEN treats each read fragment independently and represents the conditional probability of the set of read fragments as the product over all the individual read fragments'. For each individual read fragment's conditional probability, DRAGEN combines both paired-read and split-read evidence components, and approximates their contributions as indipendent by representing it as a product of these components, with the condition that the paired-read component is weighted by a linear ramp from one to zero depending on the candidate event type and size as tiny event will not affect pair-read mapping status significantly.
The paired-read component is modeled as a function measuring the deviation of the inferred fragment length from the overall distribution.
The split-read component is modeled as a function measuring the correctness of a read alignment to the a breakend by multiplying across all the non-gap bases' probability of observing a certain base call given the corresponding base of the evaluated allele.
Each read fragment may contribute only paired-read support, only split-read support, or both. Where a fragment contributes split-read support, this support may come from either or both reads in the read pair.
The somatic scoring model is a Bayesian probabilistic model using a tumor-normal joint genotyping approach. It aims to call somatic structural variants in tumors while avoiding germline variants and noisy variants. In this model, the tumor and normal allele frequencies are treated as nonindependent random variables. DRAGEN calculates posterior probabilities for a range of genotype hypotheses, under the assumption that the normal sample conforms to the diploid germline genotype considering homozygous reference, heterozygous, and homozygous states. The tumor sample is a mixture of the germline genotype and, if present, the somatic allele. For the somatic genotype, we consider only two states referring to the absence and presence of the somatic variant in the tumor sample. In cases where the somatic variant is not present, we account for unsystematic independent noise, while assuming an error-free scenario when the somatic variant is considered. To calculate the genotype likelihoods, the model integrates allele frequency likelihoods over the joint tumor and normal allele frequencies and applies modifications to address liquid tumors with Tumor-in-Normal (TiN) contamination. The integration is approximated with a discrete summation. In these calculations, the likelihood for each read to support a given allele is shared with the germline scoring model. The tumor-only somatic scoring model is seen as a special case of the somatic scoring model in the absence of normal data (zero coverage). The posterior probability is converted into a Phred quality score and reported in the VCF output INFO/SOMATICSCORE field.
The SV Caller is optimized for paired-end libraries where the fragment size is typically larger than the size of both reads. Overlapping read pairs can be used to discover SVs, but might not always be handled optimally. For libraries where the typical fragment size is less than the read length, the SV caller attempts to differentiate reads sequencing into adapter sequence from the variant signal. In such cases, the SV Caller's input quality checks may fail and cause SV analysis to be skipped.
If using the standalone mode, your BAM/CRAM inputs must first be mapped. If you have not mapped and aligned your data yet, you can generate an alignment file.
If running from a mapped and aligned BAM, then the contigs specified in the header must strictly match those in the DRAGEN hashtable specified in the current analysis. Missing or extra contigs will lead to a "Reference genome mismatch" configuration error and the analysis will not proceed. If such an error is observed, it is recommended to regenerate the alignment file with the intended DRAGEN hashtable, or to run with the DRAGEN map/align module enabled.
The SV Caller runs quality checks on the input sequencing reads for each sample to make sure that the input corresponds to a paired read assay with the expected FR orientation, before estimating the fragment size distribution. To check consensus read pair orientation, a subset of high-quality read pairs is sampled. At least 90% of these must have the expected FR orientation for SV analysis to continue, otherwise, the SV caller issues a warning, skips any further analysis, and the resulting output files display empty results.
The SV Caller can tolerate nonpaired reads in the input, if sufficient paired-end reads exist to estimate the fragment size distribution. To estimate the fragment size distribution, the SV Caller requires at least 100 read pairs that meet the quality requirements of the estimation routine. Both reads of the pair must have a non-zero mapping quality to the same chromosome, are not filtered or part of a split read mapping, and do not contain indels or soft-clipping. If a sample does not contain a sufficient number of such read pairs, the SV Caller issues a warning, skips any further analysis, and writes empty results to its output files.
The SV Caller disregards any read group labels applied to the input sequences. Each input sample is treated as a separate library with a single fragment size distribution.
At least one BAM or CRAM file must be provided for the normal or tumor sample. A matched tumor-normal sample pair can be provided as well. If multiple input files are provided for the normal sample, each file is treated as a separate sample as part of a joint diploid sample analysis.
In standalone mode, input BAM or CRAM files contain the following limitations:
Alignments cannot have an unknown read sequence (SEQ="*")
Alignments cannot contain the "=" character in the SEQ field.
Alignments cannot use the sequence match/mismatch ("="/"X"). CIGAR notation RG (read group) tags in the alignment records are ignored. Each alignment file is treated as representing one sample.
Alignments with base call quality values greater than 70 are rejected. These are not supported on the assumption that this indicates an offset error.
The following command-line examples show how to run the DRAGEN map/align pipeline depending on your input type. The map/align pipeline generates an alignment file in the form of a BAM or CRAM file that can then be used in the pipeline.
You need to generate alignment files for all samples that have not already been mapped and aligned. Each sample must have a unique sample identifier. Use the --RGSM option to specify the identifier. For BAM and CRAM input files, the sample identifier is taken from the file, so the --RGSM option is not required.
The following example command maps and aligns a FASTQ file:
The following example command maps and aligns an existing BAM file:
The following example command maps and aligns an existing CRAM file:
The SV caller can be configured for targeted sequencing inputs, which disables high-depth filters. Exome mode can be directly set to true or false with the command line option --sv-exome
. If not directly set, exome mode defaults to false unless you run the SV caller in integrated mode and there is not more than 50 Gb of sequencing input.
You can use the --sv-somatic-ins-tandup-hotspot-regions-bed ${BEDFILE}
option to specify ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed
. The file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps). To disable this feature, enter --sv-enable-somatic-ins-tandup-hotspot-regions false
.
Liquid tumors in liquid tumor mode refer to hematological cancers, such as leukemia. In tumor-normal analysis, DRAGEN accounts for Tumor-in-Normal (TiN) contamination by running liquid tumor mode. You can use liquid tumor mode to account for TiN contamination by allowing a nonzero variant allele frequency for the matched normal when calculating the posterior probability of the somatic state. If the contamination is not accounted for, it can severely impact sensitivity by suppressing true somatic variants.
Use the following two options to control liquid tumor mode behavior.
--sv-enable-liquid-tumor-mode
---Enable liquid tumor mode. Liquid tumor mode is disabled by default.
--sv-tin-contam-tolerance
---Set the TiN contamination tolerance level. DRAGEN calls variants in the presence of TiN contamination up to a specified maximum tolerance level. You can enter any value between 0–1. The default maximum TiN contamination tolerance is 0.15. If using the default value, somatic variants are expected to be observed in the normal sample with allele frequencies up to 15% of the corresponding allele in the tumor sample.
The following command line options are supported for the Structural Variant Caller.
The following are the top-level options that are shared with the DRAGEN Host Software to control the SV pipeline. You can use BAM and CRAM files as input. Alternatively, if using read mapping with the SV calling in a single run, you can use all of the DRAGEN input options, including FASTQ, BAM, and CRAM files.
--cram-input
---The CRAM file to be processed.
--tumor-cram-input
---If performing tumor-normal or tumor-only analysis, the tumor CRAM file to be processed.
--fastq-file1
, --fastq-file2
, --fastq-list
---Input FASTQ files or a list of files to be processed.
--tumor-fastq1
, --tumor-fastq2
, --tumor-fastq-list
---Input tumor FASTQ file or list of files to be processed.
--enable-map-align
---Enables DRAGEN map/align. The default is true, so all input reads are remapped and aligned unless the option is set to false.
--output-directory
---Output directory where all results are stored.
--output-file-prefix
---Output file prefix that will be prepended to all result file names.
--bam-input
---The BAM file to be processed.
--tumor-bam-input
--If performing tumor-normal or tumor-only analysis, the tumor BAM file to be processed.
--enable-sv
---Enable or disable the structural variant caller. The default is false.
--sv-call-regions-bed
---Specifies a BED file containing the set of regions to call. Optionally, you can compress the file in gzip or bgzip format.
--sv-exclusion-bed
--- Specifies a BED file containing the set of regions to exclude for the SV calling. Optionally, you can compress the file in gzip or bgzip format.
--sv-region
--- Limit the analysis to a specified region of the genome for debugging purposes. This option can be specified multiple times to build a list of regions. The value must be in the format "chr:startPos-endPos".
--sv-exome
--- Set to true to configure the variant caller for targeted sequencing inputs, which includes disabling high depth filters. In integrated mode, the default is to autodetect targeted sequencing input, and in standalone mode the default is false.
--sv-output-contigs
--- Set to true to have assembled contig sequences output in a VCF file. The default is false.
--sv-forcegt-vcf
--- Specify a VCF of structural variants for forced genotyping. The variants are scored and emitted in the output VCF even if not found in the sample data. The variants are merged with any additional variants discovered directly from the sample data.
--sv-discovery
--- Enable SV discovery. This flag can be set to false only when --sv-forcegt-vcf
is used. When set to false, SV discovery is disabled and only the forced genotyping input variants are processed. The default is true.
--sv-use-overlap-pair-evidence
--- Allow overlapping read pairs to be considered as evidence. The default is false.
--sv-somatic-ins-tandup-hotspot-regions-bed
--- Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file from <INSTALL_PATH>/config/sv_somatic_ins_tandup_hotspot_*.bed
.
--sv-enable-somatic-ins-tandup-hotspot-regions
--- Enable or disable the ITD hotspot region input. The default is true in somatic variant analysis.
--sv-min-edge-observations
--- Remove all edges from the graph with less than this many observations. The default value is set to 3.
--sv-min-candidate-spanning-count
--- Run SV caller and report all large SVs with at least this many spanning support observations. The default value is set to 3.
--sv-min-candidate-variant-size
--- Run SV caller and report all SVs/indels at or above this size. The default value is set to 10.
--sv-min-scored-variant-size
--- After candidate identification, only score and report SVs/indels at or above this size. The default value is set to 50. This parameter doesn't affect the somatic hotspot region.
--sv-hotspot-min-scored-variant-size
--- After candidate identification, only score and report SVs/indels at or above this size inside the SV hotspot region, which includes FLT3, ARHGEF7, and KMT2A genes by default. The default value is set to 25.
Structural Variant calling can run in the following modes:
--enable-map-align false
--enable-sv true
Integrated -- Automatically runs on the output of the DRAGEN mapper/aligner. This mode requires the following options:
--enable-map-align true
--enable-sv true
--enable-map-align-output true
--output-format bam
You can also enable Structural Variant calling with any other caller.
The following is an example command line for Integrated mode:
The following is an example command line for joint diploid calling in standalone mode:
The structural variants VCF output file is available in the output directory. The file is named <output-file-prefix>.sv.vcf.gz
. The contents of the file depend on the type of analysis.
For each major analysis category (germline, tumor-normal, and tumor-only), the appropriate VCF output file is output, reflecting variant calls made under the variant calling mode corresponding to the given analysis type.
VCF output follows the VCF 4.2 specification for describing structural variants. It uses standard field names wherever possible. All custom fields are described in the VCF header. The following sections provide information on the variant representation details and the primary VCF field values.
Sample names output in the VCF output are extracted from each input alignment file from the first read group (@RG) record found in the header. If no sample name is found, a default (SAMPLE1, SAMPLE2, etc.) label is used instead.
All variants are reported in the VCF using symbolic alleles unless they are classified as a small indel, in which case full sequences are provided for the VCF REF and ALT allele fields. A variant is classified as a small indel if all of the following criteria are met:
The variant can be entirely expressed as a combination of inserted and deleted sequences.
The deletion or insertion length is not 1000 or greater.
The variant breakends and/or the inserted sequence are not imprecise.
The variant has not been converted from a deletion to intra-chromosomal breakends by the depth-based SV classification routine.
When VCF records are output in the small indel format, they also include the CIGAR INFO tag describing the combined insertion and deletion event.
Large insertions are reported in some cases even when the insert sequence cannot be fully assembled. In this case, the SV Caller reports the insertion using the <INS>
symbolic allele and includes the special INFO
fields LEFT_SVINSSEQ
and RIGHT_SVINSSEQ
to describe the assembled left and right ends of the insert sequence. The following is an example of such a record from the joint diploid analysis of NA12878, NA12891 and NA12892 mapped to hg19:
The SV caller can also represent tandem duplications as insertions. This representation creates ambiguity in how the variants are presented in the VCF output, especially for small tandem duplications. The representation can lead to complications, such as unrecognized call duplication.
To better normalize the SV caller output, so that the same variant type is not represented in two different VCF formats, small tandem duplications (< 1000 bases) are converted to insertions in the VCF output. Insertions converted from such tandem duplications have a formatting similar to incomplete insertions, using the symbolic allele <INS>
for the ALT
field. The following example shows an insertion, which was converted from a tandem duplication during this normalization process.
Converted insertions include copies of certain output fields. The fields appear the same as in a tandem duplication record. For example, INFO/DUPSVINSSEQ
provides a copy of the breakpoint insertion value computed for the duplication. In the context of a duplication, the breakpoint insertion value would normally be written to INFO/SVINSSEQ
. The following example shows a converted insertion with a breakpoint insertion value:
For more information about copied INFO
fields, see VCF INFO Fields. All INFO
fields use the same DUP
prefix.
Inversions are reported as a set of breakends. For example, given a simple reciprocal inversion, four breakends are reported, sharing the same EVENT INFO
tag. The following is an example breakend records representing a simple reciprocal inversion:
In the germline calling model, when SV candidates are discovered from the sample data and have sufficient paired and split read evidence to be reported in the output, the SV caller applies additional depth-based tests to more accurately classify certain SV candidate types. Candidate breakpoints that are consistent with a deletion are tested for the lower read depth that is expected inside the deleted region. Candidate breakpoints consistent with a tandem duplication are tested for the higher read depth expected in the duplicated region. Candidate SV calls that fail the depth-based tests are still reported in the output, but changed to intrachromosomal breakends. Candidate SV calls that pass continue to be reported in the standard deletion and tandem duplication output formats.
SVs frequently include a small sequence insertion at the breakpoint. Breakpoint insertions are represented differently depending on the SV type. The INFO/SVINSSEQ
field in the VCF output provides the most general description of breakpoint insertions by describing the insertion sequence itself. The corresponding INFO/SVINSLEN
field describes the length of the insertion sequence. For example, the following VCF record describes a large (~8.8 kb) deletion, which includes a single base insertion (C) between the left and right deletion breakends.
The INFO/SVINSSEQ
field is also used to describe breakpoint insertions for tandem duplication and breakend records. The field can also be used to describe the insertion sequence of a large SV insertion.
In the following small indel format example, the VCF record describes a 57 base deletion that includes a single base insertion (A) between the left and right deletion breakends
Breakend records include an additional encoding of breakpoint insertion sequence, as described in the VCF specification for the breakend ALT
field. The SV caller also provides the information to the INFO/SVINSSEQ
field for consistency with other SV record types.
The following example shows a breakend connecting a region of chromosomes 1 and 12 in the sample with a breakend insertion sequence of CA
between the two breakends. The insertion sequence is described in both the ALT
and INFO/SVINNSEQ
fields.
SV Breakpoint Insertion Orientation
The breakpoint insertion sequence is always provided with respect to the strand of the current SV record. Some breakend records have inverted orientation. For inverted orientations, the pair of breakend records contains an insertion sequence that is reverse complemented compared to the mated record.
The following breakend pair example demonstrates an inverted orientation.
Each VCF record output by the SV caller is shifted to the left-most position of the exact homology range of the breakpoint. The exact homology range of the breakpoint is the continuous range of positions over which the SV could be represented while still describing the same SV haplotype. The exact homology range is described in the VCF output with the INFO/HOMSEQ
field, which describes the sequence of the exact homology range and the corresponding INFO/HOMLEN
field, which describes the length of the range.
The following example shows a 62 base deletion with an 11 base breakend homology region. Without left-shifting, the SV has an equivalent representation anywhere from position 39497639 to 39497650.
The following examples illustrate simplified exact breakend homology. The example displays one three base deletion and another three base insertion. In both the insertion and deletion, the variant is left-shifted, so that the corresponding VCF record position is 2.
Deletion
Reference: GTCAGCGA
Variant: GT---CGA
Insertion
Reference: GT---CAG
Variant: GTCGGCAA
In both the insertion and deletion, there is a single base of exact breakend homology C
, so that the same variant can be represented one base to the right.
Germline
The following table lists the VCF FILTER fields applied to germline VCF output.
The following table lists the VCF FILTER fields applied to tumor-normal somatic VCF output.
The following table lists the VCF FILTER fields applied to tumor-only VCF output.
There are two levels of VCF filters: record level (FILTER
) and sample level (FORMAT/FT
). Most record-level filters are independent of those at the sample-level. However, in a germline analysis, if none of the samples pass all sample-level filters, the SampleFT
record-level filter is applied.
Some structural variants reported in the VCF, such as translocations, represent a single novel sequence junction in the sample. The INFO/EVENT
field indicates that two or more such junctions are hypothesized to occur together as part of a single variant event. All individual variant records belonging to the same event share the same INFO/EVENT
string. Note that although such an inference could be applied after SV calling by analyzing the relative distance and orientation of the called variant breakpoints, the SV Caller incorporates this event mechanism into the calling process to increase sensitivity towards such larger-scale events. Given that at least one junction in the event has already passed standard variant candidacy thresholds, sensitivity is improved by lowering the evidence thresholds for additional junctions which occur in a pattern consistent with a multijunction event (such as a reciprocal translocation pair).
Although this mechanism could generalize to events including an arbitrary number of junctions, it is currently limited to two. Thus, at present it is most useful for identifying and improving sensitivity towards reciprocal translocation pairs.
Some of the evidential read pairs could provide both PR and SR support, we defined VF as an additional field to represent number of evidence in sequence fragment(or read pairs), which strongly support the REF or ALT alleles in the listed order, to facilitate unbiased calculation of Variant Allele Fraction (VAF), where VAF = VF_ALT/(VF_ALT+VF_REF).
The VCF ID
, or identifier, field can be used for annotation, or in the case of BND
(breakend) records for translocations, the ID
value is used to link breakend mates or partners. The following is an example of a VCF ID
field from the SV caller
The value provided in the ID
field reflects the SV association graph edge(s) from which the SV or indel was discovered. The value is guaranteed to be unique within any single VCF output file produced by the SV Caller. These values are therefore used to link associated breakend records using the standard VCF MATEID
key. The exact structure of this identifier may change in the future. You can use the entire value as a unique key, but parsing the key could lead to incompatibility with future DRAGEN versions. See the DRAGEN Software Support Site for information on the latest version of DRAGEN.
It can sometimes be convenient to express structural variants in BEDPE format. For such applications, DRAGEN recommends the script vcfToBedpe available on GitHub. The repository is forked from @hall-lab with modifications to support VCF 4.2 SV format.
BEDPE format greatly reduces structural variant information compared to the SV Caller VCF output. In particular, breakend orientation, breakend homology, and insertion sequence are lost, in addition to the ability to define fields for locus and sample specific information. For this reason, Illumina only recommends BEDPE as a temporary output for applications that require it.
DRAGEN generates multiple pipeline-specific metrics including:
Mapping and Aligning metrics
Variant calling metrics
Biomarker metrics
Coverage (or enrichment) metrics and reports
Duration (or run time) metrics
Figure 10: Generation of Metrics and Reports
The QC metrics are printed to the standard output. In addition CSV files are written to the run output directory:
<output prefix>.mapping_metrics.csv
<output prefix>.vc_metrics.csv
<output prefix>.<coverage region prefix>_coverage_metrics.csv
<output prefix>.time_metrics.csv
<output prefix>.<other coverage reports>.csv
Each CSV includes 5 columns, including: Section, Subsection (e.g. read group or sample), Metric, Value 1 (Count/Ratio/Minutes) and Value 2 (Percentage/Seconds).
DRAGEN computes mapping and aligning metrics similar to Samtools Flagstat.
Mapping metrics are:
available both on an aggregate level and on a per read group level.
in germline and somatic tumor-only mode only one set of mapping metrics are available.
in somatic tumor-normal mode, the mapping and aligning metrics are generated separately for the tumor and normal samples, with each line beginning with TUMOR or NORMAL to indicate the sample. The metrics for the tumor sample are output first, followed by the metrics for the normal sample. Metrics per read group are also separated into tumor and normal read groups.
unless explicitly stated, the metrics units are in reads (not in terms of pairs).
Definitions:
Total input reads---Total number of reads in the input FASTQ files.
Number of duplicate marked reads---Reads marked as duplicates as a result of the --enable-duplicate-marking
option being set to true.
Number of duplicate marked and mate reads removed---Reads marked as duplicates, along with any mate reads, that are removed when the --remove-duplicates
option is set to true.
Number of unique reads---Total number of reads minus the duplicate marked reads.
Reads with mate sequenced---Number of reads with a mate.
Reads without mate sequenced---Total number of reads minus number of reads with mate sequenced.
QC-failed reads---Reads that did not pass platform/ vendor quality checks (SAM flag 0x200).
Mapped reads---Total number of mapped reads
Mapped reads with filtered mapping---Total number of mapped reads plus reads mapped to non-reference decoy contigs plus reads mapped to the rRNA filter contig.
Mapped reads adjusted for excluded mapping---Total number of mapped reads plus reads mapped to the excluded RNA mitochondrial contig.
Mapped reads adjusted for filtered and excluded mapping---Total number of mapped reads plus reads mapped to the rRNA filter contig plus reads mapped to the excluded RNA mitochondrial contig.
Number of unique and mapped reads---Number of mapped reads minus number of duplicate marked reads.
Unmapped reads---Total number of reads that could not be mapped.
Unmapped reads minus filtered mapping---Total number of unmapped reads minus reads mapped to non-reference decoy contigs minus reads mapped to the rRNA filter contig.
Unmapped reads adjusted for excluded mapping---Total number of unmapped reads minus reads mapped to the excluded RNA mitochondrial contig.
Unmapped reads adjusted for filtered and excluded mapping---Total number of unmapped reads minus reads mapped to the rRNA filter contig minus reads mapped to the excluded RNA mitochondrial contig.
Singleton reads---Number of reads where the read could be mapped, but the paired mate could not be read.
Paired reads---Count of reads in which both reads in the pair are mapped.
Properly paired reads---Both reads in the pair are mapped and fall within an acceptable range from each other based on the estimated insert length distribution.
Not properly paired reads (discordant)---The number of paired reads minus the number of properly paired reads.
Paired reads mapped to different chromosomes---The number of reads with a mate, where the mate was mapped to a different chromosome.
Paired reads mapped to different chromosomes (MAPQ >= 10)---The number of reads with a MAPQ>10 and with a mate, where the mate was mapped to a different chromosome.
Reads with indel R1---The percentage of R1 reads containing at least 1 indel.
Reads with indel R2---The percentage of R2 reads containing at least 1 indel.
Soft-clipped bases R1---The percentage of bases in R1 reads that are soft-clipped.
Soft-clipped bases R2---The percentage of bases in R2 reads that are soft-clipped.
Mismatched bases R1---The number of mismatched bases on R1, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2---The number of mismatched bases on R2, which is the sum of SNP count and indel lengths. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R1 (excluding indels)---The number of mismatched bases on R1. The indels lengths are ignored. It does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Mismatched bases R2 (excluding indels)---The number of mismatched bases on R2. The indels lengths are ignored. The metric does not count anything within soft clipping or RNA introns. The metric also does not count a mismatch if either the reference base or read base is N.
Q30 Bases---The total number of bases with a BQ >= 30. Includes mapped & unmapped reads & bases. Excludes duplicate marked reads & secondary alignments.
Q30 Bases R1---The total number of bases on R1 with a BQ >= 30.
Q30 Bases R2---The total number of bases on R2 with a BQ >= 30.
Q30 Bases (excluding dups and clipped bases)---The number of bases on non-duplicate and non-clipped bases with a BQ >= 30.
Histogram of reads map qualities
Reads with MAPQ [40:inf)
Reads with MAPQ [30:40)
Reads with MAPQ [20:30)
Reads with MAPQ [10:20)
Reads with MAPQ [0:10)
Total alignments---Total number of loci reads aligned to with > 0 quality.
Secondary alignments---Number of secondary alignment loci.
Supplementary (chimeric) alignments---A chimeric read is split over multiple loci (possibly due to structural variants). One alignment is referred to as the representative alignment. The other are supplementary.
Estimated read length---Total number of input bases divided by the number of reads.
Insert length: mean---Mean insert size estimated for the read group
Insert length: median---Median insert size estimated for the read group
Insert length: standard deviation---Standard deviation of insert size estimated for the read group
Note: The insert length metrics reported above are computed using high quality (MAPQ >= 20) properly paired read pairs, considering all the read pairs for the read group. It may be different from the standard output log reported during insert stats sampling which reports these metrics only for the first ~2M read pairs for DNA (~100K read pairs for RNA).
Whole read group insert length estimation for RNA datasets is currently not supported. For RNA runs, the reported insert length metrics are computed using up to the first ~100K high quality read pairs for the read group from the input FASTQ/BAM/CRAM file.
Input bases divided by reference genome size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the reference genome size.
Input bases divided by target bed size---Raw coverage as computed by summing all read lengths (including duplicate marked reads, but excluding secondary and supplementary alignments) and dividing by the target bed size.
Estimated sample contamination---The estimated fraction of reads in a sample that may be from another human source.
The DRAGEN cross-sample contamination module uses a probabilistic mixture model to estimate the fraction of reads in a sample that may be from another human source. DRAGEN supports separate modes for germline and somatic samples.
The germline model, like VerifyBamID, assumes that a sample can be modeled as a DNA mixture from 2 or more individuals. Pileup analysis is used to investigate loci where variants are common in the general population. Variants with high allele frequencies are likely to be real germline variants in the individual of interest, while low allele frequency variants at these common germline loci are likely noise or germline variants from a contaminating sample B. The probabilistic mixture model accounts for noise and then tries to detect consistent allele frequency distributions. As example, if the pileups show consistent low allele frequencies of 1% or 2%, then the mixture model will likely infer 2% contamination from sample B, where the 1% and 2% AF variants correspond to heterozygous and homozygous germline calls in sample B.
The germline cross-contamination metric is enabled by using the following setting and pointing a VCF that includes marker sites (RSIDs) with population allele frequencies that are close to 0.5.
--qc-cross-cont-vcf <INSTALL_PATH>/resources/qc/sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf
The somatic model, like GATK CalculateContamination, supports tumor-only or tumor-normal runs. The somatic model is more advanced than the germline model in the way that it accounts for somatic CNVs or LoH regions where the diploid assumptions may be broken. The algorithm also tries to account for FFPE deamination and oxidation noise by empirically adjusting base qualities prior to estimation.
The somatic cross-contamination metric is enabled by pointing to the VCF that includes the marker sites (RSIDs) with high population allele frequencies.
--qc-somatic-contam-vcf <INSTALL_PATH>/resources/qc/somatic_sample_cross_contamination_resource_[hg19 or GRCh37 or GRCh38].vcf.gz
The metric value is printed as a fraction, so a value of 0.011 represents 1.1% contamination from another sample.
MAPPING/ALIGNING SUMMARY Estimated sample contamination 0.011
The precision of variant calling, particularly for somatic variants, can be significantly impacted by cross-sample contamination. To ensure safe usage of a sample, the level of cross-sample contamination must be considerably lower than the minimum allele frequencies of interest. For instance, if a sample has 1% contamination, it may be necessary to disregard all variants with less than 5% allele frequency. The cross-contamination metric for a sample reaches saturation near 30% contamination.
The contamination module requires a minimum of 100 valid pileups for contamination estimation, where a pileup is considered valid if it has at least 10X coverage and 95% or more reads are deemed valid. Soft clipped reads that could indicate INDELs or structural variants are not considered valid, and datasets with untrimmed adapters may lead to most reads being soft clipped and classified as invalid. If the contamination module reports "NA," even for high-coverage samples, it is recommended to inspect a few pileup locations in IGV for evidence of untrimmed bases.
Optional Contamination Settings:
The generated variant calling metrics are similar to the metrics computed by RTG vcfstats. Metrics are reported for each sample in multi sample VCF and gVCF files and in a csv file with the file name ending in "vc_metrics.csv". Based on the run case, metrics are reported either as standard VARIANT CALLER or JOINT CALLER. Metrics are reported both for the raw (PREFILTER) and hard filtered (POSTFILTER) VCF file.
Panel of Normals (PON) and COSMIC filtered variants are counted as PASS variants in the POSTFILTER VCF metrics. These PASS variants can cause higher than expected variant counts in the POSTFILTER VCF metrics
Number of samples---Number of samples in the population/ joint VCF.
Reads Processed---The number of reads used for variant calling, excluding any duplicate marked reads and reads falling outside of the target region.
Total---The total number of variants (SNPs + MNPs + indels).
Biallelic---Number of sites in a genome that contains two observed alleles. The reference is counted as one allele, which allows for one variant allele.
Multiallelic---Number of sites in the VCF that contain three or more observed alleles. The reference is counted as one, which allows for two or more variant alleles.
SNPs---A variant is counted as an SNP when the reference, allele 1, and allele 2 are all length 1.
Insertions (Hom)---Number of variants that contains homozygous insertions.
Insertions (Het)---Number of variants where both alleles are insertions, but not homozygous.
Deletions (Het)---Number of variants that contains homozygous deletions.
INDELS (Het)---Number of variants where genotypes are either [insertion+deletion], [insertion+SNP], or [deletion+SNP].
De Novo SNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold
option to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
De Novo INDELs---De novo marked indels with DQ values greater than the threshold. This DQ threshold can be specified by setting the --qc-indel-denovo-quality-threshold
option to the required DQ threshold. The default is 0.4 if ML recalibration is off, 0.04 if ML recalibration is on.
De Novo MNPs---De novo marked SNPs with DQ greater than the threshold. Set the --qc-snp-denovo-quality-threshold
to the required threshold. The default is 0.05 if ML recalibration is off, 0.0017 if ML recalibration is on.
(Chr X SNPs)/(Chr Y SNPs) ratio in the genome (or the target region) ---Number of SNPs in chromosome X (or in the intersection of chromosome X with the target region) divided by the number of SNPs in chromosome Y (or in the intersection of chromosome Y with the target region). If there was no alignment to either chromosome X or chromosome Y, this metric shows as NA.
SNP Transitions---An interchange of two purines (A<->G) or two pyrimidines (C<->T).
SNP Transversions---An interchange of purine and pyrimidine bases Ti/Tv ratio: ratio of transitions to transitions.
Heterozygous---Number of heterozygous variants.
Homozygous---Number of homozygous variants.
Het/Hom ratio---Heterozygous/ homozygous ratio.
In dbSNP---Number of variants detected that are present in the dbSNP reference file. If no dbSNP file is provided via the --bsnp
option, then both the In dbSNP and Novel metrics show as NA.
Novel---Total number of variants minus number of variants in dbSNP.
Percent Callability---Available in germline and somatic modes with gVCF output. The percentage of non-N reference positions having a PASSing genotype call. Multiallelic variants are not counted. Deletions are counted for all the deleted reference positions only for homozygous calls. Only autosomes and chromosomes X, Y, and M are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names. Optionally, --qc-callability-xym-contigs allows setting X, Y and M contig names.
Percent Autosome Callability---Only autosomes are considered. To produce this metric for non-human references, set --qc-callability-autosome-contigs to specify the autosome contig names.
Percent QC Region Callability in Region i (i is equivalent to regions 1,2, or 3)---Available if callability for custom regions is requested via the --qc-coverage-region-i
option and the callability output is specified with --qc-coverage-reports-i
. All contigs are considered. Setting --qc-callability-autosome-contigs enables outputting this metric for non-human references.
When the germline small variant caller is executed, DRAGEN calculates a het/hom ratio per contig.
The het/hom ratio values can be used as an indication of whole chromosome uniparental disomy (UPD). UPD of certain chromosomes are associated with genetic syndromes known as imprinting disorders. Whole chromosome UPD have het/hom ratios close to 0.0. Ranges vary, but are usually between 1.0–2.0. The het/hom ratios should be interpreted in the context of the specific assay.
DRAGEN reports the ratios for both the raw (PREFILTER) and hard-filtered (POSTFILTER) VCF. The metrics are output to the .vc_hethom_ratio_metrics.csv
file.
The file contains the following values for each primary contig processed.
Contig
Number of heterozygous variants
Number of homozygous variants
Het/Hom ratio
The following example shows a section of the metrics.
DRAGEN supports a number of reports dedicated to coverage metrics. Some other DRAGEN components, including the mapper and aligner, ploidy caller and variant callers, may emit limited coverage related metrics. The metrics from these other components may not always exactly match the metrics in the DRAGEN coverage reports. The following table list some important differences.
Table 6 Coverage reported in files other than the main coverage reports
The coverage reports listed in Table 7 all follow the same default rules for counting or excluding reads:
Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included (i.e. MAPQ=0 reads are filtered).
BQ >= 0 are included.
Table 7 DRAGEN Coverage Reports
DRAGEN coverage reports will by default be generated over the whole genome, and if provided also over a target region. DRAGEN additionally supports the ability to specify custom regions and report types of interest.
In somatic tumor-normal mode, DRAGEN generates separate reports for the tumor and normal samples. Each report is labeled according to the sample type. Tumor sample reports include tumor
at the end of the file name, and normal sample reports include normal
at the end of the file name. To include both tumor and normal sample results in one file, set the --vc-enable-separate-t-n-metrics
option to false. DRAGEN then reports on the aggregate of both samples.
The coverage reports do not require the mapper or variant callers, however it is not compatible with --enable-sort=false
.
The following command shows an example use case for specifying custom coverage reports:
The settings --qc-coverage-region-i
and --qc-coverage-reports-i
work as a pair (i can be 1, 2, or 3). The former setting specifies the region while the latter specify the report type for that region. The number i
links the settings. Up to 3 such region and report pairs may be specified.
The --qc-coverage-region-i
option requires a BED file argument (i can be 1, 2, or 3).
Regions in each BED file can be optionally padded using --qc-coverage-region-padding-i
option (by default 0 padding is applied).
A set of default reports are generated for each region.
Additional reports can be specified for each region by using the --qc-coverage-reports-i
option.
If multiple report types are selected per region, they should be space-separated, e.g. --qc-coverage-reports-1 callability full_res
.
Defaults settings used for all DRAGEN coverage reports:
Duplicate reads are ignored.
Soft and/or hard-clipped bases are ignored
Overlapping mates are double-counted.
Reads with MAPQ > 0 are included. MAPQ=0 reads are filtered.
BQ >= 0 are included.
Non-default setting
As example, the following options are used to enable full (basepair) resolution coverage output with more stringent MAPQ and BQ filtering:
The argument syntax mapq<value,bq<value implies that reads with a mapping quality less than the specified value, or bases with a read base call quality below the specified value, will be ignored.
Valid filter arguments are mapq and bq only. Either, or both, can be specified.
Only one operator < is supported. <=, >, >=, = are not supported.
By default DRAGEN will emit a _coverage_metrics.csv file for each available region type, including the full genome, target region, and any additionally specified QC regions.
The _coverage_metrics.csv file is generally the most useful of all the coverage reports and will probably be the first file to inspect when performing coverage based QC.
The first column of the output file contains the section name COVERAGE SUMMARY and the second column (the subsection) is empty for all entries in the file.
The following metrics are calculated:
Aligned bases in region---Number of uniquely mapped bases to region and the percentage relative to the number of uniquely mapped bases to the genome.
Average alignment coverage over region---Number of uniquely mapped bases to region divided by the number of sites in region.
Uniformity of coverage (PCT > 0.2*mean) over region__---Percentage of sites with coverage greater than 20% of the mean coverage in region.
PCT of region with coverage [ix, inf)---Percentage of sites in region with at least ix coverage, where i can equal 100, 50, 20, 15, 10, 3, 1 and 0.
PCT of region with coverage [ix, jx)---Percentage of sites in region with at least ix but less than jx coverage, where (i, j) can equal (50, 100), (20, 50), (15, 20), (10, 15), (3, 10), (1, 3) and (0, 1).
Average chromosome X coverage over region---Total number of bases that aligned to the intersection of chromosome X with region divided by the total number of loci in the intersection of chromosome X with region. If there is no chromosome X in the reference genome or the region does not intersect chromosome X, this metric shows as NA.
Average chromosome Y coverage over region---Total number of bases that aligned to the intersection of chromosome Y with region divided by the total number of loci in the intersection of chromosome Y with region. If there is no chromosome Y in the reference genome or the region does not intersect chromosome Y, this metric shows as NA.
XAvgCov/YAvgCov ratio over genome/target region---Average chromosome X alignment coverage in region divided by the average chromosome Y alignment coverage in region. If there is no chromosome X or chromosome Y in the reference genome or the region does not intersect chromosome X or Y, this metric shows as NA.
Average mitochondrial coverage over region---Total number of bases that aligned to the intersection of the mitochondrial chromosome with region divided by the total number of loci in the intersection of the mitochondrial chromosome with region. If there is no mitochondrial chromosome in the reference genome or the region does not intersect mitochondrial chromosome, this metric shows as NA.
Average autosomal coverage over region---Total number of bases that aligned to the autosomal loci in region divided by the total number of loci in the autosomal loci in region. If there is no autosome in the reference genome, or the region does not intersect autosomes, this metric shows as NA.
Median autosomal coverage over region---Median alignment coverage over the autosomal loci in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Mean/Median autosomal coverage ratio over region---Mean autosomal coverage in region divided by the median autosomal coverage in region. If there is no autosome in the reference genome or the region does not intersect autosomes, this metric shows as NA.
Aligned reads in region---Number of uniquely mapped reads to region and percentage relative to the number of uniquely mapped reads to the genome. Only reads with with MAPQ ≥ 1 are included. Secondary and supplementary alignments are ignored.
The following is an example of the contents of the \_coverage\_metrics.csv
file:
The fine histogram report outputs a _fine_hist.csv
file, which contains two columns: Depth and Overall. The value in the Depth column ranges from 0 to 2000+ and the Overall column indicates the number of loci covered at the corresponding depth.
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The histogram report outputs a _hist.csv file, which provides the following:
Percentage of bases in the coverage BED/target BED/WGS region that fall within a certain range of coverage.
Duplicate reads are ignored if DRAGEN is run with --enable-duplicate-marking
true.
The following ranges are used: "[100x:inf)", "[1x:3x)", "[0x:1x)"
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Overall Mean Coverage report generates an _overall_mean_cov.csv file, which contains the average alignment coverage over the coverage BED/target BED/WGS, as applicable.
The following is an example of the contents of the _overall_mean_cov.csv file:
Average alignment coverage over target_bed,80.69
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Contig Mean Coverage report generates a _contig_mean_cov.csv file, which contains the estimated coverage for all contigs and an autosomal estimated coverage. The file includes the following three columns:
Masked regions in the FASTA are ignored and no depth for these regions are reported.
The Full Res Report outputs a _full_res.bed file in tab-delimited format. The first three columns are the standard BED fields, and the fourth column is the depth. Each record in the file is for a given interval that has a constant depth. If the depth changes, then a new record is written to the file. Alignments that have a mapping quality value of 0, duplicate reads, and clipped bases are not counted towards the depth.
Only base positions that fall under the user-specified coverage-region bed regions are present in the _full_res.bed output file.
The _full_res.bed file structure is similar to the output file of bedtools genomecov -bg. The contents are identical if the bedtools command line is executed after filtering out alignments with mapping quality value of 0, and possibly filtering by a target BED (if specified).
The following is an example of the contents of the _full_res.bed file:
Coverage is reported for all positions specified by qc-coverage-region-i
. Masked regions in the FASTA are not ignored.
When --enable-metrics-compression
is set to true, the 1 bp resolution coverage metrics output bed file (_full_res.bed
) are compressed to bigwig format.
The cov_report report generates a _cov_report.bed file in a tab-delimited format. This report includes summary coverage statistic per BED region. The first three columns are standard BED fields. The DRAGEN Amplicon pipeline includes a fourth column for name and fifth column for gene_id. The remaining column fields are statistics calculated over the interval region specified on the same record line.
The following table lists the appended columns.
total_cvg---The total coverage value.
mean_cvg---The mean coverage value.
Q1_cvg---The lower quartile (25th percentile) coverage value.
median_cvg---The median coverage value.
Q3_cvg---The upper quartile (75th percentile) coverage value.
min_cvg---The minimum coverage value.
max_cvg---The maximum coverage value.
pct_above_X---Indicates the percentage of bases over the specified interval region that had a depth coverage greater than X.
By default, if an interval has a total coverage of 0, then the record is written to the output file. To filter out intervals with zero coverage, set vc-emit-zero-coverage-intervals
to false in the configuration file.
By default, if --qc-coverage-region-i-thresholds
are not set, the thresholds will default to 5, 15, 20, 30, 50, 100, 200, 300, 400, 500, 1000.
The following is an example of the contents of the _cov_report.bed file:
The read_cov_report report generates a _read_cov_report.bed file in a tab-delimited format. The first five columns are chrom
, start
, end
, name
, and gene_id
BED fields. The following additional columns represent statistics that are calculated over the interval region specified on the same record line.
total_cvg---The total coverage value.
read1_cvg---The total Read 1 coverage value.
read2_cvg---The total Read 2 coverage value.
If an alignment overlaps more than one region, the alignment is counted toward the region with the largest overlap. If the alignment overlaps equally with more than one region, the alignment is counted toward the first intersecting region.
The following shows the contents of the _read_cov_report.bed file:
Callability is defined as the fraction of positions in the genome or target region having a GVCF PASSing genotype call. The callability report can be interpreted as the fraction of sites in the genome or target bed where the small variant caller had sufficient information (enough good quality reads) to confidently either call a variant or a HOM-REF region.
The callability report requires DRAGEN to be run in gVCF mode. When gVCF mode is enabled, DRAGEN will automatically generate a callability report as part of variant caller metrics.
The following criteria are used to calculate callability metrics:
Callability is calculated over all positions included in the gVCF.
Decoy contigs are ignored.
Unplaced and unlocalized contigs are ignored.
Masked regions in the FASTA (bases set to N) are ignored.
For regions where no variant calling was performed, callability is 0.
A homozygous deletion counts as a PASSing genotype call for all the reference positions spanned by the deletion.
If the --vc-target-bed
option is specified, the output is a target_bed_callability.bed
file that contains the overall and autosome callability over the input target bed region. The padding size specified by the --vc-target-bed-padding
option is used and overlapping regions are merged.
Callability can also be output over custom regions. If the --qc-coverage-region-i
option is used with --qc-coverage-reports-i
(where i is 1, 2, or 3), callability can be added as a report type for that region. The output is a qc-coverage-region-i_callability.bed
file. For each specified qc-coverage-region-i
file, the average callability is reported in the variant calling metrics file. The padding size specified by the --qc-coverage-region-padding-i
is used and overlapping regions are merged.
The optional min MAPQ and min BQ filters only influence read and base counting and do not influence the callability reports. The callability reports only depends on the gVCF PASS variants.
The following table shows which outputs are generated when default options (--vc-target-bed
) versus optional coverage region options (--coverage-region
) are used.
The GC bias metric is computed as follows.
Calculates GC content using a 100 bp wide, per-base rolling window over all chromosomes in the reference genome, excluding any decoys and alternate contigs. Windows containing more than four masked (N) bases in the reference are discarded.
Calculates the average coverage for each window, excluding any non-PF, duplicate, secondary, and supplementary reads.
Calculates the average global coverage across the whole genome.
Groups valid windows based on the percentage of GC content, both at individual percentages and five 20% ranges as summary.
Calculates the normalized coverage for each group by dividing the average coverage for the bin by the global average coverage across the genome. Values below 1.0 indicate a lower than expected coverage at the given GC percent or range. Coverages significantly deviating from 1.0 at greater GC values are an expected result.
Calculates dropout metrics as the sum of all positive values of (percentage of windows at GC X-percentage aligned reads at GC X) for each GC ≤ 50% and > 50% for AT and GC dropout.
By default, the GC bias metric report is not calculated. To enable GC Bias calculations, enter the --gc-metrics-enable command line option. The following is an example command:
$ dragen -b <BAM file> -r <reference genome> --gc-metrics-enable=true
The GC metrics report generates a gc_metrics.csv file. The file is structured as follows.
The GC bias report also includes the following command line options, but they are not recommended.
| Setting | Description | |:-------------------------------| :---------------------------- -----------------------| | --gc-metrics-window-size | Overrides the default rolling window size of 100 bp. | | --gc-metrics-num-bins | Overrides the number of summary bins. |
In somatic mode, DRAGEN automatically generates a somatic callable regions report as a bed file. The somatic callable regions report includes all regions with tumor coverage at least as high as the tumor threshold and (if applicable) normal coverage at least as high as the normal threshold. If only the tumor sample is provided, then the report includes all regions with tumor coverage at least as high as the tumor threshold. Each line in the bed output file is formatted as follows.
chromosome region_start region_end
You can specify the threshold values using the --vc-callability-tumor-thresh
or --vc-callability-normal-thresh
options. The default value for the tumor threshold is 15. The default value for the normal threshold is 5. For more information on each option, see [Somatic Mode Options]{.underline}.
If the target bed or the --qc-coverage-region-i
(where i is 1, 2, or 3) option is included in the run. DRAGEN then generates corresponding somatic callable regions bed files in addition to the whole genome somatic callable region bed file.
The duration metrics section includes a breakdown of the run duration for each process. For example, the following metrics are generated for the mapper and variant caller pipeline:
Time loading reference
Time aligning reads
Time sorting and marking duplicates
Time DRAGStr calibration
Time partial reconfiguration
Time variant calling
Total run time
Specify a matched normal SNV VCF. For more information on specifying b-allele loci, see .
Specify a population SNP catalog. For more information on specifying b-allele loci, see .
If running in tumor-normal mode with the SNV caller enabled, use this option to specify the germline heterozygous sites. For more information on specifying b-allele loci, see .
Specify germline CNVs from the matched normal sample. For more information, see .
Use the variant allele frequencies (VAFs) from the somatic SNVs to help select the tumor model for the sample. For more information, see .
Enable HET-calling mode for heterogeneous segments. For more information, see .
--enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
enables Nirvana, the Illumina Annotation Engine. For more information on selecting the correct assembly and downloading reference files, see .
This file is used by default when building the HLA-specific hash-table as above, see .
Map-align must be enabled for HLA (see ). As such, tumor-normal paired file inputs from BAM are not currently supported for HLA calling.
Reads input files to estimate alignment statistics, including fragment size distribution and chromosome level depth. For more information on the SV Caller input options, see .
The DRAGEN SV Caller can discover all identifiable structural variant types in the absence of copy number analysis and large-scale de novo assembly. For more information on detectable types, see .
When performing somatic calling on liquid tumor samples, the matched normal sample might be contaminated with tumor cells. The contamination can substantially reduce somatic variant recall. To account for Tumor-in-Normal (TiN) contamination, you can enable liquid tumor mode. For more information, see .
You can also build systematic noise BEDPE files in the cloud using the .
The following prebuilt systematic noise files for WGS are available for download on the . To generate these noise files, we used 100 unrelated normal samples from the 1000 Genomes Project. Each systematic noise file contains a version string that DRAGEN uses to check the compatibility by default and exits early if a wrong systematic noise file is provided.
When running the SV Caller, the input sequencing reads must be from a standard Illumina paired-end sequencing assay with an FR read pair orientation, where for each sequence fragment, a read proceeds from each end of the fragment inwards. For more information, see .
In standalone mode, input sequencing reads must be mapped and provided as input in either BAM or CRAM format. Each input file must be coordinate sorted and indexed to produce an htslib-style index in a file named to match the input BAM or CRAM file with an additional .bai
, .crai
, or .csi
file name extension. For more information on standalone mode, see .
--ref-dir
---The DRAGEN reference genome hashtable directory. For more information about the reference genome hashtable, see .
--sv-enable-liquid-tumor-mode
---Enable liquid tumor mode. See .
--sv-tin-contam-tolerance
--- Set the Tumor-in-Normal (TiN) contamination tolerance level. See for more information.
--sv-systematic-noise
--- Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information see .
--sv-detect-systematic-noise
--- Set to true to generate VCF output per normal sample. For more information see
--sv-build-systematic-noise-vcfs-list
--- List of input VCFs from previous step. Enter one VCF per line. For more information see
Standalone --- Uses mapped BAM/CRAM input files. If you have not mapped and aligned your data yet, see . This mode requires the following options:
Breakpoint insertions are represented differently in the VCF small indel format. The SV caller represents small deletions and insertions using the VCF small indel format instead of symbolic ALT alleles. Any breakpoint insertion that occurs in the VCF small indel format is represented as part of the VCF ALT field. See for information on the conditions this format is used for SVs under.
The GC bias report provides information on GC content and the associated read coverage across a genome. DRAGEN GC bias metric is modeled after the Picard implementation and adapted to preexisting internal measures. The DRAGEN GC bias correction module attempts to correct these biases following the target count stage. For more information, see
2
1
10%
-5%
2
1
5%
-2.5%
2
3
5%
+2.5%
2
3
10%
+5%
XX
2
0
XY
1
1
XXY
1
1
XYY
1
1
X0
2
0
XXXY
1
1
XXX
2
0
CNV+SNV
Supported
Supported
Supported
CNV+SV
Supported
Supported
Supported
SNV+SV
Supported
Supported
Supported
CNV+SNV+SV
Supported
Supported
Supported
CNV+SNV
Supported
Supported
Not Supported
CNV+SV
Supported
Supported
Not Supported
SNV+SV
Supported
Supported
Not Supported
CNV+SNV+SV
Supported
Supported
Not Supported
CNV+SNV
Supported
Supported
Supported
CNV+SV
Supported
Supported
Supported
SNV+SV
Supported
Supported
Supported
CNV+SNV+SV
Supported
Supported
Supported
XX
0.75
1.25
0.00
0.25
XY
0.25
0.75
0.25
0.75
XXY
0.75
1.25
0.25
0.75
XYY
0.25
0.75
0.75
1.25
X0
0.25
0.75
0.00
0.25
XXXY
1.25
1.75
0.25
0.75
XXX
1.25
1.75
0.00
0.25
--vc-callability-tumor-thresh
The minimum coverage for usable coding regions
50 (default)
50 (default)
1000 (not default)
--tmb-vaf-threshold
Variant mininum allele frequency for usable variants
0.05 (default)
0.05 (default)
0.002 (not default)
--tmb-cosmic-count-threshold
Number of observations in cosmic for variant to be considered a driver mutation.
50 (default)
50 (default)
50 (default)
--tmb-skip-db-filter
Do not use Nirvana database to filter germline variants
TRUE (default:T/N)
FALSE (default:T/O)
FALSE (not default)
--tmb-enable-proxi-filter
Use allele frequency information to filter germline variants
OPTIONAL (default is FALSE)
FALSE (not default)
TRUE (not default)
Eligible Region (Mbp)
The specified custom regions in (Mbp) that meet the minimum coverage threshold.
Filtered Variant Count
Remaining variants after variant and germline filters.
Filtered Nonsyn Variant Count
Subset of filtered variants that are nonsynonymous.
TMB
Filtered variants normalized by the eligible regions (Mbp).
Nonsyn TMB
Filtered nonsynonymous variants normalized by the eligible regions (Mbp).
A*26:01
A*29:02
B*44:02
B*44:03
C*05:01
C*16:01
DQA1*01:03
DQA1*01:02
DQB1*06:03
DQB1*06:02
DRB1*15:01
DRB1*15:01
WGS_hg19_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HG19 reference
3.0.0
4.3.*
WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HG38 reference
3.0.0
4.3.*
WGS_hs37d5_v3.0.0_systematic_noise.sv.bedpe.gz
>30x coverage using the Illumina NovaSeq 6000 system with 2x150bp reads for the HS37D5 reference
3.0.0
4.3.*
contig1
chromosome of the first region (string)
start1
start position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
end1
end position of the first region (0-based left-most position of the first breakpoint containing genomic interval, integer)
contig2
chromosome of the second region (string)
start2
start position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
end2
end position of the second region (0-based left-most position of the second breakpoint containing genomic interval, integer)
event_id
The paired region unique ID (string)
score
The number of occurrences in the cohort
orientation1
direction of breakpoint1 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
orientation2
direction of breakpoint2 relative to the reference; "+" indicates to the right, "-" to the left (string, "+", or "-")
assembly-status
If all variants used to generate the noise candidate have end-to-end local assemblies, noise candidate is "precise", otherwise it is "imprecise" (string, "precise", or "imprecise")
IMPRECISE
Flag indicating that the structural variation is imprecise, ie, the exact breakpoint location is not found
SVTYPE
Type of structural variant
SVLEN
Difference in length between REF and ALT alleles
END
End position of the variant described in this record
CIPOS
Confidence interval around POS
CIEND
Confidence interval around END
CIGAR
CIGAR alignment for each alternate indel allele
MATEID
ID of mate breakend
EVENT
ID of event associated to breakend
HOMLEN
Length of base pair identical homology at event breakpoints
HOMSEQ
Sequence of base pair identical homology at event breakpoints
SVINSLEN
Length of insertion
SVINSSEQ
Sequence of insertion
LEFT_SVINSSEQ
Known left side of insertion for an insertion of unknown length
RIGHT_SVINSSEQ
Known right side of insertion for an insertion of unknown length
PAIR_COUNT
Read pairs supporting this variant where both reads are confidently mapped
BND_PAIR_COUNT
Confidently mapped reads supporting this variant at this breakend (mapping may not be confident at remote breakend)
UPSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at the upstream breakend (mapping may not be confident at downstream breakend)
DOWNSTREAM_PAIR_COUNT
Confidently mapped reads supporting this variant at this downstream breakend (mapping may not be confident at upstream breakend)
BND_DEPTH
Read depth at local translocation breakend
MATE_BND_DEPTH
Read depth at remote translocation mate breakend
JUNCTION_QUAL
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the QUAL value for the adjacency in question only
SOMATIC
Flag indicating a somatic variant
SOMATICSCORE
Somatic variant quality score
SOMATIC_EVENT
If the probability of the SV being a germline variant is greater than the probability of the SV being a somatic variant, this is 0. Otherwise, this is 1.
JUNCTION_SOMATICSCORE
If the SV junction is part of an EVENT (ie, a multi-adjacency variant), this field provides the SOMATICSCORE value for the adjacency in question only
CONTIG
Assembled contig sequence, if the variant is not imprecise (with --outputContig
)
DUPSVLEN
Length of duplicated reference sequence
DUPHOMLEN
Length of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPHOMSEQ
Sequence of base pair identical homology at event breakpoints excluding duplicated reference sequence
DUPSVINSLEN
Length of inserted sequence after duplicated reference sequence
DUPSVINSSEQ
Inserted sequence after duplicated reference sequence
NotDiscovered
Variant candidate specified by the user and not discovered from input sequencing data
UserInputId
Variant ID from user input VCF
KnownSVScoring
Variant is associated with a user specified input variant, therefore scoring and filtration criteria are relaxed under a stronger prior assumption of truth
GT
Genotype
FT
Sample filter, 'PASS' indicates that all filters have passed for this sample
GQ
Genotype Quality
PL
Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
PR
Number of spanning read pairs which strongly support the REF or ALT alleles
SR
Number of split-reads which strongly support the REF or ALT alleles
VF
Number of fragments which strongly support the REF or ALT alleles
MinQUAL
Record
QUAL score is less than a threshold. The filter is not applied to records with KnownSVScoring
flag.
Ploidy
Record
For DEL and DUP variants, the genotypes of overlapping variants with similar size are inconsistent with diploid expectation. The filter is not applied to records with KnownSVScoring
flag.
MaxDepth
Record
Depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads in all samples with MAPQ0 around either breakend that exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
NoPairSupport
Record
For variants significantly larger than the paired read fragment size, no paired reads support the alternate allele in any sample. The filter is not applied to records with KnownSVScoring
flag.
SampleFT
Record
No sample passes all the sample-level filters.
MinGQ
Sample
GQ score is less than 15. The filter is applied at sample level and not applied to records with KnownSVScoring
flag.
HomRef
Sample
Homozygous reference call. The filter is applied at the sample level.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (< 1000 bases) in the normal sample, the fraction of reads with MAPQ0 around either breakend exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation. The filter is not applied to records with the KnownSVScoring
flag.
MinSomaticScore
Record
SOMATICSCORE is less than a threshold.
SystematicNoise
Record
Variant overlaps with one of the paired regions in the systematic noise BEDPE file with matched orientation. The filter is not applied to records with the KnownSVScoring
flag.
MaxDepth
Record
Normal sample site depth is greater than 3x the median chromosome depth near one or both variant breakends. The filter is not applied to records with KnownSVScoring
flag.
MaxMQ0Frac
Record
For a small variant (<1000 bases), the fraction of reads with MAPQ0 around either breakend exceeds 0.4. The filter is not applied to records with KnownSVScoring
flag.
qc-contam-min-cov
The minimum read coverage required for a pileup to be used in contamination detection (default is 10). Lower coverage may produce unreliable results.
qc-contam-min-valid-read-ratio
The minimum ratio of valid reads in a pileup for it to be considered valid. The default setting is 0.95, meaning 95% of the reads in a pileup must be valid. This value may be lowered to 0.75 and still yield accurate contamination estimates. If many reads are classified as invalid, it is likely due to untrimmed adapters that are being systematically soft clipped. It is recommended to fix the BAM file rather than force the contamination module to use these reads.
DRAGEN SNV VCF INFO DP field
The SNV VCF INFO DP field is computed after excluding unmapped reads, secondary alignments, BQ<10, bad quality reads (badly mated reads, and reads with bad cigars). It will generally be equal or lower than coverage reported in the fine_hist or other coverage reports. It is also expected to be lower than the unfiltered coverage track reported in IGV.
DRAGEN SNV VCF FORMAT DP field
The SNV VCF FORMAT DP is similar to the INFO DP field, but it also excludes non-informative reads that matches more than 1 haplotype equally well. In general the following pattern is expected: FORMAT DP <= INFO DP <= per position coverage in full_res report.
Input bases divided by reference genome size.
Available in mapping_metrics.csv file. This metric is a useful indication of the raw sequencing coverage. All primary reads (including duplicates) are considered. Secondary and supplementary alignments are ignored, but no other filters are applied.
Autosomal Median Coverage
Available in ploidy_estimation_metrics.csv. This is an internal development metric that makes various assumptions about which regions will be treated as callable or not. This metric will not be consistent with "Median autosomal coverage over genome" in "wgs_coverage_metrics.csv". It is not recommended for any QC.
Coverage metrics
_coverage_metrics.csv
Important coverage summary statistics. On by default.
Fine histogram coverage
_fine_hist.csv
Detailed coverage histogram. On by default.
Histogram coverage
_hist.csv
Binned coverage histogram. On by default.
Overall mean coverage
_overall_mean_cov.csv
Redundant subset of information available in _coverage_metrics.csv. On by default.
Per contig mean coverage
_contig_mean_cov.csv
On by default.
Read-level coverage report
_read_cov_report.bed
On by default.
Basepair full resolution
_full_res.bed
Optionally enabled with custom reports.
Per BED region cov_report
_cov_report.bed
Optionally enabled with custom reports.
GVCF callability
_callability.bed
Optionally enabled with custom reports.
Basepair full resolution
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 full_res
Per BED region cov_report
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 cov_report
GVCF callability
--qc-coverage-region-1=BED_FILE_PATH --qc-coverage-reports-1 callability
Handing of overlapping mates
By default overlapping mates are double counted. Set --qc-coverage-ignore-overlaps=true
to resolve all of the alignments for each fragment and avoid double-counting any overlapping bases. This might result in marginally longer run times. This option also requires setting --enable-map-align=true
. --qc-coverage-ignore-overlaps
is a global setting and updates all qc-coverage reports.
Soft-clipped bases
By default soft-clipped bases are not counted towards coverage. Set --qc-coverage-count-soft-clipped-bases=true
to also include those bases in the coverage calculations. --qc-coverage-count-soft-clipped-bases
is a global setting and updates all qc-coverage reports.
MAPQ and BQ filters
The --qc-coverage-filters-i
setting can be used to override the min MAPQ and BQ filters. A coverage filter is enabled by using one of the --qc-coverage-filters-i
options (where i is 1, 2, or 3), in combination with the associated --qc-coverage-region-i
option. The default value for --qc-coverage-filters-i
is mapq<1,bq<0
. The default includes all BQ, but filters reads with MAPQ=0.
Contig name
Number of bases aligned to that contig, which excludes bases from duplicate marked reads, reads with MAPQ=0, and clipped bases.
Estimated coverage, as follows: <number of bases aligned to the contig (ie, Col2)> divided by <length of the contig or (if a target BED is used) the total length of the target region spanning that contig>.
--vc-target-bed specified? Y/N
--qc-coverage-region-i (i equal to 1, 2, or 3) specified? Y/N
Expected Output Files
N
N
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv
N
Y
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv For each coverage region specified by the user: qc-coverage-region-i_coverage_metrics.csv qc-coverage-region-i_fine_hist.csv qc-coverage-region-i_hist.csv qc-coverage-region-i_overall_mean_cov.csv qc-coverage-region-i_contig_mean_cov.csv qc-coverage-region-i_full_res.bed if full_res report type is requested for qc-coverage-region-i qc-coverage-region-i_cov_report.bed if cov_report report type is requested for qc-coverage-region-i qc-coverage-region-i_callability.bed if GVCF mode is enabled and the callability or exome-callability report type is requested
Y
N
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv target_bed_coverage_metrics.csv target_bed_fine_hist.csv target_bed_hist.csv target_bed_overall_mean_cov.csv target_bed_contig_mean_cov.csv target_bed_callability.bed if GVCF mode is enabled
Y
Y
wgs_coverage_metrics.csv wgs_fine_hist.csv wgs_hist.csv wgs_overall_mean_cov.csv wgs_contig_mean_cov.csv target_bed_coverage_metrics.csv target_bed_fine_hist.csv target_bed_hist.csv target_bed_overall_mean_cov.csv target_bed_contig_mean_cov.csv target_bed_callability.bed if GVCF mode is enabled For each coverage region specified by the user: qc-coverage-region-i_coverage_metrics.csv qc-coverage-region-i_fine_hist.csv qc-coverage-region-i_hist.csv qc-coverage-region-i_overall_mean_cov.csv qc-coverage-region-i_contig_mean_cov.csv qc-coverage-regon-i_full_res.bed if full_res report type is requested for qc-coverage-region-i qc-coverage-region-i_cov_report.bed if cov_report report type is requested for qc-coverage-region-i qc-coverage-region-i_callability.bed if GVCF mode is enabled and the callability or exome-callability report type is requested
DRAGEN can subsample a random, fractional percentage of reads from an input file using the fractional downsampler. You can use downsampling to subsample data sets in order to simulate different amounts of sequencing. DRAGEN randomly subsamples reads from primary analysis without any modification (e.g. no trimming, no filtering, etc.).
To enable fractional downsampling, set the --enable-fractional-down-sampler
command line option to true
.
Any valid sequencing data format that is compatible with the DRAGEN Host Software can be used. For more information on compatible input options, see Input Options.
In addition to enabling the fractional downsampling command line option, you must set the subsample fraction to downsample. To set the subsample fraction, use --down-sampler-normal-subsample
and/or --down-sampler-tumor-subsample
depending on the input files.
You can also specify a seed using --down-sampler-random-seed
to generate different subsamples of the input data set.
--enable-fractional-down-sampler
Set to true
to enable fractional downsampling. The default value is false.
--down-sampler-normal-subsample
Specify the fraction of reads to keep as a subsample of normal input data. The default value is 1.0 (100%).
--down-sampler-tumor-subsample
Specify the fraction of reads to keep as a subsample of tumor input data. The default value is 1.0 (100%).
--down-sampler-random-seed
Specify the random seed for different runs of the same input data. The default value is 42.
DRAGEN can reserve a random subset of fragments that are separate from the normal alignment outputs using downsampling. You can use downsampling to generate data sets for performing comparisons between samples or between replicates. DRAGEN samples fragments after performing any hardware accelerated trimming or filtering functions, which enables DRAGEN to rapidly create analysis-read test data sets.
To enable downsampling, set the --enable-down-sampler
command line option to true
.
You can use any valid sequencing data format that is compatible with the DRAGEN Host Software. For more information on compatible input options, see Input Options.
DRAGEN downsampling outputs the reserved subset of data in FASTQ format. If the input is paired-ended, DRAGEN outputs two FASTQ files that contain subsampled data. If the input is unpaired, DRAGEN outputs two FASTQ files.
In addition to enabling the downsampling command line option, you must set the quantity of fragments to downsample. To set the quantity of fragments, use either --down-sampler-fragments
or --down-sampler-coverage
.
If you specified a coverage level, you must also specify a genome using the --ref-dir
or manually specify the genome size using --down-sampler-genome-size
. If you specify both a read and coverage limit, DRAGEN applies both quantity limits and keeps whichever result is smaller.
--enable-down-sampler
Set to true
to enable downsampling. The default value is false. If enabled, you must set either down-sampler-fragments
or --down-sampler-coverage
.
--down-sampler-num-threads
Specify the number of threads to use for down-sampled reads. The default value is 8.
--down-sampler-random-seed
Set random seed for down-sampled fragments. The default value is 42.
--down-sampler-genome-size
Set target genome size for downsampling coverage. The default value is 0. The --down-sampler-genome-size
option is not compatible with the --ref-dir
option.
--down-sampler-fragments
Specify the target number of fragments for downsampling. The default value is 0.
--down-sampler-coverage
Set target genomic coverage for downsampling. The default value is 0. If enabled, you must set either -ref-dir
or --down-sampler-genome-size
.
Large genomic rearrangements affecting one or more exons account for approximately 5~10% of all disease-causing mutations in BRCA1 and BRCA2 genes in patients with hereditary breast and ovarian cancer syndrome. DRAGEN LR can detect within gene large genomic rearrangements in tumor-only mode for targeted panels such as TruSight Oncology 500. The performance has been verified for BRCA1/2 with TruSight Oncology 500 Assay.
Use the following command-line options to run large rearrangement detection. The same cmd line options can be tested on other tumor-only pipelines.
--tso500-solid-brca-lr=true
Set to true
enable large rearrangement parameters. This is not limited to TruSight Oncology 500 Assay.
--cnv-normals-list
Specify the panel of normal samples to measure instrinsic biases of the upstream processes to allow for proper normalization. To generate a panel of normals, see the example command line. The panel of normal samples should be well matched to the case sample under analysis.
--cnv-target-bed
Specify the targeted regions of the panel.
--cnv-within-gene-lr-bed
Specify the gene regions in BED format to do large rearrangment calling. Example file:
Run the following command on each normal sample to generate .target.counts.gc-corrected.gz
file.
Put the path to the generated .target.counts.gc-corrected.gz
files into a txt file. One file per line. This will be the file given to --cnv-normals-list
.
The output file .cnv.LR.json
contains the breakpoints detected for each specified gene region. The following is an example output file.
Note that coordinate follows BED format [start,stop) suggesting:
start: segment starting coordinate. (0-base inclusive: first base on the chromosome is numbered 0. start coordinate is included in the interval)
stop: segment stop coordinate. (0-base exclusive: first base on the chromosome is numbered 0. stop coordinate is not included in the interval)
Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.
DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence
mode.
The following is an example command for tumor-normal
mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal
modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.
The following is an example command for the tumor-only
mode. Please note that the WES and WGS tumor-only
modes are not as extensively tested as the tumor-normal
modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only
mode.
msi-command tumor-only/tumor-normal/collect-evidence
Mode of execution: tumor-only, tumor-normal, or collect-evidence.
msi-microsatellites-file
Specify the file containing the microsatellites. You can generate this file by scanning the genome for microsatellites using an MSI-sensor. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.
msi-ref-normal-dir
Full name of directory containing files with normal reference repeat length distribution. Used only in tumor-only
mode. These files can be generated by running collect-evidence
on each normal sample. At least 20 normal samples are required.
msi-coverage-threshold
Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not included in analysis. DRAGEN recommends using 60 as the value for solid samples. For TSO500 liquid, a value of 500 is recommended.
msi-distance-threshold
Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.
TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.
Solid
TSO500
Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.
msi-distance-threshold=0.1
20
Heme
TSO500
N/A
N/A
N/A
Liquid (cfDNA)
TSO500
Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WGS
Available for download. Repeats 10 - 50.Approx. 1 mil sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WGS
Available for download. Repeats 10 - 50. Approx. 1 mil sites.
msi-distance-threshold=0.02
TBD
The following is an example of a microsatellite file:
Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site page
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.
Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.
Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].
A subsequent post-processing step is recommended:
only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)
Please note an error would occur if long (>100bp) microsatellite sites are present in the file.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples.
Please note:
The collect-evidence
mode MUST be run in DRAGEN germline mode.
The --msi-microsatellites-file
and --msi-coverage-threshold
settings used in collect-evidence
mode must be consistent with the settings used during tumor-only MSI calling.
At least 20 normal samples are required.
The output containing MSI score (PecentageUnstableSites
) are stored in <output prefix>.microsat_output.json
.
The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".
In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.
There are two other output files (*_diffs.txt
and *.dist
) that are useful for debugging.
Here is an example of *_diffs.txt
file
The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column
The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.
The *.dist
file stores the read counts for each repeat length of the microsatellite site
The coverage of the site can be obtained by summing up all counts in the last column
<output prefix>.microsat_output.json
(described above)
<output prefix>.microsat_tumor.dist
. This file contains the repeat length array for every microsatellite.
Column length_dis
is the repeat length array.
<output prefix>.microsat_diffs.txt
. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.
Column Assessed
indicates if a site passes the coverage filter (msi-coverage-threshold
). Column PassFilter
is an internal metric and currently is not used for filtering microsatellites.
The MSI algorithm performs the following steps:
Tabulates tumor and normal counts from the read alignments for each microsatellite site.
Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal
mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only
mode).
Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal
mode). In tumor-only
mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).
Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.
While DRAGEN secondary analysis is capable of supporting up to 1000x coverage, its default settings are tuned for a more typical sample size in the ~100x range. So if you find that the processing of your large sample doesn't complete, or gives unexpected results, there are options available to improve the behavior.
Users may want to analyze high amounts of data using the DRAGEN secondary analysis. For instance, in somatic contexts it can be beneficial to sequence the tumor at a very high depth to detect mutations at even lower frequencies. DRAGEN reliably supports a total average coverage of up to 1000x. As the input read data can grow excessively, but the system memory is limited, DRAGEN can only keep a subset of the input in RAM at the same time. The area reserved for the read data is called bin_memory. Higher bin_memory size means that bigger chunks can be processed simultaneously, but less memory is available to the rest of DRAGEN or for other processes.
After the map-align step, reads are loaded into the bin_memory, looking for regions of zero coverage. A set of reads that spans two such zero-coverage loci is a callable region. The memory used by a callable region is determined by the number of reads and their length. For instance, a long region with few reads per position uses the same amount of memory as a short region with a spike in coverage. The size of a callable region must stay well below the size of the bin_memory. To this end, any callable region that surpasses the --vc-max-callable-region-memory-usage
threshold is cut into smaller regions. Due to these cuts, the accuracy of variant calls in the vicinity may be affected.
The following options can be used to change the bin_memory and callable region size.
--bin_memory
Set the amount of memory reserved for read data. Defaults to at least 20GB for germline and 40GB for a somatic run.
--vc-max-callable-region-memory-usage
Set the maximum size of a single callable region. Default is 13GB.
The DRAGEN Indel Re-aligner is a consensus based re-alignment step, independent from other DRAGEN callers and pipelines. Re-aligned reads are reflected in the output bam file, and their original alignment is described in an OA tag. The implementation is similair to the Indel Re-aligner tool that was found in GATK3. The tool is designed to reduce false positive SNP's by considering evidence of near-by indels.
The pipeline is comprised of two concurent steps: Interval creation and re-alignment. The interval creation step identifies genomic intervals for which there is evidence of insertions or deletions in the CIGAR's of properly paired (if paired) reads aligned with positive mapq. To output these intervals as a text file, use the command line argument --ir-write-intervals-file=true
. Each line will describe a genomic interval as chrom:start-end, or chrom:start for intervals of length one. The start and end positions are both inclusive and 1-based. The intervals file will be written to the DRAGEN output directory, with the suffix realign-intervals.txt
For each genomic interval, the realigment step groups all aligned reads that intersect the interval. If there are more than ir-max-num-reads
reads that intersect the interval, it is skipped. The following reads are then discarded from the re-alignment analysis:
Non-primary aligned reads.
Reads whose mapping quality is zero.
Paired end reads that mapped to different contigs.
Paired end reads that mapped to the same contig with start positions more than ir-max-distance-between-mates
apart.
Reads that have not been skipped are candidates for re-alignment. If there are more than ir-max-num-candidates
candidates, the interval is skipped. From each re-alignment candidate, a consensus read is generated from any read that has a single indel that is not the first or last CIGAR operation excluding clip operations. If there are more than ir-max-number-consensus
consensus reads, the interval is skipped. Each re-alignment candidate is then scored against each consensus to determine the winning consensus. If the combined score for the interval against the winning consensus is better than the score against the reference by a differnce of at least ir-realignment-threshold
, the reads start position, CIGAR, and NM tag are updated to reflect the re-alignment. The scoring used is hamming distance weighted by base qualities. OA tags that describe the original alignment are added to any re-aligned reads. Mate positions of reads whose mate was re-aligned are updated as well.
When the re-alignment step is complete, a summary will be printed to standard out. It will describe the number of intervals found, sum of the lengths of all intevals, number of reads that intersected intervals, number of reads that got re-aligned, and the number of reads that were skipped due to memory constraints. Such reads will be documented in the DRAGEN log. This may happen in regions with very deep coverage.
The DRAGEN Indel Re-aligner is designed to improve the quality of the DRAGEN BAM output for downstream analysis. The DRAGEN small variant caller pipeline does not read the output BAM, and has its own internal haplotype assembly step which will usually recovers most of the artifacts found during Indel Re-alignment. Limited testing has shown that there may be a small improvement in DRAGEN small variant calls when Indel Re-alignment is enabled. However, Indel Re-alignment will slow down a DRAGEN Map/Align + VC run roughly by a factor of two. For that reason, it is not recommended to enable Indel Re-alignment with the DRAGEN VC, and it is not enabled by default.
The Indel Re-alignment pipeline cannot run with:
The UMI pipeline.
The Methylation pipelines.
--qc-coverage-ignore-overlaps=true
.
SA tag generation (--generate-sa-tags=true
).
The Expansion Hunter pipeline.
enable-indel-realigner
Enable indel re-alignment
False
ir-write-intervals-file
Output a file with the reference intervals that contain evidence for re-alignment.
False
ir-max-num-reads
Max number of reads in an interval for re-alignment.
20,000
ir-max-num-candidates
Max number of re-alignment candidates in an interval for re-alignment.
256
ir-max-num-consensus
Max number of consenses reads in an interval for re-alignment.
256
ir-max-distance-between-mates
Max number of re-alignment candidates in an interval for re-alignment.
100,000
ir-realignment-threshold
Minimal improvement of sum of mismatching base qualities to merit realignment.
50
CheckFingerprint is broadly based on Picard CheckFingerprint. CheckFingerprint will output LOD score to indicate whether all the genetic data between two files from the same individual or not.
If LOD score is positive, those two samples come from the same individual. Otherwise, those two samples come from different individuals.
In general, the sign of LOD in summary file should be consistent with Picard CheckFingerprint summary file, but the exact values may be different.
Validation were done on whole-genome sequencing (WGS) data, mixing WGS samples and whole exon sequencing data.
The checks can run in one of two modes:
Read comparison mode. Aligned reads are compared with the expected VCF
VCF comparison mode. Output VCF is compared with the expected VCF
To enable CheckFingerprint module, the following command-line options are required.
--enable-checkfingerprint true
--checkfingerprint-expected-vcf [path_to_expected_sample_vcf]
Read comparison mode is enabled by default. Read comparison mode is recommended to use for small dataset or whole exon sequencing data.
To switch to VCF comparison mode, use the following options
--checkfingerprint-enable-vcf-comparison true
--enable-variant-caller true
Vcf comparison mode is recommended to use for larger samples, such as whole-genome sequencing data with average 30 coverage or whole exon sequencing data.
Read mode. Input BAM/FASTQ/CRAM, examine the individual reads in input sample, and compare individual reads with expected VCF file.
VCF mode. Input BAM/FASTQ/CRAM, generate a VCF file first, and compare the VCF file with expected VCF file
VCF mode. Input an observed VCF file, and compare observed VCF file with expected VCF file
The input files used by DRAGEN CheckFingerprint are: a) haplotype map (configuration files), b) FASTQ/BAM/CRAM (user input) or observed VCF file (user input), c) expected VCF file (user input).
a) Haplotype Map
Haplotype maps for hg19, hg38 and chm13 are files that are packaged with DRAGEN and automatically selected by the software. The haplotype map is a set of SNPs grouped into haplotyp blocks (also known as linkage disequilibrium blocks). SNPs in haplotye map is used as fingerprinting.
The following columns are of interest:
NAME
SNP identifier
MAF
minor allele frequency
ANCHOR_SNP
refers to the NAME of a SNP. SNPs with the same ANCHOR_SNP have high linkage disequilibrium with each other.
b) Sample Input
Samples are input from bam/cram/fastq or observed vcf files.
The following command-line example uses FASTQ input:
The following command-line example uses vcf input:
c) Expected Vcf Input
Vcf output from dragen is recommended. It can contains multiple samples. Multiple sample vcfs can combine together and input here --checkfingerprint-expected-vcf
Checkfingerprint calculates LOD between input sample (bam/cram/fastq or vcf) and each sample in expected_vcf file.
There are two main output files:
[output-file-prefix].CheckFingerprint.summary.txt : contains LOD scores between input sample and expected sample
[output-file-prefix].CheckFingerprint.detail.txt : contains LOD scores between individual SNPs.
CheckFingerprint.detail.txt example
CheckFingerprint.summary.txt example LOD_EXPECTED_SAMPLE is the LOD score between two samples
CheckFingerprint calculates the LOD score to identify whether two samples are from the same individual or not. A positive value indicates those two samples are from the same individual. A negative value indicates two samples are not match. LOD is in logarithmic scale (base 10). Thus, a LOD of 4 indicates it is 10,000 more likely that data matches the genotypes than not. A score that is close to 0 is inconclusive that can result from low coverage or missing informative genotypes. The identity check takes advantage of haplotype blocks defined in configuration file (hg38_nochr.map,hg19_nochr.map). It can improve statistic power for identity detection by checking SNPs in haplotype blocks.
In VCF mode, CheckFingerprint uses PL to estimate genotype probabilities.
Limitaions: Currently, Vcf mode is designed for whole genome sequencing samples with 30 coverage; Read mode is designed for whole exome sequencing. Larger datasets may encounter timeout errors. Vcf mode is recommended for general use. Read mode should be used in isolation without other components enabled and should only be used if Vcf mode does not provide sufficient accuracy.
The Star Allele Caller identifies the genotypes and metabolism status of the following PGx genes that are included in FDA's PGx recommendations or have CPIC Level A designation : CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, BCHE, ABCG2, NAT2, F5 and UGT2B17. It finds optimal genotypes for the above genes, based on star allele definitions from resources listed below. It calls metabolism status based on a PharmCAT resource file that provides mappings between genotypes and phenotypes. The file is here. The primary support for the Star Allele Caller is for human reference hg38 for which it supports the above mentioned genes. Additionally, it also supports the following genes on references hg19 and GRCh37 : CACNA1S, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, NUDT15, SLCO1B1, VKORC1, DPYD, ABCG2, F5.
For genes CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, ABCG2 the allele definitions are sourced from PharmGKB which are found here. For BCHE and NAT2, the alleles are sourced from this paper and this website, respectively. For UGT2B17, the star alleles are defined here. Note that since BCHE does not have defined star alleles, the Star Allele Caller checks if a sample is positive for any of the variants that are reported in the paper.
For genes CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, NUDT15, SLCO1B1, DPYD, the definitions are sourced from PharmVAR and can be found here. For the remaining hg19/GRCh37 genes, i.e., ABCG2, CACNA1S, IFNL3, F5 and VKORC1 - the allele definitions have been lifted from their corresponding definitions for hg38 (which are sourced from PharmGKB as noted above).
The Star Allele Caller has the following features.
It calls star allele genotypes from different types of genomic data like FASTQ, BAM, gVCF, VCF.
It provides additional details about the genotype call, including a confidence score.
It assumes genotypes for missing positions to be ref - these positions are listed in the output.
It assumes filtered genotype calls to be ref - these records are also listed in the output.
If multiple optimal diplotypes are satisfied, then it lists them all.
It supports different versions of the human reference hg38, hg19 and GRCh37.
For the genes UGT2B17 and CYP2C19, the caller analyzes CNV calls to detect star alleles.
The Star Allele Caller can accept as input, different forms of sequence data such as FASTQs files, BAM/CRAM files or gVCF/VCF files.
If small variant VCF/gVCF and CNV-VCF files are used as input, they should meet the following specifications.
Must be aligned to the same human reference that is passed through the -r option.
Variants should follow a parsimonious left aligned variant representation format.
Complex variants - for example, representing closely located, independent variants, in a single record - are NOT supported.
Note that VCF/gVCF files can also be substituted with, a compressed GZ file (i.e. <file_name>.vcf.gz
or <file_name>.gvcf.gz
).
For running the caller, the human reference needs to be always passed as a command line option. The Star Allele Caller detects the reference version (i.e., hg19, GRCh37 or hg38) and accordingly reads in the correct allele definitions.
The Star allele caller can be enabled in parallel with other components as part of a WGS germline analysis workflow using the option --enable-pgx
(see DRAGEN Recipe - Germline WGS)
In the simplest case, the caller takes DRAGEN gVCF and DRAGEN CNV-VCF files as input. The following is an example of the command line for the basic use case.
Contrary to a variant-only VCF file, a DRAGEN gVCF file contains the genotypes for all positions in a genome. Although the gVCF format is the preferred format for the caller, it can also accept a standard variant-only VCF file as input. The command line for this case will be the same as above, with the VCF file passed instead of a gVCF file. Also, the CNV-VCF file is optional - in this case the Star Allele Caller will not call star alleles that are detected through CNV analysis. An example of this use case, with only a variant only VCF file as input, is as follows.
For running the Star Allele Caller from a BAM input, the variant caller also needs to be enabled. Optionally, the CNV caller should also be preferably enabled for analyzing CNV star alleles. An example of the command line for this use case is as follows.
Note that the Star Allele Caller supports force genotyping option of the variant caller (set by --vc-forcegt-vcf
) but other variant caller options, such as combining phased variants (set using --vc-combine-phased-variants-distance
), is NOT supported at this time.
If a FASTQ file is used as input, additional options, --RGID
and --RGSM
need to be set in the command line. An example of the command line for this use case as follows.
Following completion of the DRAGEN Star Allele Caller run, the following output files are produced.
When the Star Allele Caller is run with small variant calling, or directly from genome VCF input, then the main output file, <prefix>.targeted.json
contains the complete and detailed results for all genes. This is an example output for one gene DPYD
and for one sample NA19374
.
The fields in the json file are as follows.
"genomeBuild": Reference version being used
"softwareVersion": Version of DRAGEN being run
"sampleId": Sample name
"phenotypeDatabaseSources": Resources used for calling metabolism status (phenotype)
"starAlleleDatabaseSources": Resources used for identifying star alleles (genotype)
"locusAnnotations": List of star allele caller results, one for each gene
"gene": Gene name
"geneId": HGNC or Ensembl id of the gene that is static
"starAlleleDatabaseSource": Resource for the star allele definitions file
"genotype": The detected star allele diplotype (or haplotype for haploid gene)
"genotypeQuality": Phred scaled quality score for the genotype
"phenotypeDatabaseAnnotation": Metabolism status corresponding to the genotype called
"supportingVariants": List of star alleles that are satisfied by found variants. The id field denotes the name of the star allele. Each non-ref star allele has a list of supportingVariants which displays the variant details (same as from the small variant vcf file. The quality field denotes the gq field from the vcf record)
"missingVariantSites": List of relevant gene sites for which vcf records are missing or filtered
"variantStarAllelesFound": List of star allele haplotypes that are satisfied by the found variants
Each Star allele genotype contains one or two haplotypes (a haplotype for chrM gene MT-RNR1 and chrX gene G6PD for male samples, and a diplotype for all other genes) separated by a slash (e.g. *1/*2
). Each haplotype is a pre-defined star allele and the definitions can be found under the allele definitions URL. Note that there may be some variance to star allele definitions and notations based on the resource and when it was last updated. When the Star Allele Caller cannot identify an optimal genotype for a gene, a no-call (./.
or .
) is made. In certain cases, more than one genotype is optimally satisfied, in that case all satisfied genotypes are listed, separated by a semi-colon (e.g. *1/*2;*3/*4
).
Tsv and json files (<prefix>.star_allele.tsv
and <prefix>.star_allele.json
, respectively) are produced when the Star Allele Caller is run stand-alone from a gvcf or vcf file or if the option --targeted-enable-legacy-output
is set. The json file has the same format as <prefix>.targeted.json
(shown above) while the tsv file contains summarized star allele calls for each gene. This is an example for one gene from the tsv output. The fields are gene name and genotype.
DRAGEN implements a beta version of the Population Haplotyping tool. This tool supports the estimation of haplotypes from a population scale dataset via the packaging of the SHAPEIT5 Software (2022, Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O). It is designed to phase common variants as well as rare variants in a step-by-step mode. The following step-by-step workflow must be reproduced to phase each chromosome of the studied genome.
Step 1: Phase Common step to estimate the haplotypes of common variants (variants with allele frequency above a given allele frequency threshold) on defined regions.
Step 2: Common Ligate step to ligate the phased common variants from step 1 into a single chromosome.
Step 3: Phase Rare step to add the haplotypes of rare variants (variants with allele frequency below a given allele frequency threshold) on defined regions to the common variant scaffold obtained in step 2.
Step 4: Concat All step to concatenate the haplotype regions obtained in step 3 into a single chromosome.
This tool provides best accuracy on population scale dataset with thousands of samples. It is recommended to be run on multiple nodes to parallelize processes. A common use case of the Population Haplotyping tool is the generation of a custom reference panel to be used for the VCF Imputation pipeline.
The tool supports autosomes and mixed ploidy chromosomes for diploid species only. It does not use the FPGA accelerated capability and it can run on generic software only compute node.
Note: the Population Haplotyping tool only supports input msVCF produced with the DRAGEN gVCF Genotyper tool.
The following is an example of required command to generate haplotypes on common and rare variants (with default allele frequency threshold) on population scale dataset:
To generate per chromosome haplotypes:
To generate per genome haplotyped sites
For the Phase Common step (step 1), it is recommended to provide msVCF generated with the DRAGEN gVCF Genotyper tool. This first step takes as input a .txt file with path to a single msVCF or a list of msVCF, one line per path. The msVCF must comply with the following requirements:
per chromosome msVCF OR positionally sorted msVCF shards spanning a whole chromosome without overlap. See below for shard definition
generated from the same reference build
compressed and indexed
with unphased GT calls
with no duplicates
with header ##contig "ID" and "length" fields for all contigs present in the studied genome
Note: for mixed ploidy chromosomes each PAR and non-PAR regions of the chromosome must be treated as a single chromosome. For example, on human data, the sample input msVCF for chrX must be divided into chrX_par1, chrX_par2, and chrX_nonpar.
The msVCF input list provided at step 1 is pre-processed to generate a formatted msVCF called <prefix>.preprocess.vcf.gz
. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
To facilitate parallel processing on distributed compute nodes, and to avoid overhead chromosome level multisample VCF download and upload per sub-chromosome processing, chromosome portions of equal size (shards) can be used as input. The gVCF Genotyper tool, with proper option, can generate these shards of equal size. Note: streaming from the cloud is not supported. Instead use predownload and local input process to achieve maximum IO efficiency and stability.
A per chromosome genetic map corresponding to the studied species and to the reference build used for the msVCF input is required. You can use your own genetic map computed from the recombination rate of the species and its reference genome, or use the geentic map corresponding to the human hg38 reference genome available to download from the DRAGEN Software Support Site page. DRAGEN does not generate custom genetic map files.
The genetic map should follow the format:
3 columns: position, chromosome number, distance (cM), in this order and tab separated
Genetic map for mixed ploidy chromosome must be seperated into as many PAR and non PAR regions (e.g. for human, chromosome X is split into PAR1 chrX_par1
, PAR2 chrX_par2
and non PAR chrX_nonpar
regions)
Genetic map for region in which all samples are haploid is not needed (e.g. for human, chromosome Y chrY
)
The user must ensure the genetic maps provided are from the same reference build than the reference used to generate the msVCF input.
This configuration file is a text file and is a required file. It allows for proper handling of haploid/diploid chromosomes and verifivation of concordence between genetic maps, msVCF input and sample type file information. Current configuration supports binary gender (male or female) and ploidy 2 or 1. When a region has different ploidies in male and female samples, the region is considered mixed ploidy region (e.g. for human, non PAR region on chromosome X chrX_nonpar
).
The user can provide its own or use the one available to download from DRAGEN Software Support Site page.
The config file is a text file with the headers:
##version
##ref_build indicating the reference build used for the study.
The Config file is a txt file and contains 4 columns, tabs delimited. Each of them must be populated.
First column: filename
Specifies the genetic map basename, 1 name per line. Mixed ploidy chromosomes must be separated into par and non-par regions. Basenames must match genetic map basenames.
Second column: region
Specifies the start and end positions of the chromosome or sub-chromosome region with format <contig_name>:<start_position>-<end_position>
. For chromosomes without mixed ploidy regions, the start position is 1, end position is the length of the chromosome (1-based, inclusive). For chromosomes with mixed ploidy regions, for each region, the start and end positions are those of the region (1-based, inclusive).
Third column: mixed ploidy subject
Specifies 2 on diploid chromosomes and PAR regions. 1 for non PAR region
Fourth column: diploid subject
Specifies 2 for all chromosomes
Note: for mixed ploidy chromosome ensure the genetic map is separated into as many PAR and non-PAR regions with no overlap. Example: for human data prefix should be chrX_par1
, chrX_nonpar
, and chrX_par2
.
The sample type file is a required file. The number of samples and name of samples in the input multisample VCF and sample type file should match.
The sample type file is a txt file with the following format
2 columns, tabs or space delimited
First column: list of all sample names present in the input sample
Second column: 1 or 2. 1 for subject with mixed ploidy chromosomes, 2 for subject with all diploid chromosomes.
The Phase common step (step 1) is run on a defined region, and outputs:
a single scaffold msVCF and related msVCF index with phased common variants for that region. The default name is dragen.ph_phase_common.vcf.gz
.
a single formatted msVCF called <prefix>.preprocess.vcf.gz
and related index. This formatted msVCF is generated in the directory and must be used as input of the Phase Rare step (step 3).
The Ligate Common step (step 2) ligates the regions phased in step 1 and outputs a single scaffold msVCF and related msVCF index with phased common variants for a single chromosome. The default name is “dragen.ph_ligate_common.vcf.gz”.
The Phase Rare step (step 3) is run on a defined region on a chromosome with preprocessed unphased msVCF from step 1 and phased scaffold msVCF from step 2, and outputs:
a single phased msVCF and related msVCF index with phased common and rare variants for that region. The default name is “dragen.ph_rare_common.vcf.gz”.
a single 8-column VCF and related index listing all sites that have been phased for that region. The default name is “dragen.ph_rare_common.sites.vcf.gz”.This output is used at the Concat-All step to generate a VCF file with all phased sites accross the genome.
The Concat All processing is used to generate 2 types of output
Phased common and rare variants for a chromosome
The Concat All step (step 4) concatenates the regions phased in step 3 and outputs an msVCF and related index with phased common and rare variants for a single chromosome. The default name is “dragen.ph_concat_all.vcf.gz”.
List of phased sites
This output is useful for input of the ForceGT tool. The Concat All step lists all sites in a 8-column VCF format that have been phased and output a VCF and related index with list of phased sites. This output can be generated either from a list of phased site VCFs across the genome from step3, or, in a second step once the list of per chromosome sites have been generated. The default name is “dragen.ph_concat_all.sites.vcf.gz”.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-common
Yes
Set to true to enable the Phase Common step.
--ph-phase-common-input-list
Yes
Provides a .txt file listing the sample input pertaining to one chromosome, with path to a single msVCF or a list of msVCF, one line per path. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.
--ph-phase-common-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must overlap between them for the downstream ligate common step. Examples of input region length for human data: 10 mbp
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1
, chrX_nonpar
and chrX_par2
, instead of one run with region chrX
).
--ph-phase-common-map
Yes
Provides path to the chromosome genetic map. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.
--ph-phase-common-config
Yes
Provides path to the txt config file.
--ph-phase-common-reference
No
Provides the path to a reference panel of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.
--ph-phase-common-scaffold
No
Provides the path to a scaffold of haplotypes in msVCF format. Useful for iterative haplotyping to accelerate the process.
--ph-phase-common-sample-type
Yes
Provides the path to the Sample type file.
--ph-phase-common-filter-maf
No
Default 0.001. Set the Minimum Allele Frequency threshold. All variants with allele frequency equal or above this MAF are phased during this Phase Common step.
--ph-phase-common-max-miss-gt-rate
No
Default 0.1. Set the threshold for variants to be skipped if the rate of missing GT is higher than this value.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-ligate-common
Yes
Set to true to enable the Ligate Common step.
--ph-ligate-common-input-list
Yes
Provide a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Common step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-rare
Yes
Set to true to enable the Phase Rare step.
--ph-phase-rare-input
Yes
Provides the path to the preprocessed unphased msVCF generated from Phase Common step covering the phase rare region.
--ph-phase-rare-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition. Regions must not overlap or have gaps between them.
Note: in the case of chromosome with mixed ploidy regions and diploid regions, the command should be run with one region at a time (e.g. three runs with three regions, chrX_par1
, chrX_nonpar
and chrX_par2
, instead of one run with region chrX
).
--ph-phase-rare-map
Yes
Provides the path to the genetic map of the chromosome. Note: in the case of mixed ploidy chromosome, the genetic map name must be divided into PAR and non-PAR regions.
--ph-phase-rare-config
Yes
Provides the path to the txt config file.
--ph-phase-rare-scaffold
Yes
Provides the path to the scaffold of haplotypes in msVCF format generated from Ligate Common step.
--ph-phase-rare-scaffold-region
Yes
Specifies the scaffold region to be phased. String in the format contigname: startposition-endposition. This scaffold region needs to cover the Input region and to allow buffer between regions. The buffer length impacts the accuracy and speed of the process: longer length is slower but improves accuracy.
--ph-phase-rare-sample-type
Yes
Provides the path to the Sample type file.
--ph-phase-rare-filter-maf
No
Default 0.001. Set the Maximum Allele Frequency threshold. All variants with allele frequency below this MAF are phased during this Phase Rare step. This value must be the same as the one provided at –ph-phase-common-filter-maf. If values differ not all variants will be phased.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file, generated by the pipeline.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-concat-all
Yes
Set to true to enable the Concat All step.
--ph-concat-all-input-list
Yes when --ph-concat-all-input-list is not provided
Provides a .txt file with list of phased msVCF pertaining to a single chromosome. The msVCF files provided are the output files of Phase Rare step, in ascending position order. Note: in the case of mixed ploidy chromosome each PAR and non-PAR regions must be treated as a single chromosome.
--ph-concat-all-input-list-sites-only
Yes when --ph-concat-all-input-list is not provided
Provides a .txt file with list of VCF containing all the haplotyped sites. The VCF files provided are the output files of Phase Rare step, in ascending position order, sex chromosomes at the end.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
An additional module of the Population Haplotyping tool checks for the quality of the haplotypes produced based on a phased truth set provided as input.
--enable-population-haplotyping
Yes
Set to true to enable population haplotyping tool.
--enable-phase-qc
Yes
Set to true to enable the quality control module.
--ph-phase-qc-validation
Yes
Provides the path to the phased truth set msVCF. Note: the validation msVCF must have the same samples as in the estimation msVCF for which the phasing accuracy is to be estimated.
--ph-phase-qc-estimation
Yes
Provides the path to the phased msVCF, output of Concat All to be validated.
--ph-phase-qc-input-region
Yes
Specifies the target region to be phased. String in the format contigname: startposition-endposition (startposition-endposition is optional). Regions must not overlap or have gaps between them.
--output-directory
Yes
Specifies the output directory.
--output-file-prefix
No
Outputs filename with the defined prefix for the file generated by the pipeline.
Fragmentomics is the study of fragmentation patterns of cell-free DNA or circulating turmor DNA (ctDNA). DNA molecules are released into the plasma from various tissues and cell types. Fragmentation features, such as fragment sizes and end motifs, of the cell-free DNA contains the characteristics of their tissue of origin. Studies have shown that fragmentation features are distinct between cancer and noncancer cells derived ctDNA. The use of genome-wide fragment profile of cell-free DNA has proven to be a powerful tool to infer cancer status and their tissue of origin. The DRAGEN fragmentomics component computes three fragmentomics metrics as following.[1]
Fragment profile
End motif frequency
Window protection score (WPS)
The fragmentomics component works by taking aligned reads from the mapper, calculating per read metrics, and finally tabulating into per-bin or target region metrics. DRAGEN first gets the chromosome sizes from the reference genome. Only autosomes and X, Y chromosomes are considered for fragment profile calculation. The genome is binned with the bin size specified by the user. Each aligned read is processed sequentially. Only reads satisfied with the following criteria are considered: 1) mapped, 2) mate-mapped, 3) not PCR duplicates, 4) primary alignment, 5) mapping quality no less than minimum mapq specified by the user. Reads that have template length within the short fragment size ranges are counted as short fragment. Reads that have template length within the long fragment size range are counted as long fragment. The fragment profile is calculated as the ratio of short-to-long fragment counts for each genomic bin. Genome-wide short fragment counts, long fragment counts, and their ratio are normalized against the GC bias of each genomic bin using the GC correction module from DRAGEN CNV component.
End motif frequency calculation is enabled when --fragmentomics-end-motif-len
is set to positive integers. Unmapped, duplicated, or secondary alignments are excluded for end motif frequency calculation. The first x basepair sequences (x is specified by --fragmentomics-end-motif-len
) at the 5' end of the reads is tabulated into a frequency dictionary with keys being the sequences and values being the frequencies. If the first x basepair contains any 'N's, this read will be ignored. After all reads are processed, the frequency table is sorted by the sequences in alphabetic order.
Window protection score (WPS) calculation is enabled when a target region is provided with --fragmentomics-wps-target-file
. This file must be a BED format text file with three columns. Each row in the file represents a 120-bp region for which WPS will be calculated. An interval tree will be constructed for the target regions. Then each aligned read is processed sequentially, and unmapped, duplicated, or secondary alignments are excluded. Any read with 5' end falling in a target region increments the read count for the region by one. Forward and reverse reads are counted separately. If a read fully spans the region, the fully-span read count for the region increments by one. After all reads are processed, WPS is calculated for each target region. Two ways of WPS calculation are supported, 1) number of fully spanning rads subtracted by the number of reads with 5' ending in the region. 2) percentage of reads ending in the region of all reads mapped to this region.
DRAGEN Fragmentomics currently supports Tumor-only
and Normal-only
sequencing data from TSO500/WES/WGS ctDNA assays. The results for Tumor-Normal
pair data are undefined because ctDNA data are derived from mixture of tumor and normal DNA. Therefore, users should avoid running Fragmentomics in Tumor-Normal
mode.
Enable the Fragmentomics component:
The target regions file is used only in window protection score calculation. The target regions file is in BED format with three columns.
Users can provide a blocklist of regions to remove reads from fragment profile calculation. For example, low mappability regions. This file is in BED format with three columns.
The system should output the fragment profile file, and optionally the end motif frequency file or WPS file if either or both are enabled.
The fragment profile file is in the following format:
The end motif frequency file is in the following format:
The WPS file is in the following format:
Y. M. DENNIS LO, DIANA S. C. HAN, PEIYONG JIANG, ROSSA W. K. CHIU. Epigenetics, fragmentomics, and topology of cell-free DNA in liquid biopsies. Science. 2021. DOI: 10.1126/science.aaw3616
DRAGEN Homologous Recombination Deficiency (HRD) Scoring takes in allele-specific copy number calls in either VCF format or directly streamed from somatic copy number callers. DRAGEN HRD then calculates scores for Loss of Heterozygosity (LOH), Telomeric Allelic Imbalance (TAI), and Large-Scale State Transition (LST). The three scores are output to the .hrdscore.csv
file. You can only use DRAGEN HRD when inputting results from WGS somatic CNV calling or ASCN WES somatic CNV calling.
Use the following command-line options to run HRD scoring. You can run HRD scoring with somatic CNV calling or after using somatic CNV calling results.
To run HRD scoring together with somatic CNV calling, use the following options. For more CNV parameters, please refer to CNV calling.
--enable-hrd=true
Set to true to enable HRD scoring to quantify genomic instability.
--enable-cnv=true
Set to true to enable CNV calling to run together with HRD scoring.
To run HRD scoring after somatic CNV calling, use the following options:
--enable-hrd=true
Set to true to enable HRD scoring to quantify genomic instability.
--hrd-input-ascn
Specify the allele-specific copy number file (*cnv.vcf.gz
). The CNV VCF file should include REF
calls for proper HRD segmentation. See the option --cnv-enable-ref-calls
in the CNV section.
--hrd-input-tn
Specify the tumor normalized bin count file (*.tn.tsv.gz
).
If reference is failed to AutoDetected
, then centromere and blacklist files should be specified with following options:
--hrd-input-centromere
Centromere locations per chromosome in tsv format
--hrd-input-blacklist
Blacklist bed file
The following metrics are included in the .hrdscore.csv
output file. The following is an example output file.
Sample
16
17
28
61
The following example command runs HRD end to end workflow with CNV. This is an example of Somatic WGS T/N. See the Somatic CNV section for other use cases. HRD is supported for any CNV workflows that support ASCN, and just needs to add --enable-hrd=true
on top of the CNV command lines.
The following example command runs HRD standalone.
DRAGEN can process data from whole genome and hybrid-capture assays with unique molecular identifiers (UMI). UMIs are molecular tags added to DNA fragments before amplification to determine the original input DNA molecule of the amplified fragments. UMIs help reduce errors and biases introduced by DNA damage such as deamination before library prep, PCR error, or sequencing errors.
To use the UMI Pipeline, the input reads files must be from a paired-end run. Input can be pairs of FASTQ files or aligned/unaligned BAM input. DRAGEN supports the following UMI types:
Dual, nonrandom UMIs, such as TruSight Oncology (TSO) UMI Reagents or IDT xGen Prism.
Dual, random UMIs, such as Agilent SureSelect XT HS2 molecular barcodes (MBC) or IDT xGen Duplex Seq Adapters.
Single-ended, random UMIs, such as Agilent SureSelect XT HS molecular barcodes (MBC) or IDT xGen dual index UMI Adapters.
DRAGEN uses the UMI sequence to group the read pairs by their original input fragment and generates a consensus read pair for each such group, or family. The consensus reduces error rates to detect rare and low frequency somatic variants in DNA samples with high accuracy. DRAGEN generates a consensus as follows.
Aligns reads.
Groups reads into groups with matching UMI and pair alignments. These groups are referred to as families.
Generates a single consensus read pair for each read family.
These generated reads have higher quality scores than the input reads and reflect the increased confidence gained by combining multiple observations into each base call. UMI workflow is only compatible with small variant calling and SV in DRAGEN.
Enter UMIs in one of the following formats:
Read name—The UMI sequence is located in the eighth colon-delimited field of the read name (QNAME). For example, NDX550136:7:H2MTNBDXX:1:13302:3141:10799:AAGGATG+TCGGAGA
BAM tag—The UMI is present as an RX tag in prealigned or aligned BAM file (standard SAM format).
FASTQ file—The UMI is located in a third FASTQ file using the same read order as the read pairs.
To create FASTQ, append the UMI to the read name, and then specify the appropriate OverrideCycles setting in the BCL conversion tool (see Illumina BCL Data Conversion). DRAGEN supports UMIs with two parts each with a maximum of 8 bp and separated by +, or a single UMI with a maximum of 15 bp.
The UMI workflow must be executed using a set of reads that correspond to a unique set of RGSM/RGLB. DRAGEN supports multiple lanes if all lanes correspond to the same RGSM/RGLB set.
DRAGEN UMI does not support a tumor-normal analysis, because a tumor-normal run corresponds to two different RGSM. In a tumor-normal run, one sample name is used for tumor and one sample name is used for normal. DRAGEN UMI supports one sample in a run.
If using a BAM file or a list of FASTQ files as the input, the input might contain multiple samples. DRAGEN checks if only one sample is included in the run and if the sample uses only a single, unique RGLB library. DRAGEN also accepts a library that was spread across multiple lanes. If there is a single sample and single library, DRAGEN processes all included reads. If there are multiple samples or multiple libraries, DRAGEN aborts analysis with an error.
For dual, nonrandom UMIs, you can provide a predefined UMI correction table or a list of valid UMI sequences as input. To create the UMI correction table, use a tab-delimited file, include a header, and add the following fields.
If customized correction table is not specified, DRAGEN uses the default table for TruSight Oncology (TSO) UMI Reagents that is located at <INSTALL_PATH>/resources/umi/umi_correction_table.txt.gz
. Alternatively, you can provide a file for whitelisted nonrandom UMI with valid UMI sequence one per line. DRAGEN then autogenerates a UMI correction table with hamming distance of one.
--umi-library-type
—Set the batch option for different UMIs correction. Three batch modes are available that optimize collapsing configurations for different UMI types. Use one of the following modes:
random-duplex
—Dual, random UMIs.
random-simplex
—Single-ended, random UMIs.
nonrandom-duplex
—Dual, nonrandom UMIs. To use this option, provide either --umi-nonrandom-whitelist
or --umi-correction-table
.
--umi-min-supporting-reads
—Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. For example, the following are the recommended settings for FFPE and ctDNA.
[FFPE] If the variant > 1%, use --umi-min-supporting-reads=1
with the --vc-enable-umi-solid
variant caller parameter. For more information on variant caller options, see Variant Caller Options.
[ctDNA] If the variant < 1%, use --umi-min-supporting-reads=2
with the --vc-enable-umi-liquid
variant caller parameter. For more information on variant caller options, see Variant Caller Options.
--umi-enable
—To enable read collapsing, set the --umi-enable option
to true. This option is not compatible with --enable-duplicate-marking
because the UMI pipeline generates a consensus read from a set of candidate input reads, rather than choosing the best nonduplicate read. If using the --umi-library-type
option, --umi-enable
is not required.
--umi-emit-multiplicity
—Set the consensus sequence type to output. DRAGEN UMI allows you to collapse duplex sequences from the two strands of the original molecules. Duplex sequence is typically ~20–60% of total library, depending on library kit, input material, and sequencing depth. Enter one of the following consensus sequence types:
both
—Output both simplex and duplex sequences. This option is the default.
simplex
—Output only simplex sequences.
duplex
—Output only duplex sequences.
--umi-source
—Specify the input type for the UMI sequence. The following are valid values: qname
, bamtag
, fastq
. If using --umi-source=fastq
, provide the UMI sequence from FASTQ file using --umi-fastq
.
--umi-correction-table
—Enter the path to a customized correction table. By default, DRAGEN uses lookup correction with a built-in table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits.
--umi-nonrandom-whitelist
—Enter the path for a customized, valid UMI sequence.
--umi-metrics-interval-file
—Enter the path for target region in BED format.
--umi-output-uncollapsed-bam
—Output uncollapsed (raw) reads map/aligning results to separate BAM with filename <output_prefix>.uncollapsed.bam.
DRAGEN processes UMIs by grouping reads by UMI and alignment position. If there are sequencing errors in the UMIs, DRAGEN can correct and detect small sequencing errors by using a lookup table or by using sequence similarity and read counts. You specify the type of correction with the --umi-library-type
or --umi-correction-scheme
option using the values lookup
, random
, or none
.
For sparse sets of nonrandom UMIs, it is possible to create a lookup table that specifies which sequence can be corrected and how to correct it. This correct file scheme works best on UMI sets where sequences have a minimum hamming/edit distance between them. By default, DRAGEN uses lookup correction with a built-in correct table for the Illumina TruSight Oncology and Illumina for IDT UMI Index Anchor kits. Specify the path for your correction file using the --umi-correction-table
option. If you are using a different set of nonrandom UMIs, contact Illumina Technical Support for information on generating the corresponding correction file.
In the random UMI correction scheme, DRAGEN must infer which UMIs at a given position are likely to be errors relative to other UMIs observed at the same position. The error modes include small UMI errors, such as one mismatch or UMI jumping or hopping artifact from library prep. DRAGEN accomplishes this as follows.
Groups reads by fragment alignment position.
Within a small fuzzy window at each position, groups the reads first by exact UMI sequence, which forms a family.
Estimate UMI jumping or hopping probability through insert size distribution and number of distinct UMI at certain positions.
Within a fuzzy window, calculates pair-wise likelihood ratio to assess if two families with different UMI sequences and genomic positions are derived from same original molecule.
Merges families with likelihood lower than threshold. The default threshold is 1.
Duplex UMI adapters simultaneously tag both strands of double-stranded DNA fragments. It is then possible to identify reads resulting from amplification of each strand of the original fragment.
DRAGEN considers two collapsed read pairs to be the sequence of two strands of the same original fragment of DNA if they have the same alignment position (within a fuzzy window), complementary orientations, and their UMIs are swapped from Read 1 and Read 2. If there is only single-ended UMI, DRAGEN compares the start-end position of families from two strands and computes pair-wise likelihood to determine if they are likely originated from two distinct families or should be merged as a duplex sequence. By default, DRAGEN outputs both simplex and duplex consensus sequences. To change the consensus sequence output type, use --umi-emit-multiplicity
.
The following is an example DRAGEN command for generating a consensus BAM file from input reads with Illumina UMIs:
To run with other random UMI library type, change --umi-library-type
to random-simplex
or random-duplex
.
If you enable BAM output, DRAGEN generates a <output_prefix>.bam that includes all UMI consensus reads. The QNAMEs for the reads are generated based on the following convention.
refID1—The reference ID of Read 1.
pos1—The genomic position of Read 1.
refID2—The reference ID of Read 2.
pos2—The genomic position of Read 2.
orientation—The orientation of Read 1 and Read 2. Orientation can be one of the following values. Position refers to the outermost aligned position of the read and is adjusted for soft clips.
1—Read 1 is forward and Read 2 is reverse. The starting position for Read 1 is less than or equal to the Read 2 end position.
2—Read 1 is reverse and Read 2 is forward. The starting position for Read 2 is greater than or equal to the Read 1 end position.
3—Read 1 is forward and Read 2 is reverse. The starting position for Read 1 is greater than the Read 2 end position.
4—Read 1 is reverse and Read 2 is forward. The starting position for Read 2 is greater than the Read 1 end position.
5—Read 1 and Read 2 are forward.
6—Read 1 and Read 2 are reverse.
XV
and XW
tags are added to consensus reads specifying number of supporting reads in a collapsed family. XV
tag indicates number of fragmnets and XW
tag indicates number of duplex fragments.
DRAGEN outputs an <output_prefix>.umi_metrics.csv file that describes the statistics for UMI collapsing. This file summarizes statistics on input reads, how they were grouped into families, how UMIs were corrected, and how families generated consensus reads. The following metrics can be useful when tuning the pipeline for your application:
Discarded families---Any families having fewer than --umi-min-supporting-reads
input or having a different duplex/simplex status than specified by --umi-emit-multiplicity
are discarded. These reads are logged as Reads filtered out. The families are logged as Families discarded.
UMI correction---Families may be combined in various ways. The number of such corrections are reported as follows.
Families shifted---Families with fragment alignment coordinates up to the distance specified by the umi-fuzzy-window-size
parameter. The default umi-fuzzy-window-size parameter is 3.
Families contextually corrected---Families with exactly the same fragment alignment coordinates and compatible UMIs are merged. - Duplex families---Families with close alignment coordinates and complementary UMIs are merged.
When you specify a valid path for --umi-metrics-interval-file, DRAGEN outputs a separate set of on target UMI statistics that contains only families within the specified BED file.
If you need to analyze the extent to which the observed UMIs cover the full space of possible UMI sequences, the histogram of unique UMIs per fragment position metric may be helpful. It is a zero-based histogram, where the index indicates a count of unique UMIs at a particular fragment position and the value represents the number of positions with that count.
The following figures and table describe available UMI metrics.
Fig1) Read pairs with duplex UMI
Fig2) UMI error correction
Fig3) UMI collapsible regions
DRAGEN RNA variant calling uses the DRAGEN Somatic Small Variant Caller to call SNVs and indels. DRAGEN uses somatic variant calling to account for nongermline variant allele frequencies in RNA-seq data caused by differential expression. To perform variant calling, DRAGEN uses a probability model that weighs the evidence of a real variant against evidence for various noise models. If the quality score for a variant exceeds a certain threshold, then the variant is reported in the output VCF with the PASS label. DRAGEN also applies filters, such as weak_evidence and base_quality, that might indicate if the variant does not reach the thresholds required to qualify as a passing call. For more information on DRAGEN DNA somatic variant calling, see Somatic Mode.
DRAGEN RNA also supports forced genotyping (ForceGT). A ForceGT VCF that contains variants of interest can be provided to DRAGEN RNA VC, and the output VCF will contain all variants from the input with annotation. ForceGT might be unable to accurately call complex variants or variants with long deletions (> 50 bp). Complex variants are variants that require more than one substitution, insertion, or deletion event to transform the REF allele into the ALT allele.
DRAGEN RNA does not attempt to accurately genotype variants as heterozygous or homozygous (since it uses the DRAGEN somatic caller and somatic variants do not normally fall into those classes). Instead, a heuristic is applied based on the variant allele frequency: if the AF is at least 85%, then the GT field will be set to 1/1. Otherwise GT will always be reported as 0/1. This behavior and threshold can be adjusted with the following options:
You can use a FASTQ, BAM, or CRAM file as input. Optionally, you can provide a GTF annotation file for more accurate split junction mapping.
Use the following command line options for FASTQ input files.
Use the following command line options for a list of FASTQ input files.
Use the following command line options for a BAM input file.
To enable RNA variant calling, set --enable-rna
and --enable-variant-caller
to true. To enable ForceGT, use --vc-forcegt-vcf <forcegt_vcf_file>
.
RNA variant calling outputs a VCF file that includes PASS variants and variants that did not pass, due to filters or weak evidence. For more information on filters and additional command line options, see Somatic Mode.
(NOTE: gVCF mode is not supported with RNA variant calling.)
The following is an example RNA variant calling command line.
RNA quantification and/or fusion calling can be performed along with RNA variant calling by adding the appropriate option(s) in addition to --enable-rna=true
and --enable-variant-caller=true
.
Options:
RNA quantification: --enable-rna-quantification=true
RNA gene fusion calling: --enable-rna-gene-fusion=true
For more information and options related to RNA quantification and fusion calling, see those sections of the user guide.
DRAGEN includes an RNA-seq (splicing-aware) aligner, as well as RNA specific analysis components for gene expression quantification, gene fusion detection, splice variant calling, and small variant calling. All of these analysis components require the aligner to be enabled.
Most of the functionality and options described in Host Software Options and DNA Mapping also apply to RNA applications. Additional RNA-specific aspects are described in this section.
In addition to the standard input files (reads from fastq or bam, reference genome, etc.), DRAGEN can also take a gene annotations file as input. A gene annotations file aids in the alignment of reads to known splice junctions and is required for gene expression quantification and gene fusion calling.
To specify a gene annotation file, use the -a
(--annotation-file
) command line option. The input file must conform to the GTF/GFF specification (http://uswest.ensembl.org/info/website/upload/gff.html). The file must contain features of type exon, and the record must contain attributes of type gene_id
and transcript_id
. An example of a valid GTF file is shown below.
Similarly, a GFF file can be used. Each exon feature must have as a Parent a transcript identifier that is used to group exons. An example of a valid GFF file is shown below.
NB. For proper handling of genes in the PAR regions of chromosome X and Y, it is required that the gene_id
attribute of all exons of the same gene is distinct between the two chromosomes, in order to distinguish exons within the PAR region of chromosome X from the ones within the PAR region of chromosome Y. That is, it is often the case that the gene_id
of all exons of a transcript from geneA
is equal to gene_id=geneA
in chromosome X, and gene_id=geneA_PAR_Y
in chromosome Y. This allows the GTF/GFF parser and downstream components to discriminate data associated with PAR genes in chromosome X from data associated with the same PAR genes in chromosome Y.
The DRAGEN host software parses the file for exons within the transcripts and produces splice junctions. The following output displays the number of splice junctions detected.
The splice junctions that are detected from the annotation file are also written to *.sjdb.annotations.out.tab
. Splice junctions below a minimum length are excluded, which helps filter annotation artifacts. This minimum annotation splice junction length is controlled by the --rna-ann-sj-min-len
option, which has a default value of 6
.
Note that GFF3 is a different file format from GFF. GFF3 files are not officially supported due to inconsistent contig naming conventions between GENCODE and Ensembl.
For the same reference, GENCODE provides all the attributes necessary for DRAGEN to build a hierarchical structure:
Ensembl has a different notation:
Ensembl uses different notation for contigs (for GRCh38) than GENCODE. Ensembl contigs do not have the "chr" prefix. The contig identifiers in the annotation file must match the DRAGEN reference in use, and by most conventions GRCh38/hg38 contigs are prefixed with "chr".
If necessary, DRAGEN may support GFF3 files that are GENCODE-compatible with the following annotations present in the attributes of each exon record:
For gene: "gene_name" or "name" or "gene" or "gene_id"
For transcript: "transcript_id" or "Parent"
Due to the flexibility of the GFF3 file format, issues may arise as it continues to evolve.
Please be aware that depending on the characteristics of the input file (i.e. read depth and distribution) the second pass using the first pass SJ.out.tab
may take longer than the first pass.
NOTE: Components downstream of aligner like gene expression quantification, gene fusion detection and RNA variant calling require GTF file as the input annotations file and are NOT compatible with two-pass splice-junction alignment mode.
The DRAGEN RNA pipeline contains a gene expression quantification module that estimates the expression of each transcript and gene in an RNA-seq data set. The module first internally translates the genomic mapping of each read (read pair) to the corresponding transcript mappings. Then uses an Expectation-Maximization (EM) algorithm to infer the transcript expression values that best match all the observed reads. The EM algorithm can also model and correct for GC-bias in the reported quantification results.
To enable the quantification module, set the --enable-rna-quantification
option to true
in your current RNA-seq command-line scripts. Additionally, you must provide a gene annotation file (GTF/GFF) that contains the genomic position of all transcripts to quantify. You can specify the GTF/GFF file using the -a
or --annotation-file
option.
Transcript quantification results are reported in the <outputPrefix>.quant.sf
text file. The file lists results for each transcript. You can use the output file as input for differential gene expression using tools such as tximport and DESeq2.
The following is an example of the file contents:
The gene expression quantification module also outputs the files below. For information on the metrics included, see the section Quantification and RNA QC Metrics
.
<outputPrefix>.quant.genes.sf
—Contains quantification results at the gene level. The results are produced by summing together all transcripts with the same geneID in the annotation file (GTF). Length and EffectiveLength are the expression-weighted means of the individual transcripts in the gene.
<outputPrefix>.quant.metrics.csv
—Summary statistics relevant to RNA transcripts and quantification. See Quantification and RNA QC Metrics
.
<outputPrefix>.quant.transcript_fragment_lengths.txt
—Full fragment length distribution of reads mapped to transcripts, output in length- probability pairs of length minimum through >999 bases. Summing the products of the two columns will yield the average fragment length.
<outputPrefix>.quant.transcript_coverage.txt
—Measures coverage uniformity with a normalized average of 5' to 3' coverage pattern along transcripts in increments of 1%. A summation of the 100 coverage bins should yield 100%.
<outputPrefix>.SJ.saturation.txt
—Measures sequencing saturation of the library, including the number of unique splice junctions observed as a function of reads processed.
The RNA Quantification module outputs metrics related to the gene expression results and more general RNA QC metrics that rely on the transcript-level analysis. A summary of the metrics is output to the <outputPrefix>.quant_metrics.csv
file.
Only unfiltered and properly paired reads (for paired-end sequencing) are counted in the above metrics. The seven fragments types that are listed (Forward transcript, Reverse transcript, Strand mismatched, Ambiguous strand, Intron, Intergenic, Unknown transcript) add up to 100% of the counted fragments, and the percentage of this total is provided next to each fragment metric count.
The DUX4 Rearrangement Caller identifies the events of potential structural rearrangements between DUX4 and other genes (including IGH). The primary support for the DUX4 Rearrangement Caller is for human reference hg38.
The DUX4 Rearrangement Caller has the following features:
call DUX4 Rearrangement events from various format of genomic data like FASTQ, BAM, CRAM.
scan the whole genome and identify potential DUX4 rearrangement events.
run in parallel with the host DRAGEN software with minimal overhead.
Sequencing dataset to be tumor-only, paired-end and whole-genome sequencing
Sequencing dataset with mean coverage range between 25X to 120X
Sequencing dataset with mean fragment length between 300 to 500bp
Sequencing dataset with mean read length between 100 to 151bp
A reference genome that is compatible with DRAGEN software. You can download prebuilt reference genomes from our website or build your own customized version with: dragen --build-hash-table true --output-directory <HASHTABLE_DIR> --ht-reference <REF_FASTA> [options]
The DRAGEN DUX4 caller has been validated with a cohort of samples that fall within the above defined parameters. If you have datasets that don't comply with the above parameters, you can bypass the requirements check by specifying --dux4-skip-santiy-check true
to obtain experimental results.
The basic syntax of the DRAGEN command line is:
dragen [global options] [pipeline options] [output options]
The global options are common to all pipelines and control the general behavior of DRAGEN, such as the input and output files/directories, the reference genome, and the license file.
The pipeline options are specific to each pipeline and control the parameters and features of the analysis, such as the variant callers, the filters and the annotations.
The output options control the format and content of the output files, such as the VCF, BAM, and the metrics files.
For DUX4 caller, a simple and quick example would be:
where DRAGEN analysis will take in sequencing data from fastq format (BAM, CRAM, ORA also acceptable) and map/align the reads to the reference genome, the mapped and sorted reads will be consumed by DUX4 caller.
Alternatively, DRAGEN DUX4 caller can start from bam format input by skipping the map/align step (assuming bam file is sorted and with duplicates being marked):
What's more, DUX4 caller can run in parallel with other variant callers:
Finally, you will find DUX4 VCF results in the directory of --output-dir with prefix being specified by --output-file-prefix.
The DUX4 VCF will contain positive calls that represent translocation events across gene pairs. Each event will consist of a set of 4 VCF Breakend records to describe the potential translocation event. Each record will contain PR:SR:SRPB tags to describe the number of fragment that support the events, where PR stands for number of spanning paired reads, SR stands for number of spanning split reads and SRPB stands for number of support read pairs per billion reads being processed. We predefined two sets of genomics target regions, "CoreDUX4" regions and "ExtendedDUX4" regions, to optimize the events detection process, where "CoreDUX4" regions is a subset of "ExtendedDUX4" regions.
An output VCF example will look like this:
The DRAGEN RNA pipeline uses the DRAGEN RNA-Seq spliced aligner. Mapping of short seed sequences from RNA-Seq reads is performed similarly to mapping DNA reads. In addition, splice junctions (the joining of noncontiguous exons in RNA transcripts) near the mapped seeds are detected and incorporated into the full read alignments.
The output files generated when running DRAGEN in RNA mode are similar to those generated in DNA mode. RNA mode also produces extra information related to spliced alignments. Details regarding the splice junctions are present both in the SAM alignment record and an additional file, the SJ.out.tab file.
The output BAM file meets the SAM specification and is compatible with downstream RNA-Seq analysis tools.
RNA-Seq BAM Tags
The following BAM tags are emitted alongside spliced alignments.
If a gene annotations file is used during the map/align stage, and the splice junction is detected as an annotated junction, then 20 is added to its motif value.
Cufflinks might require spliced alignments to emit the XS:A
strand tag. This tag is present in the SAM record if the alignment contains a splice junction. The possible values for XS:A
strand tag are as follows:
'.' (undefined), '+' (forward strand), '-' (reverse strand), or '*' (ambiguous).
By default, if the spliced alignment has an undefined strand or an ambiguous (conflicting) strand, then the alignment output is suppressed. These alignments can be output into the output alignment file by setting the --no-ambig-strand
option to 1.
Cufflinks also expects that the MAPQ for a uniquely mapped read is a single value. This value is specified by the --rna‑mapq-unique
option. To force all uniquely mapped reads to have a MAPQ equal to this value, set ‑‑rna‑mapq‑unique
to a nonzero value.
Along with the alignments emitted in the SAM/BAM file, an additional SJ.out.tab file summarizes the high confidence splice junctions in a tab-delimited file. The columns for this file are as follows:
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)strand (0: undefined, 1: +, 2:-)
strand (0: undefined, 1: +, 2: -)
intron motif: 0: noncanonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT
0: unannotated, 1: annotated, only if an input gene annotations file was used
number of uniquely mapping reads spanning the splice junction
number of multimapping reads spanning the splice junction
maximum spliced alignment overhang
The maximum spliced alignment overhang (column 9) field in the SJ.out.tab file is the anchoring alignment overhang. For example, if a read is spliced as ACGTACGT------------ACGT
, then the overhang is 4. For the same splice junction, across all reads that span this junction, the maximum overhang is reported. The maximum overhang is a confidence indicator that the splice junction is correct based on anchoring alignments.
There are two SJ.out.tab files generated by the DRAGEN host software, an unfiltered version and a filtered version. The records in the unfiltered file are a consolidation of all spliced alignment records from the output SAM/BAM. However, the filtered version has a much higher confidence for being correct due to the use of the following filters.
A splice junction entry in the SJ.out.tab file is filtered out if any of these conditions are met:
SJ is a noncanonical motif and is only supported by < 3 unique mappings.
SJ of length > 50000 and is only supported by < 2 unique mappings.
SJ of length > 100000 and is only supported by < 3 unique mappings.
SJ of length > 200000 and is only supported by < 4 unique mappings.
SJ is a noncanonical motif and the maximum spliced alignment overhang is < 30.
SJ is a canonical motif and the maximum spliced alignment overhang is < 12.
The filtered SJ.out.tab is recommended for use with any downstream analysis or post processing tools. Alternatively, you can use the unfiltered SJ.out.tab and apply your own filters (for example, with basic awk commands).
Note that the filter does not apply to the alignments present in the BAM or SAM file.
If there are chimeric alignments present in the sample, then a supplementary Chimeric.out.junction file is also output. This file contains information about split-reads that can be used to perform downstream gene fusion detection. Each line contains one chimerically aligned read. The columns of the file are as follows:
Chromosome of the donor.
First base of the intron of the donor (1-based).
Strand of the donor.
Chromosome of the acceptor.
First base of the intron of the acceptor (1-based).
Strand of the acceptor.
N/A---not used, but is present to be compatible with other tools. It will always be 1
.
N/A---not used, but is present to be compatible with other tools. It will always be *
.
N/A---not used, but is present to be compatible with other tools. It will always be *
.
Read name.
First base of the first segment, on the + strand.
CIGAR of the first segment.
First base of the second segment.
CIGAR of the second segment.
CIGARs in this file follow the standard CIGAR operations as found in the SAM specification, with the addition of a gap length L that is encoded with the operation p. For paired end reads, the sequence of the second mate is always reverse complemented before determining strandedness.
The following is an example entry that shows two chimerically aligned read pairs, in which one of the mates is split, mapping segments of chr19 to chr12. Also shown are the corresponding SAM records associated with these entries.
Filtered rRNA reads---Total number of ribosomal RNA reads that are filtered out with the --rrna-filter-enable
option.
Mitochondrial reads excluded---Total number of reads detected to be in ChrM if the --rna-mapping-metrics-exclude-chrm
option is enabled.
Mapped reads adjusted for filtered mapping---Adjusted count of mapped reads by adding in the filtered rRNA reads.
Mapped reads adjusted for excluded mapping---Adjusted count of mapped reads by adding in the excluded mitocondrial reads.
Mapped reads adjusted for filtered and excluded mapping---Adjusted count of mapped reads by adding in both the filtered rRNA and excluded mitocondrial reads.
Unmapped reads adjusted for filtered mapping---Adjusted count of unmapped reads by subtracting out the filtered rRNA reads.
Unmapped reads adjusted for excluded mapping---Adjusted count of unmapped reads by subtracting out the excluded mitocondrial reads.
Unmapped reads adjusted for filtered and excluded mapping---Adjusted count of unmapped reads by subtracting out both the filtered rRNA and the excluded mitocondrial reads.
Reads with splice junction---Total number of reads that included a spliced alignment that crosses an intron
The aligner stage of the RNA spliced aligner uses Smith-Waterman Alignment Scoring options and Splicing Scoring Options.
Refer to [Smith-Waterman Alignment Scoring Settings]{.underline} for more details about the alignment algorithm used within DRAGEN. The following scoring options are specific to the processing of canonical and noncanonical motifs within introns.
--Aligner.intron-motif12-pen
The --Aligner.intron-motif12-pen
option controls the penalty for canonical motifs 1/2 (GT/AG, CT/AC). The default value calculated by the host software is 1 * (match-score + mismatch-pen)
.
--Aligner.intron-motif34-pen
The --Aligner.intron-motif34-pen
option controls the penalty for canonical motifs 3/4 (GC/AG, CT/GC). The default value calculated by the host software is 3 * (match-score + mismatch-pen)
.
--Aligner.intron-motif56-pen
The --Aligner.intron-motif56-pen
option controls the penalty for canonical motifs 5/6 (AT/AC, GT/AT). The default value calculated by the host software is 4 * (match-score + mismatch-pen)
.
--Aligner.intron-motif0-pen
The --Aligner.intron-motif0-pen
option controls the penalty for noncanonical motifs. The default value calculated by the host software is 6 * (match-score + mismatch-pen)
.
--Mapper.min-intron-bases
For RNA-Seq mapping, a reference alignment gap can be interpreted as a deletion or an intron. In the absence of an annotated splice junction, the min-intron-bases option is a threshold gap length separating this distinction. Reference gaps at least this long are interpreted and scored as introns, and shorter reference gaps are interpreted and scored as deletions. However, alignments can be returned with annotated splice junctions shorter than this threshold.
--Mapper.max-intron-bases
The max-intron-bases option controls the largest possible intron that is reported, which useful for preventing false splice junctions that would otherwise be reported. Set this option to a value that is suitable to the species you are mapping against.
--Mapper.ann-sj-max-indel
For RNA-seq, seed mapping can discover a reference gap in the position of an annotated intron, but with slightly different length. If the length difference does not exceed this option, the mapper investigates the possibility that the intron is present exactly as annotated, but an indel on one side or the other near the splice junction explains the length difference. Indels longer than this option and very near annotated splice junctions are not likely to be detected. Higher values may increase mapping time and false detections.
DRAGEN RNA can detect duplicate reads, which are defined as fragments that have both ends mapping to the same (clipping-adjusted) position during alignment. In RNA-Seq data, the reads can represent PCR duplicates during library prep or as a result from deep coverage of highly expressed regions.
If --enable-duplicate-marking
is set to true, duplicate fragments are marked in the BAM file and the total number of duplicate reads is reported as a mapping metric. Marking of duplicates does not affect gene expression quantification and gene fusion calling.
DRAGEN RNA also supports internal downsampling, which is a process by which a random sub-sample of reads is selected from the dataset after trimming and alignment for downstream analysis. In RNA-Seq, this can be useful in two ways - it can speed up analysis of samples with excessively high coverage, and it can allow for more accurate cross-comparisons between different samples.
If --enable-down-sampler
is set to true and a value specified for --down-sampler-reads
, DRAGEN will use only that many RNA fragments (including both Read 1 and Read 2) for quantification, fusion, variant calling and output to BAM. Please note the the entire input dataset is still used for the generation of trimming, fastqc, and mapping metrics.
Ribosomal RNA (rRNA) sequences can contribute a large fraction of reads in some RNA-Seq datasets, depending on the sample type and library prep method. You can use the DRAGEN RNA pipeline to filter rRNA reads during alignment, because the reads are not relevant for downstream analysis. By filtering rRNA, you can reduce run time and file size and avoid deep read alignment pile ups at rRNA repeat loci on the genome to make downstream analysis of RNA BAM files easier.
rRNA filtering relies on a decoy contig with the rRNA sequence included in the reference hash table. Any read that maps to the decoy contig, including multimappers, is tagged with rRNA and is not mapped in the output.
NOTE: The rrna filter option only accepts a single contig by default. In the event multiple contigs need to be provided, they can be concatenated using a 1kb N mask between them, and added to the reference FASTA while creating the hash table.
NOTE: rRNA filtering is not supported with chm13
-based references and it will be automatically disabled.
The following are the required command-line options for rRNA filtering.
--rrna-filter-enable=true
--Enables rRNA filtering. Set to true
to enable rRNA filtering. The default value is false
.
--rrna-filter-contig
--Specify the name of the rRNA sequences to use for filtering. If you do not specify a value, the default gl000220
is provided for human genome alignments by using the reference autodetect feature. gl000220
is an unplaced contig included in hg19 and hg38 genomes, which include a full copy of the rRNA repeat. For other genomes, you must include a rRNA decoy contig when creating a hash table.
All rRNA filtering reads are left unaligned in the BAM files and tagged ZS:Z:FLT
. The number and percentage of filtered rRNA reads is reported as a mapper metric Adjustment of reads matching filter contigs
.
The --mapq-strict-sjs
option is specific to RNA, and applies where at least one exon segment is aligned confidently, but there is ambiguity regarding possible splice junctions. When this option is set to 0, a higher MAPQ value is returned, expressing confidence that the alignment is at least partially correct. When this option is set to 1, a lower MAPQ value is returned, reflecting the splice junction ambiguity.
Some downstream tools, such as Cufflinks, expect the MAPQ value to be a unique value for all uniquely mapped reads. This value is specified with the --rna-mapq-unique
option. Setting this option to a nonzero value overrides all MAPQ estimates based on alignment score. Instead, all uniquely mapped reads have a MAPQ set to the value of --rna-mapq-unique
. All multimapped reads have a MAPQ value of int(-10*log10(1 ‑ 1/NH)
, where the NH value is the number of hits (primary and secondary alignments) for that read.
The DRAGEN Gene Fusion module uses the DRAGEN RNA splice-aware aligner to detect gene fusion events. The supplementary (chimeric) alignments are used to find potential breakpoints and read evidence is accumulated for the resulting fusion event candidates. Then, an ML model is applied to score the putative fusion events to filter potential false positives. The ML scoring model is currently available on human samples and does not support non-human reference genomes.
You can run the DRAGEN Gene Fusion module together with a regular RNA-Seq map/align job. To enable the DRAGEN Gene Fusion module, set --enable-rna-gene-fusion
to true in your current RNA-Seq command-line scripts. The DRAGEN Gene Fusion module requires a gene annotations file in GTF or GFF format.
The following is an example command line for running an end-to-end RNA-Seq experiment.
At the end of a run, a summary of detected gene fusion events is output, which is similar to the following example.
The <OUTPUT_PREFIX>.fusion_candidates.features.csv
file lists the detected gene fusion events. The output CSV file includes the following columns.
#FusionGene
: Parent gene names (in 5' to 3' order of transcript) participating in the fusion; hereafter refer red to as Gene 1 and Gene 2. If a fusion breakpoint overlaps multiple genes, the genes are listed by default as separate candidates (rows). To show them as a semi-colon separated gene list on the same row, the option --rna-gf-merge-calls
can be set to true
as described in the Gene Fusion Options and Filters section.
Score
: Fusion call confidence score predicted by the ML model. If the ML model is used, the score can be 0 (low confidence) to 1 (high-confidence call). Currently the ML model only supports human references. In the case an ML model is not available, the number of supporting reads will be reported as the score.
LeftBreakpoint
: Gene 1 breakpoint formatted as <Chromosome>:<Position>:<Strand>
.
RightBreakpoint
: Gene 2 breakpoint formatted as <Chromosome>:<Position>:<Strand>
.
Filter
: Semicolon separated list of filter flags. The LOW_SCORE
filter is used to filter low confidence fusion candidates. If --rna-gf-enable-post-filtering=true
, other confidence filters will also be applied. Informative filters, on the other hand, do not fail the fusion. In the absence of the ML model scoring (i.e. a non-human reference is used), a more aggressive post-filtering will take place and all confidence and informative filters will be applied.
The following are the available filters.
Note that the specific features and column values are subject to change in future DRAGEN versions as more RNA data is analyzed.
#SplitScore
: Combined count of fusion supporting read pairs reported as split reads and soft-clipped reads
#NumSplitReads
: Number of fusion supporting read pairs with at least one split read alignment.
#NumSoftClippedReads
: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment. Included in SplitScore
and includes soft-clipped reads for both Gene1 and Gene2
#NumSoftClippedReadsGene1
: Number of fusion supporting read pairs with no split read alignment, but at least one soft clipped alignment to Gene 1
#NumSoftClippedReadsGene2
: See above (NumSoftClippedReadsGene1
) for Gene 2
#NumPairedReads
: Number of fusion supporting read pairs such that one of the reads maps to Gene1 and the other maps to Gene2, without any breakpoint overlap
#NumRefSplitReadsGene1
: Number of read pairs that map fully within Gene 1 such that at least one of the reads aligns across the breakpoint. These reads support the reference transcript and do not support the fusion.
#NumRefSplitReadsGene2
: See above (NumRefSplitReadsGene1
) for Gene 2
#NumRefPairedReadsGene1
: Number of read pairs such that one of the reads maps on the left side of the Gene1 breakpoint and the other maps on the right side of the Gene1 breakpoint, without overlapping the break. These reads support the reference transcript and do not support the fusion.
#NumRefPairedReadsGene2
: See above (NumRefPairedReadsGene1
) for Gene 2
#AltToRef
-- Ratio of (fusion split + soft clipped reads) / max(NumRefSplitReadsGene1, NumRefSplitReadsGene2); used for the LOW_ALT_TO_REF
filter
#UniqueAlignmentsGene1
: Unique (start-end) positions of fusion supporting read alignments to Gene 1 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS
filter
#UniqueAlignmentsGene2
: Unique (start-end) positions of fusion supporting read alignments to Gene 2 (after dedup); used for the LOW_UNIQUE_ALIGNMENTS
filter
#MaxMapqGene1
: Maximum MAPQ for fusion supporting reads in Gene 1
#AvgMapqGene1
: Average MAPQ for fusion supporting reads in Gene 1
#MaxMapqGene2
: Maximum MAPQ for fusion supporting reads in Gene 2
#AvgMapqGene2
: Average MAPQ for fusion supporting reads in Gene 2
#CoverageBasesGene1
: Bases in Gene 1 with read coverage within a certain distance (default 1000 bp) of the breakpoint in the direction of the breakpoint strand which is part of the fusion transcript
#CoverageBasesGene2
: See above (CoverageBasesGene1
) for Gene 2
#DeltaExonBoundaryGene1
: Distance from the Gene 1 breakpoint for the closest fusion supporting alignment (higher distance to boundary lowers score)
#DeltaExonBoundaryGene2
: See above (DeltaExonBoundaryGene1
) for Gene 2
#IsRestrictedGene1
: Indicator variable of whether the Gene 1 is tagged as protein coding in the annotation file
#IsRestrictedGene2
: Indicator variable of whether the Gene 2 is tagged as protein coding in the annotation file
#IsEnrichedGene1
: If enrichment or amplicon assay, then indicates whether Gene 1 is enriched. If whole transcriptome sequencing, then set to 1
#IsEnrichedGene2
: See above (IsEnrichedGene1
) for Gene 2
#CisDistance
: Distance between breakpoints if they are adjacent to each other and on the same strand. Large value (100M) if not a CIS break; used for the READ_THROUGH
filter.
#BreakpointDistance
: Distance between breakpoints if they are adjacent. Large value (100M) if not within same chromosome
#GenePairHomologyEval
: E-value of pairwise BLAST alignment of the parent genes
#AnchorLength1
: Longest alignment of a fusion supporting read to Gene 1
#AnchorLength2
: Longest alignment of a fusion supporting read to Gene 2
#FusionLengthGene1
: Distance from breakpoint to the end of Gene 1
#FusionLengthGene2
: Distance from breakpoint to the end of Gene 2
#NonFusionLengthGene1
: Breakpoint distance to the end of transcript not part of the fusion for Gene 1
#NonFusionLengthGene2
: Breakpoint distance to the end of transcript not part of the fusion for Gene 2
#Gene1Id
: Gene ID reported in the annotation file for Gene 1
#Gene2Id
: Gene ID reported in the annotation file for Gene 2
#Gene1Location
:
IntactExon: Breakpoint matches exon boundary,
BrokenExon: Breakpoint is within an exon but does not match the exon boundary,
Intron: Breakpoint is within an intron,
Intergenic: Breakpoint does not overlap any gene
#Gene2Location
: See above (Gene1Location
) for Gene 2
#Gene1Sense
: True
if the Gene 1 5' to 3' direction matches the breakpoint order, indicating that the gene is the upstream gene in the fusion transcript
#Gene2Sense
: See above (Gene1Sense
) for Gene 2
In addition, if --rna-gf-merge-calls
is enabled, DRAGEN merges the fusion candidates that overlap the same breakpoint into a single row reporting the feature values for the highest scoring passing candidate (or highest scoring failing candidate if no passing candidate is reported). For each breakpoint, in the column #FusionGene
, it reports a semi-colon separated list of names of all overlapping genes with a passing candidate. The following two columns are added to the features.csv
output file:
#AdditionalGenes1
: If a mix of passing and failing candidates are reported for the same breakpoint of Gene 1, genes with only failing candidates are listed. If no passing candidate exists, then all overlapping genes are reported in the #FusionGene
column.
#AdditionalGenes2
: See above (AdditionalGenes1
) for Gene 2
The <OUTPUT_PREFIX>.fusion_candidates.final
output file lists each passing fusion along with the read names that support the fusion, including Split Reads, Soft-clipped reads, and Paired (discordant) Reads and the passing scores. These reads can be extracted from the output BAM file and then used to visualize the fusions (i.e. in IGV). The same information for the non-passing fusions is provided in the <OUTPUT_PREFIX>.filter_info
output file.
The <OUTPUT_PREFIX>.fusion_candidates.vcf.gz
output file provides the VCF representation for all of the breakpoints for the candidate fusions using structural variant-style BND notation. The VCF header is annotated with ##source=DRAGEN_RNA_GF
to indicate the file is generated by the DRAGEN RNA Gene Fusion pipeline. All fusion candidates (passing and failing) are represented in the VCF output with one entry for each side of the fusion breakpoint (Gene 1 and Gene 2).
The <OUTPUT_PREFIX>.fusion_metrics.csv
output file provides a simple count of the total number of fusion candidates, those passing the scoring filter, and the number of unique left-right gene combinations that are found.
The following thresholds and options may be used to configure the fusion caller:
--rna-gf-enriched-genes
For RNA enrichment assays, a list of targeted genes specified as one gene-name per line. Only fusion calls involving at least one gene on the list are reported. The enriched genes list should only contain genes listed in the input annotation file. This option cannot be provided together with --rna-gf-enriched-regions
. If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then you do not need to set this option; DRAGEN will read the enriched genes names from the amplicon BED file (fifth column). See DRAGEN Amplicon Pipeline for further information.
--rna-gf-enriched-regions
Alternative to --rna-gf-enriched-genes
, but input is provided as a bed-file with regions coordinates instead of a gene list. All the genes in the provided annotation file that overlap such regions are included. Genes that are extracted in this way are summarized in output in the *.fusion.enriched_genes.txt
file. This option cannot be provided together with --rna-gf-enriched-genes
.
--rna-repeat-genes
Text file that contains the names or IDs (from the annotation file) of targeted repetitive genes for sensitive fusion detection. Exclusive from --rna-repeat-intervals
. This option overrides the default BED file. The repeat genes list should only contain genes listed in the input annotation file.
--rna-repeat-intervals
BED file that contains a target list of repeat intervals for sensitive fusion detection. Exclusive from --rna-repeat-genes
. This option overrides the default files, which contain the genes CIC, DUX4, PSPH, and SEPTIN14 for GRCh38 and hg19 reference genomes.
--enable-variant-annotation=true
, --variant-annotation-assembly
, and --variant-annotation-data
Enable Illumina Annotation Engine (IAE) to report fusion annotations in JSON format. --enable-variant-annotation
must be set to true. For more information, see Illumina Annotation Engine.
--rna-gf-restrict-genes
When parsing the gene annotations file for use in the DRAGEN Gene Fusion module, you can use this option to restrict the entries of interest to only protein-coding regions. Restricting the annotation to only the protein-coding genes reduces false positive rates in currently studied fusion events. To report non-coding gene fusions such as pseudo genes and lincRNAs, turn off this option. The default value is true
.
--rna-gf-merge-calls
If multiple genes overlap a fusion breakpoint, DRAGEN generates and scores a separate fusion candidate for each gene pair overlapping the breakpoint. The default value is false
so that each reported fusion event only has one left and right gene in the fusion, and overlapping genes are output as separate events.
--rna-gf-allow-overlapping-genes
Allows for fusion calls between overlapping genes. The default value is "false".
--rna-gf-enable-post-filters
Enable post-filtering of RNA gene fusion candidates by confidence flags. The filter flags are listed in the table above. The default value is "false".
--enable-rna-amplicon
A separate fusion filtering model is trained for RNA amplicon mode. Duplicate removal for fusion supporting reads is disabled for RNA amplicon mode and both genes are required to be in the list of enriched genes. By default, the DRAGEN fusion caller filters candidates if a breakpoint overlaps both transcripts (e.g. fusions such as FIP1L1--PDGFRA and GOPC--ROS1). In RNA amplicon mode, such candidates are not filtered. See DRAGEN Amplicon Pipeline for further information. The default is "false".
--rna-gf-sv-vcf
Structural Variant VCF file output from DRAGEN DNA structural variant caller run in somatic mode. See the next section for more information.
You can run the DRAGEN Gene Fusion module with a VCF file containing somatic Structural Variant (SV) calls. DRAGEN will report SV events matching each fusion candidate in the *.features.csv
output file for informational purposes, but will not use this data in the scoring or filtering of the fusion candidates. The SV events must be run in somatic mode (for more information see DRAGEN Structural Variant Calling pipeline). The following is an example command line for running an end-to-end RNA-Seq experiment with a somatic SV VCF file.
When the SV VCF input is provided to the RNA fusion caller, the following additional features will be reported in the features.csv
output file:
#SvEvent
: A semi-colon separated string representation of SV events matching the fusion candidate.
#SvType
: A semi-colon separated list of type of the matching SV events.
#SomaticScore
: The highest SomaticScore value of the matching SV events.
#SvDistance
: The maximum distance between any SV breakpoint to any fusion breakpoints (if multiple matching SV events, then minimum of all maximum distances over all SV events).
#LeftSvDistance
: The distance between the left fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#RightSvDistance
: The distance between the right fusion breakpoint and the corresponding SV breakpoint (if multiple matching SV events, then minimum over all SV events).
#SvPresent
: Set to 1 if matching SV event is present, otherwise 0.
#SvAbsent
: Set to 1 if no matching SV event is present, otherwise 0.
Instead of using a GTF file for annotated splice junctions, the DRAGEN software is also capable of reading in an SJ.out.tab
file (see ). This file enables DRAGEN to run in a two-pass mode, where the splice junctions discovered in the first pass (output as SJ.out.tab file) are used to guide the mapping and alignment reads during a second run through DRAGEN. This mode of operation is useful to increase sensitivity for spliced alignments in cases when a gene annotations file is not readily available for the target genome. If a well curated GTF is already availble for your target genome, then there is no need to run a second pass with the SJ.out.tab
.
The RNA Pipeline reports summary and per read group statistics pertaining to read mapping in the mapping_metrics.csv
file. The majority of the matrics are as described in the section, but the metrics that are specific to RNA-seq are listed below.
PolyA tails may be trimmed by including the settings --read-trimmers polya
or --soft-read-trimmers polyg,polya
(Note: polyg soft trimming is enabled by default). The minimum number of poly-A/poly-T bases required for trimming may be set using --trim-polya-min-trim
. The default threshold is 20 poly-A/poly-T bases. Refer to section for usage of read trimmers options.
The PolyA trimmer determines which end of the reads to trim for poly-A and poly-T sequences based on the library type. For example, for Illumina forward stranded paired reads the trimmer will trim poly-A sequences at 3' end of read 1 and poly-T sequences at 5' end of read 2. If --rna-library-type
is not provided or set to autodetect (A
), the trimmer assumes the library is unstranded and trims poly-A sequences from 3' end of each read and poly-T sequences from 5' end of each read. The option --rna-library-type
is described in the section.
By default, the MAPQ calculation for RNA-Seq is identical to DNA-Seq. The primary contributor to MAPQ calculation is the difference between the best and second-best alignment scores. Therefore, adjusting the alignment scoring parameters impacts the MAPQ estimate. These adjustments are outlined in .
--rna-gf-blast-pairs
A tab separated file listing gene pairs that have a high level of similarity. The first and second column are the gene names, and the third column is the e-score. This list of gene pairs is used as a homology filter to reduce false positives. For runs on human genome assemblies GRCH38 and hg19, DRAGEN automatically applies a default file generated using annotations for primary chromosomes if no other file is specified using the command-line.
UMI
The UMI sequence. For example, ACGTAC
IsValid
Specify if the UMI sequence is valid. Enter either: TRUE
or FALSE
NearestCodes
Colon-separated list of nearest UMI sequences. For example, ACGTAA:ACGTAT
SecondNearestCodes
Colon-separated list of second nearest sequences. For example, ACGGAA:ACGGAT
Number of reads
Total number of reads.
NA
Fig1) 14 pairs of read X 2 = 28 reads
Number of reads with valid or correctable UMIs
Number of reads for which the UMIs could be corrected based on the lookup table.
Number of reads
Fig2) Valid UMI read count (Exact match+Correctable UMI)
Number of reads in discarded families
Number of reads in discarded families. Families are discarded when there are not enough raw reads to support the family (family size less than "--umi-min-supporting-reads"). For "--umi-emit-multiplicity=duplex" option, simplex families will be discarded.
Number of reads
Fig1) Number of reads in Families discarded (See "Families discarded" for more detail)
Reads filtered out
Number of reads filtered out in total, either for properties or in a discarded family.
Number of reads
Number of reads in discarded families + Reads with all-G UMIs + Number of unpaired reads
Reads with all-G UMIs filtered out
Number of reads filtered out due to all-G in UMI sequence.
Number of reads
Fig2) PolyG UMI read count
Reads with uncorrectable UMIs
Number of reads where the UMI could not be corrected.
Number of reads
Fig2) Uncorrectable + Ambiguous correction + PolyG
Total number of families
Number of simplex collapsed reads.
NA
Fig1) F1~F10.
Families contextually corrected
Number of families that have some contextual correction. Contextual correction is based on other families at the same mapping location including UMI sequencing error and UMI jumping.
Total number of families
Fig2) Family count of correctable UMI
Families shifted
Number of families that have some shift correction. Shift correction merges families with fragment alignment coordinates up to the distance specified by the umi-fuzzy-window-size
parameter. I updated this description to match the description above.
Total number of families
Fig1) First read pair of DF1 (If shifted distance <= "umi-fuzzy-window-size")
Families discarded
Number of families filtered out by failing min supporting reads criteria or umi-emit type of simplex/duplex.
Total number of families
Fig1) Families discarded by min-support-reads + Families discarded by duplex/simplex (See below for detail)
Families discarded by min-support-reads
Number of families filtered out by failing min supporting reads criteria.
Total number of families
Fig1) Number of families size less than "umi-min-supporting-reads" option Size 1: F6, F10 Size 2: DF3, F5, F9 Size 3: DF1, DF2
Families discarded by duplex/simplex
Number of families filtered out by failing umi-emit type of simplex/duplex.
Total number of families
Fig1) Number of simplex families (F5, F6, F9, F10) filtered. Note that simplex reads are only filtered if umi-emit-multiplicity=duplex (default: both)
Families with ambiguous correction
Number of families where the UMI cannot be corrected because more than one possible UMI corrections exists.
Total number of families
Fig2) Number of families of ambiguous correction UMI
Duplex families
Number of families that are merged as duplex (both strands).
Consensus pairs emitted
Fig1) DF1, DF2, DF3
Consensus pairs emitted
Number of collapsed reads in output BAM.
NA
Fig1) Depends on umi-emit-multiplicity=simplex/duplex/both, umi-min-supporting-reads=x simplex=F1~F10 (F2, F3, F6, F7, F8, F10 filtered if x>=2) duplex=DF1, DF2, DF3 both=DF1, DF2, DF3, F5, F6, F9, F10 (F6, F10 filtered if x>=2)
Mean family depth
Average number of read pairs per family. Filtered reads and families are excluded.
NA
Fig1) Number of reads per family: DF1=3, DF2=3, DF3=2, F5=2, F6=1, F9=2, F10=1 Mean family depth = (3+3+2+2+1+2+1)/7 = 2
Histogram of num supporting fragments
Number of families with zero raw reads, one raw read, two raw reads, three raw reads, etc
NA
Fig1) 0 reads: None 1 reads: F6, F10 = 2 (0 if umi-min-supporting-reads=2) 2 reads: DF3, F5, F9 = 3 3 reads: DF1, DF2 = 2 Histogram = {0|0|3|2}
Number of collapsible regions
Number of regions.
NA
Fig3) R1~R7
Min collapsible region size (num reads)
Number of reads in the least populated region.
NA
Fig3) 2 reads (R4)
Max collapsible region size (num reads)
Number of reads in the most populated region.
NA
Fig3) 18 reads (R2)
Mean collapsible region size (num reads)
Average number of reads per region.
NA
Fig3) 8.3
Collapsible region size standard deviation
Standard deviation of the number of reads per region.
NA
Fig3) 5.8
On target number of reads
Number of reads that overlapped with the UMI target interval --umi-metrics-interval-file
.
NA
Fig1, Fig3) All On target metrics are same as corresponding metric but only considering fragments overlap with target intervals. i.e. DF3, F9, F10 in figure1 and R1, R3, R4, R6, R7 in figure3 are excluded from metric
On target number of bases
Number of bases that overlapped with the UMI target interval --umi-metrics-interval-file
.
NA
On target number of reads with valid or correctable UMIs
Number of reads with a UMI that matched a UMI in the lookup table, including error allowance, and overlapped with the UMI target interval.
On target number of reads
On target number of reads in discarded families
Number of reads in discarded families that overlapped with the UMI target interval.
On target number of reads
On target duplex families
Number of families that are merged as duplex among all the families that are overlapped with UMI target interval.
On target consensus pairs emitted
On target mean family depth
Average number of reads per family that overlapped with UMI target interval.
NA
On target families discarded
Number of families that overlapped with UMI target interval filtered out by failing min supporting reads criteria or umi-emit type of simplex/duplex.
On target number of families
On target families discarded by min-support-reads
Number of families that overlapped with UMI target interval filtered out by failing min supporting reads criteria.
On target number of families
On target families discarded by duplex/simplex
Number of families that overlapped with UMI target interval filtered out by failing umi-emit type of simplex/duplex.
On target number of families
On target families with ambiguous correction
Number of families that overlapped with UMI target interval where the UMI cannot be corrected because more than one possible UMI corrections exists.
On target number of families
Histogram of unique UMIs per fragment position
Number of positions with zero UMI sequences, one UMI sequence, two UMI sequences, etc.
NA
Fig1) 0 UMI sequence: None 1 UMI sequences: ins2 (F5), ins3 (F6) 2 UMI sequences: ins1 (DF1, DF2) 3 UMI sequences: ins4 (DF3, F9, F10) Histogram = {0|2|1|1}
Total Families in Probability Model Estimation
Total number of families used in estimation of UMI jumping rate and fragment size distribution used for probabilistic family merging.
NA
Number of potential Jumping Families
Total number of families that are potential UMI jumping candidates and the corresponding ratio.
Total Families in Probability Model Estimation
--enable-rna-quantification
If set to true, enables RNA quantification. Requires --enable-rna
to be set to true.
--rna-library-type
Specifies the type of RNA-seq library. The following are the available values:
IU
—Paired-end unstranded library.
ISR
—Paired-end stranded library in which read2 matches the transcript strand (eg, Illumina Stranded Total RNA Prep).
ISF
—Paired-end stranded library in which read1 matches the transcript strand.
U
—Single-end unstranded library.
SR
—Single-end stranded library in which reads are in reverse orientation to the transcript strand (eg, Illumina Stranded Total RNA Prep).
SF
—Single-end stranded library in which reads match the transcript strand.
A
— DRAGEN examines the first reads pairs in the data set to automatically detect the correct library type. For polya tail trimming, the library type is assumed to be unstranded. Autodetect is the default value.
--rna-quantification-gc-bias
GC bias correction estimates the effect of transcript %GC on sequencing coverage and accounts for the effect when estimating expression. To disable GC bias correction, set to false.
--rna-quantification-fld-max --rna-quantification-fld-mean --rna-quantification-fld-sd
Use these options to specify the insert size distribution of the RNA-seq library for single-end runs. These options are relevant for GC bias correction. The defaults are 250 +- 25. The maximum allowed value is 1000. To improve accuracy, modify the values to match your library.
Name
The ID of the transcript.
Length
The length of the (spliced) transcript in base pairs.
EffectiveLength
The length as accessible to RNA-seq, accounting for insert-size and edge effects.
TPM
Transcripts per Million (TPM) represents the expression of the transcript when normalized for transcript length and sequencing depth.
NumReads
The estimated number of reads from the transcript. The values are not normalized.
Library orientation
Library orientation of the RNA-seq reads relative to the original transcripts. The library orientation can be automatically detected, or can be explicitly provided. See Quantification Options
for more information.
Total Genes
Total number of genes from the gene annotation (GTF/GFF) input used for analysis.
Coding Genes
Number of coding genes from the gene annotation (GTF/GFF) excluding pseudo-genes and biotypes which are non-coding.
Total Transcripts
Number of transcripts from the gene annotation (GTF/GFF) input used for analysis.
Median transcript CV coverage
Median Coefficient of Variation (CV), which is standard deviation divided by mean coverage, of the 1000 most highly expressed transcripts. This metric measures uniformity of RNA-seq read coverage.
Median 5' coverage bias
Median 5 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 5'-most 100 bases divided by the mean coverage of the whole transcript.
Median 3' coverage bias
Median 3 prime bias of the 1000 most highly expressed transcripts, calculated per transcript as mean coverage of the 3'-most 100 bases divided by the mean coverage of the whole transcript.
Forward transcript fragments
The number of read pairs that match transcripts on the forward strand. Only reads that align fully within exons are counted.
Reverse transcript fragments
The number of read pairs that match transcripts on the reverse strand. Only reads that align fully within exons are counted.
Strand mismatched fragments
In the case of stranded library orientation, number of read pairs that do not match the expected strand of the transcript. Only reads that align fully within exons are counted.
Ambiguous strand fragments
Read pairs that match transcripts in both forward and reverse orientation. Only reads that align fully within exons are counted.
Intron fragments
Read pairs that overlap with a gene, but do not overlap with any exons.
Intergenic fragments
Read pairs that do not overlap with any gene.
Unknown transcript fragments
Read pairs that partially align with an exon but overlap non-exonic regions (usually due to alternative splicing).
Number of genes with coverage > 1x,10x,30x,100x
The count of the number of genes where the most highly expressed transcript has average coverage greater than 1x, 10x, 20x, and 100x .
Fold coverage of all exons
The average sequencing coverage across all annotated exons, determined using the most highly expressed transcript for each gene.
Fold coverage of coding exons
The average sequencing coverage across only exons within coding genes, determined using the most highly expressed transcript for each gene.
Fold coverage of introns
The average sequencing coverage across detected introns.
Fold coverage of intergenic regions
The average sequencing coverage across areas detected outside annotated genes.
XS:A
The XS tag denotes the strand orientation of an intron. See [Compatibility with Cufflinks]{.underline}.
NH:i
A standard SAM tag indicating the number of reported alignments that contains the query in the current record. This tag may be used for downstream tools such as featureCounts.
HI:i
A standard SAM tag denoting the query hit index, with its value indicating that this alignment is the i-th one stored in the SAM. Its value ranges from 1 ... NH. This tag may be used for downstream tools such as featureCounts.
jM:B
The jM tag lists the intron motifs for all junctions in the alignments. It has the following definitions
jM:B
Definition
0
non-canonical
1
GT/AG
2
CT/AC
3
GC/AG
4
CT/GC
5
AT/AC
6
GT/AT
LOW_SCORE
Confidence (always applied)
The fusion candidate has low probabilistic score (< 0.5) as determined by the features of the candidate.
--rna-gf-min-score
MIN_SUPPORT
Confidence (optional)
The fusion candidate has at least one fusion supporting read pairs.
--rna-gf-min-split-support
LOW_UNIQUE_ALIGNMENTS
Confidence (optional)
All fusion supporting read alignments near at least one of the breakpoints have the same start and end position.
--rna-gf-min-unique-alignments
LOW_MAPQ
Confidence (optional)
All fusion supporting read alignments at either breakpoint have MAPQ < 20.
--rna-gf-min-breakpoint-mapq
DOUBLE_BROKEN_EXON
Confidence (optional)
If both breakpoints are 50 bp from annotated exon boundaries, then the number of supporting reads do not satisfy a high threshold requirement (≥10 supporting reads).
--rna-gf-exon-snap
--rna-gf-min-support-be
UNENRICHED_GENES
Confidence (optional)
If enrichment list provided, then neither parent genes is enriched. If amplicon mode is enabled, then at least one parents gene is not enriched (See DRAGEN amplicon pipeline for further information).
--rna-gf-enriched-only
MITOCHONDRIAL_GENES
Confidence (optional)
The fusion candidate involves mitochondrial genes. Set --rna-gf-filter-chrm=false
to disable this filter.
--rna-gf-filter-chrm
READ_THROUGH
Confidence (optional)
The breakpoints are cis neighbors (< 200,000 bp) on the reference genome.
--rna-gf-min-cis-distance
ANCHOR_SUPPORT
Information only
Read alignments of fusion supporting reads are not long enough (less than 12 bp) at either breakpoint.
--rna-gf-min-anchor
HOMOLOGOUS
Information only
The candidate is likely to be a false candidate generated because the two genes involved have high gene homology.
--rna-gf-min-blast-pairs-eval
LOW_ALT_TO_REF
Information only
The number of reads supporting the fusion is < 1% of the number of reads supporting the reference transcript at either breakpoint.
--rna-gf-min-alt-to-ref
LOW_GENE_COVERAGE
Information only
Either breakpoint has less than 125 bp with nonzero read coverage.
--rna-gf-min-covered-bases
The identification of alternatively spliced isoforms (using their constitutive splice variants) and their functional effects is of high importance in the study of genetic variation and diseases, including cancer and neurological disorders. The main types of alternative splicing events resulting in splice variants are:
Exon skipping
Intron retention
Mutually exclusive exons
Alternative 5' splice site
Alternative 3' splice site
When enabled with the --enable-rna-splice-variant=true
option added to an RNA Map/Align job, DRAGEN runs a Splice Variant caller by taking advantage of its fast and highly accurate splice-aware read mapper/aligner that aligns to the whole genome to identify novel alternative Splice Junction (SJ) candidates. These candidates can be filtered by additional information provided such as a "normals list" and a "target regions list", or whitelisted with a "knowns" list.
Next during the read sorting phase, evidence for these alternative splices candidates (alts) vs. reference splicing are accumulated. Finally, each of the candidates are scored based on the accumulated read evidence and the results are written to TSV and VCF files for downstream tertiary analysis.
Following is an example command line.
In addition to the required inputs listed in the above example (i.e. paired fastq reads, reference hashtable, and annotation), the following 3 optional input resource files can be provided to help provide better precision by reducing FP count.
A list of Normal splice variants that will be filtered out of the final output (i.e. operating as a blacklist), as long as they are not in the "knowns" list, using the "--rna-splice-variant-normals" option.
The format of this file should be a tab separated file in the same format as the SJ.out.tab, except only the first 4 columns are used, i.e.
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)
To create a Normals list file, a collection of DRAGEN RNA mapper output SJ.out.tab files for at least 30 samples can be used along with a simple script to process all the SJs in these files. The pseudo code block below describes the function of this script:
A list of known splice variants that are exempt from being filtered out of the final output (i.e. operating as a whitelist), using the "--rna-splice-variant-knowns" option. The format of the file should be a tab separated file in the same format as the SJ.out.tab with 9 columns present, except only the first 4 columns are evaluated, i.e.
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -)
By default, the caller will not consider any splice variant candidates that are found in the input annotation file since it is looking for denovo variants, unless it is included in the knowns list which directs it not to discard the specified candidate. Note that some newer gene annotation models have added alt transcripts that contain clinically relevant splice variants, which causes the DRAGEN to skip reporting them.
To ensure these are reported, the user may want to pass these in with a knowns file containing these common variants if they are found in the annotation that is used. An example is shown below using hg38 coordinates specifying the MET exon 14 skip, EGFRv3, and ARv7 alt splicing events, respectively.
A list of regions that called splice variants must fall within using the "--rna-splice-variant-regions" option. Any splice variant candidates will be excluded if they are not within these regions.
This file should be in BED file format with the following info, except that the regions are 1-based
chromosome id
start position (1-based)
end position (1-based)
region (i.e. gene) name
The detected splice variants are output as two separate TSV files for the intragenic and intergenic candidates, and as a VCF for the intragenic candidates.
The following categories are used when accumulating read counts for each alt SJ candidate:
DedupUniqueSupportingReads - Non-duplicate marked reads that are unique and precisely align with the SJ
DupUniqueSupportingReads - Duplicate marked reads that are unique and precisely align with the SJ
DedupUniqueNonsupportingReads - Non-duplicate marked reads that are unique but don't support the splice variant
DupUniqueNonsupportingReads - Duplicate marked reads that are unique but don't support the splice variant
To be counted, a paired read alignment:
must be primary and properly paired
must contain a splice junction (i.e. an alignment gap in the CIGAR containing skip ops)
must have overhangs on either side of the skip that are at least 6 bases
considered to be "unique" only if NH=1 and the MAPQ > 35
These two output files are named:
.splice_variants.tsv which contains the intragenic alt splice junctions that result in transcript variants
.splice_variant_fusions.tsv which contains the intergenic alt splice junctions that result fusions across genes
Each detected splice junction contains the following columns:
gene_start - Gene name(s) at the start of the SJ. Multiple genes are separated by a semicolon
gene_end - Gene name(s) at the end of the SJ. Multiple genes are separated by a semicolon
chromosome - Chromosome containing the SJ
start - SJ's start position (1-based genomic coordinate)
end - SJ's start position (1-based genomic coordinate)
strand - Detected strand for the SJ (1: +, 2: -)
motif - intron motif
annotated - True if annotated, otherwise False
split_unique_reads_ref - DedupUniqueNonsupportingReads count that support reference
split_total_reads_ref - DupUniqueNonsupportingReads + DedupUniqueNonsupportingReads count that support reference
split_unique_reads_alt DedupUniqueSupportingReads count that support variant
split_total_reads_alt - DupUniqueSupportingReads + DedupUniqueSupportingReads count that support variant
max_spliced_alignment_overhang - maximum spliced alignment overhang from all supporting reads
score - The splice junction variant score (ranging from 0.0 to 1.0). Currently, this is just a linear function of the number of split_unique_reads divided by 10, i.e. equals MIN(1.0, split_unique_reads_alt/10)
Note:
In the intragenic output file containing transcript variant splice junctions, the gene_start and gene_end columns must match.
In the intergenic output file containing fusions from splice junctions, the gene_start and gene_end columns must be different.
This file contains the detected intra-genic splice junction variants that are not filtered out, and are written into a zipped VCF file titled .splice_variants.vcf.gz, where each splice variant candidate is written as a one-line VCF record containing the fields below:
CHROM - Chromosome of the splice
POS - SJ start position (1-based) i.e. first base of intron
ID - "." (unused)
REF - Base from the reference genome FASTA at the SJ start position
ALT - "<DEL>"
QUAL - The junction score from 0.0 - 1.0
FILTER - Semicolon separated list of filters: LowQ and LowUniqueAlignment
INFO - See the possible "Info fields below"
FORMAT - AD:DP
SAMPLE - Counts for {DedupUniqueSupportingReads}:{DedupUniqueNonsupportingReads}
The following lines of the VCF header describe columns 5 to 10 (last 6 columns)
Note on Filter Thresholds
The passing thresholds for the LowQ and LowUniqueAlignments filters are fixed to the settings below.
Filter
Description
Value
LowQ
Below splice variant score threshold
less than 1.0
LowUniqueAlignments
Below unique supporting read count threshold
less than 2
When the splice variant caller and gene fusion caller are both enabled, the passing and failed intergenic fusion SJ's will also be merged into the relevant fusion output TSV files.
The passing calls get added to the fusion caller's .fusion_candidates.final file. The tab separated fields are described below.
Field Names
Description
FusionGene
Left and Right gene names (separated by '--')
Score
Value between 0 and 1
LeftBreakpoint, RightBreakpoint
The location for left and right sides of the splice with three colon separated fields: chromosome:coordinate:strand(+/-)
Gene1Location, Gene2Location
Splice Variant caller always outputs "SpliceVar" here instead of Exon/Intron location
Gene1Sense, Gene2Sense
Always TRUE for by design
Gene1Id, Gene2Id
Long form ID (i.e. for Gencode it is usually "ENSG.version")
NumSplitReads
Taken from the dedupUniqueSupportingReads count (i.e. split_unique_reads_alt column value)
NumSoftClippedReads, NumPairedReads
These values are not used by RSV caller and are set to '0'
ReadNames
Not provided by this caller and set to 'N/A'
The failing calls get added into the fusion caller's .fusion_candidates.filter_info output file. The output fields are the same as described above for the "final" output file, with the addition of the FILTER_INFO field in the first column. The value in this field will be "RSV_FILTER:" followed by the specific filters that are not passing, as described in the table below.
Filter Names
Description
LOW_QUAL
Below the "low quality score" threshold
LOW_UNIQUE_ALIGNS
Number of unique anchors either on left or right side are below the MIN_UNIQUE_ALIGNS=2 threshold
LOW_EVIDENCE_OR_OVERHANG
Not meeting the SJ.out read count vs. splice length and overhang requirements
READTHROUGH
Gene partner is the next downstream annotated gene
The following sub-pages contain recommended command line options for specific DRAGEN pipelines. For an overview of DRAGEN command line parsing, also see Multicaller Workflows
This recipe is for processing whole genome sequencing data for germline workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
We highly recommend using a pangenome reference for human samples (excluding RNA). For more details, refer to Dragen Reference Support.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Optional settings per component are listed below. Full option list at this page.
Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range (see QUAL, QD, and GQ Formulation).
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-enable-class-2
Extend genotyping to HLA class 2 genes
The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to run query sequences against a pre-existing reference sequence database (As of DRAGEN 4.3+, users can build their own custom reference sequence database).
Required Inputs
--enable-kmer-classifier
Enables the Kmer Classifier. (Default=false).
--output-file-prefix
Prefix for all output files.
--output-directory
Directory for all output files.
--kmer-classifier-input-read-file
Input sequence file (zipped or unzipped) to the Kmer Classifier.
--kmer-classifier-db-file
Database of sequences to classify against.
Optional Inputs
--intermediate-results-dir
Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 2.
--kmer-classifier-load-db-ram
Load the database onto RAM. Do not use if database is on ramdisk. (Default=false).
--kmer-classifier-multiple-inputs
Set to true to run with multiple inputs. The input read file is now a .tsv file that has three columns: Sample ID, Read1 file, (optional) Read 2 file. (Default=false).
--kmer-classifier-min-window
The minimum number of consecutive kmers to classify assignment at taxid. (Default=1).
--kmer-classifier-output-read-seq
Option to enable read sequence column in the output file. (Default=false).
--kmer-classifier-output-taxid-seq
Option to enable a taxid string column in the output file. (Default=false).
--kmer-classifier-db-to-taxid-json
Path to JSON file that maps database IDs to external taxids, names, and ranks.
--kmer-classifier-no-read-output
Option to not create individual read output. (Default=false).
--kmer-classifier-no-taxid-counts
Option to not write taxid count output file. (Default=false).
--kmer-classifier-protein-input
Option to indicate protein query sequences. To use this option, the reference sequence database MUST be of protein sequences. (Default=false).
--kmer-classifier-ncpus
Option to set the number of CPUs available for processing.
Applies to: --kmer-classifier-input-read-file
, --kmer-classifier-multiple-inputs
If the analysis is for a single FASTA/FASTQ read file, then that filename is input to --kmer-classifier-input-read-file
and --kmer-classifier-multiple-inputs=false
. However, many read files can be submitted to the Kmer Classifier at one time, minimizing the load time for a large reference sequence database. In this case, the input file must be a .tsv
(tab-separated) file with two columns (optionally 3 columns). The first column is a unique ID, the second column is the path to the read file, and the optional third column is the path to the second read file in the case of paired-end reads. The ID is used to distinguish the output files. There is no header line. This .tsv
file is the input file to --kmer-classifier-input-read-file
and --kmer-classifier-multiple-inputs=true
.
Applies to: --kmer-classifier-db-file
, --kmer-classifier-db-to-taxid-json
, --kmer-classifier-load-db-ram
A file of reference sequences (the "database") can be quite large. If the database file is stored on a normal file system, it is recommended that you set --kmer-classifier-load-db-ram=true
. This will tell the Kmer Classifier to load the database file into memory for faster analysis. It is also allowable to store the database file on a RAM disk, which reduces load time over many Kmer Classifier runs. In this case, it is recommended to set --kmer-classifier-load-db-ram=false
.
Applies to: --kmer-classifier-db-to-taxid-json
This input file is downloaded alongside the reference sequence database. It associates a taxid internal to the classifier database to an external source, like the NCBI taxonomy. This JSON file is a dictionary where the keys are internal taxids, and is mapped to an external taxid, name, and rank. Example:
The internal taxids are used in the output files. This JSON file can be used to map the results to taxids from the NCBI taxonomy.
The genome database includes NCBI RefSeq genomes for human, bacteria, archaea, viruses, and fungi. The December 3 2023 NCBI taxonomy was used to build the database, and the sequences were collected in December 2023.
To download the reference index file and the taxid mapping JSON:
This database includes the contents of the Genome database and all of the NCBI nucleotide (nt) database. The sequences from the NCBI nucleotide database were collected in July 2023, and the December 3 2023 NCBI taxonomy was used to build the database. Two versions of this database are available for download: One that requires a machine with >= 550GB RAM, and a compressed version that trades approximately 5-10% accuracy for a smaller RAM footprint and requires a machine with >= 225GB RAM.
To download the reference index file and the taxid mapping JSON:
To download the compressed reference index file and the taxid mapping JSON:
This database includes all protein sequences of the UniRef90 database. The sequences were collected in March 2024 and the March 28 2024 NCBI taxonomy was used to build the database.
To download the reference index file and the taxid mapping JSON:
This database includes full length bacterial 16S sequences from the NCBI. The sequences were collected in April 2024 and the March 28 2024 NCBI taxonomy was used to build the database.
To download the reference index file and the taxid mapping JSON:
There are two output files, one organized around the reads, and the other organized around the taxids.
Applies to: --kmer-classifier-output-taxid-seq
, --kmer-classifier-output-read-seq
The main output file is a .tsv
file with the extension .read_classifications.tsv
. It has no header line, has tab-separated columns, and can vary in the number of columns depending on command line options. It details the results for each read.
1
Read index
integer
2
Read name
string
3
Taxid the read classified to
integer
4
Maximum number of contiguous kmers that classified to this taxid
integer
5
Score assigned to the classification
integer
6
Number of kmers that classified to this taxid
integer
7
Read duplication count
integer
8
Name associated with taxid, if given with --kmer-classifier-db-to-taxid-json
string
9
Taxonomic rank associated with taxid, if given with --kmer-classifier-db-to-taxid-json
string
10
Taxid that each kmer classified to (is output when the --kmer-classifier-output-taxid-seq
flag is set)
list of integers separated by commas
11
Read sequence (is output when the the --kmer-classifier-output-read-seq
flag is set)
string
The second output file is a .tsv
file with the extension .classifier.taxid_kmer_counts.tsv
. It has a header line and has tab-separated columns. It summarizes the results for each taxid.
db_taxid
Identifier for this taxid used internally in the database
integer
duplicity
Ratio of total number of kmers from reads assigned to this taxid compared to the number of distinct kmers from reads assigned to this taxid
float
distinct_coverage
Percent of kmers in the database assigned to this taxid that are covered by kmers in the reads assigned to this taxid
integer
read_count
Number of reads that classified to this taxid
integer
total_kmer_count
Number of kmers that classified to this taxid
integer
distinct_kmer_count
Number of distinct kmers that classified to this taxid
integer
cumulative_read_count
Cumulative number of reads assigned to this taxid and its taxonomic descendants
integer
taxid
Taxid
integer
name
Name associated with the taxid, if given with --kmer-classifier-db-to-taxid-json
string
rank
Taxonomic rank of the taxid, if given with --kmer-classifier-db-to-taxid-json
string
taxid_distinct_kmer_count
Number of distinct kmers assigned to this taxid from the reference sequences
string
probability_present
Not in use
float
The DRAGEN Single-Cell Multiomics (scRNA + scATAC) Pipeline can process data sets from single-cell RNA-Seq and ATAC-Seq reads to a cell-by-feature count matrix.
The pipeline is compatible with library designs that have:
For single-cell RNA: one read in a fragment match to a transcript and the other containing a cell-barcode and UMI.
For single-cell ATAC: two reads matching to the genome and the third one containing cell-barcode.
The pipeline includes the following functions:
Alignment
RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.
ATAC-Seq alignment
Cell-barcode (both RNA- and ATAC-Seq) and UMI (RNA-Seq only) error correction for the barcode read.
Read counting
UMI counting per cell and gene to measure gene expression.
Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.
A standard DRAGEN reference hashtable with both DNA and RNA capability is required for the Single-Cell Multiomics Pipeline. Building a reference hashtable using --ht-build-rna-hashtable=true
should satisfy this requirement. See the section Prepare a Reference Genome for more details.
The pipeline also requires a gene annotation file in GTF format, provided with the with the --annotation
(-a
) option.
Since the multiomics workflow assumes at least one pair of single-cell RNA FASTQ files and one triple of single-cell ATAC FASTQ files, the only method to provide read input to DRAGEN is through the fastq-list
file mechanism. A multiomics fastq-list
file is a CSV file with the following mandatory columns:
Lane
Sequencing lane
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Read1File
Read 1 FASTQ file
Read2File
Read 2 FASTQ file
UmiFile
FASTQ file with cell-barcodes and/or UMIs
InputGroup
Input group: either scATAC
or scRNA
(not case sensitive)
Entries in a fastq-list
file corresponding to single-cell RNA analysis must have read 2 FASTQ files in the UmiFile column and the Read2File column entries must be left empty. An example is shown below:
To use the multiomics workflow, enter --enable-rna=true --enable-single-cell-rna=true --enable-single-cell-atac=true
.
By default the multiomics workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to identify the location of the single-cell barcode and single-cell UMI in the barcode read:
For more details, please consult with the corresponding section of scRNA and scATAC workflows' documentation.
Barcode lists are mandatory for both scRNA and scATAC reads. You need to provide the list of cell barcode sequences using the following command:
--scrna-barcode-sequence-list </path/to/scRNAbarcodeAllowlist.txt> --scatac-barcode-sequence-list </path/to/scATACbarcodeAllowlist.txt>
The files must contain one possible cell barcode sequence per line. You can compress the file with gzip (*.txt.gz
). During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. If the barcodes cannot be corrected, they are filtered out.
For each individual modality (either scRNA or scATAC), DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample. For more information, see the corresponding section in scATAC/scRNA documentation.
After count thresholds in each individual modality is computed, DRAGEN performs a joint cell filtering step. Each cell barcode is represented in a 2-D space with coordinates computed as the total UMI count across genes and the total fragment count across peaks. Initially, a cell-barcode is considered as passing the joint filter if it is passing the filter in each individual modality. DRAGEN then groups all cell barcodes in two categories: those passing both individual modality filters and the rest of cell barcodes. A k-means algorithm with 2 clusters is run and the filtering status of each cell barcode is refined.
The following is an example command line to run the DRAGEN Single Cell Multiomics Pipeline.
Single-cell Multiomics outputs are found in the standard DRAGEN output location using the prefix <sample>
. in case of a single library and the prefix <sample>.<libId>
. in case of multiple libraries. All single-cell Multiomics output files contain word multiomics
in their names.
The following three files provide information per-cell feature count level in matrix market (*.mtx) format:
<prefix>.multiomics.matrix.mtx.gz
Count of unique UMIs or fragments for each cell/feature pair in sparse matrix format.
<prefix>.multiomics.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.multiomics.features.tsv.gz
Feature name and ID for each feature in the matrix.
The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.multiomics.barcodeSummary.tsv
indicated by values PASS
and FAIL
.
The output includes filtered matrix files which only include the per-cell feature count level for the filtered cells in matrix market (*.mtx
) format. The multiomics.features.tsv.gz
file is common for the unfiltered and filtered matrices:
<prefix>.multiomics.filtered.matrix.mtx.gz
Count of unique UMIs for each filtered cell/feature pair in sparse matrix format.
<prefix>.multiomics.filtered.barcodes.tsv.gz
Cell-barcode sequence for each filtered cell from the matrix.
Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in python).
The matrix can be converted into a "dense" representation through two python modules: scanpy
and pandas
. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.
First, it is necessary to install the required libraries:
Within python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
DRAGEN Single-Cell Multiomics outputs two BAM files sorted by coordinate - one with suffix scRNA.bam
and one with suffix scATAC.bam
. For more details, please consult the corresponding section from scATAC/scRNA user guide.
When running in multiomics mode, since the DRAGEN aligner processes both RNA and ATAC reads, there will be separate mapper metrics summary for each modality. This is identified via the RGID field as rna
or atac
. There is no common map/align metrics summary for all input reads since it would merge the two modalities. The <prefix>.mapping_metrics.csv
file will also reflect this.
Below is an example snippet of the alignment metrics printed at the end of a DRAGEN Single-Cell Multiomics run.
The <prefix>.multiomics_metrics.csv
file contains per sample scATAC and scRNA metrics. For more details, please consult with the corresponding section from scATAC/scRNA user guide.
Here is an example of how a <prefix>.multiomics_metrics.csv
file can look like:
The <prefix>.multiomics.barcodeSummary.tsv
contains summary statistics for each unique cell-barcode per cell after error correction. Here is an example of how a <prefix>.multiomics.barcodeSummary.tsv
file can look like:
Amplicon sequencing is a highly targeted approach that enables you to analyze genetic variation in specific genomic regions. The ultradeep sequencing of PCR products (amplicons) allows you to efficiently identify and characterize variants. This method uses oligonucleotide probes designed to target and capture regions of interest, followed by next-generation sequencing (NGS).
The Amplicon Pipeline supports both DNA and RNA data. The Amplicon Pipeline turns off duplicate marking because there are only a few unique start and end positions for fragments from an amplicon target due to the assay.
The DNA Amplicon Pipeline uses the DRAGEN DNA Pipeline by including an additional step after mapping and aligning to soft-clip primers and rewrite alignments. If the target amplicon is found, DRAGEN tags each alignment with the target amplicon and performs soft-clipping on the primer sequences. DRAGEN performs tagging by adding an XN:Z:<amplicon name>
tag to the output BAM/CRAM record. Soft-clipping makes sure that the primer sequences do not contribute to the variant calls.
In the primer clipping step, poorly aligned reads are also unaligned with MAPQ set to 0:
Alignments that don't consume any reference bases after soft-clipping.
Off-target alignments overlapping target regions.
Alignments with a substitution fraction more than a threshold. Substitution fraction is the ratio of match count to match and mismatch count and the probe regions are excluded from the calculation. The threshold is specified by --amplicon-max-substitution-fraction
with a default of 0.04.
Alignments with read base count less than the short-read threshold after soft-clipping and with a substitution fraction more than a threshold including the probes. The short-read threshold is specified by --amplicon-shortread-length-threshold
with a default of 30. The probe regions are included in the calculation and soft-clipped bases are treated as mismatches. The substitution threshold is set by --amplicon-max-shortread-substitution-fraction
with a default of 0.1.
Alignments with a soft-clipping fraction more than a threshold. The probe regions are excluded from the calculation and the treshold is set by --amplicon-max-softclip-fraction
with a default of 0.1.
Off-target alignments with a soft-clipping fraction more than a threshold. The probe regions are included in the calculation and the threshold is set by --amplicon-max-offtarget-softclip-fraction
with a default of 0.2.
The RNA Amplicon Pipeline uses the DRAGEN RNA Pipeline. Amplicon-specific parameters are set for fusion calling, including a fusion scoring model trained on RNA amplicon data. Small variant calling is not supported in RNA amplicon mode.
The DRAGEN Amplicon Pipeline requires an amplicon BED file and all input files required by the DRAGEN DNA or RNA pipeline. Each row in an amplicon BED file describes an amplicon target. The fields are as follows.
chrom
The name of the chromosome.
chromStart
The 0-based inclusive start position of the target, excluding the primer.
chromEnd
The 0-based exclusive end position of the target, excluding the primer.
name
The name of the amplicon target.
gene
[Optional] The gene ID.
targetType
[Optional] The target type.
In copy number variant calling of DNA amplicon mode, the default segmentation mode is bed and could be modified via --cnv-segmentation-mode
. The CNV segmentation bed is gene-level and auto-generated based on the gene ID column in the amplicon BED file. In RNA amplicon mode, targetType is used to identify fusion targets, whose targetType is Fusion. The gene IDs for fusion targets are collected and written to an output file. The default value of --rna-gf-enriched-genes
is then set to this file containing fusion gene IDs. A candidate fusion is required to have both partner genes in the gene list. Base-level and read-level coverage is calculated for each region in the amplicon BED file. It is recommended that the fusion targets are commented to avoid competition with gene expression targets.
To use the DNA amplicon pipeline, set --enable-dna-amplicon
to true
. Use --amplicon-target-bed
to specify the path to your amplicon BED file.
To enable small variant calling, set --enable-variant-calling
to true
. To enable copy number variant calling, set set --enable-cnv
to true
. GC bias correction when generating target counts is enabled by default. The generation of the target counts for the normal samples should also have identical command line options with the case sample under analysis.
To enable structural variant calling, set --enable-sv
to true
.
The target small variant calling BED input is set to amplicon BED file by default and could be modified via --vc-target-bed
. The CNV segmentation bed is auto generated based on the gene ID column in the amplicon BED file and could be modified via cnv-segmentation-bed
. See CNV Targeted Segmentation (Segment BED) for more information. The amplicon pipeline can be run in either germline or somatic mode. For the somatic mode, specify a tumor-only or tumor-normal input. For more details about somatic mode, see Somatic Mode and Somatic Mode Options. For more information on the multicaller (germline & somatic) workflows, see Multicaller Workflows. If calling somatic small variants, we also recommend to set --vc-use-somatic-hotspots
to false
.
By default the maximum amplicon primer length is set to 50. You can specify a different value using --amplicon-primer-length
. The parameter affects whether an alignment is assigned to an amplicon target. If an alignment starts inside the primer region of the amplicon target, the alignment is assigned to the amplicon. For a properly paired alignment, both the alignment and the mate must come from the same amplicon target. However, in order to detect deletion events that are close to the target boundaries, we now require only one of the reads to start in the primer region (--amplicon-allow-partial-target=true
by default). For candidate deletions, we rewrite the CIGAR to make them candidates for columnwise detection (--amplicon-enable-deletion-realigner=true
by default).
The following is an example command line to run the DRAGEN DNA Amplicon Pipeline with copy number, structural variant and germline small variant calling.
To use the RNA amplicon pipeline, set --enable-rna-amplicon
to true
. Use --amplicon-target-bed
to specify the path to your amplicon BED file.
We do not recommend enabling RNA quantification to produce the .sf
quantification output files as a panel-specific GTF file is usually not used. The .target_bed_read_cov_report.bed
read-level coverage output file should be used instead. This file is automatically produced when map/align is output enabled.
To enable RNA gene fusion calling, set --enable-rna-gene-fusion
to true
. Fusion calling parameters are automatically set in RNA amplicon mode but can be overridden in the command line. If fusion targets are not listed in the amplicon BED file, users can explicitly set --rna-gf-enriched-genes
to a file containing fusion gene IDs or symbols.
The following is an example command line to run the DRAGEN RNA Amplicon Pipeline with gene fusion calling.
The DRAGEN Single-Cell ATAC (scATAC) Pipeline can process single-cell ATAC-Seq data sets from reads to a cell-by-peak read count matrix. The pipeline includes the following functions:
ATAC-Seq alignment.
Cell-barcode error correction for the barcode read.
Chromatin accessibility peak calling.
Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.
The functionality and options related to alignment and gene annotation are identical to DNA pipelines. For information, see DRAGEN DNA Pipeline.
Use a standard DRAGEN DNA reference genome or hashtable for the scATAC Pipeline.
The DRAGEN scATAC Pipeline requires both the genomic sequence and the barcode sequence for each fragment (read) as input. The genomic sequence is aligned to the reference genome to determine the expressed gene, the single-cell barcode sequence is used to identify the unique cell. When starting from FASTQ, you can either include the UMI in the read name or provide separate cell-barcode FASTQ files.
Provide the genomic reads as a paired-end FASTQ files with the Barcode sequence in the eighth field of the read-name line. Separate sequences using a colon. The following example uses read2 (sample.R2.fastq.gz
) as the genomic read.
In the example, the GAAACTCGTTCAGCGC
sequence is the barcode read and the ACAG...
sequence is the genomic read.
These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode read. If using bclConvert, enter the barcode information using the OverrideCycles1
setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).
Note: bclConvert refers to the entire single-cell barcode sequence as UMI.
Enter the following command line option to use the generated FASTQ files from bclConvert: dragen -1 <file name 1> -2 <file name 2> --umi-source=qname
The option is also compatible with the --fastq-list
input options and with read input from BAM files.
A single-cell ATAC fastq-list file is a CSV file with the following mandatory columns:
Lane
Sequencing lane
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Read1File
Read 1 FASTQ file
Read2File
Read 2 FASTQ file
UmiFile
Read 3 FASTQ file (FASTQ file with cell-barcodes)
An example is shown below:
UMI Fastq
FilesAn alternative option is to provide the genomic and barcode sequences as three separate FASTQ files. Two files contain only the genomic reads and one contains the corresponding barcode-reads in the same order. This file is similar to how read-pairs are normally handled. If using separate UMI files, the sequencing system run setup and bclConvert are not aware of the UMI and treat it as normal read sequence by default.
Enter the following command line option to use the separate UMI FASTQ files: dragen -1 <file name 1> -2 <file name 2> --umi-fastq=<file name 3> --umi-source=fastq
To use this method with multiple FASTQ files, follow these steps:
Enter the barcode FASTQ files as read1 in the fastq-list
file, and then enter the genomic read FASTQ files matching the default fastq_list.csv generated by bclConvert as read2 and umifile.
Enter the following command: dragen --fastq-list fastq_list.csv --umi-source=umifile
The scATAC pipeline can process a single biological sample per DRAGEN run. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.
To use the scATAC workflow, enter --enable-single-cell-atac=true
. This section includes information on additional scATAC settings.
By default the scATAC workflow assumes that the overall barcode sequence is made up of a single-cell barcode (possibly split into multiple blocks). Enter the following command to identify the location of the single-cell barcode:
--scatac-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)]
blockPos
describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>
. For example, for a library with a 16 bp cell-barcode, enter: --scatac-barcode-position 0_15
. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, enter: --scatac-barcode-position=0_8+21_29+43_51
.
By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, enter: --scatac-barcode-position 0_15:+
or --scatac-barcode-position=0_8+21_29+43_51:+
. Conversely, to specify the reverse strand, enter: --scatac-barcode-position 0_15:-
or --scatac-barcode-position=0_8+21_29+43_51:-
.
You can provide a list of cell barcode sequences to include using the following command:
--scatac-barcode-sequence-list </path/to/barcodeAllowlist.txt>
In the case where the --scatac-barcode-position
parameter is not split into multiple blocks (see Barcode Position section) the file must contain one possible cell barcode sequence per line. Differently, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell barcode block sequences for the corresponding block. Each section should start with a line with prefix #-
, e.g.:
The input file might be compressed with gzip (*.txt.gz
).
During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. See Barcode Error Correction for more information. If the barcodes cannot be corrected, they are filtered out.
DRAGEN uses a threshold on the total count of unique reads per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.
--single-cell-number-cells
--- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold
--- Specify the method for determining the count threshold value. The available values are fixed
, ratio
, or inflection
.
If using ratio
, DRAGEN estimates the expected number of cells as max(T_e, T_m)
. T_m
is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e
is a threshold based on a fraction of the least abundant expected cell.
If using inflection
, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
If using fixed
, the count threshold is set to force the expected number of cells (--single-cell-number-cells
option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
For example, to set a fixed number of cells rather than use the automatically determined threshold, use the following command:
--single-cell-threshold=fixed --single-cell-number-cells=X
The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.
The following are additional options you can use to configure the Single-Cell ATAC Pipeline settings.
--qc-enable-depth-metrics
--- Set to false
to disable depth metrics for faster run time. The default is true
.
--scatac-write-fragments
--- Set to true
to write counted fragments to the disk (in both tsv
(<prefix>.scATAC.fragments.tsv
) and BigWig
(<prefix>.scATAC.fragments.bigwig
) format). The default is false
.
The following is an example command line to run the DRAGEN Single Cell ATAC Pipeline.
Single-cell ATAC outputs are found in the standard DRAGEN output location using the prefix <sample>
. in case of a single library and the prefix <sample>.<libId>
. in case of multiple libraries. All single-cell ATAC output files contain word scATAC
in their names.
The following three files provide information about per-cell chromatin accessibility level in matrix market (*.mtx
) format:
<prefix>.scATAC.matrix.mtx.gz
Count of unique fragments for each cell/peak pair in sparse matrix format.
<prefix>.scATAC.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scATAC.peaks.tsv.gz
Peak name and ID for each peak in the matrix.
The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scATAC.barcodeSummary.tsv
indicated by values PASS
and FAIL
.
The output includes filtered matrix files which only include the per-cell chromatin accessibility level for the PASS
cells in matrix market (*.mtx
) format. The scATAC.peaks.tsv.gz
file is common for the unfiltered and filtered matrices:
<prefix>.scATAC.filtered.matrix.mtx.gz
Count of unique UMIs for each filtered cell/peak pair in sparse matrix format.
<prefix>.scATAC.filtered.barcodes.tsv.gz
Cell-barcode sequence for each filtered cell from the matrix.
Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in python).
The matrix can be converted into a "dense" representation through two python modules: scanpy
and pandas
. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.
First, it is necessary to install the required libraries:
Within python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
Alignments of the genomic reads are sorted by coordinate and output as a BAM file. Each alignment is annotated with an XB
tag containing the cell-barcode. The alignments use the original sequences without any errors corrected. Fragments that did not have an associated barcode read, for example fragments trimmed on the input data, do not have XB
tag.
The <prefix>.scATAC_metrics.csv
file contains per sample scATAC metrics.
Invalid barcode read: Overall barcode sequence failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.
Fragments passing filters: Non-chimeric non-mitochondrial fragments that align to primary contigs with a high mapping quality (greater than 30 by default).
Non-primary contig fragments: Fragments that align to non-primary contigs (any contigs that are not autosome, X and Y).
Chimeric fragments: Fragments with the two reads aligning to different contigs.
Mitochondrial fragments: Fragments aligning to the mitochondrial contigs.
Low mapping quality fragments: Fragments with the two reads aligning with a mapping quality set to some specific value (default is 30).
Improperly mapped fragments: The two reads in the fragment are not mapped in proper pair (SAM flag "read mapped in proper pair" is set to 0).
Fragment threshold for passing cells: Number of fragments required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction peak fragments in passing cells: Percentage of counted fragments intersecting peaks assigned to cells that passed the filters.
Fraction fragments in passing cells: Percentage of all counted fragments assigned to cells that passed the filters.
Median fragments per cells: Total counted fragments per cell that passed the filters.
Median peaks per cells: Peaks with at least one fragment per cell that passed the filters.
Total peaks detected: Peaks with at least one fragment in at least one cell that passed the filters.
The <prefix>.scATAC.barcodeSummary.tsv
contains summary statistics for each unique cell-barcode per cell after error correction.
ID: Unique numeric ID for the cell-barcode.
Barcode: The cell-barcode sequence.
TotalFragments: Total fragments with the cell-barcode sequence.
UniqueFragments: Unique fragments counted towards a peak.
NonPrimaryContigFragments: Unique non-primary contig framgnets.
ChimericFragments: Unique chimeric fragments.
LowMapqFragments: Unique low mapping quality fragments.
MitochondrialFragments: Unique fragments mapped to mitochondrial genome.
Peaks: Unique peaks detected.
Filter: The following are the available filter values:
PASS
: Cell-barcode passes the filter.
LOW
: UMI count is below threshold.
Cell-barcode sequences from the input reads are error corrected based on the frequency with which each one is seen and an optional allow list of expected cell-barcode sequences. A cell-barcode sequence is considered a neighbor of another cell-barcode if there is at most one mismatch. A cell-barcode sequence is corrected to its neighbor in the following circumstances. When corrected, all reads with the cell-barcode are assigned instead to the neighboring cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020).
The neighboring cell-barcode is at least two times more frequent across all input reads.
The neighboring cell-barcode is on the cell-barcode allow list, but the original cell-barcode is not.
To avoid overcounting cell-barcodes based on sequence errors, cell-barcode error correction is performed among all reads with the same cell-barcode mapping to the same peak region. Cell-barcode sequences that are likely errors of another cell-barcodes are not counted.
Ref: Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 1 March 2022].
DRAGEN calls peaks using an algorithm based on MACS, version 3 (Zhang et al., 2008). To customize the behavior of the peak calling algorithm, modify any of the command line parameters specified in the table below.
atac-peak-qvalue
Threshold for q-value to call peaks
0.05
atac-peak-fold-change
Fold change threshold (relative to the background) to call peaks
1.0
atac-peak-min-length
Minimum length of a peak, bp
NA*
atac-peak-max-gap
Maximum pileup gap (if the gap is larger - initiate another peak), bp
NA**
(*), (**) - default value is computed automatically as the mean fragment length.
Alternatively, if fragment counting needs to be performed on a pre-specfified set of peaks, provide a peak BED file using command line parameter atac-peak-bed-file
.
Ref: Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown M., Li. W. and Liu, X.S., 2008. Model-based Analysis of ChIP-Seq (MACS). [PDF] Available at: <https://pubmed.ncbi.nlm.nih.gov/18798982/> [Accessed 1 March 2022].
DRAGEN annotates each peak with respect to a gene symbol as promoter, distal, or intergenic depending on the genomic position of both the peak and the gene. The following rules are used to determine the annotation of a peak:
If a peak overlaps with promoter region (-1000bp, +100bp) of any transcription start site (TSS), it is annotated as a promoter peak of the gene.
If a peak is within 200kb of the closest TSS, and if it is not a promoter peak of the gene of the closest TSS, it will be annotated as a distal peak of that gene.
If a peak overlaps the body of a transcript, and it is not a promoter nor a distal peak of the gene, it will be annotated as a distal peak of that gene with distance set as zero.
If a peak has not been mapped to any gene at the step, it will be annotated as an intergenic peak without a gene symbol assigned.
To enable peak annotation in DRAGEN scATAC-Seq workflow, specify a gene annotation file (GTF) using the option -a
. Peak annotations are written to a file with name <prefix>.scATAC.peaks.tsv
and each annotation is represented as a row with the following 6 columns:
Chromosome number
Start position
End position
Gene symbol
Distance from peak to gene
Peak annotation
(i.e, promoter, distal, or intergenic).
In this analysis, peaks are matched to a set of known transcription factor (TF) binding sites and for each cell barcode the fragment counts are grouped based on transcription factors their peaks are assigned to. This results in a more compact representation of chromatin accessibility patterns. To enable TF motif analysis, specify a database of position-weight matrices (PWM) corresponding to transcription factor motifs in JASPAR format:
--atac-jaspar-database=JASPAR2022_CORE_non-redundant_pfms_jaspar.txt
DRAGEN will produce two files <prefix>.scATAC.tf.matrix.mtx.gz
and <prefix>.scATAC.tf.motifs.tsv.gz
which combined with file <prefix>.tf.barcodes.tsv.gz
represent the cell-by-TF count matrix in matrix market format.
The DRAGEN Single-Cell RNA (scRNA) Pipeline can process multiplexed single-cell RNA-Seq data sets from reads to a cell-by-gene UMI count gene expression matrix. The pipeline is compatible with library designs that have one read in a fragment matched to a transcript and the other containing a cell-barcode and UMI. The pipeline includes the following functions:
RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.
Cell-barcode and UMI error correction for the barcode read.
UMI counting per cell and gene to measure gene expression.
Sparse matrix output and QC metrics.
Feature counting, such as with cell-surface proteins.
The functionality and options related to alignment and gene annotation are identical to the RNA-Seq pipeline. For information, see DRAGEN RNA Pipeline. Other RNA-Seq modules, such as gene fusion calling or transcript-level gene expression quantification are not supported for Single-Cell RNA.
Use a standard DRAGEN RNA reference genome or hashtable for the scRNA Pipeline. For example, build using --ht-build-rna-hashtable=true
. The pipeline also requires a gene annotation file in GTF format, provided with the --annotation
(-a
) option.
The DRAGEN scRNA Pipeline requires both the transcript sequence and the barcode+UMI sequence for each fragment (read) as input. The transcript sequence is aligned to the genome to determine the expressed gene, the barcode+UMI sequence is split into the single-cell barcode to identify the unique cell, and a single-cell UMI for unique molecule quantification. When starting from FASTQ, you can either include the UMI in the read name or provide separate UMI FASTQ files.
Provide the transcriptome reads as a single-end FASTQ file with the Barcode+UMI sequence in the eighth field of the read-name line. Separate sequences using a colon. The following example uses read2 (sample.R2.fastq.gz
) as the transcript read.
In the example, the GAA sequence is the barcode+UMI read and the ACAG sequence is the transcriptome read.
These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode+UMI read. If using bclConvert, enter the barcode/UMI information using the OverrideCycles1
setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).
Note: bclConvert refers to the entire single-cell barcode+UMI sequence as UMI.
Enter the following command line option to use the generated FASTQ files from bclConvert:
The option is also compatible with the --fastq-list
input options and with read input from BAM files.
A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:
Lane
Sequencing lane
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Read1File
Read 1 FASTQ file (usually FASTQ file with reads containing cell-barcodes and UMIs)
Read2File
Read 2 FASTQ file (usually transcriptomic reads)
In this case, DRAGEN needs to accept --umi-source=read1
(or --umi-source=read2
if swapped) command-line option. For example, a fastq-list file with the contents shown below must be used in combination with --umi-source=read1
option:
Alternatively, the FASTQ file with transcriptomic reads can be specified under Read1File
column and the FASTQ file with cell-barcodes and/or UMIs - under UmiFile
column. For example, a fastq-list file with the contents shown below must be used in combination with --umi-source=umifile
option:
UMI Fastq
FilesAn alternative option is to provide the transcript and barcode+UMI sequences as two separate FASTQ files. One file contains only the transcriptome reads and one contains the corresponding barcode-reads in the same order. This file is similar to how read-pairs are normally handled. If using separate UMI files, the sequencing system run setup and bclConvert are not aware of the UMI and treat it as normal read sequence by default.
Enter the following command line option to use the separate UMI FASTQ files:
To use this method with multiple FASTQ files:
Enter the barcode+UMI FASTQ files as read1 in the fastq-list
file, and then enter the transcriptome read FASTQ files, matching the default fastq_list.csv generated by bclConvert, as read2.
Use the following command:
The scRNA pipeline can process a single biological sample per DRAGEN run. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each. DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.
To use the scRNA workflow, use the options --enable-rna=true --enable-single-cell-rna=true
. This section includes information on additional scRNA settings.
By default the scRNA workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to define the location of the single-cell barcode and single-cell UMI in the barcode read:
blockPos
describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>
. For example, for a library with a 16 bp cell-barcode followed by a 10 bp UMI, use: --scrna-barcode-position 0_15 --scrna-umi-position 16_25
. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, use: --scrna-barcode-position=0_8+21_29+43_51 --scrna-umi-position=52_59
.
By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, use: --scrna-barcode-position 0_15:+
or --scrna-barcode-position=0_8+21_29+43_51:+
. To explicitly specify the reverse strand, use: --scrna-barcode-position 0_15:-
or --scrna-barcode-position=0_8+21_29+43_51:-
.
UMI position can also be specified for feature reads, which are reads with a sequence tag specific to a feature (eg, cell-surface protein or antibody). When the feature-specific UMIs are located on Read 2, you can use --scrna-feature-barcode-r2umi=0_11
to specify a 12 bp feature UMI at the beginning of each feature read.
You can provide a list of cell barcode sequences to include using the following command:
In the case where the --scrna-barcode-position
parameter is not split into multiple blocks (see Barcode Position section) the file must contain one possible cell barcode sequence per line. Differently, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell barcode block sequences for the corresponding block. Each section should start with a line with prefix #-
, e.g.:
The input file can also be provided compressed with gzip (*.txt.gz
).
During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. See Barcode Error Correction for more information. If the barcodes cannot be corrected, they are filtered out.
DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.
--single-cell-number-cells
--- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold
--- Specify the method for determining the count threshold value. The available values are fixed
, ratio
, or inflection
.
If using ratio
, DRAGEN estimates the expected number of cells as max(T_e, T_m)
. T_m
is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e
is a threshold based on a fraction of the least abundant expected cell.
If using inflection
, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
If using fixed
, the count threshold is set to force the expected number of cells (--single-cell-number-cells
option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
--single-cell-threshold-filterby
--- [Optional] Set the count distribution to consider for cell filtering. Can be either "umi" (default) or "read".
To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:
The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.
The following are additional options you can use to configure the Single-Cell RNA Pipeline settings.
--rna-library-type
--- Set the orientation of transcript reads relative to the genomes. Enter SF
for forward, SR
for reverse, or U
for unstranded. The default is SF
.
--scrna-count-introns
--- Include intronic reads in gene expression estimation. The default is false
.
--qc-enable-depth-metrics
--- Set to false
to disable depth metrics for faster run time. The default is true
.
--bypass-anchor-mapping
--- Set to true
to disable RNA anchor (two-pass) mapping for increased performance. The default is false
.
The following is an example command line to run the DRAGEN Single Cell RNA Pipeline.
Single-cell RNA outputs are found in the standard DRAGEN output location using the prefix <sample>.
in case of a single library and the prefix <sample>.<libId>.
in case of multiple libraries. All single-cell RNA output files contain word scRNA
in their names.
The following three files provide information per-cell gene expression level in matrix market (*.mtx
) format:
<prefix>.scRNA.matrix.mtx.gz
Count of unique UMIs for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.genes.tsv.gz
Gene name and ID for each gene in the matrix.
The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scRNA.barcodeSummary.tsv
indicated by values PASS
and FAIL
.
The output includes filtered matrix files which only include the per-cell gene expression for the PASS
cells in matrix market (*.mtx
) format. The scRNA.genes.tsv.gz
files is common for the unfiltered and filtered matrices:
<prefix>.scRNA.filtered.matrix.mtx.gz
Count of unique UMIs for each filtered cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gz
Cell-barcode sequence for each filtered cell from the matrix.
Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy
in python).
The matrix can be converted into a "dense" representation through two python modules: scanpy
and pandas
. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.
First, it is necessary to install the required libraries:
Within python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
Alignments of the transcript reads are sorted by coordinate and output as a BAM file. Each alignment is annotated with an XB
tag containing the cell-barcode and an RX
tag containing the UMI. The alignments use the original sequences without any errors corrected. Fragments that do not have an associated barcode read, for example fragments trimmed on the input data, do not have XB
and RX
tags.
The <prefix>.scRNA_metrics.csv
file contains per sample scRNA metrics. In scRNA, mapped reads are currently reported by default under R1 metrics, irrespective of whether R1 is the read aligned to the transcriptome or the one containing cell barcode and UMI.
Invalid barcode read: Overall barcode sequence (cell barcode + UMI) failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.
Unique exon match: Reads with valid cell-barcode and UMI that match a unique gene.
Unique intron match: Reads do not match exons, but introns of exactly one gene. For example, if using the command --scrna-count-introns=true
.
Ambiguous match: Reads match to multiple genes.
Wrong strand: Reads overlap a gene on the opposite strand defined by library type.
Mitochondrial reads: Reads map to the mitochondrial example, if there is a matching gene.
No gene match: Reads do not match to any gene. Includes intronic reads unless using --scrna-count-introns=true
.
Filtered multimapper: Reads excluded due to multiple alignment positions in the genome.
Feature reads: Reads matching to features, when using feature counting.
Total counted reads: Reads with valid cell-barcode and UMI, matching a unique gene
Reads with error-corrected UMI: Counted reads where the UMI was error-corrected to match another similar UMI sequence.
Reads with invalid UMI: Reads that were not counted due to invalid UMI sequence. For example, pure homopolymer reads or reads containing Ns.
Sequencing saturation: Fraction of reads with duplicate UMIs. 1 - ( UMIs / Reads).
Unique cell-barcodes: Overall number of unique cell-barcode sequences in counted reads only.
Total UMIs: Overall number of unique cell-barcode and UMI combinations counted.
UMI threshold for passing cells: Number of UMIs required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction genic reads in cells: Counted reads assigned to cells that passed the filters.
Fraction reads in putative cells: All counted reads assigned to cells that passed the filters.
Median reads per cells: Total counted reads per cell that passed the filters.
Median UMIs per cells: Total counted UMIs per cell that passed the filters.
Median genes per cells: Genes with at least one UMI per cell that passed the filters.
Total genes detected: Genes with at least one UMI in at least one cell that passed the filters.
The <prefix>.scRNA.barcodeSummary.tsv
contains summary statistics for each unique cell-barcode per cell after error correction.
ID: Unique numeric ID for the cell-barcode. The ID corresponds to the line number of that barcode in the UMI count matrix (*.mtx) output.
Barcode: The cell-barcode sequence.
TotalReads: Total reads with the cell-barcode sequence. This includes error corrected reads.
GeneReads: Reads (primary read alignments) counted towards a gene.
UMIs: Total number of UMIs in counted reads.
Genes: Unique genes detected.
Mitochondrial Reads: Reads mapped to mitochondrial genome.
Filter: The following are the available filter values:
PASS
: Cell-barcode passes the filter.
LOW
: UMI count is below threshold.
Cell-barcode sequences from the input reads are error corrected based on their frequency, and optionally through a list of expected cell-barcode sequences. A cell-barcode sequence is corrected into another cell-barcode sequence if they differ only by one base (Hamming distance 1) and:
Either the corrected cell-barcode is at least two times more frequent across all input reads
Or the corrected cell-barcode is on the list of expected cell-barcode sequences, but the original cell-barcode is not
When corrected, all the original cell-barcode reads are assigned to the corrected cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020)¹.
¹Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 15 October 2020].
To avoid overcounting UMIs based on sequence errors, UMI error correction is performed among all reads with the same cell-barcode mapping to the same gene. UMI sequences that are likely errors of another UMI are not counted.
DRAGEN implements several strategies for demultiplexing data sets that represent mixtures of cells from different individuals, such as cells pooled in one library prep or microfluidic run. Two of these strategies include a genotype-based and genotype-free demultiplexing. In genotype demultiplexing methods, DRAGEN can assign sample identity to cells based on alleles observed in reads in each cell (only SNVs are considered). DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.
To use genotype-based sample demultiplexing, you must provide a VCF file with genotypes for each sample in the data set. To use genotype-free sample demultiplexing, you must provide a VCF file with a set of external samples preferably coming from a population with the same genetic background. The GT
field represents the sample genotypes.
For information on the cell-hashing demultiplexing method, see Cell-Hashing.
You can use the following command line options for scRNA demultiplexing.
One of the two:
--scrna-demux-sample-vcf
— If using genotype-based sample demultiplexing, specify the VCF file that contains the sample genotypes.
--scrna-demux-reference-vcf
— If using genotype-free sample demultiplexing, specify the VCF file that contains the genotypes of a population with a similar genetic background to the samples you are using.
--scrna-demux-detect-doublets
— Enable the doublet detection in genotype-based sample demultiplexing. The default value is false
.
--scrna-demux-number-samples
— The number of samples you are using. This option is only applicable when using an external VCF reference specified with the scrna-demux-reference-vcf
option, for genotype-free sample demultiplexing.
The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-based demultiplexing.
The following is an example command line to run the DRAGEN Single Cell RNA Pipeline with genotype-free demultiplexing.
You can find information related to the output of genotype-based scRNA sample demultiplexing in the following three files.
The <prefix>.scRNA.barcodeSummary.tsv
file contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Outputs for more information on <prefix>.scRNA.barcodeSummary.tsv
metrics.
SampleIdentity
The SampleIdentity
column can contain the following values::
sampleX
—The particular cell (barcode) is uniquely assigned to a sample.
AMB(sampleX,sampleY)
—The algorithm cannot determine the sample to assign the barcode to.
MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)
—The cell barcode is classified as doublet. For example, MIX(50*sampleX+50*sampleY)
.
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample identity call. After DRAGEN determines the doublet status of the cell as singlet
, ambiguous
, or doublet
, the identity Q-score is defined as -10 * log10(Probability that the assigned identity is correct, given the second most likely identity and the doublet status). The higher values of identity Q-score correspond to more confident sample identity calls.
The <prefix>.scRNA.demux.tsv
file contains sample demultiplexing statistics that were used to infer sample identity of each cell.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of SNPs that the reads of the cell barcode intersect.
DemuxReadCount
Number of UMIs of the cell barcode that intersect at least one SNP.
Pure samples
Samples from the VCF file.
BestMixtureIdentity
Mixture sample with the highest log likelihood. Only available if --single-cell-demux-detect-doublets=true
.
BestMixtureLogLikelihood
The log likelihood of the best mixture sample. Only available if --single-cell-demux-detect-doublets=true
.
The <prefix>.scRNA.demuxSamples_metrics.csv
file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv
.
Passing cells
The number of cell barcodes that passed.
Fraction genic reads in cells
Counted reads assigned to the cells that passed.
Median reads per cell
Total counted reads per cell that passed the filters.
Median UMIs per cell
Total counted UMIs per cell that passed the filters.
Median genes per cell
Genes with at least one UMI per cell that passed the filters.
In addition to cell demultiplexing based on sample or population genotypes, there is an additional option to calculate read likelihoods for all scRNA reads conditioned on a given list of SNVs. The SNVs are specified by providing a somatic VCF (typically from a DNA WES tumor-normal somatic VC output). Note that the input VCF is "trusted" completely (ignoring variant call quality). The read-likelihoods are combined at the cell level and used to classify cells as tumor or non-tumor.
--scrna-demux-tumor-normal-vcf
— A tumor-normal somatic variant call VCF that has tumor and normal sample columns. It is recommended for the VCF to have more than ~100 confident SNVs but that threshold may vary based on the coverage of the scRNA data. Having too few somatic variants runs the risk of not having enough scRNA reads intersecting the SNVs, leading to most cells being unclassified due to insufficient information.
--scrna-demux-tn-threshold
— The log-likelihood ratio threshold for classifying a cell as tumor or non-tumor. The default value is set to 2.0, but we recommend user to test control samples (e.g. high and low tumor-fraction samples) to find an optimal threshold tailored to their experimental design and sample type.
Example DRAGEN Single Cell RNA command line using genotype-based demultiplexing:
The output for tumor/non-tumor scRNA cell demultiplexing can be found in the following three files.
The <prefix>.scRNA.barcodeSummary.tsv
file contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Outputs for more information on <prefix>.scRNA.barcodeSummary.tsv
metrics.
SampleIdentity
The SampleIdentity
column can contain the following values::
SNG:TUMOR' or 'SNG:NORMAL
—The particular cell (barcode) is uniquely classified as a tumor or non-tumor cell, respectively.
AMB(NORMAL,TUMOR)
—The algorithm has found variant-supporting reads, but the likelihoods of the cell being tumor or non-tumor are the same, and therefore an ambiguous call is made.
NA
—The cell barcode has no variant-supporting reads, so there is not enough information to make a classification.
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample identity call. After DRAGEN determines the identity of the cell, the identity Q-score is defined as -10 * log10(probability of the most likely identity / (probability of the most likely identity + probability of the secound most likely identity)). Higher values of identity Q-score correspond to more confident sample identity calls.
The <prefix>.scRNA.demux.tsv
file contains demultiplexing statistics that are used to infer the identity of each cell. Only cells in which variant-supporting reads are found are recorded in this file.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of variants that the reads of the cell barcode intersect.
DemuxReadCount
The number of UMIs of the cell barcode that intersect at least one variant.
NORMAL
The log-likelihood of the cell being non-tumor
TUMOR
The log-likelihood of a cell being tumor.
The <prefix>.scRNA.demuxSamples_metrics.csv
file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv
. Ambiguous calls, i.e. 'AMB(NORMAL,TUMOR)', are excluded from the calculation.
Passing cells
The number of cell barcodes that are PASS.
Fraction genic reads in cells
Fraction of gene-coding reads assigned to PASS cells.
Median reads per cell
Median count of reads per cell that passed the filters.
Median UMIs per cell
Median count of UMIs per cell that passed the filters.
Median genes per cell
Median count of genes with at least one UMI per cell that passed the filters.
DRAGEN implements several strategies for demultiplexing of data sets that represent mixtures of cells from different individuals, such as cells from different individuals pooled in one library prep or microfluidic run. One of these methods is a sample oligo-tag based method, referred to as cell-hashing.
To use cell-hashing, you must provide a cell-hashing CSV or FASTA reference file. In CSV format, the feature barcode reference file uses the following header: id,name,read,position,sequence,feature_type
.
id
— Identifier of the feature. For example, ADT_A1018.
name
— Name of the feature. For example, ADT_Hu.HLA.DR.DP.DQ_A1018.
read
— Read 1 (R1) or Read 2 (R2).
position
— Position on the specified read, including starting position and the length of the feature barcode. For example, a position of 0_15 represents a feature barcode that starts at position 0 and has a length of 15.
sequence
— DNA sequence of the feature barcode. For example, CAGCCCGATTAAGGT.
feature_type
— Type of the feature. For example, Antibody Capture.
To enable cell-hashing sample demultiplexing, specify the following command line options.
--scrna-cell-hashing-reference
--- Specify a CSV or FASTA cell-hashing reference file that contains sample-specific oligo-tags.
--scrna-demux-detect-doublets
--- Enable doublet detection in cell-hashing sample demultiplexing. The default value is false
.
--scrna-demux-sample-fastq
--- Output sample-specific FASTQ files. See Sample-Specific FASTQ Output Files for more information.
The <prefix>.scRNA.barcodeSummary.tsv
file contains per-cell metrics, including cell barcodes. The following column in the <prefix>.scRNA.barcodeSummary.tsv
contains cell-hashing per-cell information. For more information on the <prefix>.scRNA.barcodeSummary.tsv
file, see Single Cell RNA Outputs.
SampleIdentity
The SampleIdentity
column can contain the following values::
sampleX
—The particular cell (barcode) is uniquely assigned to a sample.
AMB(sampleX,sampleY)
—The algorithm cannot determine the sample to assign the barcode to.
MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)
—The cell barcode is classified as doublet. For example, MIX(50*sampleX+50*sampleY)
.
The <prefix>.scRNA.demux.tsv
file contains sample demultiplexing statistics that were used to infer sample identity of each cell.
Barcode
The cell barcode associated with the cell.
Pure samples
Cell-hashing read count for each sample.
If you have enabled either of the sample demultiplexing algorithms, you can output sample-specific FASTQ files after the sample identities for each cell is available using the command line option --scrna-demux-sample-fastq
.
If gzip
is specified, then the sample-specific output FASTQ files are compressed in gzip format. If fastq
is specified, then the sample-specific FASTQ files are not compressed. The default option is none
, which indicates that no sample-specific FASTQ files are produced.
Feature counting is a technique to profile the expression of proteins (e.g., cell surface proteins or antibodies). Feature reads differ from RNA reads in that their read R2 is not a transcriptomic sequence, but rather an oligo tag corresponding to a particular type of protein.
To enable feature counting, specify the following command-line options:
--scrna-feature-barcode-reference
— Specify a CSV or FASTA feature reference file that contains feature barcodes. The CSV file format is similar to the format used for Cell-Hashing.
--scrna-feature-barcode-groups
— If feature reads are specified as separate FASTQ files, specify a comma-separated list of read groups that correspond to feature FASTQ files.
The output of feature counting is appended to the output expression matrix. The extended expression matrix contains additional rows corresponding to the features.
This recipe is for processing whole exome sequencing data for germline workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
We highly recommend using a pangenome reference for human samples (excluding RNA). For more details, refer to Dragen Reference Support.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Optional settings per component are listed below. Full option list at this page.
Please include the matched normal sample in the CNV panel of normals.
--cnv-enable-gcbias-correction true
Generating Panel of Normals (PON)
WES CNV requires PON files. Follow the two steps below to generate CNV PON:
Target counts generation (per normal sample): Target counts of individual normal sample should be generated as baseline. Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.
Combined counts generation: Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz
file.
$CNV_NORMALS_LIST
is a single text file with paths to each target counts file generated by step1 (either .target.counts.gz
or .target.counts.gc-corrected.gz
). Output will have a PON file with suffix .combined.counts.txt.gz
file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts
option.
For more information, see Panel of Normals.
Note that we do not recommend changing the default QUAL thresholds of 3 for DRAGEN-ML and 10 for DRAGEN without ML. These values differ from each other because DRAGEN-ML improves the calibration of QUAL scores, leading to a change in the scoring range (see QUAL, QD, and GQ Formulation).
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-enable-class-2
Extend genotyping to HLA class 2 genes
The epigenetic methylation of cytosine bases in DNA can have a dramatic effect on gene expression, and bisulfite sequencing is the most common method for detecting epigenetic methylation patterns at single-base resolution. This technique involves chemically treating DNA with sodium bisulfite, which converts unmethylated cytosine bases to uracil, but does not alter methylated cytosines. Subsequent PCR amplification converts any uracils to thymines.
A bisulfite sequencing library can either be nondirectional or directional. For nondirectional, each double-stranded DNA fragment yields four distinct strands for sequencing, post-amplification, as shown in the following figure:
Bisulfite Watson (BSW), reverse complement of BSW (BSWR),
Bisulfite Crick (BSC), reverse complement of BSC (BSCR)
For directional libraries, the four strand types are generated, but adapters are attached to the DNA fragments such that only the BSW and BSC strands are sequenced (Lister protocol). Less commonly, the BSWR and BSCR strands are selected for sequencing (eg, PBAT).
BSW and BSC strands:
A, G, T: unchanged
Methylated C remains C
Unmethylated C converted to T
BSWR and BSCR strands:
Bases complementary to original Watson/Crick A, G, T bases remain unchanged
G complementary to original Watson/Crick methylated C remains G
G complementary to original Watson/Crick unmethylated C becomes A
Therefore, several steps are performed to map methylation, as shown in the following flowchart:
Details on each part of the mapping process can be found below.
DRAGEN methylation mapping works in one single alignment run. During alignment, the mapper considers all possible base and reference conversions for the read and emits the single best alignment to a particular methylation strand if one exists. Any read (pair) that did not have a single best scoring alignment across all methylation strands tested appears in the output BAM with MAPQ 0. The output BAM in single-pass might contain mapped reads that do not have the XM, XR, and XG methylation tags. Methylation data from reads that do not have methylation tags are not tallied into the reports or metrics files.
--enable-methylation-calling
must be set to true to enable single-pass methylation mapping.
A read must meet the following requirements to have methylation tags in the output BAM and to get tallied into the reports and metrics:
The read and its mate (if applicable) are mapped with MAPQ above the value specified using --methylation-mapq-threshold
. The default value is 0.
The read is not part of an improper pair.
The DRAGEN methylation pipeline requires a methylation-specific hash table in which combines both C to T and G to A conversions of contigs from the original FASTA. Specifically, each contig appears twice, once with C bases converted to T bases, and once with G bases converted to A bases. When --ht-methylated-combined=true
is set, the DRAGEN hash table builder will create a methylation hash table in a sub-directory of the output directory called methyl_converted
. When running the DRAGEN mapper, the top level directory should be provided (parent of methyl_converted), and DRAGEN will automatically locate the methylation sub-directory and use it as needed. The top level directory can be used for regular mapping.
Due to the base conversions in the methylation hash table, short seeds map poorly. So the default and recommended seed length of the methylation hash table is 27.
The following is an example command line for a methylation hash table.
Different methylation protocols require the generation of two or four alignments per input read, followed by an analysis to choose a best alignment and determine which cytosines are methylated. DRAGEN can automate this process by generating a single output BAM file with Bismark-compatible tags (XR, XG, and XM) that can be used for methylation calling and other downstream workflows.
When the --methylation-protocol
option is set to a valid value other than none, DRAGEN automatically produces the required set of alignment runs. Each alignment run includes the appropriate base conversions on the reads, base conversions on the reference, and constraints on whether reads must be forward-aligned or reverse complement (RC) aligned with the reference. The following options are automatically configured:
--generate-md-tags true
--Aligner.global 1
--Aligner.no-unpaired 1
--Aligner.aln-min-score 0
--Aligner.min-score-coeff -0.2
--Aligner.match-score 0
--Aligner.mismatch-pen 4
--Aligner.gap-open-pen 6
--Aligner.gap-ext-pen 1
--Aligner.supp-aligns 0
--Aligner.sec-aligns 0
--seed-density 1
Because global alignments (end-to-end in the reads) are generated, DRAGEN recommends trimming any artifacts introduced by library prep and adapter sequences.
The following table describes the properties of the alignment runs:
directional
1
C->T
C->T
G->A
Forward-only
2
G->A
C->T
G->A
RC-only
non-directional, or directional-complement
1
C->T
C->T
G->A
Forward-only
2
G->A
C->T
G->A
RC-only
3
C->T
G->A
C->T
RC-only
4
G->A
G->A
C->T
Forward-only
PBAT
3
C->T
G->A
C->T
RC-only
4
G->A
G->A
C->T
Forward-only
In directional protocols, the library is prepared such that only the BSW and BSC strands are sequenced. Thus, alignment runs are performed with the two combinations of base conversions and orientation constraints best suited for these strands (directional runs 1 and 2 above).
With nondirectional protocols, reads from each of the four strands are equally likely, so alignment runs must be performed with two more combinations of base conversions and orientation constraints (nondirectional runs 3 and 4 above).
In PBAT protocols, the library is prepared so only the BSWR and BSCR strands are sequenced. Only two alignment runs are performed with the combinations of base conversions and orientation constraints best suited for these strands (runs 3 and 4).
The directional-complement protocol can also be used for PBAT or similar libraries where mainly the BSWR and BSCR strands are sequenced. With this protocol, all four aligner runs are performed, but relatively few good alignments are expected from the runs for the BSW and BSC strands, so DRAGEN is automatically tuned to a faster analysis mode for those runs.
The following is an example DRAGEN command line for the directional protocol:
An end-to-end (ie., fastq->bam->cytosine report) run can be performed as follows:
To generate sorted alignment output (in BAM format), set --enable-sort
to true.
To detect duplicate reads, set --enable-duplicate-marking
to true.
[Optional]To remove duplicate reads, set --remove-duplicates
to true.
[Optional]Set --methylation-generate-cytosine-report
and --methylation-generate-mbias-report
to either false or true according to user need.
By default, DRAGEN methylation performs strand-aware dedup in concordance with Bismark. Strand-aware dedup partitions the mapped reads into four groups, one per methylation strand. Within each group, DRAGEN performs a normal dedup. For paired reads, the strand of the pair is defined as the strand of the first read in the pair.
The following example demonstrates strand-aware dedup for paired-end reads. The example pairs all map to the same position, but the first read in each pair (BAM flag 83 and 99) is mapped to a different methylation strand, as shown by the different values of the XR and XG tags. None of these pairs are marked as duplicates.
DRAGEN support fastq files that contains UMI barcode during the alingment phase. The principle and requirement are identical to DNA UMI. Briefly, during library prep, (methyl-treated) DNA fragments could be barcoded by unique molecular identifiers, so that true signals from the original fragments can be separated from PCR error and sequencing error, which enables more accurate methylation calling. The fastq files need to have UMI barcode in 7th field of the QNAME.
eg. @NS500561:434:H5LC2BGXJ:1:11101:10798:1359:CACATGA+ACATTC 1:N:0:TGGTACCTAA+AGTACTCATG
To enable UMI, either set --umi-enable true
if you are using random UMI (common), or set --tso500-solid-umi true
if you are using the same non-random UMIs as the TSO500 solid panel. If so, read collapsing will be performed among reads with the same UMI that are mapped to the same genomic location at the same strand, either from top (OT/CTOT) or bottom (OB/CTOB) strand.
See DRAGEN DNA Pipeline / Unique Molecular Identifiers for more details.
TET-Assisted Pyridine Borane Sequencing (TAPS) is a new assay which directly converts methylated C to T, whereas typical bisulfite conversion converts unmethylated C to T. This approach preserves genomic complexity and uses less destructive chemicals to enable lower input DNA. To enable analysis of FASTQ data generated through TAPS, set --methylation-TAPS
to true. By default, the option is false. This option is performed only during the alignment step and is not necessary when generating methylation cytosine and M-Bias reports from an existing BAM.
When --enable-methylation-calling
is set to true, DRAGEN analyzes the alignments produced for the configured --methylation-protocol
and generates a single output BAM file that includes methylation-related tags for all mapped reads. As in Bismark, reads without a unique best alignment are excluded from the output BAM. The added tags are as follows.
XR:Z
Read conversion
For the best alignment, which base-conversion was performed on the read: CT or GA.
XG:Z
Reference conversion
For the best alignment, which base-conversion was performed on the reference: CT or GA
XM:Z
Methylation call
A byte-per-base methylation string.
The XM:Z (methylation call) tag contains a byte that corresponds to each base in the sequence of the read. Each position that does not involve a cytosine contains a period (.). Each position that does involve a cytosine contains a letter. The letter indicates the context (CpG, CHG, CHH, or unknown). The case indicates methylation. Methylated positions use upper-case and unmethylated positions use lower-case. The letters used at cytosine positions are as follows.
.
not cytosine
not cytosine
z
No
CpG
Z
Yes
CpG
X
No
CHG
X
Yes
CHG
h
No
CHH
H
Yes
CHH
u
No
Unknown
U
Yes
Unknown
You can use DRAGEN to generate a genome-wide cytosine methylation report. Your command line options settings depend on if you are running using FASTQ through the aligner or a prealigned BAM that already contains the methylation tags.
For FASTQ input, set --methylation-generate-cytosine-report=true
For BAM input, set --methylation-reports-only=true
To keep all cytosines from your reference in the CX_report
, even if they are not included in the input sequences, set --methylation-keep-ref-cytosine true
. The default value is false. Setting this option to true increases run time and the CX_report
file size.
To compress the cytosine report, set --methylation-compress-cx-report = true
. The default value is false. DRAGEN outputs a compressed *.CX_report.txt.gz
, instead of a *.CX_report.txt
.
The position and strand of each C in genome are given in the first three fields of the report. A record with a - in the strand field is used for a G in the reference FASTA. The counts of methylated and unmethylated Cs covering the positions are given in the fourth and fifth fields. The C context in the reference (CG, CHG, or CHH) is given in the sixth field. The trinucleotide sequence context is given in the last field (eg, CCC, CGT, CGA, and so on) The cytosine report only includes records for positions that have one or more spanning alignments. The following is an example cytosine report record:
chr2 24442367 + 18 0 CG CGC
To generate an M-bias report, set --methylation-generate-mbias-report
to true. This report contains three tables for single-ended data with one table for each C-context and six tables for paired-end data. Each table is a series of records, with one record per read base position. For example, the first record for the CHG table contains the counts of methylated Cs (field 2) and unmethylated Cs (field 3) that occur in the first read base position, and restricts to those reads in which the first base is aligned to a CHG location in the genome. Each record of a table also includes the percent methylated C bases (field 4) and the sum of methylated and unmethylated C counts (field 5).
The following is an example M-bias record for read base position 10:
10 7335 2356 75.69 9691
For data sets with paired-end reads that overlap, both the cytosine and M-bias reports do not report any Cs in the second read that overlaps the first read. In addition, 1-based coordinates are used for positions in both reports..
The quality of each methylation run can be summarized in the following two metric files.
*.mapping_metrics.csv
—Contains mapping-specific metrics that are generated for the alignment phase, including benchmarks like number of total reads, aligned reads, deduped reads, base quality, etc.
*.methyl_metrics.csv
—Contains methylation-specific metrics that are generated for the methylation calling phase, including benchmarks like the total number of cytosines analyzed, count and rate of methylation in each cytosine context, strand of the best alignment, etc.
The Explify Analysis Pipeline offers a dedicated informatics solution with flexible analysis options for the following Illumina Infectious Disease and Microbiology target-capture enrichment panel kits: the Illumina Respiratory Pathogen ID/AMR Enrichment Panel Kit (RPIP), Illumina Urinary Pathogen ID/AMR Enrichment Panel Kit (UPIP), and Illumina Viral Surveillance Panel V2 Kit (VSP V2). The application delivers easy-to-use, powerful secondary analysis of Illumina sequencing data, with workflows for sample QC, viral WGS (whole-genome sequencing), pathogen detection and quantification, and antimicrobial resistance (AMR) marker profiling. It also supports custom reference sequence analysis.
RPIP: Target-capture enrichment of >280 RNA and DNA respiratory pathogens, including SARS-CoV-2, Influenza viruses, Respiratory syncytial virus, Mycobacterium and Legionella species, and >4000 AMR markers.
UPIP: Target-capture enrichment of >170 genitourinary pathogens, including fastidious, slow-growing, and anaerobic uropathogens, sexually transmitted microorganisms, and >4000 bacterial AMR markers.
VSP V2: Target-capture enrichment for whole-genome sequencing (WGS) of 200 RNA and DNA viruses prioritized as high-risk to public health, zoonotic surveillance, and biotech, and >200 viral AMR markers.
Custom: Analyze FASTQ/FASTA read files with a custom reference sequence database.
Note that samples enriched using the Illumina Respiratory Virus Oligo Panel/Respiratory Virus Enrichment Kit (RVOP/RVEK) and Viral Surveillance Panel Kit (VSP) can also be analyzed using the Explify Analysis Pipeline and VSPv2 database.
Applies to: --explify-sample-list
The sample input list is a column-formatted file with tab separations between the columns (i.e., a .tsv
file).
Notes:
The SampleID values must be unique.
BatchID and RunID are to help users track and manage sample analyses. Often the BatchID is used to track libraries that were prepared together, and the RunID is used to track sequencing runs. They can also be left blank.
The ControlFlag value can be POS, NEG, BLANK, or left empty.
POS is used to indicate a positive control sample.
NEG is used to indicate a negative control sample.
BLANK is used to indicate a blank control sample (e.g. buffer only).
If there are multiple FASTQ files, they are tab delimited.
Please be very careful when editing tsv files. Some editors replace tabs with spaces without alerting the user.
Applies to: --explify-internal-control
, --explify-internal-control-concentration
The user may specify one of the internal controls listed below. If NONE
is specified, the internal control concentration is ignored. These are case-sensitive and must be input exactly as they appear:
Allobacillus halotolerans
Armored RNA Quant Internal Process Control
Enterobacteria phage T7
(This is the default)
Escherichia virus MS2
Escherichia virus Qbeta
Escherichia virus T4
Imtechella halotolerans
Phocid alphaherpesvirus 1
Phocine morbillivirus
Truepera radiovictrix
NONE
The internal control concentration is an integer representing the number of copies/mL of sample for the internal control.
Applies to: --explify-ref-db-dir
, --explify-test-panel-name
, --explify-test-panel-version
, --explify-load-db-ram
,--explify-custom-ref-fasta
, --explify-custom-ref-bed
An Explify Reference Database is required to run the Explify Analysis Pipeline in DRAGEN. The databases are stored remotely and must be downloaded prior to running an analysis. The database download script provided to facilitate the download is described below.
Prior to downloading the databases, create a directory that will be dedicated to storing them. It is recommended that the directory be on a disk with at least 150 GB of free space. The path to this directory will be used for the -d
parameter when the download script is run in subsequent steps: "explify-databases/" is used in the examples below.
Download and management of Explify reference databases is handled by a shell script. The script can be downloaded with the following command:
The search
subcommand can be used to list what databases can be downloaded:
The -d
argument is the base directory used for storage of the databases
Optionally, when a test panel name is specified with the -p
argument, the results will be limited to that panel
Optionally, setting the -n
argument will filter the search to databases that have not already been downloaded
The download
subcommand is used to download the database files for a test panel:
The -d
argument is the base directory used for storage of the databases
The -p
argument is the test panel name
The -v
argument is the test panel version
The -n
argument is the number of CPUs that can be used to download the files (defaults to 1)
Additional notes:
In this example, after the UPIP-8.6.0 are downloaded, additional required files will be downloaded to a subdirectory named "common"
After the files are downloaded, their checksums will be automatically checked
Due to the size of some of the files, this command will take some time. It is best to run it via screen
or nohup
The list
subcommand is used to view the databases that have already been downloaded:
The -d
argument is the base directory used for storage of the databases
Optionally, when a test panel name is specified with the -p
argument, the results will be limited to that panel
The download
subcommand will automatically check the file checksums after download. The check
subcommand can also be used on its own to check the files:
The -d
argument is the base directory used for storage of the databases
The -p
argument is the test panel name
The -v
argument is the test panel version
The -n
argument is the number of CPUs that can be used to download the files (defaults to 1)
Assume the Explify database distributable, when unpacked, has a root directory name of /explify-databases
. The database files will be organized in this root directory first by test panel type, then by test panel version:
To run an analysis with RPIP 6.5.1, for example, the following inputs would be needed:
The Explify Analysis Pipeline will use these inputs to navigate to the specified database location, namely /explify-databases/RPIP/6.5.1
.
If the databases are stored on a normal file system, it is recommended that you set --explify-load-db-ram=true
. This will tell the Explify Analysis Pipeline to load the databases into memory for faster analysis. It is also allowable to store the databases on a RAM disk, which reduces load time over many Explify Analysis Pipeline runs. In this case, it is recommended to set --explify-load-db-ram=false
.
To use a Custom database, references are supplied through a FASTA file via --explify-custom-ref-fasta
and an optional BED file via --explify-custom-ref-bed
. Note that you must have downloaded the Custom database and set --explify-test-panel-name
to "Custom", and set --explify-test-panel-version
to the version you have downloaded. The supplied Custom Explify Reference Database is used by the Explify Analysis Pipeline to filter out host reads.
In the FASTA file, sequence names must be unique and should not contain any spaces. If there is any space in the FASTA header, the part before the first space is assumed to be the sequence name. It is recommended to use only the following in sequence names: alphabets, numbers, underscore (_), hyphen (-), parentheses ((,)), and period (.). Otherwise, the sequence names may appear different in the output.
The BED file must be tab-delimited with at least 4 columns:
chrom: the sequence name as it appears in the FASTA
chromStart: start position (always set to 0)
chromEnd: end position (sequence length)
genomeName: name of the genome, target, or microorganism the sequence belongs to (e.g. Monkeypox virus clade II)
segmentName (optional): the name of the segment or gene (e.g. Segment 4 (HA)). Set to 'Full' if the sequence is the full genome
Sequence names must match between the FASTA file and BED file, and the same set of sequences must appear in both files. If there are multiple viruses, their names should be unique. For example, if there are multiple Influenza genomes, they should not be labeled with the same virus name in the 4th column.
The BED file controls how sequences are labeled in the output JSON. If the custom reference FASTA file includes sequences from multiple segments, it is recommended to provide a BED file so that the segments are included under the results of that microorganism.
The output of the Explify Analysis Pipeline is a single ap.json
file written to the specified output directory containing general metadata, version information, sample QC, microorganism, and AMR marker results, as well as detailed test information.
Top-Level Node
The top-level section of the output JSON contains general metadata and version information.
.qcReport.sampleQc Node
This section contains information about sample quality control (QC). The fields are relative to .qcReport.sampleQc
.qcReport.enrichmentFactor Node
This section contains information about the enrichment factor calculation. Detection of an appropriate Internal Control is required. The fields are relative to .qcReport.enrichmentFactor
.qcReport.sampleComposition Node
This section contains information about the composition of the sample. The fields are relative to .qcReport.sampleComposition
.qcReport.internalControls Node
This section contains information about internal control detection. The value of the .qcReport.internalControls
field is an array of objects containing name and RPKM information for each Internal Control. See the code block below for an example:
.userOptions Node
This section gives information about analysis options specified by the user. The fields are relative to .userOptions
.targetReport.microorganisms[] Node
The value of the .targetReport.microorganisms[]
field is an array of objects containing information about detected microorganisms. The following table describes one .targetReport.microorganisms[]
object. The fields are relative to .targetReport.microorganisms[]
.targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[] Node
The value of the .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[]
field is an array of objects containing information about genetically related microorganisms. The following table describes one .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[]
object. The fields are relative to .targetReport.microorganisms[].predictionInformation[].relatedMicroorganisms[]
.targetReport.microorganisms[].variants[] Node
The value of the .targetReport.microorganisms[].variants[]
field is an array of objects containing information about viral variants for all VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only. The following table describes one .targetReport.microorganisms[].variants[]
object. The fields are relative to .targetReport.microorganisms[].variants[]
.targetReport.amrMarkers[] Node
The value of the .targetReport.amrMarkers[]
field is an array of objects containing information about detected bacterial AMR markers. The following table describes one .targetReport.amrMarkers[]
object. The fields are relative to .targetReport.amrMarkers[]
.targetReport.amrMarkers[].variants[] Node
The value of the .targetReport.amrMarkers[].variants[]
field is an array of objects containing information about variants for bacterial AMR markers with "protein variant" or "rRNA variant" model types. The following table describes one .targetReport.amrMarkers[].variants[]
object. The fields are relative to .targetReport.amrMarkers[].variants[]
.targetReport.customReferences[] Node
This section contains information about custom reference detection results and is only present for custom database analyses. When only a custom reference FASTA file is provided (no BED file), each .targetReport.customReferences[]
object contains information for a single reference sequence. When both a FASTA and BED file are provided, each .targetReport.customReferences[]
object contains information for a single genome/microorganism, which can be a collection of one or more reference sequences. The fields are relative to .targetReport.customReferences[]
.targetReport.customReferences[].consensusSequences[] Node
The value of the .targetReport.customReferences[].consensusSequences[]
field is an array of objects containing majority consensus sequence information for a single custom reference sequence. When only a FASTA file is provided (no BED file), there will be only one object in the array. When both a FASTA and BED file are provided, there may be more than one object in the array. The fields are relative to .targetReport.customReferences[].consensusSequences[]
.targetReport.customReferences[].variants[] Node
The value of the .targetReport.customReferences[].variants[]
field is an array of objects containing information about a single detected variant. The fields are relative to .targetReport.customReferences[].variants[]
This recipe is for processing whole genome sequencing data for somatic tumor only workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
In general (for most libraries and sample types) we recommend the default values, however for some specific libraries or sample types where it may be advisable to use different values those are explicitly listed below each variant caller section under "library specific settings".
When possible it is recommended to build a pipeline specific systematic noise file that matches the library prep and sequencer of interest:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED
.
SV library-specific settings
To build the SV systematic noise file
You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To generate a BEDPE file, do as follows.
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise
set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list
: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
We recommend using --enable-variant-deduplication true
to filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup
. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
Build Normal references of miscrosatellite repeat distribution
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples. This ONLY works with DRAGEN germline mode.
The --msi-microsatellites-file
should be the same file used for running tumor-only
mode. --msi-coverage-threshold
should also be the same value used for running tumor-only
mode.
A minimum of 20 normal samples is required for tumor-only mode.
This recipe is for processing sequencing data with unique molecular identifier (UMI) for somatic tumor only workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
However for UMI samples and panels it is strongly recommended to build a custom systematic noise file as follow:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED
.
Generating Panel of Normals (PON)
Somatic WES CNV requires PON files. Follow the two steps below to generate CNV PON:
Target counts generation (per normal sample): Target counts of individual normal sample should be generated as baseline. Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.
Combined counts generation: Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz
file.
$CNV_NORMALS_LIST
is a single text file with paths to each target counts file generated by step1 (either .target.counts.gz
or .target.counts.gc-corrected.gz
). Output will have a PON file with suffix .combined.counts.txt.gz
file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts
option.
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files please refer to the MSI Biomarker section in the user guide.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples. This ONLY works with DRAGEN germline mode.
The --msi-microsatellites-file
should be the same file used for running tumor-only
mode. --msi-coverage-threshold
should also be the same value used for running tumor-only
mode.
A minimum of 20 normal samples is required for tumor-only mode.
This recipe is for processing whole genome sequencing data for somatic tumor normal workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
SNV library specific settings
When possible it is recommended to build a pipeline specific systematic noise file that matches the library prep and sequencer of interest:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED
.
Generating SV systematic noise BEDPE file You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To build the SV systematic noise file
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise
set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list
: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
We recommend using --enable-variant-deduplication true
to filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup
. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
This recipe is for processing whole exome sequencing data for somatic tumor only workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Generating Panel of Normals (PON)
Somatic WES CNV requires PON files. Follow the two steps below to generate CNV PON:
Target counts generation (per normal sample): Target counts of individual normal sample should be generated as baseline. Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.
Combined counts generation: Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz
file.
$CNV_NORMALS_LIST
is a single text file with paths to each target counts file generated by step1 (either .target.counts.gz
or .target.counts.gc-corrected.gz
). Output will have a PON file with suffix .combined.counts.txt.gz
file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts
option.
When possible it is recommended to build a pipeline specific systematic noise file that matches the library prep and sequencer of interest:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED
.
We recommend using --enable-variant-deduplication true
to filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup
. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples. This ONLY works with DRAGEN germline mode.
The --msi-microsatellites-file
should be the same file used for running tumor-only
mode. --msi-coverage-threshold
should also be the same value used for running tumor-only
mode.
A minimum of 20 normal samples is required for tumor-only mode.
This recipe is for processing whole exome sequencing data for somatic tumor normal workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Please include the matched normal sample in the CNV panel of normals.
Generating Panel of Normals (PON)
Somatic WES CNV requires PON files. Follow the two steps below to generate CNV PON:
Target counts generation (per normal sample): Target counts of individual normal sample should be generated as baseline. Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.
Combined counts generation: Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz
file.
$CNV_NORMALS_LIST
is a single text file with paths to each target counts file generated by step1 (either .target.counts.gz
or .target.counts.gc-corrected.gz
). Output will have a PON file with suffix .combined.counts.txt.gz
file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts
option.
When possible it is recommended to build a pipeline specific systematic noise file that matches the library prep and sequencer of interest:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED
.
We recommend using --enable-variant-deduplication true
to filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup
. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
This recipe is for processing sequencing data with unique molecular identifier (UMI) for somatic tumor normal workflows.
For Somatic UMI Tumor Normal inputs, tumor and normal sample need to be run separately for the Map/Align stage, and then Variant Calling is started from tumor and normal UMI collapsed BAM.
For Map/Align stage:
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN
Configure UMI options
For Variant Calling stage:
Configure the INPUT options
Configure the OUTPUT options
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Map/Align stage
Variant Calling (and optional biomarkers) stage:
However for UMI and panels it is strongly recommended to build a custom systematic noise file as follow:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 20-50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
ALUs comprise approximately 11% of the genome and are common in introns. High rates of deamination FP calls have been observed in some FFPE libraries. If the ALU regions are not clinically significant for a specific analysis, then it is recommended to simply filter out the entire ALU region using the DRAGEN excluded regions filter: --vc-excluded-regions-bed $BED
.
Please include the matched normal sample in the CNV panel of normals.
Generating Panel of Normals (PON)
Somatic WES CNV requires PON files. Follow the two steps below to generate CNV PON:
Target counts generation (per normal sample): Target counts of individual normal sample should be generated as baseline. Any options used for panel of normals generation (BED file, GC Bias Correction, etc) should be matched when processing the case sample.
Combined counts generation: Individual PON counts can be merged into a single file as a <prefix>.combined.counts.txt.gz
file.
$CNV_NORMALS_LIST
is a single text file with paths to each target counts file generated by step1 (either .target.counts.gz
or .target.counts.gc-corrected.gz
). Output will have a PON file with suffix .combined.counts.txt.gz
file. Use the PON file in case sample runs of DRAGEN CNV with --cnv-combined-counts
option.
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the manifest. This will avoid using any off-target reads in the MSI analysis. For small panels it may be required to generate custom site files to ensure the panel covers at least 2000 sites. To generate custom MSI site files please refer to the MSI Biomarker section in the user guide.
The metagenomics classifier uses a k-mer based classification algorithm to classify each query sequence (usually a read) against a collection of reference sequences. There are two logical steps to this process: 1) reference sequences are indexed into a searchable database 2) reference sequence database is searched using query sequences and query sequences are classified to taxid(s) associated with the reference sequences. This guide explains how to generate your own indexed, searchable database of reference sequences to be used by the k-mer classifier.
K-mer/G-mer length considerations:
G-mer length refers to the size of the window from which to pick a minimizer. The larger the window, the fewer minimizers will be chosen overall, resulting in a smaller database. However, this can cause a loss of sensitivity since fewer k-mers overall will be saved.
K-mer length refers to the size of the minimizer to be saved in a window size specified by the G-mer length. In general, larger k-mers result in greater specificity while shorter k-mers result in greater sensitivity. However, this general statement can be proven wrong by the specifics of an application and we recommend trying a few different g-mer and k-mer lengths to determine what works best for a given sequence type and application.
As a general rule, we recommend starting with a g-mer length of 35 and k-mer length of 31.
Pre-built Explify Reference Database k-mer/g-mer length settings for reference:
Very large collection of NCBI Refseq genomes and the entirety of the NCBI nucleotide (nt) database (more than 2 Terabases of sequence): G-mer length of 41 and k-mer length of 31. Compressed version built with g-mer length of 64 and k-mer length of 31. This results in a database with less than half the storage requirements.
Subset of reference genomes from Refseq with a focus on viral detection: G-mer length of 35 and k-mer length of 31
Collection of 16S sequences for bacterial identification / profiling: G-mer length of 31 and k-mer length of 31
Uniref90 protein sequences: G-mer length of 15, k-mer length of 12 (these are k-mers of amino acids, not nucleotides)
Three types of databases can be built with this tool:
Binner: each k-mer is assigned to a category/bin.
Must use --kmer-class-db-builder-num-categories
.
Do not use --kmer-class-db-builder-tax-tree-file
, --kmer-class-db-builder-save-weights
, or --kmer-class-db-builder-kmer-cutoff
.
Classifier: each k-mer is assigned to one taxid.
Must define a taxonomic tree with --kmer-class-db-builder-tax-tree-file
.
Do not use --kmer-class-db-builder-num-categories
, --kmer-class-db-builder-save-weights
, or --kmer-class-db-builder-kmer-cutoff
.
Classifier with weights: each k-mer is assigned to one or more taxids; associated weights are also stored (frequency of k-mer's association with a taxid). Uses much more memory, but is more accurate.
Must use --kmer-class-db-builder-save-weights
and define a taxonomic tree with --kmer-class-db-builder-tax-tree-file
.
Can use --kmer-class-db-builder-kmer-cutoff
.
Do not use --kmer-class-db-builder-num-categories
.
Enable or disable GC bias correction when generating target counts. For more information, see .
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to .
Optional settings per component are listed below. Full option list at .
Generic SNV noise files can be downloaded here:
The ALU bed file can be downloaded as part of the Bed File Collection:
You can also build systematic noise BEDPE files in the cloud using the .
Microsatellite sites file can be downloaded here:
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to .
Optional settings per component are listed below. Full option list at .
Generic SNV noise files can be downloaded here:
The ALU bed file can be downloaded as part of the Bed File Collection:
For more information, see .
Microsatellite sites file can be downloaded here:
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to .
Optional settings per component are listed below. Full option list at .
Generic SNV noise files can be downloaded here:
The ALU bed file can be downloaded as part of the Bed File Collection:
You can also build systematic noise BEDPE files in the cloud using the .
Microsatellite sites file can be downloaded here:
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to .
Optional settings per component are listed below. Full option list at .
For more information, see .
Generic SNV noise files can be downloaded here:
The ALU bed file can be downloaded as part of the Bed File Collection:
Microsatellite sites file can be downloaded here:
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to .
Optional settings per component are listed below. Full option list at .
For more information, see .
Generic SNV noise files can be downloaded here:
The ALU bed file can be downloaded as part of the Bed File Collection:
Microsatellite sites file can be downloaded here:
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to .
Optional settings per component are listed below. Full option list at .
Generic SNV noise files can be downloaded here:
The ALU bed file can be downloaded as part of the Bed File Collection:
For more information, see .
Microsatellite sites file can be downloaded here:
Required Inputs
--enable-explify
Enables the Explify Analysis Pipeline. (Default=false)
--output-file-prefix
Prefix for all output files.
--output-directory
Directory for all output files.
--explify-sample-list
Input sample list .tsv file with sample IDs, FASTQs, etc.
--explify-test-panel-name
"RPIP", "UPIP", "VSPv2", "Custom".
--explify-test-panel-version
Set to test panel version (e.g. "1.0.0").
--explify-ref-db-dir
Path to root directory for Explify Database files.
Optional Inputs
--intermediate-results-dir
Area for temporary files. Size must be greater than size of all FASTQ files multiplied by 3.
--explify-load-db-ram
Option to load database into RAM if not on ramdisk. (Default=false).
--explify-no-read-qc
Option to turn off read QC on FASTQs before analysis. (Default=false).
--explify-internal-control
Option to set internal control from an accepted list. (Default="Enterobacteria phage T7")
--explify-internal-control-concentration
Option to set internal control concentration. (Default=12100000)
--explify-ncpus
Option to set the number of CPUs available for processing.
--explify-sensitivity-threshold
Option to set sensitivity threshold. Range: 0 < Integer < 1000. Only valid for VSPv2. (Default=5).
--explify-custom-ref-fasta
Reference FASTA file. Required for Custom reference DBs.
--explify-custom-ref-bed
Reference BED file. Optional for Custom reference DBs.
.accession
Identifier used for the sample
.deploymentEnvironment
Environment in which the results were produced
.batchId
Identifier used for the batch of samples processed together
.analysisId
Identifier used for the analysis
.runId
Identifier used for the sequencing run
.controlFlag
Indicates whether the sample is a control. It is based on the ControlFlag field in the sample .tsv
and can be set to “POS”, “NEG”, “BLANK”, or “-”
.dragenVersion
DRAGEN release version
.analysisPipelineVersion
Analysis Pipeline release version
.testType
Type of test panel ("RPIP", "UPIP", "VSPv2", "Custom")
.testVersion
Test panel release version
.testName
Full name of test panel
.testUse
Test use. "For Research Use Only. Not for use in diagnostic procedures"
.reportTime
Date and time the report was generated
.warnings
List of warnings encountered during the analysis
.errors
List of errors encountered during the analysis
.totalRawBases
Number of base pairs in sample before read QC processing
.totalRawReads
Number of reads in sample before read QC processing
.uniqueReads
Number of distinct reads in sample before read QC processing
.uniqueReadsProportion
Proportion of distinct reads in sample before read QC processing
.preQualityMeanReadLength
Average read length before read QC processing
.postQualityMeanReadLength
Average read length after read QC processing
.postQualityReads
Number of reads in sample after read QC processing, inclusive of any duplicate reads
.postQualityReadsProportion
Proportion of post-quality reads in sample relative to total raw reads
.removedInDehostingReads
Number of host reads in sample removed during dehosting (host = human)
.removedInDehostingReadsProportion
Proportion of host reads in sample removed relative to total raw reads (host = human)
.entropy
Shannon entropy of the counts of 5-mers in the reads after read QC processing, which is a measure of randomness
.gContent
Proportion of guanine (G) base calls in reads after read QC processing
.libraryQScore
Quality score of the library after read QC processing
.value
Enrichment factor value reflecting how well targeted regions were enriched
.category
Enrichment factor category: "poor", "fair", "good", or "not calculated"
.readClassification
Proportion of post-quality reads classified to the following categories:
.readClassification.targetedMicrobial
Targeted microbial
.readClassification.targetedInternalControl
Targeted Internal Control
.readClassification.untargeted
Untargeted
.readClassification.ambiguous
More than one category
.readClassification.unclassified
No category
.readClassification.lowComplexity
Low complexity
.targetedMicrobial
Proportion of post-quality targeted microbial reads classified to the following sub-categories:
.targetedMicrobial.viral
Viral targeted
.targetedMicrobial.bacterial
Bacterial targeted
.targetedMicrobial.fungal
Fungal targeted
.targetedMicrobial.parasitic
Parasitic targeted
.targetedMicrobial.bacterialAmr
Bacterial AMR targeted
.untargeted
Proportion of post-quality untargeted reads classified to the following sub-categories:
.untargeted.viral
Viral untargeted
.untargeted.bacterial
Bacterial untargeted
.untargeted.fungal
Fungal untargeted
.untargeted.parasitic
Parasitic untargeted
.untargeted.bacterialAmr
Bacterial AMR untargeted
.untargeted.internalControl
Internal Control untargeted
.untargeted.human
Human untargeted
.viral
Proportion of post-quality viral reads classified to the following categories:
.viral.targeted
Viral targeted
.viral.untargeted
Viral untargeted
.viral.untargetedSubcategories
Proportion of post-quality viral untargeted reads classified to the following sub-categories:
.viral.untargetedSubcategories.panel
Viral panel members
.viral.untargetedSubcategories.phage
Viral phage
.viral.untargetedSubcategories.other
Viral other (not a panel member or phage)
.bacterial
Proportion of post-quality bacterial reads classified to the following categories:
.bacterial.targeted
Bacterial targeted
.bacterial.untargeted
Bacterial untargeted
.bacterial.untargetedSubcategories
Proportion of post-quality bacterial untargeted reads classified to the following sub-categories:
.bacterial.untargetedSubcategories.panel
Bacterial panel members
.bacterial.untargetedSubcategories.ribosomalDna
Bacterial ribosomal DNA (16S)
.bacterial.untargetedSubcategories.plasmid
Bacterial plasmids
.bacterial.untargetedSubcategories.other
Bacterial other (not a panel member, ribosomal DNA, or plasmid)
.fungal
Proportion of post-quality fungal reads classified to the following categories:
.fungal.targeted
Fungal targeted
.fungal.untargeted
Fungal untargeted
.fungal.untargetedSubcategories
Proportion of post-quality fungal untargeted reads classified to the following sub-categories:
.fungal.untargetedSubcategories.panel
Fungal panel members
.fungal.untargetedSubcategories.ribosomalDna
Fungal ribosomal DNA (18S)
.fungal.untargetedSubcategories.other
Fungal other (not a panel member or ribosomal DNA)
.parasitic
Proportion of post-quality parasitic reads classified to the following categories:
.parasitic.targeted
Parasitic targeted
.parasitic.untargeted
Parasitic untargeted
.parasitic.untargetedSubcategories
Proportion of post-quality parasitic untargeted reads classified to the following sub-categories:
.parasitic.untargetedSubcategories.panel
Parasitic panel members
.parasitic.untargetedSubcategories.ribosomalDna
Parasitic ribosomal DNA (18S)
.parasitic.untargetedSubcategories.other
Parasitic other (not a panel member or ribosomal DNA)
.human
Proportion of post-quality human reads classified to the following categories:
.human.untargeted
Human untargeted
.human.untargetedSubcategories
Proportion of post-quality human untargeted reads classified to the following sub-categories:
.human.untargetedSubcategories.ribosomalDna
Human ribosomal DNA
.human.untargetedSubcategories.codingSequence
Human coding sequence
.human.untargetedSubcategories.other
Human other (not ribosomal DNA or coding sequence)
.internalControl
Proportion of post-quality Internal Control reads classified to the following categories:
.internalControl.targeted
Internal Control targeted
.internalControl.untargeted
Internal Control untargeted
.microbialAndInternalControl
Proportion of post-quality Microbial and Internal Control reads classified to the following categories:
.microbialAndInternalControl.targeted
Microbial and Internal Control targeted
.microbialAndInternalControl.untargeted
Microbial and Internal Control untargeted
.bacterialAmr
Proportion of post-quality bacterial AMR reads classified to the following categories:
.bacterialAmr.targeted
Bacterial AMR targeted
.bacterialAmr.untargeted
Bacterial AMR untargeted
.quantitativeInternalControlName
Quantitative Internal Control used for microorganism absolute quantification (recommendation: Enterobacteria phage T7)
.quantitativeInternalControlConcentration
Quantitative Internal Control concentration (recommendation: 1.21 x 10^7 copies/mL of sample)
.readQcEnabled
Boolean indicating if read QC (trimming and filtering based on quality and read length) is enabled
.readClassificationSensitivity
(VSP V2 only) Sensitivity threshold for classifying reads. Determines whether alignment should proceed for a microorganism and/or reference sequence. Value is an integer with a valid range of 1 to 1000, inclusive
.customPanelFastaFile
(Custom Panel only) Name of the custom reference FASTA file
.customPanelBedFile
(Custom Panel only) Name of the custom reference BED file
.class
Microorganism class ("viral", "bacterial", "fungal", "parasite")
.name
Name of microorganism
.coverage
Proportion of targeted microorganism reference sequence bases that appear in sample sequencing reads
.ani
Average nucleotide identity of consensus sequence to targeted microorganism reference sequences
.medianDepth
Median depth of sample sequencing reads aligned to targeted microorganism reference sequences, indicating the median number of times each targeted microorganism reference sequence base appears in sample sequencing reads
.condensedDepthVector
Read depth across the targeted microorganism reference sequences, condensed to 256 bins
.rpkm
Normalized representation of the number of sample sequencing reads aligned to targeted microorganism reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)
.alignedReadCount
Number of sample sequencing reads that aligned to targeted microorganism reference sequences
.kmerReadCount
(UPIP only) Number of sample sequencing reads classified to targeted microorganism reference sequences
.absoluteQuantityRatio
Numerical absolute quantification value. Quantitative internal control required for calculation
.absoluteQuantityRatioFormatted
Formatted absolute quantification value with units. Quantitative internal control required for calculation
.phenotypicGroup
(RPIP, UPIP only) Grouping indicating general association with normal flora, colonization, or contamination from the environment or other sources, as well as general association with disease
.associatedAmrMarkers
(Bacteria only) Information about the bacterial AMR markers associated with the microorganism
.associatedAmrMarkers.applicable
Boolean indicating whether one or more bacterial AMR markers are associated with the microorganism
.associatedAmrMarkers.detected
List of detected bacterial AMR markers associated with the microorganism
.associatedAmrMarkers.predicted
List of predicted bacterial AMR markers associated with the microorganism
.consensusGenomeSequences
(RPIP, VSP V2 viruses only) Information about the majority consensus genome (or segment) sequence
.consensusGenomeSequences.sequence
Consensus genome (or segment) sequence bases
.consensusGenomeSequences.referenceAccession
Accession of the reference genome (or segment) sequence
.consensusGenomeSequences.referenceDescription
Description of the reference genome (or segment) sequence
.consensusGenomeSequences.referenceLength
Length of the reference genome (or segment) sequence
.consensusGenomeSequences.maximumAlignmentLength
Longest contiguous alignment between consensus sequence and reference genome (or segment) sequence
.consensusGenomeSequences.maximumGapLength
Longest contiguous alignment gap (insertion or deletion) between consensus sequence and reference genome (or segment) sequence
.consensusGenomeSequences.maximumUnalignedLength
Longest section of the reference genome (or segment) sequence not aligned to by consensus sequence
.consensusGenomeSequences.coverage
Proportion of reference genome (or segment) sequence bases that appear in sample sequencing reads
.consensusGenomeSequences.ani
Average nucleotide identity of consensus sequence to reference genome (or segment) sequence
.consensusGenomeSequences.alignedReadCount
Number of sample sequencing reads that aligned to reference genome (or segment) sequence
.consensusGenomeSequences.medianDepth
Median depth of sample sequencing reads aligned to reference genome (or segment) sequence, indicating the median number of times each reference genome (or segment) sequence base appears in sample sequencing reads
.consensusGenomeSequences.targetAnnotation
List of targeted region annotations for the reference genome (or segment) sequence. Each annotation is a JSON object with the following fields: start (int), end (int), strand (string: "+", "-"), target_name (string), type (string)
.consensusGenomeSequences.condensedDepthVector
Read depth across the reference genome (or segment) sequence, condensed to 256 bins
.consensusTargetSequences
(RPIP viruses only) Information about the majority targeted region consensus sequences
.consensusTargetSequences.sequence
Consensus targeted region sequence bases
.consensusTargetSequences.name
Name of the targeted region
.consensusTargetSequences.referenceAccession
Accession of the targeted region reference sequence
.consensusTargetSequences.depthVector
Read depth across the targeted region reference sequence, not condensed
.predictionInformation
Information about microorganism prediction results
.predictionInformation.predictedPresent
Boolean indicating whether the microorganism passed its reporting logic algorithm
.predictionInformation.notes
List of notes about the prediction result
.predictionInformation.subpanels
List of pre-defined subpanels that the microorganism belongs to
.predictionInformation.relatedMicroorganisms
Array of objects with information about genetically related microorganisms. See below for details
.variants
(all VSP V2 viruses, RPIP: SARS-CoV-2 & FluA/B/C only) Information about viral variants. See below for details
.name
Name of related microorganism
.onPanel
Boolean indicating whether the related microorganism is a panel member
.kmerReadCount
(UPIP only) Number of sample sequencing reads classified to related microorganism reference sequences
.coverage
Proportion of related microorganism reference sequence bases that appear in sample sequencing reads
.ani
Average nucleotide identity of consensus sequence to related microorganism reference sequences
.alignedReadCount
Number of sample sequencing reads that aligned to related microorganism reference sequences
.referenceAccession
Accession of reference genome (or segment) sequence used for variant calling
.segment
(Segmented viruses only) Segment number of reference segment sequence
.ntChange
Nucleotide change associated with variant
.referencePosition
Variant position in viral reference genome (or segment) sequence
.referenceAllele
Reference allele at variant position
.variantAllele
Variant allele
.depth
Variant depth, indicating the number of times variant position appears in sample sequencing reads
.alleleFrequency
Frequency of variant allele in sample sequencing reads
.class
Microorganism class ("bacterial")
.cardModelType
Bacterial AMR marker model type in the Comprehensive Antibiotic Resistance Database (CARD) ("homolog", "protein variant", "rRNA variant")
.cardGeneFamily
Bacterial AMR marker gene family in the Comprehensive Antibiotic Resistance Database (CARD)
.name
Bacterial AMR marker name
.cardName
Bacterial AMR marker name in the Comprehensive Antibiotic Resistance Database (CARD)
.ncbiName
Bacterial AMR marker name in the National Center for Biotechnology Information (NCBI) Reference Gene Catalog
.referenceAccession
Accession of the bacterial AMR marker reference sequence
.coverage
Proportion of bacterial AMR marker reference sequence residues that appear in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.pid
Percent identity of consensus sequence aligned to bacterial AMR marker reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.medianDepth
Median depth of sample sequencing reads aligned to bacterial AMR marker reference sequence, indicating the median number of times each bacterial AMR marker sequence residue appears in sample sequencing reads (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.rpkm
Normalized representation of the number of sample sequencing reads aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.alignedReadCount
Number of sample sequencing reads that aligned to bacterial AMR reference sequence (protein alignment for "homolog" and "protein variant" model types; nucleotide alignment for "rRNA variant" model type)
.nucleotideConsensusSequence
Nucleotide consensus sequence bases
.proteinConsensusSequence
Protein consensus sequence bases
.nucleotideDepthVector
Read depth across the bacterial AMR marker nucleotide reference sequence, not condensed
.proteinDepthVector
Read depth across the bacterial AMR marker protein reference sequence, not condensed
.associatedMicroorganisms
Information about the microorganisms associated with the bacterial AMR marker
.associatedMicroorganisms.all
List of all microorganisms associated with the bacterial AMR marker
.associatedMicroorganisms.detected
List of detected microorganisms associated with the bacterial AMR marker
.associatedMicroorganisms.predicted
List of predicted microorganisms associated with the bacterial AMR marker
.predictionInformation
Information about bacterial AMR marker prediction results
.predictionInformation.predictedPresent
Boolean indicating whether the bacterial AMR marker passed its reporting logic algorithm
.predictionInformation.confidence
Confidence level of bacterial AMR marker prediction ("high", "medium", "low")
.predictionInformation.notes
List of notes about the prediction result
.category
Variant category ("Bacterial Variant; Known AMR")
.referenceSourceMicroorganism
Microorganism that reference sequence is associated with in NCBI
.comments
List of additional information regarding the variant
.product
Protein product of gene
.ntChange
Nucleotide change associated with variant
.referencePosition
Variant position in reference sequence
.referenceAllele
Reference allele at variant position
.variantAllele
Variant allele
.depth
Variant depth, indicating the number of times variant position appears in sample sequencing reads
.alleleFrequency
Frequency of variant allele in sample sequencing reads
.annotation
Type of change (e.g. "Nonsynonymous Variant")
.aaChange
Amino acid change associated with variant
.epistaticGroups
List of epistatic groups variant is associated with
.name
Provided name of custom reference sequence, accession, genome, or microorganism
.coverage
Proportion of custom reference sequence bases that appear in sample sequencing reads
.ani
Average nucleolotide identity of consensus sequence to custom reference sequence or, if specified, collection of one or more custom reference sequences
.medianDepth
Median depth of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences, indicating the med\ian number of times each custom reference sequence base appears in sample sequencing reads
.condensedDepthVector
Read depth across custom reference sequence or, if specified, collection of one or more custom reference sequences, condensed to 256 bins
.rpkm
Normalized number of sample sequencing reads aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences (targeted Reads mapped Per Kilobase of targeted sequence per Million quality-filtered reads)
.alignedReadCount
Number of sample sequencing reads that aligned to custom reference sequence or, if specified, collection of one or more custom reference sequences
.consensusSequences
Array of objects with information about each consensus sequence. See below for details
.variants
Array of objects with information about variants detected in custom reference sequence or, if specified, collection of one or more custom reference sequences. See below for details
.sequence
Majority consensus sequence bases
.referenceAccession
Accession of custom reference sequence
.referenceDescription
Description of custom reference sequence
.referenceLength
Length of custom reference sequence
.coverage
Proportion of custom reference sequence bases that appear in sample sequencing reads
.ani
Average nucleolotide identity of consensus sequence to custom reference sequence
.medianDepth
Median depth of sample sequencing reads aligned to custom reference sequence, indicating the median number of times each custom reference sequence base appears in sample sequencing reads
.depthVector
Read depth across custom reference sequence, not condensed
.alignedReadCount
Number of sample sequencing reads that aligned to custom reference sequence
.maximumAlignmentLength
Longest contiguous alignment between consensus sequence and custom reference sequence
.maximumGapLength
Longest contiguous alignment gap (insertion or deletion) between consensus sequence and custom reference sequence
.maximumUnalignedLength
Longest section of custom reference sequence not aligned to by consensus sequence
.ntChange
Nucleotide change associated with variant
.referenceAccession
Accession of custom reference sequence used for variant calling
.referencePosition
Variant position in custom reference sequence
.referenceAllele
Reference allele at variant position
.variantAllele
Variant allele
.depth
Variant depth, indicating the number of times variant position appears in sample sequencing reads
.alleleFrequency
Frequency of variant allele in sample sequencing reads
--heme-cnv true
Configures DRAGEN to use CNV settings for Liquid Tumors (e.g., AML/MLL).
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
Systematic noise filter. In tumor-only variant calling, this filter is essential for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probability to variants at these positions. Use this option to override a custom hotspot file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-germline-tagging true --enable-variant-annotation true --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER --variant-annotation-assembly $REFERENCE
Germline filtering. Enable to tag variants as germline or somatic based on population databases. $REFERENCE can be GRCh37 or GRCh38 (GRCh37 is compatible with hs37d5 and hg19). The Nirvana annotation database is downloadable at this page.
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
--vc-systematic-noise-method
The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.
--vc-excluded-regions-bed $BED
Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.
--sv-systematic-noise $SV_SYSTEMATIC_NOISE_BEDPE
Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information, see Systematic Noise Filtering.
--heme-sv true
configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).
--sv-systematic-noise $SV_SYSTEMATIC_NOISE_BEDPE
A prebuilt systematic noise BEDPE file can be downloaded from the DRAGEN Software Support Site page
--sv-min-scored-variant-size $MIN_SCORED_VAR_SIZE
100000
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-enable-class-2
Extend genotyping to HLA class 2 genes
--umi-source qname/fastq/bamtag
Specify the input type for the UMI sequence. For more information, see UMI Options.
--umi-library-type random-duplex/random-simplex/nonrandom-duplex
Set the batch option for different UMIs correction. For more information, see UMI Options.
--umi-nonrandom-whitelist $WHITELIST
If UMI is nonrandom, enter the path for a customized, valid UMI sequence.
--umi-min-supporting-reads 2
Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. For more information, see UMI Options.
--umi-metrics-interval-file $UMI_TARGET_BED
Enter the path for target region in BED format.
--umi-emit-multiplicity both
Set the consensus sequence type to output. DRAGEN UMI allows you to collapse duplex sequences from the two strands of the original molecules. For more information, see Merge Duplex UMIs.
--vc-enable-umi-solid true / --vc-enable-umi-liquid true
When running from UMI data, one of these options is required to let DRAGEN know that the reads have been UMI-collapsed and are therefore more reliable than non-UMI reads. Solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher. Liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of ~2000–2500X and target allele frequencies of 0.4% and higher. As a rough rule of thumb, choose solid for coverage below 1000X and liquid for higher coverage.
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 4(Solid)/2(Liquid). Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
Systematic noise filter. In tumor-only variant calling, this filter is essential for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probabilityto variants at these positions. Use this option to override with a custom hotspots file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-germline-tagging true --enable-variant-annotation true --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER --variant-annotation-assembly $REFERENCE
Germline filtering. Enable to tag variants as germline or somatic based on population databases. $REFERENCE can be GRCh37 or GRCh38 (GRCh37 is compatible with hs37d5 and hg19). The Nirvana annotation database is downloadable at this page.
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
--vc-systematic-noise-method
The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.
--vc-excluded-regions-bed $BED
Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.
--cnv-enable-gcbias-correction true
Enable or disable GC bias correction when generating target counts. For more information, see GC Bias Correction.
--cnv-segmentation-mode $SEG_MODE
Specifies the segmentation algorithm to perform. For more information, see Segmentation.
--sv-systematic-noise $SYSTEMATIC_NOISE_BEDPE
Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information, see Systematic Noise Filtering.
--tmb-vaf-threshold FLOAT
Variant mininum allele frequency for usable variants.
0.05 ( default)
0.002
--msi-coverage-threshold INT
Minimum coverage for a microsatellite
60 ( default)
500
--msi-distance-threshold FLOAT
Minimum Jensen-Shannon distance between tumor and normal for a microsatellite
0.1 ( default)
0.02
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-as-filter-min-threshold
Internal option to set min alignment score threshold
hla-as-filter-ratio-threshold
Minimum Alignment score of a read mate to be considered
hla-enable-class-2
Extend genotyping to HLA class 2 genes
--heme-cnv true
Configures DRAGEN to use CNV settings for HEME.
--cnv-normal-cnv-vcf $CNV_NORMAL_VCF
Specify germline CNVs from the matched normal sample. For more information, see Germline-aware Mode.
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 17.5. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
Systematic noise filter. In tumor-normal calling, this filter is recommended for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probability to variants at these positions. Use this option to override with a custom hotspots file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-liquid-tumor-mode true
Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.
--vc-override-tumor-pcr-params-with-normal false
Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
--vc-systematic-noise-method
The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.
--vc-excluded-regions-bed $BED
Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.
--heme-sv true
Configure DRAGEN to use SV settings for HEME.
--sv-enable-liquid-tumor-mode true
Enable liquid tumor mode. For more information, see Liquid Tumor Mode.
--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE
Set the Tumor-in-Normal (TiN) contamination tolerance level. For more information, see Liquid Tumor Mode.
--sv-systematic-noise $SYSTEMATIC_NOISE_BEDPE
Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information, see Systematic Noise Filtering.
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-enable-class-2
Extend genotyping to HLA class 2 genes
--cnv-enable-gcbias-correction true
Enable or disable GC bias correction when generating target counts. For more information, see GC Bias Correction.
--cnv-segmentation-mode $SEG_MODE
Specifies the segmentation algorithm to perform. For more information, see Segmentation.
--cnv-population-b-allele-vcf $CNV_POP_VCF
Specifies a population SNV catalog for ASCN CNV. For more information on specifying b-allele loci, see Specification of B-Allele Loci.
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
Systematic noise filter. In tumor-only variant calling, this filter is essential for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probabilityto variants at these positions. Use this option to override with a custom hotspots file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-germline-tagging true --enable-variant-annotation true --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER --variant-annotation-assembly $REFERENCE
Germline filtering. Enable to tag variants as germline or somatic based on population databases. $REFERENCE can be GRCh37 or GRCh38 (GRCh37 is compatible with hs37d5 and hg19). The Nirvana annotation database is downloadable at this page.
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
--vc-systematic-noise-method
The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.
--vc-excluded-regions-bed $BED
Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-enable-class-2
Extend genotyping to HLA class 2 genes
--cnv-enable-gcbias-correction true
Enable or disable GC bias correction when generating target counts. For more information, see GC Bias Correction.
--cnv-segmentation-mode $SEG_MODE
Specifies the segmentation algorithm to perform. For more information, see Segmentation.
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 17.5. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
Systematic noise filter. In tumor-normal calling, this filter is recommended for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probability to variants at these positions. Use this option to override with a custom hotspots file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-liquid-tumor-mode true
Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.
--vc-override-tumor-pcr-params-with-normal false
Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
--vc-systematic-noise-method
The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.
--vc-excluded-regions-bed $BED
Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.
--sv-enable-liquid-tumor-mode true
Enable liquid tumor mode. For more information, see Liquid Tumor Mode.
--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE
Set the Tumor-in-Normal (TiN) contamination tolerance level. For more information, see Liquid Tumor Mode.
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-enable-class-2
Extend genotyping to HLA class 2 genes
--umi-source qname/fastq/bamtag
Specify the input type for the UMI sequence. For more information, see UMI Options.
--umi-library-type random-duplex/random-simplex/nonrandom-duplex
Set the batch option for different UMIs correction. For more information, see UMI Options.
--umi-nonrandom-whitelist $WHITELIST
If UMI is nonrandom, enter the path for a customized, valid UMI sequence.
--umi-min-supporting-reads 2
Specify the number of matching UMI inputs reads required to generate a consensus read. Any family with insufficient supporting reads is discarded. For more information, see UMI Options.
--umi-metrics-interval-file $UMI_TARGET_BED
Enter the path for target region in BED format.
--umi-emit-multiplicity both
Set the consensus sequence type to output. DRAGEN UMI allows you to collapse duplex sequences from the two strands of the original molecules. For more information, see Merge Duplex UMIs.
--vc-enable-umi-solid true / --vc-enable-umi-liquid true
When running from UMI data, one of these options is required to let DRAGEN know that the reads have been UMI-collapsed and are therefore more reliable than non-UMI reads. Solid mode is optimized for solid tumors with post collapsed coverage rates of ~200—300X and target allele frequencies of 5% and higher. Liquid mode is optimized for a liquid biopsy pipeline with post collapsed coverage rates of ~2000–2500X and target allele frequencies of 0.4% and higher. As a rough rule of thumb, choose solid for coverage below 1000X and liquid for higher coverage.
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 4(Solid)/2(Liquid). Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
Systematic noise filter. In tumor-normal calling, this filter is recommended for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the DRAGEN Software Support Site page. Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probability to variants at these positions. Use this option to override with a custom hotspots file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-liquid-tumor-mode true
Tumor-in-normal contamination. Only use if there is some tumor leakage in the normal control.
--vc-override-tumor-pcr-params-with-normal false
Mixed sample preparation. Only use if the tumor and normal samples exhibit different PCR (indel) noise patterns, e.g., due to using different sample preparation.
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
--vc-systematic-noise-method
The 'max' method is recommended for WGS and results in a more aggressive filter. The 'mean' method is recommended for UMI/PANELs/WES and results in a less aggressive filter. The default is specified in the noise file header.
--vc-target-vaf FLOAT
In FFPE samples with UMI Simplex collapsing it may be beneficial to increase the vc-target-vaf to 0.2 or 0.3. In FFPE samples with UMI Duplex collapsing some of the strand specific FFPE deamination noise may be removed by the duplex collapsing so that the default vc-target-vaf of 0.01 may remain appropriate.
--vc-excluded-regions-bed $BED
Some FFPE samples may have a high rate of FP calls in SINE (and specifically in ALU) regions. Optionally use an ALU bed to hard filter all calls in this region. Steps are provided below to download an ALU region bed.
--cnv-enable-gcbias-correction true
Enable or disable GC bias correction when generating target counts. For more information, see GC Bias Correction.
--cnv-segmentation-mode $SEG_MODE
Specifies the segmentation algorithm to perform. For more information, see Segmentation.
--sv-enable-liquid-tumor-mode true
Enable liquid tumor mode. For more information, see Liquid Tumor Mode.
--sv-tin-contam-tolerance $TIN_CONTAM_TOLERANCE
Set the Tumor-in-Normal (TiN) contamination tolerance level. For more information, see Liquid Tumor Mode.
--tmb-vaf-threshold FLOAT
Variant mininum allele frequency for usable variants.
0.05 ( default)
0.002
--msi-coverage-threshold INT
Minimum coverage for a microsatellite
60 ( default)
500
--msi-distance-threshold FLOAT
Minimum Jensen-Shannon distance between tumor and normal for a microsatellite
0.1 ( default)
0.02
enable-hla
Enable HLA typer (this setting by default will only genotype class 1 genes)
hla-as-filter-min-threshold
Internal option to set min alignment score threshold
hla-as-filter-ratio-threshold
Minimum Alignment score of a read mate to be considered
hla-enable-class-2
Extend genotyping to HLA class 2 genes
Required Inputs
--enable-kmer-class-db-builder
Enables the Kmer Classifier Database Builder. (Default=false).
--kmer-class-db-builder-input-file
Headerless, tab-delimited file where each line is (1) the path to a reference fasta file and (2) the associated taxid. When using --kmer-class-db-builder-taxids-as-seq-name, the second column is required but ignored.
--output-file-prefix
Prefix for all output files.
--output-directory
Directory for all output files.
--kmer-class-db-builder-kmer-length
Kmer length (Range: [4, 31]).
--kmer-class-db-builder-gmer-length
Gmer length (must be >= kmer length. Range: [4, 64]).
Optional Inputs
--kmer-class-db-builder-tax-tree-file
.tri file with nodes in the taxonomic tree for a classifier database (not required if building a binner database). Headerless, tab-delimited file where each line has (1) the child node taxid and (2) the parent node taxid. Root of tree must be 1 and have a parent of 0.
--kmer-class-db-builder-protein
Set to indicate input sequences are protein sequences. (Default=false).
--kmer-class-db-builder-taxids-to-keep
File with taxids to keep. If set, any kmers with taxids not in this file will be excluded from database.
--kmer-class-db-builder-num-categories
Set to build a binner database with this number of categories. Max is 25 categories, assumes categories are from 2^0..2^n sequentially. The categories take the place of taxids in the input file.
--kmer-class-db-builder-save-weights
Set to build classification database that saves all kmers / taxids / weights.
--kmer-class-db-builder-kmer-cutoff
Cutoff that excludes k-mers that are found in more than cutoff number of taxids when building a database using --kmer-class-db-builder-save-weights. Helps speed up classification. (Default=1000).
--kmer-class-db-builder-mask-bits
Number of bits to mask in kmer before building / searching. (Default=7).
--kmer-class-db-builder-num-cpus
Option to set the number of CPUs available for processing.
--kmer-class-db-builder-num-kmers-per-bucket
Set to output number of kmers in each minimizer bucket. (Default=false).
--kmer-class-db-builder-include-lowercase
Set to include kmers with lowercase bases (usually repeatmasked). (Default=false).
--kmer-class-db-builder-taxids-as-seq-name
Set to indicate that the reference fastas listed in the input file have taxids as sequence name. In this case, the second column of the input file is ignored. (Default=false).
This recipe is for processing panel data for RNA workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure the RNA MAP/ALIGN options
Configure the QUANT options
Configure the SPLICE options
Configure the FUSION options
Configure the VARIANT options
We recommend using a linear (non-pangenome) reference for RNA analysis. For more details, refer to Dragen Reference Support.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Amplicon data
If you are running amplicon, you need to set --enable-rna-amplicon true --amplicon-target-bed <AMPLICON_BED_PATH>
.
If RNA amplicon mode is enabled and the amplicon bed file already includes the gene name, then you do not need to set the ENRICH options option; DRAGEN will read the enriched genes names from the amplicon BED file (fifth column).
SPLICE options
You can provide a list of normal slice variants to reduce noisy calls. The file should be a tab separated file with the following first four columns:
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -) Use the optional option --rna-splice-variant-normals <SPLICE_NORMAL_FILE_PATH>
to provide the normal splice variants.
This recipe is for processing whole genome sequencing data for somatic heme tumor only workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure MAP/ALIGN depending on if realignment is desired or not
Configure the VARIANT CALLERs based on the application
Configure any additional options
Build up the necessary options for each component separately, so that they can be re-used in the final command line.
We recommend using a linear (non-pangenome) reference for somatic analysis. For more details, refer to Dragen Reference Support.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
Optional settings per component are listed below. Full option list at this page.
--heme-cnv true
Configures DRAGEN to use CNV settings for Liquid Tumors (e.g., AML/MLL).
--vc-sq-filter-threshold $THRESHOLD
Threshold for sensitivity-specificity tradeoff. The default threshold is 3. Raise this value to improve specificity at the cost of sensitivity, or lower it to improve sensitivity at the cost of specificity.
--vc-systematic-noise $SYSTEMATIC_NOISE_FILE
--vc-somatic-hotspots somatic_hotspots_GRCh38.vcf.gz
Hotspots file. By default, DRAGEN treats positions in the COSMIC database as hotspots, assigning an increased prior probability to variants at these positions. Use this option to override a custom hotspot file if a list of positions of interest is available.
--vc-combine-phased-variants-distance $DIST
Combining phased variants. By default, DRAGEN will not combine nearby phased calls into MNVs or indels. To combine such calls, set this parameter to a value greater than zero indicating the maximum distance at which calls should be combined. If the user wants to enable the combining of phased variants the recommended value of the distance is 15 base pairs. The valid range is [0; 15]
--vc-enable-germline-tagging true --enable-variant-annotation true --variant-annotation-data $NIRVANA_ANNOTATION_FOLDER --variant-annotation-assembly $REFERENCE
--vc-target-vaf FLOAT
This option is only available starting in V4.2. The vc-target-vaf is used to select the variant allele frequencies of interest. The variant caller will aim to detect variants with allele frequencies equal to and larger than this setting. This setting will not apply a hard filter and it is possible to detect variants with allele frequencies lower than the selected threshold. On high coverage and clean datasets, a lower target-vaf may help increase sensitivity. On noisy samples (like FFPE) a higher target-vaf maybe help reduce false positives. Using a low target-vaf may also increase runtime. The valid range is [0, 1]. The default is 0.03 (or 0.001 when --vc-enable-umi-liquid=true
).
Generic SNV noise files (including a HEME specific WGS noise file) can be downloaded here: DRAGEN Software Support Site page
When possible it is recommended to build a pipeline specific systematic noise file that matches the library prep and sequencer of interest:
Step 1. Run DRAGEN somatic tumor-only on each of approximately 50 normal samples:
Gather the full paths to the VCFs from step 1 in ${VCF_LIST} by specifying 1 file per line.
Step 2. Generate the final noise file with:
--sv-systematic-noise $SV_SYSTEMATIC_NOISE_BEDPE
--heme-sv true
configures DRAGEN to use SV settings for Liquid Tumors (e.g., AML/MLL).
--sv-min-scored-variant-size $MIN_SCORED_VAR_SIZE
100000
--sv-somatic-ins-tandup-hotspot-regions-bed $BED_FILE
Specify a BED of ITD hotspot regions to increase sensitivity for calling ITDs in somatic variant analysis. By default, DRAGEN SV automatically selects a reference-specific hotspots BED file. The default file includes FLT3, ARHGEF7, KMT2A, and UBTF exonic regions with some padding on both sides (300 bps)
To build the SV systematic noise file
You can generate systematic noise BEDPE files from normal samples collected using library prep, sequencing system, and panels.
To generate a BEDPE file, do as follows.
Run DRAGEN somatic tumor-only on normal samples with --sv-detect-systematic-noise
set to true to generate VCF output per normal sample.
Build the BEDPE file using the VCFs and the --sv-build-systematic-noise-vcfs-list
: List of input VCFs from previous step. Enter one VCF per line. Example command line is provided below
You can also build systematic noise BEDPE files in the cloud using the DRAGEN Baseline Builder App on BaseSpace.
Pre-built SV systematic noise file
The following prebuilt systematic noise files for WGS are available for download on the DRAGEN Software Support Site page. To generate these noise files, we used 46 unrelated normal samples.
IDPF_WGS_hg38_v3.0.0_systematic_noise.sv.bedpe.gz
>200x coverage with 2x150bp reads for the HG38 reference
3.0.0
4.3.*
We recommend using --enable-variant-deduplication true
to filter all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup.vcf.gz
. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
Systematic noise filter. In tumor-only variant calling, this filter is essential for removing systematic noise observed in normal samples. Prebuilt systematic noise files are available for download on the . Alternatively, a systematic noise file can be generated by running the somatic TO pipeline on normal samples. We recommend using a systematic noise file based on normal samples that match the library prep of the tumor samples.
Germline filtering. Enable to tag variants as germline or somatic based on population databases. $REFERENCE can be GRCh37 or GRCh38 (GRCh37 is compatible with hs37d5 and hg19). The Nirvana annotation database is downloadable at .
Systematic noise BEDPE file containing the set of noisy paired regions (optionally gzip or bzip compressed). For more information, see .
This recipe is for processing Whole Transcriptome Sequencing data for RNA workflows.
For most scenarios, simply creating the union of the command line options from the single caller scenarios will work.
Configure the INPUT options
Configure the OUTPUT options
Configure the RNA MAP/ALIGN options
Configure the QUANT options
Configure the SPLICE options
Configure the FUSION options
Configure the VARIANT options
We recommend using a linear (non-pangenome) reference for RNA analysis. For more details, refer to Dragen Reference Support.
The following are partial templates that can be used as starting points. Adjust them accordingly for your specific use case.
For SPLICE options, you can provide a list of normal slice variants to reduce noisy calls. The file should be a tab separated file with the following first four columns:
contig name
first base of the splice junction (1-based)
last base of the splice junction (1-based)
strand (0: undefined, 1: +, 2: -) Use the optional option --rna-splice-variant-normals <SPLICE_NORMAL_FILE_PATH>
to provide the normal splice variants.