DRAGEN Host Software

You use the DRAGEN host software program dragen to build and load reference genomes, and then to analyze sequencing data by decompressing the data, mapping, aligning, sorting, duplicate marking with optional removal, and variant calling.

Invoke the software using the dragen command. The command line options are described in the following sections.

Command line options can also be set in a configuration file. For more information on configuration files, see Configuration Files . If an option is set in the configuration file and is also specified on the command-line, the command line option overrides the configuration file.

Command-line Options

The following are examples of frequently used command lines:

  • Build Reference/Hash Table

    dragen --build-hash-table true --ht-reference <REF_FASTA> \
    --output-directory <REF_DIRECTORY>  [options]
  • Run Map/Align and Variant Caller (*.fastq to *.vcf)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
    [-2 <FASTQ2>] --RGID <RG0> --RGSM <SM0> --enable-variant-caller true
  • Run Map/Align (*.fastq to *.bam)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] \
    -1 <FASTQ1> [-2 <FASTQ2>]  \
    --RGID <RG0> --RGSM
  • Run Variant Caller Only (*.bam to *.vcf)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
    --enable-map-align false \
    --enable-variant-caller true
  • Re-map and Run Variant Caller (*.bam to *.vcf)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -b <BAM> \
    --enable-map-align true \
    --enable-variant-caller true
  • Run BCL Converter (BCL to *.fastq)

    dragen --bcl-conversion-only true --bcl-input-directory <BCL_DIRECTORY> \
    --output-directory <OUT_DIRECTORY>
  • Run RNA Map/Align (*.fastq to *.bam)

    dragen -r <REF_DIRECTORY> --output-directory <OUT_DIRECTORY> \
    --output-file-prefix <FILE_PREFIX> [options] -1 <FASTQ1> \
    [-2 <FASTQ2>] --enable-rna true

For recommended command lines in typical use cases, see DRAGEN Recipes.

Reference Genome Options

Before you can use the DRAGEN system for aligning reads, you must load a reference genome and its associated hash tables onto the PCIe card. For information on preprocessing a reference genome's FASTA files into the native DRAGEN binary reference and hash table formats, see Prepare a Reference Genome. You must also specify the directory containing the preprocessed binary reference and hash tables with the -r [or --ref-dir] option. This argument is always required.

Use the following command to load the reference genome and hash tables to DRAGEN card memory separately from processing reads.

dragen -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

Use the -l (--force-load-reference) option to force the reference genome to load even if it is already loaded.

dragen -l -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149

The time needed to load the reference genome depends on the size of the reference, but for typical recommended settings, it takes approximately 30--60 seconds.

Operating Modes

DRAGEN has two primary modes of operation, as follows:

  • Mapper/aligner

  • Variant caller

DRAGEN is capable of performing each mode independently or as an end-to-end solution. DRAGEN also allows you to enable and disable decompression, sorting, duplicate marking, and compression along the DRAGEN pipeline.

  • Full pipeline mode To execute full pipeline mode, set --enable-variant-caller to true and provide input as unmapped reads in *.fastq, *.bam, or *.cram formats. DRAGEN performs decompression, mapping, aligning, sorting, and optional duplicate marking and feeds directly into the variant caller to produce a VCF file. In this mode, DRAGEN uses parallel stages throughout the pipeline to drastically reduce the overall run time.

  • Map/align mode Map/align mode is enabled by default. Input is unmapped reads in *.fastq, *.bam, or *.cram format. DRAGEN produces an aligned and sorted BAM or CRAM file. To mark duplicate reads at the same time, set ‑-enable‑duplicate‑marking to true.

  • Variant caller mode To execute variant caller mode, set the --enable-variant-caller option to true, and set --enable-map-align option to false. The input must be a mapped and aligned BAM/CRAM file. DRAGEN produces a VCF file. DRAGEN will force-enable re-sorting of the BAM, because a number of read statistics and estimates are required for the Variant Caller to operate effectively. Setting --enable-sort to false will be overridden. BAM files cannot be duplicate marked in the DRAGEN pipeline prior to variant calling if they have not already been marked. Use the end-to-end mode of operation to take advantage of the mark-duplicates feature.

  • RNA-Seq data To enable processing of RNA-Seq--based data, set --enable-rna to true. DRAGEN uses the RNA spliced aligner during the mapper/aligner stage. DRAGEN dynamically switches between the required modes of operation..

  • Bisulfite MethylSeq data To enable processing of Bisulfite MethylSeq data, set the --enable-methylation-calling option to true. DRAGEN automates the processing of data for Lister (directional) and Cokus (nondirectional) protocols to generate a single BAM with bismark-compatible tags. Alternatively, you can run DRAGEN in a mode that produces a separate BAM file for each combination of the C->T and G->A converted reads and references. To enable this mode of processing, you need to build a set of reference hash tables with --ht-methylated enabled, and run DRAGEN with the appropriate ‑‑methylation-protocol setting.

Output Options

The following command line options for output are mandatory:

  • --output-directory <out_dir>—Specifies the output directory for generated files.

  • --output-file-prefix <out_prefix>-Specifies the output file prefix. DRAGEN appends the appropriate file extension onto this prefix for each generated file.

  • -r [--ref-dir ]—Specifies the reference hash table.

The following examples do not include these mandatory options.

For mapping and aligning, the output is sorted and compressed into BAM format by default before saving to disk. The user can control the output format from the map/align stage with the --output-format <SAM|BAM|CRAM> option. If the output file exists, the software issues a warning and exits. To force overwrite if the output file already exists, use the -f [ --force ] option.

For example, the following commands output to a compressed BAM file, and then forces overwrite:

dragen ... -f

dragen ... -f --output-format bam

To generate a BAI-format BAM index file (*.bai), set --enable-bam-indexing to true.

The following example outputs to a SAM file, and then forces overwrite:

dragen ... -f --output-format sam

The following example outputs to a CRAM file, and then forces overwrite:

dragen ... -f --output-format cram

DRAGEN only outputs lossless CRAM files. All QNAMEs and BAM tags are preserved in the CRAM.

Alignment tags

DRAGEN can generate mismatch difference (MD) tags, as described in the BAM standard. The feature is turned off by default because there is a small performance cost to generate these strings. To generate MD tags, set --generate-md-tags to true.

To generate ZS:Z alignment status tags, set --generate-zs-tags to true. These tags are only generated in the primary alignment and when a read has suboptimal alignments qualifying for secondary output (even if none were output because --Aligner.sec-aligns was set to 0). The following are valid tag values:

Tag
Tag meaning

ZS:Z:R

Multiple alignments with similar score were found.

ZS:Z:NM

No alignment was found.

ZS:Z:QL

An alignment was found but it was below the quality threshold.

To generate SA:Z tags, set --generate-sa-tags to true (the default). These tags provide alignment information (position, cigar, orientation) of groups of supplementary alignments, which are useful in structural variant calling.

To generate pair score in a ps:i tag, set --generate-ps-tags to true (false by default for DNA, true for RNA). The pair score is used in DRAGEN for computing MAPQ and can be used to check how well alignment candidate pairs score against each other.

DRAGEN can also output mate alignment tags. To generate the mate cigar (in the MC:Z tag), set --generate-mc-tags to true (this is the default). To generate the mate mapping quality (in the MQ:i) tag, set --generate-mq-tags to true (this is the default). To generate mate sequence (in the R2:Z tag) and mate base qualities (in the Q2:Z tag), set --generate-r2-tags to true (default is false) and set --generate-q2-tags to true (default is false) respectively. Please note that when enabled, R2:Z and Q2:Z tags are emitted only for improperly paired read alignments with fragment length atleast 1000 bp. Also, our methylation pipelines currently do not support the output of mate alignment tags.

DRAGEN also outputs a graph alignment tag ga:Z --generate-ga-tags (true by default for DNA, false for RNA) when applicable. This tag is used to describe the best alt contig alignment which improved the score of a primary-contig alignment at its liftover position. It can also be used to describe read alignments to alt contigs for which there is no liftover and the primary alignment is unmapped. For example, cases when the read maps best to an alt contig describing a novel long-insertion that is not present in the reference. In addition, read alignments that have been marked as unmapped because they map to auto-detected decoy contigs not present in the original user-provided FASTA also have their alignments described in the ga tag.

The ga tag uses the same format as the SA tag used to describe supplementary alignments.

CRAM Output

When CRAM is selected as output, DRAGEN generates a CRAM file with the following features:

  • CRAM format V3.0 is produced

  • The CRAM is lossless. Lossy compression is never employed and not optional

  • Quality score compression is lossless. Read names are preserved

  • Only the GZIP compression algorithm is employed for maximum compatibility. bgzip, lzma not employed. rANS is used for quality scores

  • All input BAM tags are preserved

  • The reference used to compress the CRAM file, is the DRAGEN Hash Table provided during the map/align run. When decompressing the CRAM with a FASTA file and 3rd party tools, the FASTA that was used to generate the Hash Table must be used.

  • A CRAM index is produced in .crai format

  • CRAM output is only possible when sort is enabled. CRAM alignments will always be positionally sorted

The following list of default settings are used for the CRAM output

CRAM option
Value
Description

SEQS_PER_SLICE

2000

Max sequences per slice

BASES_PER_SLICE

SEQS_PER_SLICE*500

Max bases per slice

SLICE_PER_CNT

1

Max slices per container

embed_ref

0

Do not embed reference sequence

noref

0

Do not use non-referenced based encoding

multiseq

-1

Do not use multiple references per slice

unsorted

0

Do not use unsorted mode

use_bz2

0

Do not compress using bzip2

use_lzma

0

Do not compress using lmza

use_rans

1

Use rANS for quality score compression

binning

NONE

Qual score binning not used

preserve_aux_order

1

Preserve all aux tags and order (incl RG,NM,MD)

preserve_aux_size

0

Aux tag sizes not preserved ('i', 's', 'c')

lossy_read_names

0

Preserve read names

lossy

0

Do not enable Illumina 8 quality-binning system

ignore_md5

0

Enable all checking of checksums

decode_md

0

Do not (re)generate MD and NM tags

Input Options

DRAGEN can process reads in FASTQ format or BAM/CRAM format. DRAGEN supports the following compression options for FASTQ input files.

  • Uncompressed

  • gzip or bgzip compression

  • ORA compression. To use ORA compression, you must provide an ORA reference and reference directory. See ORA Compression and Decompression.

If your input FASTQ files are gzipped, DRAGEN automatically decompresses the files using hardware-accelerated decompression, and then streams the reads into the mapper. If your files end in *.ora, DRAGEN automatically decompresses the files using ORA decompression, and then streams the reads into the mapper. The same FASTQ command-line options apply for all compression formats.

FASTQ Input Files

FASTQ input files can be single-ended or paired-end, as shown in the following examples.

  • Single-ended in one FASTQ file (-1 option)

    dragen -r <REF_DIR> -1 <fastq> \
    --output-directory <OUT_DIR> -output-file-prefix <OUTPUT_PREFIX> \
    --RGID <RGID> --RGSM <RGSM>
  • Paired-end in two matched FASTQ files(-1 and -2 options)

    dragen -r <REF_DIR> -1 <fastq1> -2 <fastq2> \
    --output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
    --RGID <RGID> --RGSM <RGSM>
  • Paired-end in a single interleaved FASTQ file(--interleaved (-i) option)

    dragen -r <REF_DIR> -1 <INTERLEAVED_FASTQ> -i \
    --RGID <RGID> --RGSM <RGSM>

Both bcl2fastq and the DRAGEN BCL command use a common file naming convention, as follows:

<SampleID>_S<#>_<Lane>_<Read>_<segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz

RDRS182520_S1_L001_R1_002.fastq.gz

...

RDRS182520_S1_L001_R1_008.fastq.gz

These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 <FileName>_001.fastq). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile to false on the command line.

DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name option to true

If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.

To avoid impacting system performance, input files must be located on a fast file system.

Multiple FASTQ Input Files

To process multiple FASTQ input files as one sample, it is recommended that you use the --fastq-list <csv file name> option to specify the name of a CSV file containing the list of FASTQ files, instead of using the --combine-samples-by-name option.

For example:

dragen -r <ref_dir> --fastq-list <CSV_FILE> \
-fastq-list-sample-id <Sample_ID> -output-directory <OUT_DIR> 
--output-file-prefix <OUT_PREFIX>

Using a CSV file avoids having to concatenate the FASTQ files, for cases where there are multiple FASTQ files for a sample such as top-up scenarios or where FASTQ files are split across lanes. It also allows you to name the FASTQ input files, input from multiple subdirectories, and add BAM tags specified explicitly for each read group. DRAGEN automatically generates a CSV file of the correct format during BCL conversion to FASTQ. The CSV file is named fastq_list.csv and contains an entry for each FASTQ file or paired-end file pair produced during the run.

FASTQ CSV File Format

The first line of the CSV file specifies the title of each column, and is followed by one or more data lines. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or other extraneous characters.

Column titles are case-sensitive. The following column titles are required:

  • RGID--Read Group

  • RGSM--Sample ID

  • RGLB--Library

  • Lane--Flow cell lane

  • Read1File--Full path to a valid FASTQ input file

  • Read2File--Full path to a valid FASTQ input file. Required for paired-end input. If not using paired-end input, leave empty.

Each FASTQ file referenced in the CSV list can be referenced only once. All values in the Read2File column must be either nonempty and reference valid files, or they must all be empty.

When generating a BAM file using fastq-list input, one read group is generated per unique RGID value. The BAM header contains RG tags for the following read groups:

  • ID (from RGID)

  • SM (from RGSM)

  • LB (from RGLB)

You can specify additional tags for each read group by adding a column title. The column title must be only four upper-case characters and begin with RG. For example, to add a PU (platform unit) tag, add a column named RGPU and specify the value for each read group in this column. All column titles must be unique.

A fastq-list file can contain files for more than one sample. If a fastq-list file contains only one unique RGSM entry, then no additional options need to be specified, and DRAGEN processes all files listed in the fastq-list file. If there is more than one unique RGSM entry in a fastq-list file, --fastq-list-sample-id <SampleID> must be used in addition to --fastq-list <filename> to process only a specific sample from the CSV file. Only the entries in the fastq-list file with an RGSM value that match the specified SampleID are processed.

  • Independent processing and output for multiple individual samples in one run is not supported.

  • To process all listed files together as one sample, regardless of the RGSM value, the option --fastq-list-all-samples=true can be used instead of --fastq-list-sample-id.

Note

For a single run, only one BAM and VCF output file are produced because all input read groups are expected to belong to the same sample. To process multiple samples independently from one BCL conversion run, DRAGEN must be run multiple times using different values for the `--fastq-list-sample-id` option.

There is no option to specify groupings or subsets of RGSM values for more complex filtering, but the fastq-list file can be modified to achieve the same effect.

The following is an example FASTQ list CSV file with the required columns:

RGID,RGSM,RGLB,Lane,Read1File,Read2File
CACACTGA.1,RDSR181520,UnknownLibrary,1,/staging/RDSR181520_S1_L001_R1_001.fastq,
/staging/RDSR181520_S1_L001_R2_001.fastq
AGAACGGA.1,RDSR181521,UnknownLibrary,1,/staging/RDSR181521_S2_L001_R1_001.fastq,
/staging/RDSR181521_S2_L001_R2_001.fastq
TAAGTGCC.1,RDSR181522,UnknownLibrary,1,/staging/RDSR181522_S3_L001_R1_001.fastq,
/staging/RDSR181522_S3_L001_R2_001.fastq
AGACTGAG.1,RDSR181523,UnknownLibrary,1,/staging/RDSR181523_S4_L001_R1_001.fastq,
/staging/RDSR181523_S4_L001_R2_001.fastq

If you use the --tumor-fastq-list option for somatic input, use the --tumor-fastq-list-sample-id SampleID> option to specify the sample ID for the corresponding FASTQ list, as shown in the following example:

dragen -r <ref_dir> --tumor-fastq-list <csv_file> \
--tumor-fastq-list-sample-id <Sample_ID> \
--output-directory <out_dir> \
--output-file-prefix <out_prefix> --fastq-list <csv_file_2> \
--fastq-list-sample-id <Sample_ID_2>

Tumor-Normal Pairs Input

If using fastq_lists or tumor_fastq_lists comprising of multiple samples (RGSMs) in somatic mode, you can use a loop to iterate through the two lists to create tumor-normal pairs for testing. Create a *.txt file with the RGSM of each normal sample to be tested (one per line), and then create a separate *.txt file with the RGSM of the tumor samples to be tested. Make sure that the tumor sample RGSM is listed in the same order as the corresponding normal samples and to include a blank line after the last sample.

You can use the following example script to perform testing in somatic mode. Each iteration takes one entry from the tumor samples list and one entry from the normal samples list (from top to bottom) to create a tumor-normal pair as input for the DRAGEN run.

#!/bin/bash

HT="/staging/HT/"
tumor_fastq_list="/staging/inputs/tumor_fastq_list.csv"
normal_fastq_list="/staging/inputs/normal_fastq_list.csv"

tumor_samples_list="/staging/inputs/tumor_samples_list.txt"
normal_samples_list="/staging/inputs/normal_samples_list.txt"

while read -u 3 -r tumor_RGSM && read -u 4 -r normal_RGSM; do
output_dir="/staging/results/${tumor_RGSM}_${normal_RGSM}"
mkdir -p ${output_dir}

dragen \
-r ${HT} \
--tumor-fastq-list ${tumor_fastq_list} \
--tumor-fastq-list-sample-id ${tumor_RGSM} \
--fastq-list ${normal_fastq_list} \
--fastq-list-sample-id ${normal_RGSM} \
--output-directory ${output_dir} \
--output-file-prefix ${tumor_RGSM}_${normal_RGSM}
done 3<${tumor_samples_list} 4<${normal_samples_list}

Sample fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_N1.1,normal-1,ILLUMINA,1,/staging/inputs/normal-1_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N1.2,normal-1,ILLUMINA,2,/staging/inputs/normal-1_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.1,normal-2,ILLUMINA,1,/staging/inputs/normal-2_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N2.2,normal-2,ILLUMINA,2,/staging/inputs/normal-2_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.1,normal-3,ILLUMINA,1,/staging/inputs/normal-3_S1_L001_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_N3.2,normal-3,ILLUMINA,2,/staging/inputs/normal-3_S1_L002_R1_001.fastq.gz,/staging/inputs/normal-3_S1_L002_R2_001.fastq.gz

The following are examples of the FASTQ lists and samples lists used as input for the script.

Sample tumor_fastq_list.csv content:

RGPL,RGID,RGSM,RGLB,Lane,Read1File,Read2File
DRAGEN_RGPL,DRAGEN_RGID_T1.1,tumor-1,ILLUMINA,1,/staging/inputs/tumor-1_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T1.2,tumor-1,ILLUMINA,2,/staging/inputs/tumor-1_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-1_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.1,tumor-2,ILLUMINA,1,/staging/inputs/tumor-2_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T2.2,tumor-2,ILLUMINA,2,/staging/inputs/tumor-2_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-2_S1_L002_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.1,tumor-3,ILLUMINA,1,/staging/inputs/tumor-3_S1_L001_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L001_R2_001.fastq.gz
DRAGEN_RGPL,DRAGEN_RGID_T3.2,tumor-3,ILLUMINA,2,/staging/inputs/tumor-3_S1_L002_R1_001.fastq.gz,/staging/inputs/tumor-3_S1_L002_R2_001.fastq.gz
Sample normal_samples_list content

normal-1
normal-2
normal-3
Sample tumor_samples_list content

tumor-1
tumor-2
tumor-3

FASTQ ORA Input Files

You can use the same options as the other FASTQ input file types for ORA files. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference.

See ORA Compression and Decompression for more information on ORA reference files.

The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).

dragen -r <REF_DIR> -1 <fastq.ora1> -2 <fastq.ora2> \
--ora-reference <ORADATA_DIR> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX> \
--RGID <RGID> --RGSM <RGSM>

BAM Input Files

BAM files can be used as input to the mapper/aligner. By default --enable-map-align is true. You can use the BAM file as input to the variant caller by setting the --enable-map-align option to false.

When you specify a BAM file as input, with map/align enabled, DRAGEN ignores any alignment information contained in the input file, and outputs new alignments for all reads.

If the input file contains paired-end reads, it is important to specify that the input data should be sorted so that pairs can be processed together. Other pipelines would require you to re-sort the input data set by read name. DRAGEN vastly increases the speed of this operation by pairing the input reads, and sending them on to the mapper/aligner when pairs are identified. Use the --pair-by-name option to enable or disable this feature (the default is true).

Specify single-ended input in one BAM file with the (-b) and --pair-by-name=false options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name false

Specify paired-end input in one BAM file with the (-b) and \--pair-by-name=true options, as follows:

dragen -r <ref_dir> -b <bam> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

CRAM Input

You can use CRAM files as input to the DRAGEN mapper/aligner and variant caller. The DRAGEN functionality available when using CRAM input is the same as when using BAM input.

By default, the CRAM compressor and decompressor uses the DRAGEN reference specified with the --ref-dir option. CRAM compression is reference based, and the reference used for compression is not part of the CRAM file. Therefore, the CRAM input file must have been created with the same reference than what is provided to DRAGEN for the analysis.

DRAGEN supports the re-alignment of a CRAM input that was created with a different reference in one step. Re-aligning a CRAM file that was created with a different reference requires use of the --cram-reference option. This option will make the CRAM decompressor use the specified reference.

  • --cram-reference can be either a fasta file, or a DRAGEN hash table folder.

  • If pointing to a fasta file, the fasta .fai index file must be present next to the fasta file

  • CRAM output will always be compressed using the --ref-dir reference

Example: CRAM was created with hg19, re-analysis with hg38

dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <ref_dir HG19>
dragen -r <ref_dir HG38> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --cram-reference <hg19.fa>

The following options are used for providing a CRAM input to either mapper/aligner or variant caller:

  • --cram-input--The name and path for the CRAM file

  • --cram-input--One usage example is paired-end input in a single CRAM file. In addition, set the --pair-by-name option to true.

dragen -r <ref_dir> --cram-input <cram> --output-directory <out_dir> \
--output-file-prefix <out_prefix> --pair-by-name true

Multiple BAM or CRAM Input Files

To provide multiple BAM input files, you can use the --bam-list <csv file name> option to specify the name of a CSV file containing the list of BAM files. For example:

dragen -r <ref_dir> --bam-list <CSV_FILE> \
--output-directory <OUT_DIR> --output-file-prefix <OUT_PREFIX>

To provide multiple CRAM input files, you can use the --cram-list <csv file name> option.

BAM or CRAM CSV Input File Format

The first line of the CSV file specifies the header containing the title for each column and each subsequent line is a data line. All lines in the CSV file must contain the same number of comma-separated values and should not contain white space or any other extraneous characters.

An example BAM CSV file:

BamFile
/path/to/bam/one
/path/to/bam/two

Column titles are case sensitive. The following column titles are required:

  • BamFile -- path to BAM file

Please note that only the "BamFile" column is supported as this time. Extra fields may be specified in the CSV file but they will not be processed by DRAGEN.

CRAM CSV input follows the same format above, with "CramFile" as the column title instead.

Restrictions and Limitations:

DRAGEN bam-list and cram-list are intended to mirror manually merging BAM or CRAM files via a utility such as samtools or MergeSamFiles (Picard). As a result, using bam-list or cram-list is analogous to having a single merged BAM or CRAM input file. Please note that some callers (i.e. DRAGEN variant calling) are unable to process a bam-list or cram-list that is composed of input files containing multiple samples.

In the case where identical read group IDs appear across multiple files and you want to treat them as distinct read groups, you can use the --prepend-filename-to-rgid=true option to distinguish between read groups.

If enabled, the resulting output BAM or CRAM file will contain all read groups from the input BAM or CRAM files passed in the CSV list file.

Tumor-Normal Pairs Input

You can also use --tumor-bam-list <csv file name> or --tumor-cram-list <csv file name> when running with tumor-only or tumor-normal inputs to DRAGEN. The CSV file has the same format as the options described above.

BCL Input Files

BCL is the output format of Illumina sequencing systems. Under limited circumstances, DRAGEN can read directly from BCL for map-align operations, saving the time needed for conversion to FASTQ.

DRAGEN can read directly from BCL in the following circumstances:

  • Only one lane is input as part of a run (specified on the command-line).

  • The lane has only a single sample specified in the SampleSheet.csv file. When converting BCL to FASTQ is required, DRAGEN provides a BCL to FASTQ converter (see DRAGEN BCL Data Conversion).

The following example command is for BCL input with only one lane of input:

dragen --bcl-input-dir <BCL_ROOT> --bcl-only-lane <num> -r <ref_dir> \
--output-directory <out_dir> --output-file-prefix <out_prefix>

For additional BCL conversion options, see Input File Types.

Handling of N bases

One of the techniques that DRAGEN uses to optimize handling sequences can lead to the overwriting the base quality score assigned to N base calls.

When you use the --fastq-n-quality and --fastq-offset options, the base quality scores are overwritten with a fixed base quality. The default values for these options are 2 and 33 to match the Illumina minimum quality of 35 (ASCII character ‘#’).

Read Names for Paired-End Reads

By a common convention, read names can include suffixes, such as /1 or /2), which indicate the end of a pair the read represents. For BAM input using the --pair-by-name option, DRAGEN ignores these suffixes to find matching pair names. By default, DRAGEN uses the forward slash character as the delimiter for these suffixes and ignores the /1 and /2 when comparing names. By default, DRAGEN strips these suffixes from the original read names.

DRAGEN has the following options to control how suffixes are used:

  • To change the delimiter character, for suffixes, use the --pair-suffix-delimiter option. Valid values for this option include forward-slash (/), dot (.), and colon (:).

  • To preserve the entire name, including the suffixes, set --strip-input-qname-suffixes to false.

  • To append a new set of suffixes to all read names, set --append-read-index-to-name to true. The delimiter is determined by the --pair-suffix-delimiter option. By default, the delimiter is a slash, so /1 and /2 are added to the names.

Gene Annotation Input Files

When processing RNA-Seq data, you can supply a gene annotations file by using the --annotation-file option. Providing this file improves the accuracy of the mapping and aligning stage (see [Input Files]{.underline}). The file should conform to the GTF/GFF format specification and should list annotated transcripts that match the reference genome being mapped against. The similar GFF3 format is currently not supported, due to inconsistent contig naming between GENCODE and Ensembl. See the RNA user guide section for more details on potential issues and workarounds.

DRAGEN can take the SJ.out.tab file (see [SJ.out.tab]{.underline}) as an annotations file to help guide the aligner in a two-pass mode of operation.

Networked Streaming

AWS S3, Azure Blob Storage, and AWS Presigned URL Input Streaming

DRAGEN can stream input files directly from an AWS S3 bucket, Azure Blob storage account, or by using AWS presigned URLs (presigned URLs are not supported for Azure Blob storage at this time). With streaming, input files are not required to be downloaded locally prior to being processed. The files are streamed over the network directly into the DRAGEN processor.

Input streaming is most beneficial for large input files. DRAGEN supports input streaming for BAMs and compressed FASTQ files. For FASTQ files, input streaming can be used in all the configurations, including single-end FASTQs, paired-end FASTQs, and FASTQ lists.

Input streaming is supported for the following use cases:

  • Mapping/aligning of FASTQ and BAM.

  • Germline and somatic small variant calling from BAM (without remapping).

For other file types that are significantly smaller in size, download them locally before running the analysis.

Streaming FASTQ Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 s3://s3-bucket-name/path/to/object_1.fastq.gz \
  -2 s3://s3-bucket-name/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://storage-account-name.blob.core.windows.net/path/to/object_1.fastq.gz \
  -2 https://storage-account-name.blob.core.windows.net/path/to/object_2.fastq.gz \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming FASTQ Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 https://bucket-name.amazonaws.com/path/to/object_1.fastq.gz?querystring \
  -2 https://bucket-name.amazonaws.com/path/to/object_2.fastq.gz?querystring \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b s3://s3-bucket-name/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://storage-account-name.blob.core.windows.net/path/to/object_1.bam \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

Streaming BAM Input Using Presigned URLs (for AWS only)

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -b https://bucket-name.amazonaws.com/path/to/object_1.bam?querystring \
  --output-directory /staging/examples/ \
  --output-file-prefix streaming

AWS S3, Azure Blob Storage, Output Streaming

DRAGEN can stream its output to an AWS S3 Bucket or an Azure Blob Storage Account Container. Output streaming is beneficial for large output files and for sharing results.

Streaming output to AWS S3

dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory s3://s3-bucket-name/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Streaming output to Azure Blob Storage Account

AZ_ACCOUNT_NAME="storage-account-name" AZ_ACCOUNT_KEY="<account-key>" dragen -f \
  -r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
  -1 SRA056922.fastq \
  --RGID object_ID \
  --RGSM sample_name \
  --output-directory https://storage-account-name.blob.core.windows.net/path/to/output \
  --intermediate-results-dir /staging/examples \
  --output-file-prefix streaming

Security and Permissions

To stream input files or write to a cloud providers storage, you must have permission to access the remote files.

AWS S3

S3 requires AWS authentication and credentials. The authentication should already be set up on the instance you are running, for example, via IAM policies.

Azure Blob Storage Account

Azure requires authentication and environment variables. DRAGEN supports two cases: (1) Using managed identities and (2) Storage account access keys.

To use managed identities you must run DRAGEN on an Azure instance. The instance must have Contributor permissions (read/write) on the Storage Account it wants to read and write to. If the instance has a single managed identity, only the AZ_ACCOUNT_NAME=<azure-storage-account-name> environment variable is required. For multiple managed identities, you must also provide the AZR_IDENT_CLIENT_ID=<client-id> environment variable, with the client id of the identity that can access your storage bucket. This can be found on the Azure Portal.

With storage account access keys, DRAGEN can write to an Azure bucket both on and off Azure instances. For this use case, find the Storage Account Access Key and set the environment variables AZ_ACCOUNT_NAME=<azure-storage-account-name> and AZ_ACCOUNT_KEY=<account-key>.

Presigned URL (AWS only)

An AWS presigned URL most likely has a query string attached to it, which provides the authentication credentials or necessary tokens to grant permission to the S3 bucket (e.g., https://bucket-name.amazonaws.com/path/to/folder?querystring). Currently, streaming input to DRAGEN Azure presigned URLs is not supported.

Sample Sex

Use the --sample-sex command line option to control the sex karyotype input used in downstream components, such as variant callers. If a sample sex karyotype input is not specified using the command line, the sex karyotype is automatically determined. The sex karyotype input is converted to a reference sex karyotype for use in variant calling. Other components might support sex karyotype input. Refer to the corresponding section for the component you are using.

The --sample-sex option supports the following values. Values are not case-sensitive.

  • none: No sex karyotype input. Components use a default reference sex karyotype.

  • auto: The sex karyotype is estimated by the Ploidy Estimator. If using CNV calling, sex karyotype is determined using a separate sex estimation module. If DRAGEN cannot estimate the sex karyotype, then components do not have a sex karyotype input. This behavior is then the same as none. auto is the default value.

  • female: Sex karyotype input is XX.

  • male: Sex karyotype input is XY.

The following example command lines use --sample-sex to specify the sex karyotype.

--sample-sex FEMALE

--sample-sex MALE

--sample-sex NONE

If the value is none, female, or male, the Ploidy Estimator could still run and produce output, but variant callers will not use any estimated sex karyotype that is different than the sex karyotype provided via the command-line.

The sex karyotype input is converted to the reference sex karyotype for the different components as follows. See the relevant component section for more information on how --sample-sex is used.

Reference Sex Karyotype

Sex Karyotype Input
CNV Caller
ExpansionHunter
Ploidy Caller
Small Variant Caller
SV Caller

XX

XX

XX

XX

XX

XXYY

XY

XY

XY

XY

XY

XXYY

XXY

XY

XX

XY

XXYY

XXYY

XYY

XY

XY

XY

XXYY

XXYY

X0

XX

XY

XX

XXYY

XXYY

XXXY

XY

XX

XY

XXYY

XXYY

XXX

XX

XX

XX

XXYY

XXYY

None

XX/XY

XX

XX

XXYY

XXYY

  • For sex karyotype input of None, CNV independently checks the coverage ratio of X and Y to determine the reference sex karyotype. Detection of minimal Y coverage will yield XY, otherwise XX.

Preservation or Stripping of BQSR Tags

The Picard Base Quality Score Recalibration (BQSR) tool produces output BAM files that include tags BI and BD. BQSR calculates these tags relative to the exact sequence for a read. If a BAM file with BI and BD tags is used as input to mapper/aligner with hard clipping enabled, the BI and/or BD tags can become invalid.

The recommendation is to strip these tags when using BAM files as input. To remove the BI and BD tags, set the --preserve-bqsr-tags option to false. If you preserve the tags, DRAGEN warns you to disable hard clipping.

Read Group Options

DRAGEN assumes that all the reads in a given FASTQ belong to the same read group. DRAGEN creates a single @RG read group descriptor in the header of the output BAM file, with the ability to specify the following standard BAM attributes:

Attribute
Argument
Description

ID

--RGID

Read group identifier. If you include any of the read group parameters, RGID is required. It is the value written into each output BAM record.

LB

--RGLB

Library.

PL

--RGPL

Platform/technology used to produce the reads. The BAM standard allows for values CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT and PACBIO.

PU

--RGPU

Platform unit, eg, flowcell-barcode.lane.

SM

--RGSM

Sample.

CN

--RGCN

Name of the sequencing center that produced the read.

DS

--RGDS

Description.

DT

--RGDT

Date the run was produced.

PI

--RGPI

Predicted mean insert size.

If any of these arguments are present, DRAGEN adds an RG tag to all the output records to indicate that they are members of a read group. The following example shows a command line that includes read group parameters:

dragen --RGID 1 --RGCN Broad --RGLB Solexa-135852 \
--RGPL Illumina --RGPU 1 --RGSM NA12878 \
-r /staging/human/reference/hg19/hg19.fa.k_21.f_16.m_149 \
-1 SRA056922.fastq --output-directory /staging/tmp/ \
--output-file-prefix rg_example

When using the --fastq-list option to input multiple read groups, BAM tags (and others) are specified for each read group by adding columns to the fastq_list.csv file. Each column heading consists of four capital letters and each begins with 'RG'. For each column, each read group's values for that column are propagated to the output BAM file in an identically named tag.

License Options

To suppress the license status message at the end of the run, use the --lic-no-print option. The following shows an example of the license status message:

LICENSE_MSG| =====================================================
LICENSE_MSG| License report
LICENSE_MSG|   Genome status [ACxxxxxxxxxxx] : used 1263.9 Gbases
since 2018-Feb-15 (1263886160894 bases, unlimited)
LICENSE_MSG|   Genome  bases [ACxxxxxxxxxxx] : 202000000
LICENSE_MSG|   Genome  bases [total]         : 202000000

Autogenerated MD5SUM for BAM and CRAM Output Files

An MD5SUM file is generated automatically for BAM and CRAM output files. The MD5SUM file has the same name as the output file, with an .md5sum extension appended (eg, whole_genome_run_123.bam.md5sum). The MD5SUM file is a single-line text file that contains the md5sum of the output file, which exactly matches the output of the Linux md5sum command.

The MD5SUM calculation is performed as the output file is written, so there is no measurable performance impact (compared to the Linux md5sum command, which can take several minutes for a 30x BAM).

Configuration Files

Command line options can be stored in a configuration file. The location of the default configuration file is <INSTALL_PATH>/config/dragen-user-defaults.cfg. You can override this file by using the --config-file (-c) option to specify a different file. The configuration file used for a given run supplies the default settings for that run, any of which can be overridden by command line options.

The recommended approach is to use the dragen-user-defaults.cfg file as a template to create default settings for different use cases. Copy dragen-user-defaults.cfg, rename the copy, then modify the new file for the specific use-case. Best practice is to put options that rarely change into the configuration file and to specify options that vary from run to run on the command line.

Cloud Authentication and Licensing

Authentication is required for users that run DRAGEN on the cloud, with the Bring-Your-Own-License (BYOL) model, outside of integrated Illumina cloud products. A valid license is required to enable authentication and usage quotas.

License Server

DRAGEN cloud runs access the DRAGEN License Server to validate the credentials and licenses against the intended run. BYOL users must provide credentials and must allow access to the license server URL. The following command line option can be used to pass the credentials to DRAGEN: --lic-server=https://<user>:<pass>@license.edicogenome.com.

An alternative way to provide license server credentials is by using a license credentials file. The --lic-credentials input command line option can be used to provide the full path to the license credentials file. This provides a more secure way to pass cloud credentials, which avoids accidental credentials leaks from command line console logs.

A license credentials file is a plain text file audited by the customer. The format is the same as the DRAGEN config files: = , each {key,value} separated by new line. The following key names must be used: credentials1 = credentials2 =

AWS Instance Metadata Service Support (IMDSv1/IMDSv2)

DRAGEN uses AWS Instance Metadata Service (IMDS) to identify its own metadata within the AWS environment, including location, identity, and configuration.

DRAGEN supports both AWS IMDSv1, and the more secure AWS IMDSv2. AWS IMDSv1 is request/response based. It accesses metadata by HTTP requests to a specific endpoint on the instance. AWS IMDSv2 is token-based authentication with time-limited tokes.

AWS IMDSv2 must be enabled on the AWS instance, otherwise, IMDSv1 is used by default. DRAGEN software will automatically detect the IMDS version in use and adapt its behavior accordingly.

Instance Identity

DRAGEN cloud runs access the instance identity document via the Instance Metadata Service as part of the authentication. It uses the IPv4 local address. If access to the local address is not allowed, the authentication will fail. Alternately, the user may save the instance identity document(s) and point DRAGEN to use them instead, if the user does not want to allow applications to access this service. The method for providing instance identity documents to the software is described below.

  • Save the instance identity document(s) as files from the user's instance, and provide them as inputs to the DRAGEN software with each run.

  • The instance identity document(s) only need to be saved once per AWS account and region, and those files can be re-used subsequently.

Examples for saving instance identity document(s):

AWS

IMDSv1

curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/pkcs7" -o /opt/instance-identity/pkcs7
curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/document" -o /opt/instance-identity/document
cp /opt/instance-identity/pkcs7 /opt/instance-identity/signature

IMDSv2

curl -X PUT -H "X-aws-ec2-metadata-token-ttl-seconds: 300" -H "X-aws-ec2-metadata-token: required" --noproxy "*" "http://169.254.169.254/latest/api/token"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/document"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/signature"
curl -H "X-aws-ec2-metadata-token: <your-token>" --noproxy "*" "http://169.254.169.254/latest/dynamic/instance-identity/pkcs7"

There should be 3 files in this folder, respectively named pkcs7, signature and document. Run Dragen using the --lic-instance-id-location ${instance_identity} command option.

Azure

curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2020-09-01" -o /opt/instance-identity/instance
curl -v -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/attested/document?api-version=2020-09-01" -o /opt/instance-identity/document

There should be 2 files in this folder, respectively named instance and document. Run Dragen using the --lic-instance-id-location ${instance_identity} command option.

Last updated