DRAGEN FASTQC

DRAGEN FastQC is a tool for calculating common metrics used for quality control of high-throughput sequencing data. The tool is modeled after the metrics generated by Babraham Institute's FastQC tool.

The metrics are generated automatically on all DRAGEN map-align workflows with no additional run time and output in a CSV format file called \<PREFIX\>.fastqc_metrics.csv. All metrics are calculated and reported separately for each mate-pair.

For users only interested in sample QC or would like to obtain FastQC results only, DRAGEN provides a mode to generate the fastqc_metrics.csv file directly.

By default DRAGEN FastQC and read-trimming are run as preprocessing steps to standard sequence alignment workflows. If DNA alignment is not needed or if QC results are needed more quickly, the mapping and BAM output portions of the workflow can be disabled. The workflow only outputs key metric files and runs ~70% faster. This option is available on the command-line by entering --fastqc-only=true after the DRAGEN command.

If FastQC runs stand-alone, then the license will not be consumed. If FastQC runs with map-align enabled, then the license will be consumed.

Differences from the Babraham Institutes' FastQC

DRAGEN FastQC is a complete reimplementation of the original FastQC tool developed by the Babraham Institute (henceforth BI-FastQC). The reimplementation of FastQC in DRAGEN, however, has been modified to take advantage of the hardware-acceleration provided by the DRAGEN Field-Programmable Gate Array (FPGA) for a significant speed improvement. As such, there are some differences in how the values are calculated and the resulting metrics will not be exactly identical between the two tools. The most significant differences are described below.

  • Binning: BI-FastQC uses a customizable binning strategy with a default of 5bp bins, while DRAGEN uses an algorithmic binning strategy based on the Granularity setting described below. In general, this should mean that DRAGEN provides more precise results at default settings.

  • Outputs: BI-FastQC text output contain the same information as their plots in tabular format, while DRAGEN-FastQC outputs it's raw data. For example, BI-FastQC both plots an outputs the average base quality per-position, while DRAGEN outputs the average base quality by both position and nucleotide. This allows for a more detailed analysis of the data, but requires slightly more work to generate the associated plot.

  • Rounding: DRAGEN consistently rounds it's calculations to the nearest integer, while the original FastQC uses a mixture of rounding and taking the mathematical floor, leading DRAGEN-FastQC to provide incrementally higher results for some metrics.

  • Smoothing: Both DRAGEN-FastQC and BI-FastQC utilize smoothing techniques for their distributions of %GC, to account for the fact that 151bp do not divide evenly into 100 percentile bins. However, to take advantage of the speed offered by the FPGA, DRAGEN utilizes a slightly different algorithm than BI-FastQC which results in slightly different results.

Metric Granularity

It is not possible due to memory constraints to guarantee single-base resolution for all metrics. DRAGEN provides an algorithmic solution for binning via --fastqc-granularity. DRAGEN allocates 256 bins in memory for each size or position-based metric. The granularity value of 4–7 inclusive can be used to determine the bin size. High values use smaller bins for greater resolution. Lower values can be used to create larger bins for larger read-lengths

Granularity
Single Base Resolution (bp)
Resolution at 150 (bp)
Recommended Read-Lengths (bp)

7

1-255

1

<256

6

1-128

2

>=256 and <507

5

1-64

4

>=507 and <4031

4

1-32

8

>=4031

If a value for --fastqc-granularity is not provided by the user, DRAGEN will attempt to estimate the read length of the input data and set the granularity accordingly.

Adapter and Kmer Sequence Files

To include metrics for adapter or other sequence content, DRAGEN FastQC needs to be provided with the desired sequences in FASTA format. DRAGEN provides two options for this purpose, --fastqc-adapter-file for adapter sequences and --fastqc-kmer-file for any additional kmers of interest so that users can add sequences of interest without changing the expected adapter results.

DRAGEN FastQC can accept up to a combined total of 16 adapters and kmer sequences. Each sequence can be a maximum of 12 bp in length. By default, DRAGEN uses the adapter file located at <INSTALL_PATH>/config/adapter_sequences.fasta. The file contains the following same adapter sequences as Babraham's FastQC v 0.11.10 and later.

  • Illumina Universal Adapter--AGATCGGAAGAG

  • Illumina Small RNA 3' Adapter--TGGAATTCTCGG

  • Illumina Small RNA 5' Adapter--GATCGTCGGACT

  • Nextera Transposase Sequence--CTGTCTCTTATA

FastQC Metrics Output

The FastQC metrics are output to a CSV file format in the run output directory called

  • <PREFIX>.fastqc_metrics.csv

The reported metrics are broken down into eight sections by metric type. Each section is broken down further into separate rows by either the length, position, or other relevant categorical variables. The following are the metric sections.

  • Read Mean Quality---Total number of reads. Each average Phred-scale quality value is rounded to the nearest integer.

  • Positional Base Mean Quality---Average Phred-scale quality value of bases with a specific nucleotide and at a given location in the read. Locations are listed first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, or T. N or ambiguous bases are assumed to have the system default value, usually QV2.

  • Positional Base Content---Number of bases of each specific nucleotide at given locations in the read. Locations are given first and can be either specific positions or ranges. The nucleotide is listed second and can be A, C, G, T, N.

  • Read Lengths---Total number of reads with each observed length. Lengths can be either specific sizes or ranges, depending on settings specified using --fastqc-granularity.

  • Read GC Content---Total number of reads with each GC content percentile between 0 % and 100 %.

  • Read GC Content Quality---Average Phred-scale read mean quality for reads with each GC content percentile between 0% and 100%.

  • Sequence Positions---Number of times an adapter or other kmer sequence is found, starting at a given position in the input reads. Sequences are listed first in the metric description in quotes. Locations are listed second and can be either specific positions or ranges.

  • Positional Quality---Phred-scale quality value for bases at a given location and a given quantile of the distribution. Locations are listed first and can be either specific positions or ranges. Quantiles are listed second and can be any whole integer 0–100.

The following are examples rows from each section.

Section
Mate
Metric
Value

READ MEAN QUALITY

Read1

Q38 Reads

965377

...

POSITIONAL BASE MEAN QUALITY

Read1

ReadPos 145-152 T Average Quality

34.49

POSITIONAL BASE MEAN QUALITY

Read1

ReadPos 150 T Average Quality

34.44

POSITIONAL BASE MEAN QUALITY

Read1

ReadPos 256+ T Average Quality

36.99

...

POSITIONAL BASE CONTENT

Read1

ReadPos 145-152 A Bases

113362306

POSITIONAL BASE CONTENT

Read1

ReadPos 150 A Bases

14300589

POSITIONAL BASE CONTENT

Read1

ReadPos 256+ A Bases

13249068

...

READ LENGTHS

Read1

150bp Length Reads

77304421

READ LENGTHS

Read1

144-151bp Length Reads

77304421

READ LENGTHS

Read1

>=255bp Length Reads

1000000

...

READ GC CONTENT

Read1

50% GC Reads

140878674373

...

READ GC CONTENT QUALITY

Read1

50% GC Reads Average Quality

36.20

...

SEQUENCE POSITIONS

Read1

'AGATCGGAAGAG' 137bp Starts

20

SEQUENCE POSITIONS

Read1

'AGATCGGAAGAG' 137-144bp Starts

23

...

POSITIONAL QUALITY

Read1

ReadPos 150 50% Quantile QV

37

POSITIONAL QUALITY

Read1

ReadPos 145-152 50% Quantile QV

37

...

Last updated