Read Trimming
DRAGEN can remove artifacts from reads using hardware accelerated read trimming. Hardware accelerated read trimming is available on U200 and cloud systems, as part of the DRAGEN mapper and adds no additional run time. DRAGEN provides multiple independent trimming filters that target different types of artifacts or use cases. You can enable and configure the artifacts or use cases independently to tailor the read-trimming to your analysis. Read trimming uses two different modes, hard-trimming and soft-trimming.
To enable hard-trimming mode, use --read-trimmers
. In hard-trimming mode, potential artifacts are removed from input reads. Reads that are trimmed to fewer than 20 bases are filtered and replaced with a placeholder read that uses 10 N bases. DRAGEN assigns the filtered reads a 0x200 flag set.
DRAGEN contains a novel lossless soft-trimming mode. In soft-trimming mode, reads are mapped as though they had been trimmed, but no bases are removed. To enable the trimmer in soft mode, use --soft-read-trimmers
.
Soft-trimming suppresses systematic mismapping of reads that contain trimmable artifacts, without actually losing the trimmed bases in aligned output. Soft-trimming prevents reads with trimmable artifacts, such as Poly-G artifacts, from being mapped to reference G homopolymers, or prevents adapter sequences from being mapped to the matching reference loci. Soft-trimming might map reads to different positions in the reference than they would have been if not using soft-trimming. When using soft-trimmed, DRAGEN does not filter reads and does not map reads with bases that would have been trimmed entirely.
Soft-trimming for Poly-G artifacts is enabled by default on supported systems.
Read Trimming Tools
Fixed-Length Trimming
Fixed-length trimming removes a fixed number of bases from the 5' end of each read. If you are analyzing sequencing data from an amplicon of fixed size and expect the read-length to consistently exceed the length of quality sequence data, you can use the expected number in fixed-length trimming.
Poly-G Trimming
Poly-G artifacts appear on two-channel sequencing systems when the dark base G is called after synthesis has terminated. As a result, DRAGEN calls several erroneous high-confidence G bases on the ends of affected reads. For contaminated samples, many affected reads can be mapped to reference regions with high G content. The affected reads can cause problems for processing downstream.
Quality Trimming
Base quality can degrade over the length of a read toward the 5' end and separate from any artifacts from early termination of synthesis. The lower quality bases can affect mapping and alignment results, and might lead to incorrect variant or methylation calls downstream. The quality trimming tool calculates a rolling average of the base quality inward from the 5' end and removes the minimum number of bases, so the average number of bases is above the threshold specified using --trim-min-quality
.
Adapter Trimming
Problems during library preparation, or libraries with smaller inserts can result in the synthesis of high quality reads containing sequence from the adapters used. If not removed before analysis, noninsert bases can reduce mapping efficiency and downstream accuracy. The adapter trimming tool uses the adapter sequences from the input FASTA file, and then removes all hits greater than a specified size. Adapter trimming allows for a 10% mismatch. For 3' adapters, trimming is from the first matching adapter base to the end of the read. For 5' adapters, trimming is from the first (3') matching adapter base to the beginning (5') of the read.
Ambiguous Base Trimming
If quality trimming is not feasible due to reduced yield or other limitations, an alternative option is to remove only explicitly ambiguous bases from the ends of read. If enabled the ambiguous base trimmer applies a simple exact-match search to both ends of all processed reads, regardless of mate-pair status.
Minimum Length Trimming
You can maximize trimmer sensitivity, by using the minimum length trimming tool to remove a fixed number of bases from each read after the trimmer tools above have run. For example, if you would like to remove 5 bp from each read, a 7 bp adapter hit could be missed if five of the bases are removed first. To mitigate this issue, DRAGEN provides an optional minimum trim-length filter.
Maximum Length Trimming
If using libraries of fixed-size inserts, such as small PCR amplicons, it is more convenient to specify a length that all reads should be trimmed to rather than the number of bases to remove. You can use the maximum length trimming tool.
PolyA Tail Trimming
If using RNA libraries, reads overlapping the poly-A tail of the transcripts may contain long poly-A/poly-T sequences at the end of the reads which may result in incorrect alignment. The poly-A trimmer mitigates this by trimming the poly-A tail from the end of the read. See additional description in RNA alignment section.
Read Trimming Metrics
The trimmer generates a metrics file titled \<output prefix\>.trimmer_metrics.csv
. Metrics are available on an aggregate level over all input data. The metrics units are in reads or bases.
Total input reads Total number of reads in the input files.
Total input bases Total number of bases in the input reads.
Total input bases R1 Total number of bases in R1 reads.
Total input bases R2 Total number of bases in R2 reads.
Average input read length Total number of input bases divided by the number of input reads.
Total trimmed reads Total number of reads trimmed by at least one base, not including soft-trimming.
Total trimmed bases Total number of bases trimmed, not including soft-trimming.
Average bases trimmed per read The number of trimmed bases divided by the number of input reads.
Average bases trimmed per trimmed read The number of trimmed bases divided by the number of trimmed reads.
Remaining poly-G K-mers R1 3prime The number of R1 3' read ends that contain likely Poly-G artifacts after trimming.
Remaining poly-G K-mers R2 3prime The number of R2 3' read ends that contain likely Poly-G artifacts after trimming.
Total filtered reads The number of reads that were filtered out during trimming.
Reads filtered for minimum read length R1 The number of R1 reads that were filtered due to being trimmed below the minimum read length.
Reads filtered for minimum read length R2 The number of R2 reads that were filtered due to being trimmed below the minimum read length.
<Trimmer tool> trimmed reads The number of reads with at least one base trimmed by TRIMMER. DRAGEN reports the metric for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes reads that were trimmed during soft-trimming. Each trimming tool above produces the metric.
<Trimmer tool> trimmed bases The number of bases trimmed by TRIMMER. The metric is produced for both R1 and R2 mates and the filtering status (unfiltered or filtered) of the trimmed read. The metric includes bases from reads that were trimmed during soft trimming. Each trimming tool above produces the metric.
Read Trimming Settings
Read trimmer
Filtering after the trimmer's execution
Fixed-length trimming
Quality trimming
Adapter trimming
Bisulfite trimming
Minimum-length trimming
Maximum-length trimming
PolyA trimming
PolyG trimming
PolyX trimming
Last updated