Indel Re-aligner (Beta)

Overview

The DRAGEN Indel Re-aligner is a consensus based re-alignment step, independent from other DRAGEN callers and pipelines. Re-aligned reads are reflected in the output bam file, and their original alignment is described in an OA tag. The implementation is similair to the Indel Re-aligner tool that was found in GATK3. The tool is designed to reduce false positive SNP's by considering evidence of near-by indels.

Description

The pipeline is comprised of two concurent steps: Interval creation and re-alignment. The interval creation step identifies genomic intervals for which there is evidence of insertions or deletions in the CIGAR's of properly paired (if paired) reads aligned with positive mapq. To output these intervals as a text file, use the command line argument --ir-write-intervals-file=true. Each line will describe a genomic interval as chrom:start-end, or chrom:start for intervals of length one. The start and end positions are both inclusive and 1-based. The intervals file will be written to the DRAGEN output directory, with the suffix realign-intervals.txt

For each genomic interval, the realigment step groups all aligned reads that intersect the interval. If there are more than ir-max-num-reads reads that intersect the interval, it is skipped. The following reads are then discarded from the re-alignment analysis:

  • Non-primary aligned reads.

  • Reads whose mapping quality is zero.

  • Paired end reads that mapped to different contigs.

  • Paired end reads that mapped to the same contig with start positions more than ir-max-distance-between-mates apart.

Reads that have not been skipped are candidates for re-alignment. If there are more than ir-max-num-candidates candidates, the interval is skipped. From each re-alignment candidate, a consensus read is generated from any read that has a single indel that is not the first or last CIGAR operation excluding clip operations. If there are more than ir-max-number-consensus consensus reads, the interval is skipped. Each re-alignment candidate is then scored against each consensus to determine the winning consensus. If the combined score for the interval against the winning consensus is better than the score against the reference by a differnce of at least ir-realignment-threshold, the reads start position, CIGAR, and NM tag are updated to reflect the re-alignment. The scoring used is hamming distance weighted by base qualities. OA tags that describe the original alignment are added to any re-aligned reads. Mate positions of reads whose mate was re-aligned are updated as well.

When the re-alignment step is complete, a summary will be printed to standard out. It will describe the number of intervals found, sum of the lengths of all intevals, number of reads that intersected intervals, number of reads that got re-aligned, and the number of reads that were skipped due to memory constraints. Such reads will be documented in the DRAGEN log. This may happen in regions with very deep coverage.

Limitations

The DRAGEN Indel Re-aligner is designed to improve the quality of the DRAGEN BAM output for downstream analysis. The DRAGEN small variant caller pipeline does not read the output BAM, and has its own internal haplotype assembly step which will usually recovers most of the artifacts found during Indel Re-alignment. Limited testing has shown that there may be a small improvement in DRAGEN small variant calls when Indel Re-alignment is enabled. However, Indel Re-alignment will slow down a DRAGEN Map/Align + VC run roughly by a factor of two. For that reason, it is not recommended to enable Indel Re-alignment with the DRAGEN VC, and it is not enabled by default.

The Indel Re-alignment pipeline cannot run with:

  • The UMI pipeline.

  • The Methylation pipelines.

  • --qc-coverage-ignore-overlaps=true.

  • SA tag generation (--generate-sa-tags=true).

  • The Expansion Hunter pipeline.

Command Line Options

Name
Description
Default Value

enable-indel-realigner

Enable indel re-alignment

False

ir-write-intervals-file

Output a file with the reference intervals that contain evidence for re-alignment.

False

ir-max-num-reads

Max number of reads in an interval for re-alignment.

20,000

ir-max-num-candidates

Max number of re-alignment candidates in an interval for re-alignment.

256

ir-max-num-consensus

Max number of consenses reads in an interval for re-alignment.

256

ir-max-distance-between-mates

Max number of re-alignment candidates in an interval for re-alignment.

100,000

ir-realignment-threshold

Minimal improvement of sum of mismatching base qualities to merit realignment.

50

Last updated