Filter Duplicate Variants

DRAGEN can find and remove variants that are common to separate VCF files. DRAGEN supports the following modes:

  • Small indel deduplication—If using a structural variant VCF and a small variant VCF, DRAGEN filters all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS in the FILTER column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF (without changing SV and SNV VCF files) that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix followed by sv.small_indel_dedup.vcf.gz as suffix. The diagram below describes the small indel deduplication pipeline. You must provide a reference genome to generate the VCF files to normalize the variants. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.

  • SMN deduplication—If using a small variant VCF and an ExpansionHunter VCF, DRAGEN filters any lines in the small variant VCF that have the same chromosome and position as lines in the ExpansionHunter VCF with the INFO tag VARID=SMN. A reference genome is not required.

Use the following command line options to input VCF or gVCF files. The input files are not altered.

  • vd-sv-vcf—Specify a structural variant VCF or gVCF.

  • vd-small-variant-vcf—Specify a small variant VCF or gVCF.

  • vd-eh-vcf—Specify an ExpansionHunter VCF or gVCF.

DRAGEN determines the name and type of the output file as follows.

ComponentDescription

Output prefix

If a value is specified for output-file-prefix, the prefix is used as usual. If the value is not valid, the name of the filtered input is used as the prefix.

Deduplication mode

The prefix is followed by .small_indel_dedup or .smn_dedup depending on the deduplication mode used.

File type

The output file type matches the input file type (VCF or gVCF). If enable-vcf-compression is set to true, the output file is gzip compressed, regardless of if the input file was compressed. The name of the match log is either match_log.smn_dedup.txt or match_log.small_indel_dedup.txt depending on which deduplication mode you use.

Command-Line Options

You can use the following command line options for variant deduplication.

OptionDescription

enable-variant-deduplication

To enable variant deduplication, set to true. The default is false.

enable-vcf-indexing

To generate tabix index files, set to 'true'. The default is 'true'.

vd-output-match-log

To log matching lines to a text file, set to true. The default is false. For each match, the two matching lines follow each other, then by a new line.

The following is an example command for an SMN deduplication standalone run:

dragen --enable-map-align false \
--enable-variant-deduplication true \
--vd-small-variant-vcf <small variant vcf> \
--vd-eh-vcf <expansion hunter vcf> \
--output-directory /tmp/ \
--vd-output-match-log true \

You can also run small indel deduplication automatically on outputs from the DRAGEN joint caller where both structural variant and small variant callers are enabled. To run small indel deduplication automatically, set enable-variant-deduplication to true, and make sure the vd-sv-vcf, vd-small-indel-vcf, and vd-eh-vcf input options are not set. Only small indel deduplication can be run automatically.

The following is an example command for an automatic small indel deduplication run.

dragen \
--ref-dir <REF>
--output-directory <DIR> \
--output-file-prefix <PREFIX> \
-b <BAM> \
--enable-map-align false \
--enable-variant-caller true \
--enable-sv true \
--enable-variant-deduplication true \
--vd-output-match-log true \

Last updated