Filter Duplicate Variants
Last updated
Last updated
DRAGEN can find and remove variants that are common to separate VCF files. DRAGEN supports the following modes:
Small indel deduplication—If using a structural variant VCF and a small variant VCF, DRAGEN filters all small indels in the structural variant VCF that appear and are passing in the small variant VCF (PASS
in the FILTER
column of the small variant VCF file). Using this feature, DRAGEN will create a new VCF (without changing SV and SNV VCF files) that contains variants in SV VCF that are not matching a variant from SNV VCF file. The new deduplicated SV VCF file will have the same prefix passed by --output-file-prefix
followed by sv.small_indel_dedup.vcf.gz
as suffix. The diagram below describes the small indel deduplication pipeline. You must provide a reference genome to generate the VCF files to normalize the variants. DRAGEN normalizes variants by trimming and left shifting by up to 500 bases. An instance of utilizing this feature is when incorporating both SV and SNV callers in somatic workflows, which can increase sensitivity and prevent the occurrence of replicated variants within genes such as FLT3 and KMT2A.
SMN deduplication—If using a small variant VCF and an ExpansionHunter VCF, DRAGEN filters any lines in the small variant VCF that have the same chromosome and position as lines in the ExpansionHunter VCF with the INFO tag VARID=SMN
. A reference genome is not required.
Use the following command line options to input VCF or gVCF files. The input files are not altered.
vd-sv-vcf
—Specify a structural variant VCF or gVCF.
vd-small-variant-vcf
—Specify a small variant VCF or gVCF.
vd-eh-vcf
—Specify an ExpansionHunter VCF or gVCF.
DRAGEN determines the name and type of the output file as follows.
Output prefix
If a value is specified for output-file-prefix
, the prefix is used as usual. If the value is not valid, the name of the filtered input is used as the prefix.
Deduplication mode
The prefix is followed by .small_indel_dedup
or .smn_dedup
depending on the deduplication mode used.
File type
The output file type matches the input file type (VCF or gVCF). If enable-vcf-compression
is set to true
, the output file is gzip compressed, regardless of if the input file was compressed. The name of the match log is either match_log.smn_dedup.txt
or match_log.small_indel_dedup.txt
depending on which deduplication mode you use.
You can use the following command line options for variant deduplication.
enable-variant-deduplication
To enable variant deduplication, set to true
. The default is false
.
enable-vcf-indexing
To generate tabix index files, set to 'true'. The default is 'true'.
vd-output-match-log
To log matching lines to a text file, set to true. The default is false. For each match, the two matching lines follow each other, then by a new line.
The following is an example command for an SMN deduplication standalone run:
You can also run small indel deduplication automatically on outputs from the DRAGEN joint caller where both structural variant and small variant callers are enabled. To run small indel deduplication automatically, set enable-variant-deduplication
to true
, and make sure the vd-sv-vcf
, vd-small-indel-vcf
, and vd-eh-vcf
input options are not set. Only small indel deduplication can be run automatically.
The following is an example command for an automatic small indel deduplication run.