Repeat Expansion Detection
Last updated
Was this helpful?
Last updated
Was this helpful?
Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.
DRAGEN includes a repeat expansion detection tool for STRs, DRAGEN-STR. DRAGEN-STR Performs sequence-graph based realignment of reads that originate inside and around each target repeat. DRAGEN-STR then genotypes the length of the repeat in each allele based on these graph alignments.
DRAGEN-STR is designed for PCR-free whole genome samples. Repeats are only genotyped if the coverage at the locus is at least 10x, but a minimum of 30x is recommended. Sequencing reads must be paired-end with a minimum read length of 100 (2x100bp). DRAGEN-STR cannot be run on multiple FASTQ files that are assigned to different library IDs in the fastq_list.csv
file.
DRAGEN-STR does not support somatic analysis.
NOTE:
DRAGEN STR is based on the ExpansionHunter tool. For more information about implementation details and performance assessment refer to these .
To enable DRAGEN repeat expansion detection, the following command-line options are required.
--repeat-genotype-enable=true
--repeat-genotype-specs=<path to specification file>
You can use the --sample-sex
option to specify the sex of the sample.
The following options are optional.
--repeat-genotype-region-extension-length=<length of region around repeat to examine>
(default 1000 bp)
--repeat-genotype-min-baseq=<Minimum base quality for high confidence bases>
(default 20)
For more information on the specification file specified by --repeat-genotype-specs
option, see .
The main output of repeat expansion detection is a VCF file that contains the variants found via this analysis.
The repeat-specification (also called variant catalog) JSON file defines the repeat regions for DRAGEN-STR to analyze. Default repeat-specification for some pathogenic and polymorphic repeats are in the <INSTALL_PATH>/resources/repeat-specs/
directory, based on the reference genome used with DRAGEN.
--repeat-genotype-specs
is required for DRAGEN-STR. If the option is not provided, DRAGEN attempts to autodetect the applicable catalog file from <INSTALL_PATH>/resources/repeat-specs/
based on the reference provided.
The results of repeat genotyping are output as a separate VCF file, which provides the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf
(*.gz).
The VCF output file lists with the following fields first.
Table 2 Core VCF Fields
CHROM
Chromosome identifier
POS
Position of the first base before the repeat region in the reference
ID
Always .
REF
The reference base at position POS
ALT
List of repeat alleles in format <STRn>
. N is the number of repeat units. If REF, then .
.
QUAL
Always .
FILTER
LowDepth filter is applied when the overall locus depth is below 10x or number of reads that span one or both breakends is below 5.
Table 3 Additional INFO Fields
END
Position of the last base of the repeat region in the reference
REF
Number of repeat units spanned by the repeat in the reference
RL
Reference length in bp
VARID
Variant ID from the variant catalog
RU
Repeat unit in the reference orientation
REPID
Variant ID from the variant catalog
Table 4 GENOTYPE (Per Sample) Fields
GT
Genotype
SO
Type of reads that support the allele. Values can be SPANNING, FLANKING, or INREPEAT. These values indicate if the reads span, flank, or are fully contained in the repeat.
REPCN
Number of repeat units spanned by the allele
REPCI
Confidence interval for REPCN
ADSP
Number of spanning reads consistent with the allele
ADFL
Number of flanking reads consistent with the allele
ADIR
Number of in-repeat reads consistent with the allele
LC
Locus Coverage
For example, the following VCF entry describes the ATXN1 repeat in a sample NA13537.
In this example, the first allele spans 33 repeat units while the second allele spans 58 repeat units. The repeat unit is TGC (RU INFO field), so the sequence of the first allele is TGC x 33 and the sequence of the second allele is TGC x 58. The repeat spans 30 repeat units in the reference (REF INFO field).
The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (52,71). There are 4 spanning and 69 flanking reads consistent with the repeat allele of size 33 that is 4 reads fully contain the repeat of size 33 and 69 flanking reads overlap at most 33 repeat units. There are 83 flanking and 4 in-repeat reads consistent with the repeat allele of size 58. The average coverage of this locus is 37.46x.
The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool available on GitHub to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMs.
The BAMs store graph alignments in custom XG tags using the format<LocusName>,<StartPosition>,<GraphCIGAR>
.
LocusName---A locus identifier that matches the corresponding entry in the repeat expansion specification file.
StartPosition---The starting alignment position of a read on the first graph node.
GraphCIGAR---The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node.
Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.
Some STR loci have polymorphic motifs, meaning that the different repetitions of the motif have minor variations in their sequence. For example, an STR may contain some repetitions of AAAGG and some repetitions of AAGGG, all in the same haplotype.
DRAGEN can estimate the global motif composition of STR loci and in some cases, if the STR expansion is heterozygous, also the per-allele composition, by leveraging reads that are only compatible with the estimated repeat size on one haplotype but not the other.
DRAGEN has two workflows to compute motif fractions:
Kmer-counting : Kmers extracted from the graph-alignment of each read to the locus graph are used to detect and count the motifs in the locus.
HMM labeling : The repetitive patterns in each read are labeled using an HMM model generated from a set of possible motifs passed on by the user.
There are advantages and disadvantages to each technique:
Kmer counting does not need to know the set of motifs a priori, but it will only count kmers the same size as the pattern.
HMM labeling needs the set of possible motifs to be known a priori, but it will perform much better when the set contains motifs of varying length.
NOTE If the set of motifs is known, it is always advisable to leverage the more accurate HMM labeling over the kmer-counting.
Motif analysis is activated on a per-locus basis by adding a field to the respective catalog entry.
NOTE Either HMM or kmer motif analysis can be enabled, not both.
=======IMPORTANT The HMM motif analysis is enabled by default only for the RFC1 locus in built-in DRAGEN catalogs with the following motifs set:
FORMAT
fieldsWhen motif analysis is enabled on a locus, the following FORMAT
fields will be present in the VCF record:
MOTIFS
Set of high-quality motifs detected from graph-aligned reads
MF
Fraction of quality weighted counts for each motif in FORMAT:MOTIFS
AMF
Fraction of quality weighted counts for each motif in FORMAT:MOTIFS
stratified by allele
To reduce false positives due to drops in base quality, DRAGEN has to detect a motif in high quality parts of the read to report it. When there are no high-quality motifs in the sample, the corresponding MF
and AMF
field will be 0
when using HMM labeling, or not reported in the VCF when kmer-counting.
If there is no genotype call, the MF
and AMF
fields will be empty. If a genotype call is present, but the genotype is homozygous, the AMF
fields will be empty. Empty fields will be encoded with a dot (.
).
You can create specification files for new repeat regions by using one of the provided specification files as a template. See the for details on the format.
Users can choose between any of the three default repeat-specification files packaged with DRAGEN using the command line option: --repeat-genotype-use-catalog=<default|default_plus_smn|expanded>
. The default
option includes ~60 repeats. The default_plus_smn
option includes the SMN repeat in addition to all the repeats in the default
catalog. The expanded catalog includes ~174K repeats, see . If --repeat-genotype-use-catalog
is not specified on the command line, then the default
catalog is used.
The repeat genotyping results will be incorrect if the selected reference genome is not compatible with the repeat specification file. When this occurs, many repeats may be marked as "LowDepth" in the VCF output file or estimated to have zero length. This can be further confirmed by visualizing read alignments with the .
The default
variant catalog contains specifications on disease-causing repeats located in AFF2, AR, ARX_1, ARX_2, ATN1, ATXN1, ATXN10, ATXN2, ATXN3, ATXN7, ATXN8OS, BEAN1, C9ORF72, CACNA1A, CBL, CNBP, COMP, CSTB, DAB1, DIP2B, DMD, DMPK, EIF4A3, FMR1, FOXL2, FXN, GIPC1, GLS, HOXA13_1, HOXA13_2, HOXA13_3, HOXD13, HTT, JPH3, LRP12, MARCHF6, NIPA1, NOP56, NOTCH2NLC, NUTM2B-AS1, PABPN1, PHOX2B, PPP2R2B, PRDM12, PRNP, RAPGEF2, RFC1, RUNX2, SAMD12, SOX3, STARD7, TBP, TBX1, TCF4, TNRC6A, VWA1, XYLT1, YEATS2, ZIC2 and ZIC3 genes. More information about disease-causing repeats can also be found .
For the expanded
variant catalog, apart from the aforementioned disease-causing repeats, there are ~174K additional polymorphic repeats. They are initially detected using STR-Finder from the 1000 Genomes Project. After that, the candidate repeats are filtered out based on a customized quality control pipeline, see details .
DRAGEN-STR can detect pathogenic expansions of FXN, ATXN3, ATN1, AR, DMPK, HTT, FMR1, ATXN1, C9ORF72 repeats with high accuracy (see ). The pathogenicity status of some repeats might depend on the presence of sequence interruptions or motif changes that DRAGEN-STR does not call. If you would like to visually inspect the relevant read alignments, you can use a Repeat Expansion Viewer third-party tool.
In some cases, the motif composition affects the pathogenic threshold: a long expansion of one motif variant can be harmless while a short expansion of another can be pathogenic. The main example of this is the .