De Novo STR Expansion Detection

Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.

ExpansionHunter de novo allows the discovery of expanded STR regions from paired-end reads across a cohort of samples. It is designed to work with PCR-free samples of 100-200bp paired-end reads at >30X coverage.

Note:

  • STRs shorter than the read length are ignored; the program is appropriate only for detecting expansions that exceed the read length.

  • The location of each reported STR is approximate (up to about 500bp-1Kbp)

  • STRs are not genotyped; the program reports a depth-normalized count of reads originating inside each STR; this count can be used as a very approximate measure of the repeat length

  • To achieve best results all samples must be sequenced on the same instrument to similar coverage, have the same read and fragment lengths, and be subjected to the same computational pre-processing (e.g. reads must be aligned by the same aligner)

For more information refer to:

Dolzhenko, E., Bennett, M.F., Richmond, P.A. et al. ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21, 102 (2020)

Briefly, the workflow can be separated in two distinct steps: profiling and analysis. In the profiling step, repetitive reads are found and used to infer the location of potential STR regions. The regions and the respective read counts are then saved in a "profile" on disk. The profiling step is run for each sample and the resulting profiles are merged into a single dataset for the analysis. In the analysis step the user needs to provide a table describing the experimental design to run either an outlier analysis which tests one sample against the rest or a case-control analysis where the samples are split in two groups.

In DRAGEN, the analysis is more streamlined than the standalone EHdn tool and has considerable performance improvements, while retaining the same output.

Note: The output in the case of outlier analysis might not be exactly identical because it involves bootstrapping. In DRAGEN, the random sampling function necessary for bootstrapping is different than what is implemented by Numpy in the standalone EHdn.

Running DRAGEN

Note: The DRAGEN implementation is based on EHdn version v0.9.1

The two steps of the workflow, profiling and analysis, are performed by two separate DRAGEN commands.

Profiling

In the first step we compute the profiles which are going to be saved as ProtoBuf messages (<out_prefix>.data). The profile can be saved in a specific directory with the --str-profiler-output-directory flag. The sample name will be saved in the profile and can be specified at the profiling stage with the flag --str-profiler-sample-name. If not specified, the sample name in the RGSM field will be used instead.

DRAGEN has to be called once for each sample, for example with the command:

dragen \
  --enable-map-align=false \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --bam-input <bam_input> \
  --enable-str-profiler=true \
  --str-profiler-sample-name=<optional_name> \            # if not set is == to RGSM
  --str-profiler-output-directory=<path_to_directory> \   # if not set is == to --output-directory
  -r=<dragen_ref>

After all the profiles are computed, they have to be divided in 'cases' and 'controls' directories. This can be achieved while computing the profiles by passing the directory with the --str-profiler-output-directory flag. The input can be a list of samples with the --fastq-list option. DRAGEN can take as input a list of FASTQ files and save each profile in the directory specified directory with --str-profiler-output-directory. A list of cases and a list of controls can be run in this manner.

Example command:

dragen \
  --enable-map-align=true \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --fastq-list=<fastq_list> \
  --fastq-list-all-samples=true \                # necessary to run samples separately based on sample-id
  --enable-multi-sample=true \                   # necessary to run samples separately based on sample-id
  --enable-per-sample-map-align-output=true \    # can be set to false if map align output (BAM) is not needed
  --enable-str-profiler=true \
  --str-profiler-output-directory=<path_to_directory> \
  -r=<dragen_ref>

Analysis

The analysis is performed with a separate DRAGEN command, which takes as input the path to the two directories.

Two analysis types can be specified:

  • outlier = bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for the cases samples

  • casecontrol = cases and controls counts are compared with a one-sided Wilcoxon rank-sum test and a Bonferroni correction is applied to the resulting p-values

Providing the --str-profile-analysis flag will trigger the analysis workflow. Example command:

dragen \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --enable-str-profiler=true \
  --str-profiler-analysis=<outlier|casecontrol> \
  --str-profiler-cases-directory=<directory_with_cases_profiles> \
  --str-profiler-controls-directory=<directory_with_controls_profiles> \
  -r=<dragen_ref>

A note about bootstrapping

The standalone version of EHdn performs 100 rounds of resampling during bootstrapping due to computational constraints. In DRAGEN the resampling has been increased to 1000 by default thanks to the much faster computation. This number can be adjusted with the flag --str-profiler-resampling-rounds. Increasing the number of resampling cycles will improve the precision of the estimate but also linearly increase the compute times.

DRAGEN will spread the computation across 48 threads by default, but the number can be adjusted on the command line with the flag --str-profiler-threads.

Output

The output (as in the standalone EHdn implementation) is composed of two tables, one for the "motif" level analysis and one for the "locus" level analysis which will be saved as <output-prefix>.str_profiler_locus.tsv and <output-prefix>.str_profiler_motif.tsv respectively. Below is a description of the locus analysis output. The motif table is the same as the locus table but without the contig, start and end columns.

Outlier analysis (locus) output

ColumnDescription

contig

Contig of the repeat region

start

Approximate start of the repeat

end

Approximate end of the repeat

motif

Inferred repeat motif

top_case_zscore

Top z-score of a case sample

high_case_counts

Counts of case samples corresponding to z-score greater than 1.0

counts

Nonzero counts for all samples

Case-control analysis (locus) output

ColumnDescription

contig

Contig of the repeat region

start

Approximate start of the repeat

end

Approximate end of the repeat

motif

Inferred repeat motif

pvalue

P-value from Wilcoxon rank-sum test

bonf_pvalue

P-value after Bonferroni correction

counts

Depth-normalized counts of anchored in-repeat reads for each sample (omitting samples with zero count)

Last updated