De Novo Repeat Expansion Detection

Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.

STR profiling allows the discovery of expanded STR regions from paired-end reads across a cohort of samples. It is designed to work with PCR-free samples of 100-200bp paired-end reads at >30X coverage.

Note:
STRs shorter than the read length are ignored; the program is appropriate only for detecting expansions that exceed the read length.
The location of each reported STR is approximate (up to about 500bp-1Kbp)
STRs are not genotyped; the program reports a depth-normalized count of reads originating inside each STR; this count can be used as a very approximate measure of the repeat length
To achieve best results all samples must be sequenced on the same instrument to similar coverage, have the same read and fragment lengths, and be subjected to the same computational pre-processing (e.g. reads must be aligned by the same aligner)

Briefly, the workflow can be separated in two distinct steps: profiling and analysis. In the profiling step, repetitive reads are found and used to infer the location of potential STR regions. The regions and the respective read counts are then saved in a "profile" on disk. The profiling step is run for each sample and the resulting profiles are merged into a single dataset for the analysis. In the analysis step the user needs to provide a table describing the experimental design to run either an outlier analysis which tests one sample against the rest or a case-control analysis where the samples are split in two groups.

Running DRAGEN

The two steps of the workflow, profiling and analysis, are performed by two separate DRAGEN commands.

Profiling

In the first step we compute the profiles which are going to be saved as ProtoBuf messages (<out_prefix>.data). The profile can be saved in a specific directory with the --str-profiler-output-directory flag. The sample name will be saved in the profile and can be specified at the profiling stage with the flag --str-profiler-sample-name. If not specified, the sample name in the RGSM field will be used instead.

DRAGEN has to be called once for each sample, for example with the command:

dragen \
  --enable-map-align=false \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --bam-input <bam_input> \
  --enable-str-profiler=true \
  --str-profiler-sample-name=<optional_name> \            # if not set is == to RGSM
  --str-profiler-output-directory=<path_to_directory> \   # if not set is == to --output-directory
  -r=<dragen_ref>

After all the profiles are computed, they have to be divided in 'cases' and 'controls' directories. This can be achieved while computing the profiles by passing the directory with the --str-profiler-output-directory flag. The input can be a list of samples with the --fastq-list option. DRAGEN can take as input a list of FASTQ files and save each profile in the directory specified directory with --str-profiler-output-directory. A list of cases and a list of controls can be run in this manner.

Example command:

dragen \
  --enable-map-align=true \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --fastq-list=<fastq_list> \
  --fastq-list-all-samples=true \                # necessary to run samples separately based on sample-id
  --enable-multi-sample=true \                   # necessary to run samples separately based on sample-id
  --enable-per-sample-map-align-output=true \    # can be set to false if map align output (BAM) is not needed
  --enable-str-profiler=true \
  --str-profiler-output-directory=<path_to_directory> \
  -r=<dragen_ref>

Analysis

The analysis is performed with a separate DRAGEN command, which takes as input the path to the two directories.

Two analysis types can be specified:

outlier = bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for the cases samples
casecontrol = cases and controls counts are compared with a one-sided Wilcoxon rank-sum test and a Bonferroni correction is applied to the resulting p-values

Providing the --str-profile-analysis flag will trigger the analysis workflow. Example command:

dragen \
  --output-directory=<out_dir>  \
  --output-file-prefix=<out_prefix>  \
  --enable-str-profiler=true \
  --str-profiler-analysis=<outlier|casecontrol> \
  --str-profiler-cases-directory=<directory_with_cases_profiles> \
  --str-profiler-controls-directory=<directory_with_controls_profiles> \
  -r=<dragen_ref>

A note about bootstrapping

DRAGEN uses bootstrapping to approximate the distribution for the outlier analysis with a default number of iteration of 1000. This number can be adjusted with the flag --str-profiler-resampling-rounds. Increasing the number of resampling cycles will improve the accuracy of the approximation but also linearly increase the compute times.

DRAGEN will spread the computation across 48 threads by default, but the number can be adjusted on the command line with the flag --str-profiler-threads.

Output

The output is composed of two tables, one for the "motif" level analysis and one for the "locus" level analysis which will be saved as <output-prefix>.str_profiler_locus.tsv and <output-prefix>.str_profiler_motif.tsv respectively. Below is a description of the locus analysis output. The motif table is the same as the locus table but without the contig, start and end columns.

Outlier analysis (locus) output

Column

Description

contig

Contig of the repeat region

start

Approximate start of the repeat

end

Approximate end of the repeat

motif

Inferred repeat motif

top_case_zscore

Top z-score of a case sample

high_case_counts

Counts of case samples corresponding to z-score greater than 1.0

counts

Nonzero counts for all samples

Case-control analysis (locus) output

Column

Description

contig

Contig of the repeat region

start

Approximate start of the repeat

end

Approximate end of the repeat

motif

Inferred repeat motif

pvalue

P-value from Wilcoxon rank-sum test

bonf_pvalue

P-value after Bonferroni correction

counts

Depth-normalized counts of anchored in-repeat reads for each sample (omitting samples with zero count)

PreviousRepeat Expansion Detection NextTargeted Caller

Last updated 3 months ago

Was this helpful?