Microsatellite Instability
Last updated
Was this helpful?
Last updated
Was this helpful?
Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.
DRAGEN MSI supports running in tumor-normal and tumor-only modes. The tumor-only mode will require a panel of normals. The panel of normals can be generated using the collect-evidence
mode.
The default microsatellite site lists and the panel of normals are available for WES and WGS (). Custom panels other than WES and WGS may require more extensive validation and possibly require .
The MSI algorithm performs the following steps:
Tabulate the number of read alignments for each microsatellite site in tumor and normal samples.
A read is counted toward a repeat length only if the sequence contains the repeat sequence, 5 bases each on the left and right flanks as specified in the microsatellite site list.
When msi-read-stitching
is turned on, a pair of reads are counted as one read if they are overlapping with each other.
Calculate Jensen-Shannon distance of tumor and normal distributions
In tumor-normal
mode, the JS distance is calculated bewteen the tumor sample and the normal sample.
In tumor-only
mode, we first calculate intra-normal JS distances between all pairs of normal samples. Then, we normalize the mean JS distance between the tumor sample and all normal samples by the mean intra-normal distance.
Compute P-values for each site using
chi-square testing between tumor and normal distributions in tumor-normal
mode, and
student-t testing between mean tumor and normal distributions in tumor-only
mode.
Determine if the site is assessed if the followign criteria are satisfied:
the total number of supporting reads is greater than SpanningCoverageThreshold
in both tumor and normal samples
the number of reads supporting the reference repeat length is larger than MinReferencePeakHeight
.
Determine if a site is unstable based on both the Jensen-Shannon distance and P-values. A site is unstable if JS distance is larger than DistanceThreshold
(default=0.1), and P-value is smaller than PValueThreshold
(default=0.01).
Determine if the site passes filters based on specific peak heights if the following criteria are satisfied:
the number of reads supporting (reference repeat length - 1) is greater than or equal to MinLeftPeakHeight
the ratio of number of reads supporting reference repeat length and (reference repeat length - 1) is between MinLeftPeakRatio
and MaxLeftPeakRatio
.
If a filter is not passed, the site is counted toward total assessed site, but is not counted toward unstable sites even though distance and P-value pass the thresholds.
Summarize stats and produce a report in the given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites. The parameter values mentioned above are also reported.
tumor-only
modeIt is recommended to use tumor-only
mode rather than tumor-normal
if a panel of normals is available. It is also recommended to match the sample types of the panel of normals and the tumor sample for optimal performance. For example, a panel of normals that are FFPE samples should be used with FFPE or FF (Fresh-Frozen) tumor samples.
The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only
mode.
tumor-normal
modeThe paired normal sample is specified by --fastq-file1
and --fastq-file2
.
msi-command
Mode of execution: tumor-only
, tumor-normal
, or collect-evidence
.
msi-microsatellites-file
msi-ref-normal-dir
msi-ref-normal-input
Full name of a combined file with reference normal repeat length distributions from multiple samples.
msi-read-stitching
Whether to count overlapping reads as one fragment. It is recommended to set this option to True for libraries with short fragments. When read-stitching is turned on, the coverage of reads on each site will be lowered. It is recommended to lower msi-coverage-threshold especially for lower coverage samples.
msi-coverage-threshold
msi-distance-threshold
Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.
TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.
Solid
TSO500
Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.
msi-distance-threshold=0.1
20
Heme
TSO500
N/A
N/A
N/A
Liquid (cfDNA)
TSO500
Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WGS
Available for download. Repeats 10 - 50.Approx. 1 mil sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WGS
Available for download. Repeats 10 - 50. Approx. 1 mil sites.
msi-distance-threshold=0.02
TBD
The following is an example of a microsatellite file:
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.
chromosome
Chromosome of the site
location
Start location of the site
repeat_unit_length
Size of the repeat unit
repeat_unit_binary
Binary encoding of the repeat unit base converted to decimal (A: 0, C: 1, G: 2, T: 3)
repeat_times
Number of repeats units in reference
left_flank_binary
Left flank bases in terms of binary encoding converted to decimal
right_flank_binary
Right flank bases in terms of binary encoding converted to decimal
repeat_unit_bases
Repeat unit base in A/T/C/G
left_flank_bases
Five bases on the left flank of the microsatellite site
right_flank_bases
Five bases on the right flank of the microsatellite site
Custom Microsatellite site files may be required if a small panel is targeted and/or the default site files do not have sufficient overlapping sites.
Custom Microsatellite site files can be generated by using MSIsensor-Pro https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices.
A subsequent post-processing step is required for the site list to be used by DRAGEN:
only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain no more than 1 million sites (to avoid excessive run time)
An error would occur if long (>100bp) microsatellite sites are present in the file.
The Microsatellite site file output by MSI-sensor Pro is in a different format as the DRAGEN site file. A post-processing step is required to convert the format.
We recommend filtering out microsatellite sites that overlap with known population variants. A locus affected by small variants will result in artificially inflated differences between samples. In the example below, the site in normal sample overlaps with a heterozygous variant (possibly a one-base ins/del). In the paired tumor sample, the heterozygosity is lost (LOH). The difference observed between the two distributions are not due to microsatellite instability, but LOH.
The normals reference can be provided in two formats: as separate files in one directory, or as a single file containing distributions from multiple samples.
Separate files can be provided with msi-ref-normal-dir
. The directory should contain only the .dist
files that are used as normal references.
A combined file can be provided with msi-ref-normal-input
. The combined .dist
file must contain an additional column that specifies the name of the sample for each distribution.
Please note:
The collect-evidence
mode MUST be run in DRAGEN germline mode, as indicated by fastq options -1
and -2
.
The --msi-microsatellites-file
and --msi-coverage-threshold
settings used in collect-evidence
mode must be consistent with the settings used during tumor-only MSI calling.
At least 20 normal samples are required.
DRAGEN outputs the following files during the MSI workflow:
<prefix>.microsat_output.json
<prefix>.microsat_diffs.txt
<prefix>.microsat_normal.dist
<prefix>.microsat_tumor.dist
<prefix>.microsat_log.txt
Logs the runtime and MSI results
The JSON file <prefix>.microsat_output.json
contains the parameters to reproduce the experiments, and the MSI results (including the MSI score PrecentageUnstableSites
).
The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of tumor vs normal distributions. The "SumDistance" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".
In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.
DRAGEN MSI computes the number of repeat units (repeat lengths) supported by each read fragment.
The distribution is recorded in <prefix>.microsat_normal.dist
and <prefix>.microsat_tumor.dist
for normal and tumor samples, respectively.
Example .dist
file:
Summing up the numbers in the last column give the total number of reads covering the site.
Columns in .dist
files:
chromosome
chromosome of the site
location
start position of the site
repeat_unit_bases
the base(s) of the repeat unit in reference in A/T/C/G string
reference_allele
the number of repeats in reference
covered
whether the site is covered by sufficient reads (determined by msi-coverage-threshold
)
length_distribution
A vector of size 100 that records read support for each repeat length from 1 to 100.
Example <prefix>.microsat_diffs.txt
file
Columnns in <prefix>.microsat_diffs.txt
Chromosome
chromosome of the site
Start
start position of the site
RepeatUnit
the base(s) of the repeat unit in reference in A/T/C/G string
Assessed
whether the base is assesed based on read coverage and number of reads supporting the reference length
Distance
the Jensen-Shannon distance between tumor and normal distritbutions
PValue
statistical significance of the difference observed between distributions
PassFilter
whether the site passes filters based on on specific peak heights
Specify the file containing the . DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.
Full name of directory containing files with . These files can be generated by running collect-evidence
on each normal sample. A site is only evaluated if at least 20 normal samples have enough coverage for that site.
Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not assessed in analysis. DRAGEN recommends using 60 as the default value for solid samples. If the coverage is low, user can try lowering the threshold to 30 to increase the number of microsatellite sites assessed in the analysis. For TSO500 liquid, a value of 500 is recommended. See for the details on how the number of spanning reads are counted.
Default WES and WGS Microsatellite site files can be downloaded here:
rearrange the columns to match the format of a DRAGEN microsatellite site list (see )
We recommend using as the reference database to filter all sites that overlap with small variants with population allele frequencies > 1%.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples. The output is in the same format as the .dist
file described in . The default normal reference files are also available for WES and WGS at .
reports MSI status and parameters used in JSON format.
reports the statistical distance between tumor and normal samples for each site, and stats used to determine the status of the site. This file is not part of output in collect-evidence
mode.
reports the repeat length distribution of each site in a normal sample. This file is not part of output in tumor-only
mode.
reports the repeat length distribution of each site in tumor sample. This file is not part of output in collect-evidence
mode.
The above figure shows a mock example of read pileup (left) at a pre-specified homopolymer site with 10 repeat units of T in reference with two abnormal alignments at bottom, and the distribution of repeat lengths (right) corresponding to the pileup.
The details of how column values are computed can be found in .