Microsatellite Instability
Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.
DRAGEN MSI supports running in tumor-normal and tumor-only modes. The tumor-only mode will require a panel of normals. The panel of normals can be generated using the collect-evidence
mode.
The default microsatellite site lists and the panel of normals are available for WES and WGS (DRAGEN Software Support Site page). Custom panels other than WES and WGS may require more extensive validation and possibly require generating a new sites file.
MSI Algorithm
The MSI algorithm performs the following steps:
Tabulate the number of read alignments for each microsatellite site in tumor and normal samples.
A read is counted toward a repeat length only if the sequence contains the repeat sequence, 5 bases each on the left and right flanks as specified in the microsatellite site list.
When
msi-read-stitching
is turned on, a pair of reads are counted as one read if they are overlapping with each other.
Calculate Jensen-Shannon distance of tumor and normal distributions
In
tumor-normal
mode, the JS distance is calculated bewteen the tumor sample and the normal sample.In
tumor-only
mode, we first calculate intra-normal JS distances between all pairs of normal samples. Then, we normalize the mean JS distance between the tumor sample and all normal samples by the mean intra-normal distance.
Compute P-values for each site using
chi-square testing between tumor and normal distributions in
tumor-normal
mode, andstudent-t testing between mean tumor and normal distributions in
tumor-only
mode.
Determine if the site is assessed if the followign criteria are satisfied:
the total number of supporting reads is greater than
SpanningCoverageThreshold
in both tumor and normal samplesthe number of reads supporting the reference repeat length is larger than
MinReferencePeakHeight
.
Determine if a site is unstable based on both the Jensen-Shannon distance and P-values. A site is unstable if JS distance is larger than
DistanceThreshold
(default=0.1), and P-value is smaller thanPValueThreshold
(default=0.01).Determine if the site passes filters based on specific peak heights if the following criteria are satisfied:
the number of reads supporting (reference repeat length - 1) is greater than or equal to
MinLeftPeakHeight
the ratio of number of reads supporting reference repeat length and (reference repeat length - 1) is between
MinLeftPeakRatio
andMaxLeftPeakRatio
.
If a filter is not passed, the site is counted toward total assessed site, but is not counted toward unstable sites even though distance and P-value pass the thresholds.
Summarize stats and produce a report in the JSON output file given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites. The parameter values mentioned above are also reported.
Command-Line Options
Example command for tumor-only
mode
tumor-only
modeIt is recommended to use tumor-only
mode rather than tumor-normal
if a panel of normals is available. It is also recommended to match the sample types of the panel of normals and the tumor sample for optimal performance. For example, a panel of normals that are FFPE samples should be used with FFPE or FF (Fresh-Frozen) tumor samples.
The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only
mode.
Example command for tumor-normal
mode
tumor-normal
modeThe paired normal sample is specified by --fastq-file1
and --fastq-file2
.
msi-command
Mode of execution: tumor-only
, tumor-normal
, or collect-evidence
.
msi-microsatellites-file
msi-ref-normal-dir
msi-ref-normal-input
Full name of a combined file with reference normal repeat length distributions from multiple samples.
msi-read-stitching
Whether to count overlapping reads as one fragment. It is recommended to set this option to True for libraries with short fragments. When read-stitching is turned on, the coverage of reads on each site will be lowered. It is recommended to lower msi-coverage-threshold especially for lower coverage samples.
msi-coverage-threshold
msi-distance-threshold
Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.
Assay-Specific Settings
TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.
Solid
TSO500
Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.
msi-distance-threshold=0.1
20
Heme
TSO500
N/A
N/A
N/A
Liquid (cfDNA)
TSO500
Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WES
Available for download. Repeats 10 - 50. Approx. 3.5K sites.
msi-distance-threshold=0.02
TBD
Solid, Heme
WGS
Available for download. Repeats 10 - 50.Approx. 1 mil sites.
msi-distance-threshold=0.1
TBD
Liquid (cfDNA)
WGS
Available for download. Repeats 10 - 50. Approx. 1 mil sites.
msi-distance-threshold=0.02
TBD
Microsatellite sites files
The following is an example of a microsatellite file:
Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site page
For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.
Microsatellite site list columns
chromosome
Chromosome of the site
location
Start location of the site
repeat_unit_length
Size of the repeat unit
repeat_unit_binary
Binary encoding of the repeat unit base converted to decimal (A: 0, C: 1, G: 2, T: 3)
repeat_times
Number of repeats units in reference
left_flank_binary
Left flank bases in terms of binary encoding converted to decimal
right_flank_binary
Right flank bases in terms of binary encoding converted to decimal
repeat_unit_bases
Repeat unit base in A/T/C/G
left_flank_bases
Five bases on the left flank of the microsatellite site
right_flank_bases
Five bases on the right flank of the microsatellite site
Custom Microsatellite files
Custom Microsatellite site files may be required if a small panel is targeted and/or the default site files do not have sufficient overlapping sites.
Custom Microsatellite site files can be generated by using MSIsensor-Pro https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices.
A subsequent post-processing step is required for the site list to be used by DRAGEN:
only keep microsatellites sites with a repeat unit of length 1
keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)
remove any sites containing Ns in the left or right anchors
downsample the remaining sites to contain no more than 1 million sites (to avoid excessive run time)
rearrange the columns to match the format of a DRAGEN microsatellite site list (see Microsatellite site list columns)
An error would occur if long (>100bp) microsatellite sites are present in the file.
The Microsatellite site file output by MSI-sensor Pro is in a different format as the DRAGEN site file. A post-processing step is required to convert the format.
Germline variant filtering
We recommend filtering out microsatellite sites that overlap with known population variants. A locus affected by small variants will result in artificially inflated differences between samples. In the example below, the site in normal sample overlaps with a heterozygous variant (possibly a one-base ins/del). In the paired tumor sample, the heterozygosity is lost (LOH). The difference observed between the two distributions are not due to microsatellite instability, but LOH.
We recommend using gnomAD as the reference database to filter all sites that overlap with small variants with population allele frequencies > 1%.
Normal references of miscrosatellite repeat distribution
The normals reference can be provided in two formats: as separate files in one directory, or as a single file containing distributions from multiple samples.
Separate files can be provided with
msi-ref-normal-dir
. The directory should contain only the.dist
files that are used as normal references.A combined file can be provided with
msi-ref-normal-input
. The combined.dist
file must contain an additional column that specifies the name of the sample for each distribution.
Normal reference files can be generated by running collect-evidence
mode on a panel of normal samples. The output is in the same format as the .dist
file described in MSI output. The default normal reference files are also available for WES and WGS at DRAGEN Software Support Site page.
Please note:
The
collect-evidence
mode MUST be run in DRAGEN germline mode, as indicated by fastq options-1
and-2
.The
--msi-microsatellites-file
and--msi-coverage-threshold
settings used incollect-evidence
mode must be consistent with the settings used during tumor-only MSI calling.At least 20 normal samples are required.
MSI Output
DRAGEN outputs the following files during the MSI workflow:
<prefix>.microsat_output.json
<prefix>.microsat_diffs.txt
<prefix>.microsat_normal.dist
<prefix>.microsat_tumor.dist
<prefix>.microsat_log.txt
Logs the runtime and MSI results
MSI score report
The JSON file <prefix>.microsat_output.json
contains the parameters to reproduce the experiments, and the MSI results (including the MSI score PrecentageUnstableSites
).
The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of tumor vs normal distributions. The "SumDistance" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".
In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.
Distribution of repeat lengths
DRAGEN MSI computes the number of repeat units (repeat lengths) supported by each read fragment.
The distribution is recorded in <prefix>.microsat_normal.dist
and <prefix>.microsat_tumor.dist
for normal and tumor samples, respectively.
Example .dist
file:
Summing up the numbers in the last column give the total number of reads covering the site.
Columns in .dist
files:
chromosome
chromosome of the site
location
start position of the site
repeat_unit_bases
the base(s) of the repeat unit in reference in A/T/C/G string
reference_allele
the number of repeats in reference
covered
whether the site is covered by sufficient reads (determined by msi-coverage-threshold
)
length_distribution
A vector of size 100 that records read support for each repeat length from 1 to 100.
Difference between tumor and normal samples
Example <prefix>.microsat_diffs.txt
file
Columnns in <prefix>.microsat_diffs.txt
The details of how column values are computed can be found in MSI algorithm.
Chromosome
chromosome of the site
Start
start position of the site
RepeatUnit
the base(s) of the repeat unit in reference in A/T/C/G string
Assessed
whether the base is assesed based on read coverage and number of reads supporting the reference length
Distance
the Jensen-Shannon distance between tumor and normal distritbutions
PValue
statistical significance of the difference observed between distributions
PassFilter
whether the site passes filters based on on specific peak heights
Last updated
Was this helpful?