Microsatellite Instability

Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.

DRAGEN MSI supports running in tumor-normal and tumor-only modes. The tumor-only mode will require a panel of normals. The panel of normals can be generated using the collect-evidence mode.

The default microsatellite site lists and the panel of normals are available for WES and WGS (DRAGEN Software Support Site page). Custom panels other than WES and WGS may require more extensive validation and possibly require generating a new sites file.

MSI Algorithm

The MSI algorithm performs the following steps:

  1. Tabulate the number of read alignments for each microsatellite site in tumor and normal samples.

    • A read is counted toward a repeat length only if the sequence contains the repeat sequence, 5 bases each on the left and right flanks as specified in the microsatellite site list.

    • When msi-read-stitching is turned on, a pair of reads are counted as one read if they are overlapping with each other.

  2. Calculate Jensen-Shannon distance of tumor and normal distributions

    • In tumor-normal mode, the JS distance is calculated bewteen the tumor sample and the normal sample.

    • In tumor-only mode, we first calculate intra-normal JS distances between all pairs of normal samples. Then, we normalize the mean JS distance between the tumor sample and all normal samples by the mean intra-normal distance.

  3. Compute P-values for each site using

    • chi-square testing between tumor and normal distributions in tumor-normal mode, and

    • student-t testing between mean tumor and normal distributions in tumor-only mode.

  4. Determine if the site is assessed if the followign criteria are satisfied:

    • the total number of supporting reads is greater than SpanningCoverageThreshold in both tumor and normal samples

    • the number of reads supporting the reference repeat length is larger than MinReferencePeakHeight.

  5. Determine if a site is unstable based on both the Jensen-Shannon distance and P-values. A site is unstable if JS distance is larger than DistanceThreshold (default=0.1), and P-value is smaller than PValueThreshold (default=0.01).

  6. Determine if the site passes filters based on specific peak heights if the following criteria are satisfied:

    • the number of reads supporting (reference repeat length - 1) is greater than or equal to MinLeftPeakHeight

    • the ratio of number of reads supporting reference repeat length and (reference repeat length - 1) is between MinLeftPeakRatio and MaxLeftPeakRatio.

    If a filter is not passed, the site is counted toward total assessed site, but is not counted toward unstable sites even though distance and P-value pass the thresholds.

  7. Summarize stats and produce a report in the JSON output file given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites. The parameter values mentioned above are also reported.

Command-Line Options

Example command for tumor-only mode

It is recommended to use tumor-only mode rather than tumor-normal if a panel of normals is available. It is also recommended to match the sample types of the panel of normals and the tumor sample for optimal performance. For example, a panel of normals that are FFPE samples should be used with FFPE or FF (Fresh-Frozen) tumor samples.

The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only mode.

dragen \
--msi-command tumor-only \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \
--msi-ref-normal-dir ${normal_reference_directory} \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align=true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output=true \
--enable-sort true \
--enable-duplicate-marking=true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2}

Example command for tumor-normal mode

The paired normal sample is specified by --fastq-file1 and --fastq-file2.

dragen \
--msi-command tumor-normal \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output true \
--enable-sort true \
--enable-duplicate-marking true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2} \
--fastq-file1 ${fq1} \
--fastq-file2 ${fq2}
Option
Description

msi-command

Mode of execution: tumor-only, tumor-normal, or collect-evidence.

msi-microsatellites-file

Specify the file containing the microsatellite sites. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.

msi-ref-normal-dir

Full name of directory containing files with normal reference repeat length distribution. These files can be generated by running collect-evidence on each normal sample. A site is only evaluated if at least 20 normal samples have enough coverage for that site.

msi-ref-normal-input

Full name of a combined file with reference normal repeat length distributions from multiple samples.

msi-read-stitching

Whether to count overlapping reads as one fragment. It is recommended to set this option to True for libraries with short fragments. When read-stitching is turned on, the coverage of reads on each site will be lowered. It is recommended to lower msi-coverage-threshold especially for lower coverage samples.

msi-coverage-threshold

Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not assessed in analysis. DRAGEN recommends using 60 as the default value for solid samples. If the coverage is low, user can try lowering the threshold to 30 to increase the number of microsatellite sites assessed in the analysis. For TSO500 liquid, a value of 500 is recommended. See MSI algorithm for the details on how the number of spanning reads are counted.

msi-distance-threshold

Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.

Assay-Specific Settings

TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.

Sample Type
Assay
Microsatelitte file
Specific Settings
PercentageUnstableSites Threshold

Solid

TSO500

Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.

msi-distance-threshold=0.1

20

Heme

TSO500

N/A

N/A

N/A

Liquid (cfDNA)

TSO500

Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.

msi-distance-threshold=0.02

TBD

Solid, Heme

WES

Available for download. Repeats 10 - 50. Approx. 3.5K sites.

msi-distance-threshold=0.1

TBD

Liquid (cfDNA)

WES

Available for download. Repeats 10 - 50. Approx. 3.5K sites.

msi-distance-threshold=0.02

TBD

Solid, Heme

WGS

Available for download. Repeats 10 - 50.Approx. 1 mil sites.

msi-distance-threshold=0.1

TBD

Liquid (cfDNA)

WGS

Available for download. Repeats 10 - 50. Approx. 1 mil sites.

msi-distance-threshold=0.02

TBD

Microsatellite sites files

The following is an example of a microsatellite file:

#chromosome     location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases       left_flank_bases    right_flank_bases
chr1	985443	1	2	15	676	992	G	GGGCA	TTGAA
chr1	7980985	1	0	10	231	1020	A	ATGCT	TTTTA
chr1	8022800	1	3	19	13	41	T	AAATC	AAGGC
chr1	8029500	1	2	10	39	0	G	AAGCT	AAAAA
chr1	9146447	1	3	15	887	248	T	TCTCT	ATTGA
chr1	9767837	1	3	12	704	195	T	GTAAA	ATAAT

Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site page

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

Microsatellite site list columns

Column name
Description

chromosome

Chromosome of the site

location

Start location of the site

repeat_unit_length

Size of the repeat unit

repeat_unit_binary

Binary encoding of the repeat unit base converted to decimal (A: 0, C: 1, G: 2, T: 3)

repeat_times

Number of repeats units in reference

left_flank_binary

Left flank bases in terms of binary encoding converted to decimal

right_flank_binary

Right flank bases in terms of binary encoding converted to decimal

repeat_unit_bases

Repeat unit base in A/T/C/G

left_flank_bases

Five bases on the left flank of the microsatellite site

right_flank_bases

Five bases on the right flank of the microsatellite site

Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and/or the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using MSIsensor-Pro https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices.

msisensor-pro scan -d /path/to/reference.fa -o ${microsatellite_file}

A subsequent post-processing step is required for the site list to be used by DRAGEN:

  • only keep microsatellites sites with a repeat unit of length 1

  • keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)

  • remove any sites containing Ns in the left or right anchors

  • downsample the remaining sites to contain no more than 1 million sites (to avoid excessive run time)

  • rearrange the columns to match the format of a DRAGEN microsatellite site list (see Microsatellite site list columns)

An error would occur if long (>100bp) microsatellite sites are present in the file.

The Microsatellite site file output by MSI-sensor Pro is in a different format as the DRAGEN site file. A post-processing step is required to convert the format.

Germline variant filtering

We recommend filtering out microsatellite sites that overlap with known population variants. A locus affected by small variants will result in artificially inflated differences between samples. In the example below, the site in normal sample overlaps with a heterozygous variant (possibly a one-base ins/del). In the paired tumor sample, the heterozygosity is lost (LOH). The difference observed between the two distributions are not due to microsatellite instability, but LOH.

msi-snv

We recommend using gnomAD as the reference database to filter all sites that overlap with small variants with population allele frequencies > 1%.

Normal references of miscrosatellite repeat distribution

The normals reference can be provided in two formats: as separate files in one directory, or as a single file containing distributions from multiple samples.

  • Separate files can be provided with msi-ref-normal-dir. The directory should contain only the .dist files that are used as normal references.

  • A combined file can be provided with msi-ref-normal-input. The combined .dist file must contain an additional column that specifies the name of the sample for each distribution.

Normal reference files can be generated by running collect-evidence mode on a panel of normal samples. The output is in the same format as the .dist file described in MSI output. The default normal reference files are also available for WES and WGS at DRAGEN Software Support Site page.

dragen -f \
--msi-command collect-evidence \
--ref-dir ${reference_directory} \
--msi-microsatellites-file ${microsatellite_file} \
--msi-coverage-threshold 60 \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
-1 ${normal_fq1} \
-2 ${normal_fq2}

Please note:

  • The collect-evidence mode MUST be run in DRAGEN germline mode, as indicated by fastq options -1 and -2.

  • The --msi-microsatellites-file and --msi-coverage-threshold settings used in collect-evidence mode must be consistent with the settings used during tumor-only MSI calling.

  • At least 20 normal samples are required.

MSI Output

DRAGEN outputs the following files during the MSI workflow:

File name
Description

<prefix>.microsat_output.json

MSI score report reports MSI status and parameters used in JSON format.

<prefix>.microsat_diffs.txt

Difference between tumor and normal samples reports the statistical distance between tumor and normal samples for each site, and stats used to determine the status of the site. This file is not part of output in collect-evidence mode.

<prefix>.microsat_normal.dist

Distribution of repeat lengths reports the repeat length distribution of each site in a normal sample. This file is not part of output in tumor-only mode.

<prefix>.microsat_tumor.dist

Distribution of repeat lengths reports the repeat length distribution of each site in tumor sample. This file is not part of output in collect-evidence mode.

<prefix>.microsat_log.txt

Logs the runtime and MSI results

MSI score report

The JSON file <prefix>.microsat_output.json contains the parameters to reproduce the experiments, and the MSI results (including the MSI score PrecentageUnstableSites).

{   
    "Settings":{
        "Command": "tumor-normal",
        ...,
    },
    "TotalMicrosatelliteSitesAssessed": "20020",
    "TotalMicrosatelliteSitesUnstable": "4374",
    "PecentageUnstableSites": "21.850000000000001",
    "ResultIsValid": "true",
    "ResultMessage": "",
    "SumDistance": "1214.174" 
}

The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of tumor vs normal distributions. The "SumDistance" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".

In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.

Distribution of repeat lengths

DRAGEN MSI computes the number of repeat units (repeat lengths) supported by each read fragment.

The above figure shows a mock example of read pileup (left) at a pre-specified homopolymer site with 10 repeat units of T in reference with two abnormal alignments at bottom, and the distribution of repeat lengths (right) corresponding to the pileup.

The distribution is recorded in <prefix>.microsat_normal.dist and <prefix>.microsat_tumor.dist for normal and tumor samples, respectively.

Example .dist file:

#chromosome     location        repeat_unit_bases       reference_allele        covered length_distribution
chr1    985443  G       15      false   0,0,0,0,0,0,0,0,0,0,0,0,...,0
chr1    7980985 A       10      true    0,0,0,0,0,0,2,0,8,393,14,1,...,0
chr1    8022800 T       19      true    0,0,0,0,0,0,0,0,0,0,0,0,0,2,3,3,4,35,42,13,2,2,0,0...0,0

Summing up the numbers in the last column give the total number of reads covering the site.

Columns in .dist files:

Column name
Description

chromosome

chromosome of the site

location

start position of the site

repeat_unit_bases

the base(s) of the repeat unit in reference in A/T/C/G string

reference_allele

the number of repeats in reference

covered

whether the site is covered by sufficient reads (determined by msi-coverage-threshold)

length_distribution

A vector of size 100 that records read support for each repeat length from 1 to 100.

Difference between tumor and normal samples

Example <prefix>.microsat_diffs.txt file

#Chromosome	Start	RepeatUnit	Assessed	Distance	PValue	PassFilter
chr1	69106	T	true	0.04105300052	0.4786448589	true
chr1	69116	TC	false	0	0	false

Columnns in <prefix>.microsat_diffs.txt

The details of how column values are computed can be found in MSI algorithm.

Column name
Description

Chromosome

chromosome of the site

Start

start position of the site

RepeatUnit

the base(s) of the repeat unit in reference in A/T/C/G string

Assessed

whether the base is assesed based on read coverage and number of reads supporting the reference length

Distance

the Jensen-Shannon distance between tumor and normal distritbutions

PValue

statistical significance of the difference observed between distributions

PassFilter

whether the site passes filters based on on specific peak heights

Last updated

Was this helpful?