Microsatellite Instability

Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.

DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence mode.

Command-Line Options

The following is an example command for tumor-normal mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.

dragen \
--msi-command tumor-normal \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \ # See section: Default Microsatellite sites files
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output true \
--enable-sort true \
--enable-duplicate-marking true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2} \
--fastq-file1 ${fq1} \
--fastq-file2 ${fq2}

The following is an example command for the tumor-only mode. Please note that the WES and WGS tumor-only modes are not as extensively tested as the tumor-normal modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only mode.

Option
Description

msi-command tumor-only/tumor-normal/collect-evidence

Mode of execution: tumor-only, tumor-normal, or collect-evidence.

msi-microsatellites-file

Specify the file containing the microsatellites. You can generate this file by scanning the genome for microsatellites using an MSI-sensor. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.

msi-ref-normal-dir

Full name of directory containing files with normal reference repeat length distribution. Used only in tumor-only mode. These files can be generated by running collect-evidence on each normal sample. At least 20 normal samples are required.

msi-coverage-threshold

Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not included in analysis. DRAGEN recommends using 60 as the value for solid samples. For TSO500 liquid, a value of 500 is recommended.

msi-distance-threshold

Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.

Assay Specific Settings

TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.

Sample Type
Assay
Microsatelitte file
Specific Settings
PercentageUnstableSites Threshold

Solid

TSO500

Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.

msi-distance-threshold=0.1

20

Heme

TSO500

N/A

N/A

N/A

Liquid (cfDNA)

TSO500

Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.

msi-distance-threshold=0.02

TBD

Solid, Heme

WES

Available for download. Repeats 10 - 50. Approx. 3.5K sites.

msi-distance-threshold=0.1

TBD

Liquid (cfDNA)

WES

Available for download. Repeats 10 - 50. Approx. 3.5K sites.

msi-distance-threshold=0.02

TBD

Solid, Heme

WGS

Available for download. Repeats 10 - 50.Approx. 1 mil sites.

msi-distance-threshold=0.1

TBD

Liquid (cfDNA)

WGS

Available for download. Repeats 10 - 50. Approx. 1 mil sites.

msi-distance-threshold=0.02

TBD

Default Microsatellite sites files

The following is an example of a microsatellite file:

Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site pagearrow-up-right

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].

A subsequent post-processing step is recommended:

  • only keep microsatellites sites with a repeat unit of length 1

  • keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)

  • remove any sites containing Ns in the left or right anchors

  • downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)

Please note an error would occur if long (>100bp) microsatellite sites are present in the file.

Normal references of miscrosatellite repeat distribution

Normal reference files can be generated by running collect-evidence mode on a panel of normal samples.

Please note:

  • The collect-evidence mode MUST be run in DRAGEN germline mode.

  • The --msi-microsatellites-file and --msi-coverage-threshold settings used in collect-evidence mode must be consistent with the settings used during tumor-only MSI calling.

  • At least 20 normal samples are required.

MSI Output

The output containing MSI score (PecentageUnstableSites) are stored in <output prefix>.microsat_output.json.

The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".

In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.

MSI Debugging Output

There are two other output files (*_diffs.txt and *.dist) that are useful for debugging.

Here is an example of *_diffs.txt file

The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column

The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.

The *.dist file stores the read counts for each repeat length of the microsatellite site

The coverage of the site can be obtained by summing up all counts in the last column

MSI Verbose Output

    1. <output prefix>.microsat_output.json (described above)

    1. <output prefix>.microsat_tumor.dist. This file contains the repeat length array for every microsatellite.

Column length_dis is the repeat length array.

    1. <output prefix>.microsat_diffs.txt. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.

Column Assessed indicates if a site passes the coverage filter (msi-coverage-threshold). Column PassFilter is an internal metric and currently is not used for filtering microsatellites.

MSI Algorithm

The MSI algorithm performs the following steps:

  1. Tabulates tumor and normal counts from the read alignments for each microsatellite site.

  2. Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only mode).

  3. Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal mode). In tumor-only mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).

  4. Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.

Last updated

Was this helpful?