Microsatellite Instability

Microsatellites are genomic regions of short DNA motifs that are repeated 5–50 times and are associated with high mutation rates. Microsatellite Instability (MSI) results from deficiencies in the DNA mismatch repair pathway and can be used as a critical biomarker to predict immunotherapy responses in multiple tumor types.

DRAGEN MSI supports running in tumor-normal and tumor-only modes. Tumor-normal is generally expected to generate more accurate results. The tumor-only mode will require a panel of normals. The panel of normals will be generated using the collect-evidence mode.

Command-Line Options

The following is an example command for tumor-normal mode. Default resource files are available for WES and WGS. Please note that the WES and WGS tumor-normal modes are fully supported and tested. Custom panels may require more extensive validation and possibly require generating a new sites file.

dragen \
--msi-command tumor-normal \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \ # See section: Default Microsatellite sites files
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output true \
--enable-sort true \
--enable-duplicate-marking true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2} \
--fastq-file1 ${fq1} \
--fastq-file2 ${fq2}

The following is an example command for the tumor-only mode. Please note that the WES and WGS tumor-only modes are not as extensively tested as the tumor-normal modes. The TSO500 panels do not have normal controls, and are only tested and validated in tumor-only mode.

dragen \
--msi-command tumor-only \
--msi-coverage-threshold 60 \ 
--msi-microsatellites-file ${microsatellite_file} \  # See section: Default Microsatellite sites files
--msi-ref-normal-dir ${normal_reference_directory} \ # See section: Normal references of miscrosatellite repeat distribution 
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
--enable-map-align=true \
--RGID=read_group_ID \ 
--RGSM=read_group_sample \
--ref-dir ${reference_directory} \
--enable-map-align-output=true \
--enable-sort true \
--enable-duplicate-marking=true \
--tumor-fastq1 ${tumor_fq1} \
--tumor-fastq2 ${tumor_fq2}
OptionDescription

msi-command tumor-only/tumor-normal/collect-evidence

Mode of execution: tumor-only, tumor-normal, or collect-evidence.

msi-microsatellites-file

Specify the file containing the microsatellites. You can generate this file by scanning the genome for microsatellites using an MSI-sensor. DRAGEN has tested with ≥ 10 bp homopolymers for solid samples, and 6-7 bp homopolymers for liquid samples.

msi-ref-normal-dir

Full name of directory containing files with normal reference repeat length distribution. Used only in tumor-only mode. These files can be generated by running collect-evidence on each normal sample. At least 20 normal samples are required.

msi-coverage-threshold

Specify the minimum spanning read coverage for a microsatellite. Microsatellites that do not meet the specified threshold are not included in analysis. DRAGEN recommends using 60 as the value for solid samples. For TSO500 liquid, a value of 500 is recommended.

msi-distance-threshold

Threshold for distance distributions to be considered different. Default is 0.1. For liquid samples, a value of 0.02 is recommended.

Assay Specific Settings

TSO500 Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". It is generally recommended to use "PercentageUnstableSites" as metric for determining the MSI status. This metric is normalized, and is expected to be more consistent for different pipelines and with different input site files. The exact thresholds for other assays may still depend on the sample noise characteristics (PCR / UMI etc) and may need some empirical calibration.

Sample TypeAssayMicrosatelitte fileSpecific SettingsPercentageUnstableSites Threshold

Solid

TSO500

Part of TSO500 resource bundle. Repeats 10 - 50. 130 sites.

msi-distance-threshold=0.1

20

Heme

TSO500

N/A

N/A

N/A

Liquid (cfDNA)

TSO500

Part of TSO500 resource bundle. Repeats 6,7. 2344 sites.

msi-distance-threshold=0.02

TBD

Solid, Heme

WES

Available for download. Repeats 10 - 50. Approx. 3.5K sites.

msi-distance-threshold=0.1

TBD

Liquid (cfDNA)

WES

Available for download. Repeats 10 - 50. Approx. 3.5K sites.

msi-distance-threshold=0.02

TBD

Solid, Heme

WGS

Available for download. Repeats 10 - 50.Approx. 1 mil sites.

msi-distance-threshold=0.1

TBD

Liquid (cfDNA)

WGS

Available for download. Repeats 10 - 50. Approx. 1 mil sites.

msi-distance-threshold=0.02

TBD

Default Microsatellite sites files

The following is an example of a microsatellite file:

#chromosome     location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases       left_flank_bases    right_flank_bases
chr1	985443	1	2	15	676	992	G	GGGCA	TTGAA
chr1	7980985	1	0	10	231	1020	A	ATGCT	TTTTA
chr1	8022800	1	3	19	13	41	T	AAATC	AAGGC
chr1	8029500	1	2	10	39	0	G	AAGCT	AAAAA
chr1	9146447	1	3	15	887	248	T	TCTCT	ATTGA
chr1	9767837	1	3	12	704	195	T	GTAAA	ATAAT

Default WES and WGS Microsatellite site files can be downloaded here: DRAGEN Software Support Site page

For panels it is recommended to post-process the file by intersecting the WES or WGS sites with the panel of interest. This will avoid using any off-target reads in the MSI analysis.

Custom Microsatellite files

Custom Microsatellite site files may be required if a small panel is targeted and the default site files do not have sufficient overlapping sites.

Custom Microsatellite site files can be generated by using msi-sensor [https://github.com/xjtu-omics/msisensor-pro/wiki/Best-Practices].

msisensor-pro scan -d /path/to/reference.fa -o ${microsatellite_file}

A subsequent post-processing step is recommended:

  • only keep microsatellites sites with a repeat unit of length 1

  • keep sites with 10 - 50bp repeats (a max length of 100bp repeats is supported)

  • remove any sites containing Ns in the left or right anchors

  • downsample the remaining sites to contain at least 2000 sites, but no more than 1 million sites (to avoid excessive run time)

Please note an error would occur if long (>100bp) microsatellite sites are present in the file.

Normal references of miscrosatellite repeat distribution

Normal reference files can be generated by running collect-evidence mode on a panel of normal samples.

dragen -f \
--msi-command collect-evidence \
--ref-dir ${reference_directory} \
--msi-microsatellites-file ${microsatellite_file} \
--msi-coverage-threshold 60 \
--output-directory ${output_directory} \
--output-file-prefix ${prefix} \
-1 ${normal_fq1} \
-2 ${normal_fq2}

Please note:

  • The collect-evidence mode MUST be run in DRAGEN germline mode.

  • The --msi-microsatellites-file and --msi-coverage-threshold settings used in collect-evidence mode must be consistent with the settings used during tumor-only MSI calling.

  • At least 20 normal samples are required.

MSI Output

The output containing MSI score (PecentageUnstableSites) are stored in <output prefix>.microsat_output.json.

"TotalMicrosatelliteSitesAssessed": "20020",
"TotalMicrosatelliteSitesUnstable": "4374",
"PecentageUnstableSites": "21.850000000000001",
"ResultIsValid": "true",
"ResultMessage": "",
"SumDistance": "1214.174" 

The "SumDistance" is the sum of Jensen-Shannon distance of all unstable sites based on distances of T vs N distributions. The "sumDistace" depends on the size of microsatellite file, and is not normalized. In general it is recommended to set MSI thresholds based on "PecentageUnstableSites" rather than "SumDistance".

In TSO500, Solid microsatellite instability is defined as all samples with "PercentageUnstableSites >= 20". The exact thresholds for other assays with different site files and noise characteristics may need some empirical calibration.

MSI Debugging Output

There are two other output files (*_diffs.txt and *.dist) that are useful for debugging.

Here is an example of *_diffs.txt file

#Chromosome	Start	RepeatUnit	Assessed	Distance	PValue	PassFilter
chr1	69106	T	true	0.04105300052	0.4786448589	true
chr1	69116	TC	false	0	0	false

The fourth column (Assessed) is the coverage filter. Any site with coverage >= 60 is true for this column

The sixth column (PassFilter) is an internal flag used for left allele filter. It removes low quality sites that has no coverage and helps to increase prediction accuracy. It's true when the following conditions are met.

1. read count for [reference repeat length] > 0
2. read count for [reference repeat length - 1] >= 0
3. the ratio of read count for [reference repeat length - 1] and [reference repeat length] >= 0

The *.dist file stores the read counts for each repeat length of the microsatellite site

#chromosome	location	repeat_unit_bases	reference_allele	covered	length_distribution
chr1	69106	T	5	true	0,0,0,0,103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The coverage of the site can be obtained by summing up all counts in the last column

MSI Verbose Output

    1. <output prefix>.microsat_output.json (described above)

    1. <output prefix>.microsat_tumor.dist. This file contains the repeat length array for every microsatellite.

#chromosome     location        repeat_unit_bases       reference_allele        covered length_dis
chr1    16200729        T       10      true    0,0,0,0,0,0,0,0,0,118,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
chr1    40361307        A       10      true    0,0,0,0,0,0,2,0,2,95,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Column length_dis is the repeat length array.

    1. <output prefix>.microsat_diffs.txt. This file contains the distance metrics for every microsatellite between tumor/normal or tumor/reference normals.

#Chromosome     Start   RepeatUnit      Assessed        Distance        PValue  PassFilter
chr1    16200729        T       true    0.03348841224   0.002411104562  true
chr1    40361307        A       true    0.0406985608    0.0006306633961 true
chr1    156842471       T       false   0       0       true
chr1    239881908       T       true    0.003136536956  0.5983661726    true

Column Assessed indicates if a site passes the coverage filter (msi-coverage-threshold). Column PassFilter is an internal metric and currently is not used for filtering microsatellites.

MSI Algorithm

The MSI algorithm performs the following steps:

  1. Tabulates tumor and normal counts from the read alignments for each microsatellite site.

  2. Calculates Jensen-Shannon distance of tumor and normal distribution for each microsatellite site (tumor-normal mode), or Jensen-Shannon distance of two normal baseline samples (tumor-only mode).

  3. Determines unstable sites by performing chi-square testing of tumor and normal distribution. Unstable sites have repeat length distributions that are significantly shifted between tumor and normal measured by Jensen-Shannon distance (tumor-normal mode). In tumor-only mode, JSD is calculated for each pair of tumor and normal reference samples, as well as each pair of normal-normal samples. Then the two sets of JSD is compared to derive a mean distance difference and p-value calculated from student t-test. Microsatellite instability is called if the mean distance difference is greater than or equal to the distance threshold (default 0.1) and p-value less than or equal to the p-value threshold (default 0.01).

  4. Produces a report given assessed site count, unstable site count, the percentage of unstable sites in all assessed sites and the sum of the Jensen-Shannon distance of all the unstable sites.

Last updated