Mutational Signatures

Each tumor genome accumulates somatic mutations through one or more mutagenic processes — such as UV radiation, tobacco exposure, APOBEC enzyme activity, or defective DNA repair pathways. Each process leaves a characteristic pattern, or mutational signature, imprinted on the distribution of mutation types across a sample. DRAGEN Mutational Signatures Analysis identifies which of the known COSMIC v3.5arrow-up-right mutational signatures are active in a tumor sample, and quantifies how much each contributes to the observed somatic mutation burden. These analyses are supported for WGS workflows only. They can be used on WES workflows or panels. The analysis auto-enables with no additional options required for any tumor-normal run. However, the results obtained in these workflows are likely not reliable (see limitations below for details). Use of these results for WES or panel inputs is not recommended.

Terminology for this section follows COSMICarrow-up-right usage particularly for the meaning of class and type, but deviates as described in the table below.

Term
Meaning in this section

Mutation class

Refers to the topmost level classification of mutations (e.g., SBS, DBS, ID are classes)

Mutation type

Within a class, mutations which are unambiguously distinguished are referred to as mutation types

Mutational profile

Observed counts of each distinguished mutation type within a class, in a defined order (a vector)

Mutation spectrum

This term is equivalent to COSMIC's use of the term Mutational Profile, and will be the preferred term for this document

Signature Attribution[4]

COSMIC refers to this as "activities", older terminology uses the term "exposures", and it refers to how much of the observed mutation spectrum or profile for a tumor is attributed to each mutational signature in a class.

Three classes of somatic mutations are analyzed independently:

  • Single Base Substitutions (SBS): the 96-mutation-type trinucleotide context spectrum

  • Double Base Substitutions (DBS): the 78-mutation-type adjacent dinucleotide spectrum

  • Small Insertions and Deletions (ID): the 83-mutation-type indel classification spectrum

Results are reported as three pairs of TSV files — the observed mutation count spectrum prefix.ms_obs.{sbs,dbs,id}.tsv and the corresponding signature attributions prefix.ms_atr.{sbs,dbs,id}.tsv.

Notable Signatures

The table below highlights a subset of COSMIC v3.5 signatures that have well-established aetiologies or clear clinical associations, and are therefore of particular interpretive interest. This is not a comprehensive list; it reflects signatures that are either highly specific for a known phenotype, strongly associated with a clinically actionable biology, or likely to be encountered in routine oncology sequencing. Each entry links to the COSMIC signature page for the primary reference.

Where multiple signatures share the same aetiology (e.g., the four UV signatures SBS7a–d, or the seven MMR-deficiency SBS signatures), one representative is listed and the others are noted.

Signature(s)
Proposed Aetiology
Notes

HR deficiency, often BRCA1/2

Deletions at microhomology-flanked sites arising from NHEJ repair of DSBs in the absence of HR. High sensitivity and specificity for HRD phenotype; one of the most diagnostically useful signatures in this regard. Co-occurs with SBS3 and DBS13.

Homologous recombination (HR) deficiency

C>T and C>G substitutions concentrated at the T[C>x]T trinucleotide context. Strongly associated with BRCA1/2 loss-of-function. Co-occurs with ID6 and DBS13.

MMR deficiency / MSI

One of seven SBS signatures associated with defective DNA mismatch repair (others: SBS14, 15, 20, 21, 26, 44). Elevated in microsatellite-unstable tumors; commonly co-occurs with ID1 and ID2.

POLE exonuclease domain mutation

Ultramutator phenotype (>100 mut/Mb). Somatic mutations in the POLE proofreading domain generate these very high-TMB signatures. Co-occurs with DBS3 and SBS28.

Tobacco smoking

C>A transversions in specific trinucleotide contexts, consistent with benzo[a]pyrene adducts. Validated experimentally. Dominant in lung and head & neck cancers of smokers. Co-occurs with ID3 and DBS2.

UV light exposure

C>T transitions at dipyrimidines and CC>TT double substitutions, characteristic of cyclobutane pyrimidine dimers and 6-4 photoproducts. Highly specific for UV-exposed skin cancers. Co-occurs with DBS1 and ID13.

APOBEC cytidine deaminase activity

C>T (SBS2) and C>G (SBS13) mutations in TCA/TCT contexts driven by APOBEC3A/3B activity. Almost always co-occur. Associated with viral infection, retrotransposon activity, and clustered hypermutation (kataegis).

Colibactin exposure (E. coli pks+)

T>N mutations with preference for adenines at −3/−4 positions. Experimentally validated. Predominant in colorectal cancers; most active early in life. Associated with pks-island-carrying E. coli colonization.

Replication slippage; markedly elevated in MMR deficiency

Single-base insertions (ID1) and deletions (ID2) at T-homopolymer runs. Ubiquitous at low levels in normal aging; dramatically elevated in MSI tumors alongside the MMR-associated SBS signatures.

DSB repair by NHEJ (proposed)

Deletions >1 bp, primarily lacking microhomology, consistent with NHEJ-mediated end-joining. Aetiology partially unclear; age-correlated clock-like component also present. A minority of cases are linked to somatic TOP2A mutations.

UV light exposure

Predominantly CC>TT doublets, a hallmark of UV-induced tandem pyrimidine lesions.

HR deficiency

Predominantly TC>NN doublet base substitutions. Independently corroborates an HRD finding when co-observed with SBS3 and ID6.

Supported Workflows

Only WGS sample type and workflows are supported. If the tumor sequencing depth is low, this will naturally limit the somatic variant caller and also the input to the analysis here. Mutational Signatures Analysis is possible to run in WES and panels, but not recommended; see the limitations section for more detail. However, the auto-enabling logic does not currently check whether the input is WGS, and analysis will auto-enable on WES and panel workflows under the same conditions. Output files will be produced but the results are not considered reliable; use --enable-mutational-signatures false to suppress the analysis explicitly if needed.

Workflow
Requirements
Auto-enables?

Tumor-Normal (T/N)

Small variant caller enabled in a somatic workflow with a matched normal input

Yes

Tumor-Only (T/O)

Small variant caller enabled plus --vc-enable-germline-tagging true[3] and --tmb-enable-proxi-filter true

Yes, when both prerequisite options are set

Standalone / post-hoc

--enable-mutational-signatures true and --mutsig-input-snv-vcf vcf-file pointing to an existing hard-filtered somatic VCF

N/A (always explicit)

A human reference genome is required. Non-human references are not supported.

Limitations

Aspect
Explanation

WES or Panels

Although possible, this usage cannot be recommended at this time. The major limitation really comes down to the limited opportunity to observe somatic variants. A typical whole exome is constrained to about 1-2% of the genome, and panels much less than that. The somatic mutation spectrum simply may not have enough counts for a signature to be clearly recognizable. See also the relative power limitation below.

Tumor-Only vs Tumor-Normal

Tumor-only (T/O) mode is generally less accurate than tumor-normal (T/N) analysis. Without a matched normal, germline variants must be removed using database filtering and allele frequency-based heuristics (--vc-enable-germline-tagging and --tmb-enable-proxi-filter). Residual germline contamination in the somatic variant set can distort the mutation spectrum and lead to incorrect or inflated signature attributions. Conversely, at high tumor purities, the proxi filter can result in the exclusion of true somatic variants by suspecting they are germline inherited because of similar frequency as a nearby variant which is known to be segregating the standing population, reducing sensitivity of the analysis. These are not mutually exclusive; a tumor-only analysis can suffer from both problems.

Mutation Spectrum Information Content

The power to detect mutational signatures present in the sample depends strongly on the total somatic mutation count, and the extent to which truly active signatures may obscure each other. All processes, including those not described by COSMIC v3.5, are superimposed in the observed spectrum.

Relative Power to detect each signature

More distinctive signatures are more easily recognized than ones which are relatively flat with counts distributed across most or all mutation types. The relative power to detect each signature also depends on how distinctive that signature is vs. all others considered. Characterizing this further is beyond the scope of this document.

Signatures not in COSMIC v3.5

Only signatures defined in COSMIC v3.5 can be detected. Any novel, rare, or tissue-specific mutational process not represented in the reference panel will not be identified; its mutations will instead be attributed to the nearest matching known signature(s).

Dependence on variant calling accuracy

Results are entirely dependent on the accuracy of small variant calling and the accuracy of distinguishing somatic variants from germline. Errors, miscalibration, or systematic biases in variant calling propagate directly into the mutation spectrum and will distort signature attributions.

Low tumor mutation burden

Samples with low TMB may not have enough somatic variants for reliable signature attribution. The mutation spectrum at low counts is dominated by Poisson noise, making it difficult to distinguish genuine signature activity from sampling variation regardless of workflow or germline filtering quality.

Mutation Classes

DRAGEN analyzes three classes of somatic mutations, each matched against the corresponding COSMIC v3.5[1] reference signature panel.

Class
Abbreviation
Number of Distinguished Mutation Types
Description

Single Base Substitution

SBS

96

Single nucleotide variants classified by the substitution type and the immediately flanking 5′ and 3′ bases (trinucleotide context). By convention, the strand is standardized such that the changed reference base is a pyrimidine. Thus, all substitutions are expressed as C> or T> changes and are independent of strand as well as position.

Double Base Substitution

DBS

78

Two adjacent single nucleotide substitutions occurring simultaneously, classified by the dinucleotide reference and alternate alleles.

Small Insertion and Deletion

ID

83

Insertions and deletions classified by the indel length, whether the affected sequence is a single-base or multi-base repeat, and the number of repeat units present in the surrounding reference sequence. Multi-base deletions adjacent to a short stretch of identical sequence are classified as microhomology-mediated deletions.

Signature Decomposition Method

For each mutation class, DRAGEN tallies the observed somatic mutations into their respective mutation types, to produce the somatic mutation spectrum vector. COSMIC v3.5 signature profiles are then used as a reference matrix, and Non-Negative Least Squares (NNLS) decomposition finds the combination of known signatures that best reconstructs the observed spectrum.[2] Each attributed signature is tested for significance using a parametric bootstrap procedure: synthetic datasets are generated under the null hypothesis that the signature contributes nothing, and a Monte-Carlo estimated p-value is reported as the proportion of simulated datasets that would produce an attribution at least as large as what was observed. Smaller p-values indicate greater confidence that the attribution is genuine rather than a chance artifact of the sample's mutation count and spectrum shape.

The overall quality of the decomposition is also reported in the run log as root mean squared error (RMSE) and an R-squared value for each mutation class.

Variant Filters

The following variants are excluded from mutational signatures analysis:

  • Variants with more than one alternate allele

  • Non-PASS variants, except those where the only filter flag set is mnv_component

  • Mitochondrial variants

  • Alt haplotype contigs; only the main autosomal contigs and the X and Y chromosomes are included

  • Context ambiguity: either the 5′ or 3′ flanking base in the reference is N

  • The MNV VCF record itself, where both ref and alt alleles are longer than one base — however, the component single-base substitutions are used (see MNV note below)

  • Tumor-Only workflows: variants not tagged as Somatic by germline tagging are excluded

  • Tumor-Normal workflows with germline variant reporting enabled: only variants tagged with the INFO/SOMATIC flag are counted

MNV handling: When two single-base substitutions occur at immediately adjacent positions, each is counted individually as an SBS mutation type, and the pair is additionally counted as a DBS mutation type. For an MNV composed of N consecutive substitutions, N SBS classifications and N−1 DBS classifications are made. The combined MNV VCF record is skipped; only the component records are used.

Diagnosing unexpectedly low mutation counts:

  • Tumor-Normal: If the total mutation count in any class seems lower than expected, the most common cause is that variants have a filter field value other than PASS or mnv_component. Small variant caller settings outside the scope of this section may need adjustment for the sample type or use case.

  • Tumor-Only: The same applies, but an additional frequent cause is the TMB proxi filter aggressively removing true somatic variants. This is most pronounced at high tumor purities, where somatic variants can have VAFs similar to nearby germline variants and be incorrectly flagged. See TMB: Support for germline variants for details on how the proxi filter works and its known trade-offs.

Auto-Enabling Behavior

DRAGEN will automatically enable or suppress mutational signatures analysis based on the configured workflow. The module can always be explicitly overridden using --enable-mutational-signatures true or --enable-mutational-signatures false.

  • Tumor-Normal mode: Automatically enabled when the small variant caller is active in a somatic T/N workflow. No additional options are required.

  • Tumor-Only mode: Automatically enabled only when both --vc-enable-germline-tagging true and --tmb-enable-proxi-filter true are set. If either option is absent, a warning is written to the run log and mutational signatures analysis is skipped. The warning is intentionally not displayed on the console, as it would appear for every tumor-only run regardless of whether mutational signatures are relevant.

  • GVCF-only output: Analysis is not auto-enabled when only GVCF output is requested. To run mutational signatures alongside GVCF output, add --vc-enable-vcf-output true. Both GVCF and VCF outputs can be enabled simultaneously.

  • Methylation conversion: Analysis is not auto-enabled when any --methylation-conversion option is in use. It can be explicitly enabled with --enable-mutational-signatures true if desired.

  • Tumor-Normal with germline variant reporting: When --vc-report-germline-variants is set in a T/N workflow, only variants tagged with the INFO/SOMATIC flag are considered for the mutation spectrum.

Command-Line Options

Enabling

Option
Description

--enable-mutational-signatures

Set to true to explicitly enable mutational signatures analysis. Set to false to explicitly disable. Auto-enabled in any tumor-normal workflows and conditionally in tumor-only; see Auto-Enabling Behavior.

Tumor-Only prerequisites

These options must both be set for mutational signatures to run in tumor-only mode. Additionally, --vc-enable-germline-tagging true requires at least two more options as described below.

Option
Description

--vc-enable-germline-tagging true

Enables germline variant tagging using population databases. Required in tumor-only mode to exclude germline variants from the mutation spectrum. Requires --enable-variant-annotation true, --variant-annotation-assembly, and --variant-annotation-data for Nirvana annotations.

--tmb-enable-proxi-filter true

Enables the allele frequency proximity filter to remove additional germline variants not captured by the database filter. Required in tumor-only mode. See TMB Proxi Filter

--variant-annotation-data <path/to/that>

The Nirvana data download directory, see variant annotation documentation for more details.

Standalone input

Option
Description

--mutsig-input-snv-vcf

Path to an existing somatic hard-filtered SNV VCF file (*.hard-filtered.vcf.gz). When provided, the module runs in standalone mode using this file rather than the variant caller output from the current run. A reference directory (-r) is still required.

Example Commands

Tumor-Normal (integrated with variant calling)

Mutational signatures analysis is automatically enabled in T/N somatic mode when the variant caller is active. No additional options are needed beyond a standard somatic T/N variant calling command.

Tumor-Only (integrated with variant calling)

Both germline tagging and the TMB proxi filter must be enabled for mutational signatures to run in tumor-only mode. These, and why they are needed, are documented in more detail in TMB - Support for germline variants.

Standalone (post-hoc analysis of an existing VCF)

Mutational signatures can be run on its own against a previously generated hard-filtered somatic VCF without re-running variant calling.

Output Files

DRAGEN produces six output files per run: an observed mutation spectrum and a signature attribution table for each of the three mutation classes.

File
Description

<prefix>.ms_obs.sbs.tsv

Observed SBS mutation counts

<prefix>.ms_obs.dbs.tsv

Observed DBS mutation counts

<prefix>.ms_obs.id.tsv

Observed ID mutation counts

<prefix>.ms_atr.sbs.tsv

COSMIC SBS signature attributions

<prefix>.ms_atr.dbs.tsv

COSMIC DBS signature attributions

<prefix>.ms_atr.id.tsv

COSMIC ID signature attributions

Observed mutation spectrum files (ms_obs.{sbs,dbs,id}.tsv)

Each file contains one row per mutation type, in the same order used by COSMIC.

Column
Description

Mutation Type

Mutation type label (e.g., A[C>A]A for SBS, AC>CA for DBS, 1:Del:C:0 for ID)

Count

Raw count of observed somatic mutations in this mutation type

Proportion

Fraction of total observed mutations in this mutation type; values sum to 1.0 across the file

Signature attribution files (ms_atr.{sbs,dbs,id}.tsv)

Each file contains one row per COSMIC signature in the reference panel.

Column
Description

Signature

COSMIC signature name (e.g., SBS1, DBS2, ID3)

Attributed Mutations

Estimated number of mutations in the sample attributable to this signature

Proportion

Fraction of total attributed mutations from this signature

Significance

P-value from parametric bootstrap significance testing. Small values (e.g., < 0.001) indicate the attribution is unlikely to have arisen by chance. Signatures which have a zero attribution are assigned p = 1.0 without simulation. There is no p-value adjustment performed to account for multiple testing.

Cosine Loss

Reduction in cosine similarity between the observed spectrum and the reconstructed spectrum when this signature is removed. Larger values indicate the signature is more important to the quality of the overall fit.

Interpreting results: A signature is most confidently considered active when it has both a small Significance value (p-value), and a large enough number of Attributed Mutations. What should be considered large enough depends on the use case, workflow, tumor type, and tumor age or tumor mutational burden. No general guidance can be given here. Cosine Loss is provided because it is often of interest. Signatures with attributions > 0 but p = 1.0 should be treated as not significant.

Parametric Bootstrap with Monte-Carlo p-value estimation: Attribution Significance Test

After DRAGEN finds the best-fit signature attributions using NNLS, it tests each attribution for statistical significance using a parametric bootstrap procedure. The null hypothesis is that the signature under test contributes nothing — that the observed mutations attributed to it could instead be explained by some combination of the remaining signatures. To evaluate this, the analysis fits the best possible model without the signature under test, using all other signatures. Synthetic null datasets are then sampled from that reduced model's fitted spectrum. Both the sampling noise and the overlap of other signatures can result in observation of mutation types associated with the signature under test and a false non-zero attribution when the full model is fit to the synthetic data. This is repeated as many times as needed to estimate a p-value, up to a limit of 100,000.

For each signature with a non-negligible attribution, the test proceeds as follows:

  1. Fit a reduced model. The signature under test is removed from the reference matrix and NNLS is solved again using only the remaining signatures. The resulting attributions under the reduced model are used to reconstruct a mutation spectrum without that signature present. This is termed the null fitted spectrum below.

  2. Simulate null datasets. Random synthetic mutation spectra are drawn from the null fitted spectrum, representing what the data might look like under that null hypothesis with only Poisson counting noise. Two sampling strategies are used depending on total mutation count:

  • At lower mutation counts (below 50,000), a Poisson-multinomial sampler is used: a total count is drawn from a Poisson distribution, then distributed across mutation types by multinomial sampling. This accurately models both the randomness in how many mutations occur and how they fall across mutation types.

  • At higher mutation counts, each mutation type is sampled independently from its own Poisson distribution. This is computationally faster and statistically equivalent at large counts where the two approaches converge.

  1. Count exceedances. Each synthetic null spectrum is decomposed using NNLS against the full signature matrix which includes the signature under test. A "success" is recorded whenever the synthetic attribution for the tested signature meets or exceeds the originally observed attribution. This can only be due to either noise or cross-attribution from other similar signatures, or both.

  2. Adaptive early stopping. Rather than always running the maximum number of simulations (100,000), the loop terminates as soon as 10 successes accumulate. Signatures that are clearly not significant reach 10 successes quickly and are resolved with minimal computation; genuinely significant signatures run longer.

  3. Compute the p-value. The p-value estimator depends on which condition ended the loop:

  • If 10 successes were reached: p = k / n, the maximum likelihood estimate for the negative binomial distribution.

  • If the simulation ran to completion: p = (k + 0.5) / (n + 1), a Jeffreys prior Bayesian estimate for the binomial proportion. This avoids returning p = 0 when no null simulation ever matched the observed attribution — a finite simulation cannot rule out an event with absolute certainty, and this estimator reflects that residual uncertainty.

In addition to the p-value, the cosine similarity between the observed spectrum and the null reconstruction (without the tested signature) is recorded. The Cosine Loss — the reduction in cosine similarity between the full model and the reduced model — gives an indication of how much other signatures can compensate if the one under test is not present. This may or may not be useful and will depend on the individual signature.

[1] Alexandrov, L.B. et al. (2020). The repertoire of mutational signatures in human cancer. Nature, 578, 94–101. https://doi.org/10.1038/s41586-020-1943-3arrow-up-right

[2] Díaz-Gay, M. et al. (2023). Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment. Bioinformatics, 39(12), btad756. https://doi.org/10.1093/bioinformatics/btad756arrow-up-right

[3] --vc-enable-germline-tagging true requires --variant-annotation-data <path/to/nirvana/downloads> and --variant-annotation-assembly (GRCh37|GRCh38) to also be specified.

[4] This term is preferred because it does not imply that an underlying process must exist, or that it is exogenous rather than endogenous.

Last updated

Was this helpful?