# SMN Caller

Disruption of all copies of the *SMN1* gene in an individual causes spinal muscular atrophy (SMA). *SMN1* has a high identity paralog, *SMN2*. *SMN2* differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C->T affects splicing and largely disrupts the production of functional SMN protein from *SMN2*. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (*SMN1*) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.

DRAGEN offers the following two independent components that can call the *SMN1* copy number from a germline sample.

* DRAGEN-STR
* SMN Caller

## SMA Calling With DRAGEN-STR

SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents *SMN1* and *SMN2*.

In addition to the standard diploid genotype call, SMA Calling with DRAGEN-STR uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.

SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.

To enable SMA calling along with repeat expansion detection, set the `--repeat-genotype-enable` option to `true`. For information on graph-alignment options, see [Repeat Expansion Detection](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/repeat-expansions).

To activate SMA calling, the variant specification catalog file must include a description of the targeted *SMN1*/*SMN2* variant. The `<INSTALL_PATH>/resources/repeat-specs/experimental` folder contains example files.

The `<output-file-prefix>.repeat.vcf` file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in *SMN1* with SMA status in the following custom fields.

### SMA Result in repeat.vcf Output File

| Field | Description                                                                                                                      |
| ----- | -------------------------------------------------------------------------------------------------------------------------------- |
| VARID | SMN marks the SMN call.                                                                                                          |
| GT    | Genotype call at this position using a normal (diploid) genotype model.                                                          |
| DST   | <p>SMA status call:<br>+ indicates detected<br>- indicates undetected<br>? indicates undetermined.</p>                           |
| AD    | Total read counts supporting the C and T allele.                                                                                 |
| RPL   | Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely. |

## SMN Caller

The SMN Caller calls *SMN1* and *SMN2* copy numbers and detects the presence of a SNP, `NM_000344.4:c.*3+80T>G` that is associated with the two-copy *SMN1* allele. The caller is derived from the method implemented in *Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data*.²

For information about enabling the SMN caller see [Targeted Caller](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/targeted-caller/..#targeted-caller).

The SMN Caller performs the following steps:

1. Determines total and intact SMN copy numbers
2. Calls *SMN1* copy number at eight differentiating sites
3. Determines copy number for `NM_000344.4:c.*3+80T>G`

### Total and Intact SMN Copy Number

Two common copy-number variants (CNVs) in *SMN1* and *SMN2* include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either *SMN1* or *SMN2* are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.

### *SMN1* Copy Number at Differentiating Sites

To calculate the *SMN1* copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of *SMN1* and *SMN2*. One of these sites is the splice site variant used for SMA calling with DRAGEN-STR (see *SMA Calling With DRAGEN-STR*). The caller selects differentiating sites at positions that have sequence differences between *SMN1* and *SMN2* where calling the *SMN1* copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.

For each differentiating site, the *SMN1*-specific and *SMN2*-specific alleles are counted in reads mapping to either *SMN1* or the homologous region in *SMN2*. The caller uses a binomial model to calculate the likelihood of each possible *SMN1* copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.

A no-call is made for the *SMN1*-specific and *SMN2*-specific copy number when there is no consensus among the eight differentiating sites. This can occur due to low quality sequencing data or the presence of rare *SMN1*/*SMN2* haplotypes in the sample.

### *SMN1* duplication haplotype detection

Accurate detection of the *SMN1* duplication haplotype is essential for identifying spinal muscular atrophy (SMA) silent carriers. These individuals have a 2+0 *SMN1* configuration, meaning two copies of *SMN1* reside on one haplotype while the other haplotype carries none. Because the total *SMN1* copy number remains two, silent carriers appear copy-number neutral and are therefore missed by conventional carrier screening assays that report only total *SMN1* copy number.

To address this limitation, the DRAGEN SMN caller facilitates detection of the *SMN1* duplication haplotype using two complementary and independent approaches:

1. Targeted detection of known duplication-associated variants
2. A proprietary DRAGEN biomarker driven ML based prediction model

#### Duplication-Associated Variant Detection

[Two variants have been reported in the literature to be associated with the two-copy *SMN1* allele:](https://pmc.ncbi.nlm.nih.gov/articles/PMC6138687/)

1. SNP: NM\_000344.4:c.\*3+80T>G (also referred to as g.27134T>G)
2. Indel: NM\_000344.4:c.211\_212del (also referred to as c.211\_212del)

These variants lie within a high-homology region shared by *SMN1* and *SMN2*. As a result, reads mapping to either gene are included in variant calling. For each site, the number of reads supporting the variant and non-variant alleles is counted, and a binomial model incorporating the sequencing error rate is applied to estimate the most likely variant allele copy number (with 0 indicating absence of the variant).

When detected, these variants are reported in the targeted.json file under the variants field associated with the SMN caller output, and in the corresponding targeted.vcf file

#### Biomarker driven ML based prediction model

In addition to targeted variant detection, DRAGEN provides an ML-driven feature for *SMN1* duplication haplotype detection. This approach predicts *SMN1* duplication haplotype status using a logistic regression model trained on over 200 biomarker SNPs associated with the duplication haplotype. Rather than relying on any single variant, this model integrates signals across all biomarker SNPs to determine duplication status. When the total *SMN1* copy number is two, the model output is reported in the targeted.json file under the following fields associated with the SMN caller output: `smn1Duplication`, `smn1DuplicationQual` and `smn1DuplicationFilter`.

Detailed descriptions of these fields are provided in the following section.

### SMN Output File

The SMN Caller prints out its calls in the targeted caller output file, `<output-file-prefix>.targeted.json` that also contains calls from other targets (see [Targeted JSON File](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/targeted-caller/..#targeted-json-file)). An example of the SMN caller content in this file is shown below.

```
"smn": {
    "fullLengthCopyNumber": 4,
    "totalCopyNumber": 4,
    "smn1CopyNumber": 2,
    "smn2CopyNumber": 2,
    "smn2Delta78CopyNumber": 0,
    "fullLengthCopyNumberFloat": "3.99",
    "totalCopyNumberFloat": "4.01",
    "smn1Duplication": false,
    "smn1DuplicationQual": 0.15504023290928903,
    "smn1DuplicationFilter": "TargetedSmn1DuplicationLowQual",
    "variants": [
        {
            "alleleId": "NM_000344.4:c.*3+80T>G",
            "alleleCopyNumber": 1,
            "genotypeQuality": 26,
            "filter": "PASS"
        }
    ]
}
```

For SMN caller, the fields are defined as follows.

| Fields in JSON            | Explanation                                                                                                                                         | Type and Possible Values                              |
| ------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| fullLengthCopyNumber      | Copy number of intact SMN (exons 7 & 8)                                                                                                             | nonnegative integer                                   |
| totalCopyNumber           | Copy number of total SMN (exons 1 to 6)                                                                                                             | nonnegative integer                                   |
| smn1CopyNumber            | Copy number of intact *SMN1*                                                                                                                        | nonnegative integer or null for no-call               |
| smn2CopyNumber            | Copy number of intact *SMN2*                                                                                                                        | nonnegative integer or null for no-call               |
| smn2Delta78CopyNumber     | Copy number of *SMN2*Δ7–8 (deletion of exon 7 and 8)                                                                                                | nonnegative integer                                   |
| fullLengthCopyNumberFloat | Raw normalized depth of intact SMN (exons 7 & 8)                                                                                                    | string representing nonnegative floating point number |
| totalCopyNumberFloat      | Raw normalized depth of total SMN (exons 1 to 6)                                                                                                    | string representing nonnegative floating point number |
| smn1Duplication           | Field only present when *SMN1* smn1CopyNumber is 2. Call for the *SMN1* copy-neutral DEL/DUP genotype. True for detected and False for not detected | bool                                                  |
| smn1DuplicationQual       | Field only present when *SMN1* smn1CopyNumber is 2. Phred-scaled quality for detection of the *SMN1* copy-neutral DEL/DUP genotype                  | nonnegative decimal number                            |
| smn1DuplicationFilter     | Field only present when *SMN1* smn1CopyNumber is 2. The filter status for detection of the *SMN1* copy-neutral DEL/DUP genotype.                    | string (`PASS` or `TargetedSmn1DuplicationLowQual`)   |
| variants                  | a json array containing info about specific SMN variants                                                                                            | json-array                                            |

Each variant reported in the `variants` array will have the fields below.

| Fields in JSON   | Explanation                                      | Type and Possible Values         |
| ---------------- | ------------------------------------------------ | -------------------------------- |
| alleleId         | HGVS identifier of the variant allele            | string                           |
| alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer              |
| genotypeQuality  | Phred-scaled quality for the called genotype     | nonnegative integer              |
| filter           | Filter for the called genotype                   | string. "PASS" when not filtered |

The variant `NM_000344.4:c.*3+80T>G` is also reported in VCF format. See [Targeted VCF File](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/targeted-caller/..#targeted-vcf-file) for details about how these variants are reported in VCF.

The *SMN1* copy number, when called, is reported in the `<output-file-prefix>.targeted.vcf.gz` output file. Below is a portion of the VCF output from a sample having an *SMN1* copy-neutral DEL/DUP genotype call.

```
##ALT=<ID=DEL,Description="Region of lowered copy number relative to the reference, or a deletion breakpoint">
##ALT=<ID=DUP,Description="Region of elevated copy number relative to the reference, or a tandem duplication breakpoint">
##FILTER=<ID=TargetedSmn1DuplicationLowQual,Description="SMN1 copy-neutral DEL/DUP genotype is detected with QUAL below 10.00.">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG02762
chr5    70925086        smn1_region     C       <DEL>,<DUP>     22.91   PASS    END=70953015;IMPRECISE;SVLEN=27929,27929;SVCLAIM=D,D;CN=0,2;TargetedCaller=smn  GT:GQ:CN        1/2:23:2
```

## References

¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). *Human Mutation.* 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3>3.0.CO;2-9

²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. *Genetics in Medicine.* 2020;22(5):945-953. doi:10.1038/s41436-020-0754-0
