# SMN Caller

Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C->T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.

DRAGEN offers the following two independent components that can call the SMN1 copy number from a germline sample.

* DRAGEN-STR
* SMN Caller

## SMA Calling With DRAGEN-STR

SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.

In addition to the standard diploid genotype call, SMA Calling with DRAGEN-STR uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.

SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.

To enable SMA calling along with repeat expansion detection, set the `--repeat-genotype-enable` option to `true`. For information on graph-alignment options, see [Repeat Expansion Detection](https://help.dragen.illumina.com/dragen-v4.4/product-guide/dragen-v4.4/dragen-dna-pipeline/repeat-expansions).

To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The `<INSTALL_PATH>/resources/repeat-specs/experimental` folder contains example files.

The `<output-file-prefix>.repeat.vcf` file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.

### SMA Result in repeat.vcf Output File

| Field | Description                                                                                                                      |
| ----- | -------------------------------------------------------------------------------------------------------------------------------- |
| VARID | SMN marks the SMN call.                                                                                                          |
| GT    | Genotype call at this position using a normal (diploid) genotype model.                                                          |
| DST   | <p>SMA status call:<br>+ indicates detected<br>- indicates undetected<br>? indicates undetermined.</p>                           |
| AD    | Total read counts supporting the C and T allele.                                                                                 |
| RPL   | Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely. |

## SMN Caller

The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, `NM_000344.4:c.*3+80T>G` that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in *Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data*.²

For information about enabling the SMN caller see [Targeted Caller](https://help.dragen.illumina.com/dragen-v4.4/product-guide/dragen-v4.4/dragen-dna-pipeline/targeted-caller/broken-reference).

The SMN Caller performs the following steps:

1. Determines total and intact SMN copy numbers
2. Calls SMN1 copy number at eight differentiating sites
3. Determines copy number for `NM_000344.4:c.*3+80T>G`

### Total and Intact SMN Copy Number

Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.

### SMN1 Copy Number at Differentiating Sites

To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with DRAGEN-STR (see *SMA Calling With DRAGEN-STR*). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.

For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.

### Copy number call for `NM_000344.4:c.*3+80T>G`

The SNP [NM\_000344.4:c.\*3+80T>G](https://www.ncbi.nlm.nih.gov/clinvar/RCV001806742/) (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.

For this high-homology region SNP, reads mapping to either *SMN1* or *SMN2* are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).

### SMN Output File

The SMN Caller prints out its calls in the targeted caller output file, `<output-file-prefix>.targeted.json` that also contains calls from other targets (see [Targeted JSON File](https://help.dragen.illumina.com/dragen-v4.4/product-guide/dragen-v4.4/dragen-dna-pipeline/targeted-caller/broken-reference)). An example of the SMN caller content in this file is shown below.

```
"smn": {
    "fullLengthCopyNumber": 3,
    "totalCopyNumber": 3,
    "smn1CopyNumber": 2,
    "smn2CopyNumber": 1,
    "smn2Delta78CopyNumber": 0,
    "fullLengthCopyNumberFloat": "2.99",
    "totalCopyNumberFloat": "3.01",
    "variants": [
    {
        "alleleId": "NM_000344.4:c.*3+80T>G",
        "alleleCopyNumber": 1,
        "genotypeQuality": 26,
        "filter": "PASS"
    }
    ]
}
```

For SMN caller, the fields are defined as follows.

| Fields in JSON            | Explanation                                              | Type and Possible Values                              |
| ------------------------- | -------------------------------------------------------- | ----------------------------------------------------- |
| fullLengthCopyNumber      | Copy number of intact SMN (exons 7 & 8)                  | nonnegative integer                                   |
| totalCopyNumber           | Copy number of total SMN (exons 1 to 6)                  | nonnegative integer                                   |
| smn1CopyNumber            | Copy number of intact SMN1                               | nonnegative integer or null for no-call               |
| smn2CopyNumber            | Copy number of intact SMN2                               | nonnegative integer or null for no-call               |
| smn2Delta78CopyNumber     | Copy number of SMN2Δ7–8 (deletion of exon 7 and 8)       | nonnegative integer                                   |
| fullLengthCopyNumberFloat | Raw normalized depth of intact SMN (exons 7 & 8)         | string representing nonnegative floating point number |
| totalCopyNumberFloat      | Raw normalized depth of total SMN (exons 1 to 6)         | string representing nonnegative floating point number |
| variants                  | a json array containing info about specific SMN variants | json-array                                            |

Each variant reported in the `variants` array will have the fields below.

| Fields in JSON   | Explanation                                      | Type and Possible Values         |
| ---------------- | ------------------------------------------------ | -------------------------------- |
| alleleId         | HGVS identifier of the variant allele            | string                           |
| alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer              |
| genotypeQuality  | Phred-scaled quality for the called genotype     | nonnegative integer              |
| filter           | Filter for the called genotype                   | string. "PASS" when not filtered |

The variant `NM_000344.4:c.*3+80T>G` is also reported in VCF format. See [Targeted VCF File](https://help.dragen.illumina.com/dragen-v4.4/product-guide/dragen-v4.4/dragen-dna-pipeline/targeted-caller/broken-reference) for details about how these variants are reported in VCF.

## References

¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). *Human Mutation.* 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3>3.0.CO;2-9

²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. *Genetics in Medicine.* 2020;22(5):945-953. doi:10.1038/s41436-020-0754-0
