SMN Caller

Disruption of all copies of the SMN1 gene in an individual causes spinal muscular atrophy (SMA). SMN1 has a high identity paralog, SMN2. SMN2 differs only in approximately 10 SNVs and small indels. For example, hg19 chr5:70247773 C-> T affects splicing and largely disrupts the production of functional SMN protein from SMN2. Due to the high-similarity duplication combined with common-copy number variation, standard whole-genome sequencing (WGS) analysis does not produce complete variant calling results for SMN. Since 95% of SMA cases result from the absence of the functional C (SMN1) allele in any copy of SMN¹, a targeted calling solution can be effective in detecting SMA.

DRAGEN offers the following two independent components that can call the SMN1 copy number using WGS data from a germline sample.

  • ExpansionHunter

  • SMN Caller

SMA Calling With ExpansionHunter

SMA calling is implemented together with repeat expansion detection using sequence-graph realignment to align reads to a single reference that represents SMN1 and SMN2.

In addition to the standard diploid genotype call, SMA Calling with ExpansionHunter uses a direct statistical test to check for presence of any C allele. If a C allele is not detected, the sample is called affected, otherwise unaffected.

SMA calling is only supported for human whole-genome sequencing with PCR-free libraries.

To enable SMA calling along with repeat expansion detection, set the --repeat-genotype-enable option to true. For information on graph-alignment options, see Repeat Expansion Detection with ExpansionHunter.

To activate SMA calling, the variant specification catalog file must include a description of the targeted SMN1/SMN2 variant. The <INSTALL_PATH>/resources/repeat-specs/experimental folder contains example files.

The <output-file-prefix>.repeat.vcf file includes SMN output along with any targeted repeats. SMN output is represented as a single SNV call at the splice-affecting position in SMN1 with SMA status in the following custom fields.

SMA Result in repeat.vcf Output File

FieldDescription

VARID

SMN marks the SMN call.

GT

Genotype call at this position using a normal (diploid) genotype model.

DST

SMA status call: + indicates detected - indicates undetected ? indicates undetermined.

AD

Total read counts supporting the C and T allele.

RPL

Log10 likelihood ratio between the unaffected and affected models. Positive scores indicate the unaffected model is more likely.

SMN Caller

The SMN Caller calls SMN1 and SMN2 copy numbers and detects the presence of a SNP, NM_000344.4:c.*3+80T>G that is associated with the two-copy SMN1 allele. The caller is derived from the method implemented in Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data

To enable the SMN Caller, use --enable-smn=true as part of a germline-only WGS analysis workflow. Additionally, it can also be enabled along with other targets from the targeted caller by using the option --enable-targeted=true. The SMN Caller is disabled by default.

The SMN Caller performs the following steps:

  1. Determines total and intact SMN copy numbers

  2. Calls SMN1 copy number at eight differentiating sites

  3. Determines copy number for NM_000344.4:c.*3+80T>G

The SMN Caller requires WGS data aligned to a human reference genome with at least 30x coverage

Total and Intact SMN Copy Number

Two common copy-number variants (CNVs) in SMN1 and SMN2 include whole gene CNV and a partial gene deletion of exons 7 and 8. Reads that align to either SMN1 or SMN2 are counted. The read counts in exon 1 through exon 6 are used to determine total SMN copy number. The read counts in exon 7 and 8 are used to determine the SMN copies that do not have the exon 7 and 8 deletion (intact SMN copy number). To estimate the SMN copy number for these two regions, read counts are normalized to a diploid baseline derived from 3000 preselected 2 kb regions across the genome. The 3000 normalization regions are randomly selected from the portion of the reference genome that has stable coverage across population samples. The SMN Caller then calculates the number of SMN copies that have the exon 7 and 8 deletion by subtracting the intact SMN copy number from the total SMN copy number.

SMN1 Copy Number at Differentiating Sites

To calculate the SMN1 copy number, the caller uses eight predefined differentiating sites in exons 7 and 8 of SMN1 and SMN2. One of these sites is the splice site variant used for SMA calling with ExpansionHunter (see SMA Calling With ExpansionHunter). The caller selects differentiating sites at positions that have sequence differences between SMN1 and SMN2 where calling the SMN1 copy number is most likely to be correct based on sequencing data from the 1000 Genomes Project.

For each differentiating site, the SMN1-specific and SMN2-specific alleles are counted in reads mapping to either SMN1 or the homologous region in SMN2. The caller uses a binomial model to calculate the likelihood of each possible SMN1 copy number from the two gene-specific counts given the intact SMN copy number calculated in the previous step.

Copy number call for NM_000344.4:c.*3+80T>G

The SNP NM_000344.4:c.*3+80T>G (also referred to as g.27134T>G) has been reported in the literature to be associated with the two-copy SMN1 allele.

For this high-homology region SNP, reads mapping to either SMN1 or SMN2 are used for variant calling. The number of reads containing the variant allele and the nonvariant allele are counted and then a binomial model that incorporates the sequencing error rate is used to determine the most likely variant allele copy number (0 for nonvariant).

SMN Output File

The SMN Caller prints out its calls in the targeted caller output file, <output-file-prefix>.targeted.json that also contains calls from other targets (see Targeted JSON File). An example of the SMN caller content in this file is shown below.

"smn": {
    "smn1CopyNumber": 3,
    "smn2CopyNumber": 1,
    "smn2Delta78CopyNumber": 0,
    "totalCopyNumber": "3.89",
    "fullLengthCopyNumber": "4.10",
    "variants": [
    {
        "hgvs": "NM_000344.4:c.*3+80T>G",
        "qual": null,
        "altCopyNumber": 1,
        "altCopyNumberQuality": 54.63854344932561
    }
    ]
}

For SMN caller, the fields are defined as follows.

Fields in JSONExplanationType and Possible Values

smn1CopyNumber

Copy number of intact SMN1

nonnegative integer or null

smn2CopyNumber

Copy number of intact SMN2

nonnegative integer or null

smn2Delta78CopyNumber

Copy number of SMN2Δ7–8 (deletion of exon 7 and 8)

nonnegative integer

totalCopyNumber

Raw normalized depth of total SMN (exons 1 to 6)

nonnegative floating point number

fullLengthCopyNumber

Raw normalized depth of intact SMN (exons 7 & 8)

nonnegative floating point number

variants

a json array containing info about specific SMN variants

json-array

Each variant reported in the variants array will have the fields below.

Fields in JSONExplanationType and Possible Values

hgvs

HGVS id of the variant being reported

string

qual

Phred quality that at least one copy of the variant allele is found

nonnegative floating point number

altCopyNumber

detected copy number of the variant allele

nonnegative floating point number

altCopyNumberQuality

Phred quality of the detected copy number

nonnegative floating point number

The variant NM_000344.4:c.*3+80T>G is also reported in a <output-file-prefix>.targeted.vcf[.gz] file in the output directory. The output file is a VCFv4.2 formatted file and possibly compressed. The variant is reported with the VARIANT_IN_HOMOLOGY_REGION flag in the INFO field and also filtered with the TargetedRepeatConflict filter. This variant lies in a region of homology between SMN1 and SMN2 and hence this variant is reported twice - once for each SMN1 and SMN2 regions - and is connected by the same EVENT in the INFO field. The ploidy of the variant is reported in concordance with the identified genotype.

An example of the vcf entry for the variant NM_000344.4:c.*3+80T>G is as follows.

##fileformat=VCFv4.2
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG04038
chr5	70076654	.	T	G	150	TargetedRepeatConflict	EVENT=NM_000344.4:c.*3+80T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/1:55
chr5	70952074	.	T	G	150	TargetedRepeatConflict	EVENT=NM_000344.4:c.*3+80T>G;EVENTTYPE=VARIANT_IN_HOMOLOGY_REGION	GT:GQ	0/0/0/1:55
...

The variant NM_000344.4:c.*3+80T>G in the <output-file-prefix>.targeted.vcf[.gz] file can also be included in the <output-file-prefix>.hard-filtered.vcf[.gz] by including smn in the --targeted-merge-vc list, i.e. --targeted-merge-vc smn. The output file <output-file-prefix>.targeted.vcf[.gz] is compressed by default. This option can be disabled using --enable-vcf-compression=false.

References

¹Wirth B. An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA). Human Mutation. 2000;15(3):228-237. doi:10.1002/(SICI)1098-1004(200003)15:3<228::AID-HUMU3> 3.0.CO;2-9

²Chen X, Sanchis-Juan A, French CE, et al. Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data. Genetics in Medicine. 2020;22(5):945-953. doi: 10.1038/s41436-020-0754-0

Last updated