# CYP2D6 Caller

The CYP2D6 Caller is capable of genotyping the *CYP2D6* gene from whole-genome sequencing (WGS) data and is derived from the method implemented in Cyrius¹. Due to high sequence similarity with its pseudogene paralog *CYP2D7* and a wide variety of common structural variants (SVs), a specialized caller is necessary to resolve variants and identify likely star allele haplotypes.

The CYP2D6 Caller performs the following steps:

1. Determines total *CYP2D6* and *CYP2D7* copy number from read depth.
2. Determines *CYP2D6*-derived copy number at *CYP2D6*/*CYP2D7* differentiating sites.
3. Detects SV breakpoints by calculating the changes in *CYP2D6*-derived copy number along the *CYP2D6* gene.
4. Calls small variants in *CYP2D6* copies.
5. Identifies star alleles from the detected SV breakpoints and small variants.
6. Identifies the most likely genotype for the called star alleles.

## Total *CYP2D6* and *CYP2D7* Copy Number

The first step of CYP2D6 calling is to determine the combined copy number of *CYP2D6* and *CYP2D7*. Reads aligned to regions in either *CYP2D6* or *CYP2D7* are counted. The counts in each region are corrected for GC-bias, and then normalized to a diploid baseline. The GC-bias correction and normalization factors are determined from read counts in 3000 preselected 2 kb regions across the genome. These 3000 normalization regions were randomly selected from the portion of the reference genome having stable coverage across population samples. The combined *CYP2D6* and *CYP2D7* copy number is then calculated from the average sequencing depth across the *CYP2D6* and *CYP2D7* regions.

## Differentiating Sites

The *CYP2D6*-derived copy number is calculated at 117 predefined differentiating sites across the *CYP2D6* gene. The differentiating sites are selected at positions with sequence differences in *CYP2D6* and *CYP2D7* where calling the *CYP2D6*-derived copy number shows an accuracy of greater than 98% based on sequencing data from the 1000 Genomes Project.

For each differentiating site, *CYP2D6*-specific and *CYP2D7*-specific alleles are counted in reads mapping to either *CYP2D6* or the homologous region in *CYP2D7*. The *CYP2D6*-derived copy number is then calculated from the two gene-specific allele counts using the total *CYP2D6* and *CYP2D7* copy number calculated from the previous step.

## Structural Variant Calling

The *CYP2D6*-derived copy number along the *CYP2D6* gene is used to identify known population structural variants (SVs), including whole gene deletions and duplications as well as certain gene conversions and gene fusions. The following fusion variants are detected:

| Fusion Breakpoint | Hybrid Gene Structure | Star-Allele Designation                                                                       |
| ----------------- | --------------------- | --------------------------------------------------------------------------------------------- |
| exon 9            | 2D6-2D7               | <p><code>\*4.013</code>,<br><code>\*36</code>,<br><code>\*57</code>,<br><code>\*83</code></p> |
| exon 9            | 2D7-2D6               | `*13`                                                                                         |
| intron 4          | 2D7-2D6               | `*13`                                                                                         |
| intron 1          | 2D7-2D6               | `*13`                                                                                         |
| intron 1          | 2D6-2D7               | `*68`                                                                                         |

In addition to the exon 9 fusion breakpoints, exon 9 can participate in *CYP2D7* gene conversion resulting in an embedded *CYP2D7* sequence instead of a true hybrid. The structural variant caller also detects exon 9 gene conversions. Because only changes in *CYP2D6*-derived copy number yield structural variant calls, there might be rare cases where two hybrid copies result in no structural variant calls. For example, when both `*36` and `*13` with fusion breakpoint in exon 9 are present. However, the structural variant caller is capable of detecting multiple copies of the same fusion type (eg, `*36x2`) or cases where both an exon 9 gene conversion copy and an exon 9 2D6-2D7 hybrid are present.

## Small Variant Calling

118 small variants that define various star alleles are detected from the read alignments. 96 of these variants are in unique (nonhomologous) regions of *CYP2D6* with high mapping quality. Only reads mapping to *CYP2D6* are used for calling variants in nonhomologous regions. The other 22 variants occur in homologous regions of *CYP2D6* where reads mapping to either *CYP2D6* or *CYP2D7* are used for variant calling.

For each variant, reads containing either the variant allele or the nonvariant alleles are counted. A binomial model that incorporates the sequencing errors is then used to determine the most likely variant copy number (0 for nonvariant). A strand bias filter is applied to a small subset of variants that would otherwise tend to have false positive calls in the population.

Samples with poor sequencing quality or greater than five copies of *CYP2D6* will have allele counts with higher variance. This elevated variance increases the chance that the most likely variant copy number is wrong. To handle these cases, the small variant caller also indicates alternate, less likely variant copy numbers.

## Star Allele Identification

The called SVs and small variant genotypes are matched against the definitions of 128 different star alleles (PharmVar version 4.17, July 3, 2020). This might result in different sets of star alleles matching the called variant genotypes, such as with `*1`, `*46` and `*43`, `*45` where both sets of star alleles contain the same 4 small variants. When the small variant caller emits alternate, less likely variant copy numbers in addition to the most likely variant copy numbers, this might result in different sets of star alleles being identified, since these alternate sets of variant copy numbers are also matched to the star allele definitions. The number of matched star alleles must match the number of *CYP2D6*-derived gene copies determined from previous steps. When there are fewer than two *CYP2D6*-derived gene copies, then one or more `*5` deletion haplotypes are included in the output set of star alleles. If all variant genotypes cannot be matched to a set of star alleles, the CYP2D6 Caller returns a no call during the genotyping step with filter value `No_call`.

## Genotyping

Given a possible set of star alleles, the genotyping step attempts to identify the two likely haplotypes that contain all star alleles in the set. The deletion haplotype (`*5`) is considered as a possible haplotype during this process. The likelihood of any given genotype is determined from a table of population frequencies determined from the 1000 Genomes Project and the genotype with the highest population frequency is selected. When two or more possible genotypes are identified with similar population frequencies, then all genotypes are emitted. This results in a call with filter value `More_than_one_possible_genotype`.

## CYP2D6 Output File

The CYP2D6 Caller prints out its calls in the targeted caller output file, `<output-file-prefix>.targeted.json` that also contains calls from other targets (see [Targeted JSON File](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/targeted-caller/..#targeted-json-file)). An example of the CYP2D6 caller content in the output is as follows:

```
{
    cyp2d6": {
        "totalCopyNumber": 4,
        "genotype": "*113/*2",
        "genotypeFilter": "PASS",
        "genotypeQuality": 150,
        "phenotypeDatabaseAnnotation": "Indeterminate",
        "variants": [
        {
            "alleleId": "g.42126752C>T",
            "alleleCopyNumber": 1,
            "genotypeQuality": 28,
            "filter": "PASS"
        },
        {
            "alleleId": "g.42126938C>T",
            "alleleCopyNumber": 2,
            "genotypeQuality": 58,
            "filter": "PASS"
        },
        {
            "alleleId": "g.42126611C>G",
            "alleleCopyNumber": 1,
            "genotypeQuality": 150,
            "filter": "PASS"
        },
        {
            "alleleId": "g.42127941G>A",
            "alleleCopyNumber": 1,
            "genotypeQuality": 150,
            "filter": "PASS"
        }
        ]
    }
}
```

| Fields in JSON              | Explanation                                                                               | Type and Possible Values                                                                  |
| --------------------------- | ----------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| totalCopyNumber             | Total combined copy number of *CYP2D6* and *CYP2D7*                                       | nonnegative integer                                                                       |
| genotype                    | called star allele genotype                                                               | string (semi-colon delimited list of possible genotypes with haplotypes separated by `/`) |
| genotypeFilter              | The filter status for the genotype call                                                   | string (The value can include: PASS, No\_call, or More\_than\_one\_possible\_genotype)    |
| genotypeQuality             | The quality value for the genotype call                                                   | integer (The value can be in the range of 0-150)                                          |
| phenotypeDatabaseAnnotation | The metabolism status corresponding to the genotype, mapped from phenotypeDatabaseSources | string or null if not present in the database                                             |
| variants                    | a json array containing info about specific *CYP2D6* variants                             | json-array                                                                                |

Each variant reported in the `variants` array will have the fields below.

| Fields in JSON   | Explanation                                      | Type and Possible Values         |
| ---------------- | ------------------------------------------------ | -------------------------------- |
| alleleId         | HGVS identifier of the variant allele            | string                           |
| alleleCopyNumber | Copy number of the allele in the called genotype | nonnegative integer              |
| genotypeQuality  | Phred-scaled quality for the called genotype     | nonnegative integer              |
| filter           | Filter for the called genotype                   | string. "PASS" when not filtered |

Each *CYP2D6* genotype contains two haplotypes separated by a slash (eg `*1/*2`). Each haplotype consists of one or more star alleles separated by a plus sign (eg `*10+*36`). When a haplotype contains more than one copy of the same star allele, that star allele only appears once and is followed by a multiplication sign, and then the number of copies (eg `*1x2` for two copies of `*1`).

CYP2D6 Caller can also print out its variant calls into VCF format which has been described in the top-level targeted-caller section (see [Targeted VCF File](https://help.dragen.illumina.com/product-guides/dragen-v4.5/dragen-dna-pipeline/targeted-caller/..#targeted-vcf-file)).

¹Chen X, Shen F, Gonzaludo N, et al. Cyrius: accurate CYP2D6 genotyping using whole-genome sequencing data. The Pharmacogenomics Journal. 2021;21(2):251-261. doi:10.1038/s41397-020-00205-5
