# Star Allele Caller

## Overview

The Star Allele Caller identifies the genotypes and metabolism status of the following PGx genes that are included in [FDA's PGx recommendations](https://www.fda.gov/medical-devices/precision-medicine/table-pharmacogenetic-associations) or have [CPIC Level A designation](https://cpicpgx.org/genes-drugs/) : CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, BCHE, ABCG2, NAT2, F5 and UGT2B17. It finds optimal genotypes for the above genes, based on star allele definitions from resources listed below. It calls metabolism status based on a PharmCAT resource file that provides mappings between genotypes and phenotypes. The file is [here](https://github.com/PharmGKB/PharmCAT/blob/aeecfe5f787e95dfb31ede62884e287affef45b3/src/main/resources/org/pharmgkb/pharmcat/definition/gene_phenotypes.json). The Star Allele Caller is supported for human references hg38, hg19 and GRCh37.

## Star allele definition resources for hg38

For genes CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, ABCG2 the allele definitions are sourced from PharmGKB (Snapshot 2025.07.17) which are found [here](https://www.pharmgkb.org/page/pgxGeneRef). For BCHE and NAT2, the alleles are sourced from [this](https://www.dovepress.com/getfile.php?fileID=61995) paper and [this](https://api.pharmgkb.org/v1/download/submission/1447964753) website (Snapshot-2025.07.17), respectively. For UGT2B17, the star alleles are defined [here](https://www.pharmacogenomics.pha.ulaval.ca/wp-content/uploads/2015/04/HAP-UGT2B17.htm) (Snapshot 2025.07.17). Note that since BCHE does not have defined star alleles, the Star Allele Caller checks if a sample is positive for any of the variants that are reported in the paper.

## Star allele definition resources for hg19/GRCh37

For genes CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, NUDT15, SLCO1B1, DPYD, the definitions are sourced from PharmVAR and can be found [here](https://www.pharmvar.org/genes) (PharmVar Version: 5.2.11 for DPYD, and Version 6.2.14 for rest of the above genes). For the remaining Star allele caller genes, the allele definitions have been lifted from their corresponding definitions for hg38 (which are sourced from PharmGKB as noted above).

## Functionality

The Star Allele Caller has the following features.

* It calls star allele genotypes from different types of genomic data like FASTQ, BAM, gVCF, VCF.
* It provides additional details about the genotype call, including a confidence score.
* It assumes genotypes for missing positions to be ref - these positions are listed in the output.
* It assumes filtered genotype calls to be ref - these records are also listed in the output.
* If multiple optimal diplotypes are satisfied, then it lists them all.
* It supports different versions of the human reference hg38, hg19 and GRCh37.
* For the genes UGT2B17 and CYP2C19, the caller analyzes CNV calls to detect star alleles.

## Input files and command line examples

The Star Allele Caller can accept as input, different forms of sequence data such as FASTQs files, BAM/CRAM files or gVCF/VCF files.

If small variant VCF/gVCF and CNV-VCF files are used as input, they should meet the following specifications.

* Must be aligned to the same human reference that is passed through the -r option.
* Variants should follow a parsimonious left aligned variant representation format.
* Complex variants - for example, representing closely located, independent variants, in a single record - are NOT supported.

Note that VCF/gVCF files can also be substituted with, a compressed GZ file (i.e. `<file_name>.vcf.gz` or `<file_name>.gvcf.gz`).

For running the caller, the human reference needs to be always passed as a command line option. The Star Allele Caller detects the reference version (i.e., hg19, GRCh37 or hg38) and accordingly reads in the correct allele definitions.

### Configuration files

The Star Allele Caller uses configuration files that are included in the `resources/star_allele` directory of the DRAGEN install location. These files include the star allele definitions at `resources/star_allele/star_allele_definitions_hg19.json` and `resources/star_allele/star_allele_definitions_hg38.json`. It is possible to modify these files to customize the functionality of the caller, such as defining custom star alleles or modifying the names of the star alleles. Use the following steps to run DRAGEN with a custom set of star allele caller configuration files:

1. Copy the \<dragen\_install\_dir>/resources/star\_allele directory to a new location
2. Modify the configuration files in the new location as needed
3. Run DRAGEN with the additional command line option to specify the new resources directory: `--star_allele-resources-path /path/to/new/resources/star_allele`

Note the underscore in `--star_allele-resources-path`. Modification of the star allele caller configuration files can cause unexpected results or errors and should be done with caution.

### Recommended command line

From a bam/cram/fastq input, the Star allele caller can be enabled in parallel with other components as part of a WGS germline analysis workflow using the option `--enable-pgx` ( see [DRAGEN Recipe - Germline WGS](https://help.dragen.illumina.com/product-guides/dragen-recipes/dna-germline-wgs#dragen-recipe-dna-germline-wgs)). This is the simplest and recommended way to run the Star allele caller.

Additionally, the Star allele caller can also be enabled separately using the following command line options.

### Command line with gVCF input

In the simplest case, the caller takes DRAGEN gVCF and DRAGEN CNV-VCF files as input. The following is an example of the command line for the basic use case.

```
dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
--star-allele-gvcf /staging/test/data/NA12878.gvcf \
--star-allele-cnv-vcf /staging/test/data/NA12878.cnv.vcf.gz \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-star-allele true
```

### Command line with VCF input

Contrary to a variant-only VCF file, a DRAGEN gVCF file contains the genotypes for all positions in a genome. Although the gVCF format is the preferred format for the caller, it can also accept a standard variant-only VCF file as input. The command line for this case will be the same as above, with the VCF file passed instead of a gVCF file. Also, the CNV-VCF file is optional - in this case the Star Allele Caller will not call star alleles that are detected through CNV analysis. An example of this use case, with only a variant only VCF file as input, is as follows.

```
dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/${HASH_TABLE_VERSION} \
--star-allele-gvcf /staging/test/data/NA12878.vcf \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-star-allele true
```

### Command line with BAM input

For running the Star Allele Caller from a BAM input, the variant caller also needs to be enabled. Optionally, the CNV caller should also be preferably enabled for analyzing CNV star alleles. An example of the command line for this use case is as follows.

```
dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
--bam-input /staging/test/data/NA12878.bam \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-map-align false \
--enable-star-allele true \
--enable-variant-caller true \
--vc-emit-ref-confidence gvcf \
--enable-cnv true \ 
--cnv-enable-self-normalization true
```

**Note that the Star Allele Caller supports force genotyping option of the variant caller (set by `--vc-forcegt-vcf`) but other variant caller options, such as combining phased variants (set using `--vc-combine-phased-variants-distance`), is NOT supported at this time.**

### Command line with FASTQ input

If a FASTQ file is used as input, additional options, `--RGID` and `--RGSM` need to be set in the command line. An example of the command line for this use case as follows.

```
dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/${HASH_TABLE_VERSION} \
-1 /scratch/NA11829.fq1.gz \
-2 /scratch/NA11829.fq2.gz \
--RGID DRAGEN_RGID \
--RGSM DRAGEN_RGSM \
--enable-map-align true \
--output-directory /staging/test/output \
--output-file-prefix NA11829 \
--enable-star-allele true \
--enable-variant-caller true \
--vc-emit-ref-confidence gvcf \
--enable-cnv true \ 
--cnv-enable-self-normalization true
```

## Output files

Following completion of the DRAGEN Star Allele Caller run, the following output files are produced.

1. When the Star Allele Caller is run with small variant calling, or directly from genome VCF input, then the main output file, `<prefix>.targeted.json` contains the complete and detailed results for all genes. This is an example output for one gene `DPYD` and for one sample `NA19374`.

```
{
  "genomeBuild": "hg38",
  "softwareVersion": "dragen <VERSION>",
  "sampleId": "HG00236",
  "phenotypeDatabaseSources": [
    "PharmCAT Phenotypes Version: Snapshot-2022.09.15"
  ],
  "starAlleleDatabaseSources": [
    "PharmGKB Database Version: Snapshot-2022.01.01",
    "PharmGKB Database Version: Snapshot-2022.03.01",
    "UGT Nomenclature Committee Version: Snapshot-01.01.2023",
    "Zhu et al. 2020, PMID: 33061533"
  ],
  "locusAnnotations": [
    {
      "gene": "CYP3A5",
      "geneId": "HGNC:2638",
      "starAlleleDatabaseSource": "PharmGKB Database Version: Snapshot-2022.01.01",
      "genotype": "*3/*3",
      "genotypeQuality": 43,
      "phenotypeDatabaseAnnotation": "Poor Metabolizer",
      "supportingVariants": [
        {
          "alleleId": "*3",
          "chrom": "chr7",
          "pos": 99672916,
          "ref": "T",
          "alt": "C,<NON_REF>",
          "gt": "1/1",
          "quality": 43
        }
      ],
      "missingVariantSites": [],
      "variantStarAllelesFound": "*3",
      "variantStarAllelesChecked": [
        "*3",
        "*6",
        "*7",
        "*8",
        "*9"
      ]
    }
    {
      "gene": "UGT1A1",
      "geneId": "HGNC:12530",
      "starAlleleDatabaseSource": "PharmGKB Database Version: Snapshot-2022.01.01",
      "genotype": "*1/*1",
      "genotypeQuality": 0,
      "phenotypeDatabaseAnnotation": "Normal Metabolizer",
      "supportingVariants": [],
      "missingVariantSites": [
        {
          "id": "233760233:C:CAT",
          "alleleIds": "*28,*80+*28"
        },
        {
          "id": "chr2:233759924:C:T,<NON_REF>:0/1:0:10:LowGQ",
          "alleleIds": "*80,*80+*28,*80+*37"
        }
      ],
      "variantStarAllelesFound": "",
      "variantStarAllelesChecked": [
        "*6",
        "*27",
        "*28",
        "*36",
        "*37",
        "*80",
        "*80+*28",
        "*80+*37"
      ]
    },
  ]
}
```

The fields in the json file are as follows.

* "genomeBuild": Reference version being used
* "softwareVersion": Version of DRAGEN being run
* "sampleId": Sample name
* "phenotypeDatabaseSources": Resources used for calling metabolism status (phenotype)
* "starAlleleDatabaseSources": Resources used for identifying star alleles (genotype)
* "locusAnnotations": List of star allele caller results, one for each gene
* "gene": Gene name
* "geneId": HGNC or Ensembl id of the gene that is static
* "starAlleleDatabaseSource": Resource for the star allele definitions file
* "genotype": The detected star allele diplotype (or haplotype for haploid gene)
* "genotypeQuality": Phred scaled quality score for the genotype
* "phenotypeDatabaseAnnotation": Metabolism status corresponding to the genotype called
* "supportingVariants": List of variants corresponding to the star-allele genotype. The id field denotes the name of the star allele. Each non-ref star allele has a list of supportingVariants which displays the variant details (same as from the small variant vcf file. The quality field denotes the gq field from the vcf record)
* "missingVariantSites": List of relevant gene sites for which vcf records are missing or filtered
* "variantStarAllelesFound": List of star allele haplotypes that are satisfied by the found variants
* "variantStarAllelesChecked": List of all star alleles checked by the caller

The fields in "supportingVariants" are as follows.

* "alleleId": The star allele associated with this variant
* "chrom": Chromosome
* "pos": Position
* "ref": Reference allele
* "alt": Alt alleles (comma separated)
* "gt": Genotype call for the variant
* "quality": Qual for the variant

Note that the fields other than the alleleId corresponds with the vcf record call for the variant

The fields in the missingVariantSites are as follows.

* "id": an id for a missing or filtered variant site. For a missing variant site the format is `CHROM:REF:ALT` For a filtered variant site the format is `CHROM:POS:REF:ALT:GT:GQ:DP:FILTER` These fields corresponds with the vcf record call for the filtered variant call. The `ALT` and `FILTER` fields may have comma separated alt alleles and filters.
* "alleleIds": star alleles that are associated with the missed or filtered variant sites.

Each Star allele genotype contains one or two haplotypes (a haplotype for chrM gene MT-RNR1 and chrX gene G6PD for male samples, and a diplotype for all other genes) separated by a slash (e.g. `*1/*2`). Each haplotype is a pre-defined star allele and the definitions are from resources listed in the field "starAlleleDatabaseSources". Note that these resources are periodically updated by the agencies maintaining them, and may receive updates that are not yet covered by a specific version of our caller. When the Star Allele Caller cannot identify an optimal genotype for a gene, a no-call (`./.` or `.`) is made. In certain cases, more than one genotype is optimally satisfied, in that case all satisfied genotypes are listed, separated by a semi-colon (e.g. `*1/*2;*3/*4`).

2. Tsv and json files (`<prefix>.star_allele.tsv` and `<prefix>.star_allele.json`, respectively) are produced when the Star Allele Caller is run stand-alone from a gvcf or vcf file or if the option `--targeted-enable-legacy-output` is set. The json file has the same format as `<prefix>.targeted.json` (shown above) while the tsv file contains summarized star allele calls for each gene. This is an example for one gene from the tsv output. The fields are gene name and genotype.

```
UGT1A1  *36/*80+*37
```
