Star Allele Caller

Overview

The Star Allele Caller identifies the genotypes and metabolism status of the following PGx genes that are included in FDA's PGx recommendations or have CPIC Level A designation : CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, BCHE, ABCG2, NAT2, F5 and UGT2B17. It finds optimal genotypes for the above genes, based on star allele definitions from resources listed below. It calls metabolism status based on a PharmCAT resource file that provides mappings between genotypes and phenotypes. The file is here. The primary support for the Star Allele Caller is for human reference hg38 for which it supports the above mentioned genes. Additionally, it also supports the following genes on references hg19 and GRCh37 : CACNA1S, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, NUDT15, SLCO1B1, VKORC1, DPYD, ABCG2, F5.

Star allele definition resources for hg38

For genes CACNA1S, CFTR, CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, IFNL3, RYR1, NUDT15, SLCO1B1, TPMT, UGT1A1, VKORC1, DPYD, G6PD, MT-RNR1, ABCG2 the allele definitions are sourced from PharmGKB which are found here. For BCHE and NAT2, the alleles are sourced from this paper and this website, respectively. For UGT2B17, the star alleles are defined here. Note that since BCHE does not have defined star alleles, the Star Allele Caller checks if a sample is positive for any of the variants that are reported in the paper.

Star allele definition resources for hg19/GRCh37

For genes CYP2C19, CYP2C9, CYP3A4, CYP3A5, CYP4F2, NUDT15, SLCO1B1, DPYD, the definitions are sourced from PharmVAR and can be found here. For the remaining hg19/GRCh37 genes, i.e., ABCG2, CACNA1S, IFNL3, F5 and VKORC1 - the allele definitions have been lifted from their corresponding definitions for hg38 (which are sourced from PharmGKB as noted above).

Functionality

The Star Allele Caller has the following features.

  • It calls star allele genotypes from different types of genomic data like FASTQ, BAM, gVCF, VCF.

  • It provides additional details about the genotype call, including a confidence score.

  • It assumes genotypes for missing positions to be ref - these positions are listed in the output.

  • It assumes filtered genotype calls to be ref - these records are also listed in the output.

  • If multiple optimal diplotypes are satisfied, then it lists them all.

  • It supports different versions of the human reference hg38, hg19 and GRCh37.

  • For the genes UGT2B17 and CYP2C19, the caller analyzes CNV calls to detect star alleles.

Input files and command line examples

The Star Allele Caller can accept as input, different forms of sequence data such as FASTQs files, BAM/CRAM files or gVCF/VCF files.

If small variant VCF/gVCF and CNV-VCF files are used as input, they should meet the following specifications.

  • Must be aligned to the same human reference that is passed through the -r option.

  • Variants should follow a parsimonious left aligned variant representation format.

  • Complex variants - for example, representing closely located, independent variants, in a single record - are NOT supported.

Note that VCF/gVCF files can also be substituted with, a compressed GZ file (i.e. <file_name>.vcf.gz or <file_name>.gvcf.gz).

For running the caller, the human reference needs to be always passed as a command line option. The Star Allele Caller detects the reference version (i.e., hg19, GRCh37 or hg38) and accordingly reads in the correct allele definitions.

The Star allele caller can be enabled in parallel with other components as part of a WGS germline analysis workflow using the option --enable-pgx (see DRAGEN Recipe - Germline WGS)

Command line with gVCF input

In the simplest case, the caller takes DRAGEN gVCF and DRAGEN CNV-VCF files as input. The following is an example of the command line for the basic use case.

dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
--star-allele-gvcf /staging/test/data/NA12878.gvcf \
--star-allele-cnv-vcf /staging/test/data/NA12878.cnv.vcf.gz \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-star-allele true

Command line with VCF input

Contrary to a variant-only VCF file, a DRAGEN gVCF file contains the genotypes for all positions in a genome. Although the gVCF format is the preferred format for the caller, it can also accept a standard variant-only VCF file as input. The command line for this case will be the same as above, with the VCF file passed instead of a gVCF file. Also, the CNV-VCF file is optional - in this case the Star Allele Caller will not call star alleles that are detected through CNV analysis. An example of this use case, with only a variant only VCF file as input, is as follows.

dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/${HASH_TABLE_VERSION} \
--star-allele-gvcf /staging/test/data/NA12878.vcf \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-star-allele true

Command line with BAM input

For running the Star Allele Caller from a BAM input, the variant caller also needs to be enabled. Optionally, the CNV caller should also be preferably enabled for analyzing CNV star alleles. An example of the command line for this use case is as follows.

dragen \
-r /staging/human/reference/hg38_alt_aware/DRAGEN/${HASH_TABLE_VERSION} \
--bam-input /staging/test/data/NA12878.bam \
--output-directory /staging/test/output \
--output-file-prefix NA12878_dragen \
--enable-map-align false \
--enable-star-allele true \
--enable-variant-caller true \
--vc-emit-ref-confidence gvcf \
--enable-cnv true \ 
--cnv-enable-self-normalization true

Note that the Star Allele Caller supports force genotyping option of the variant caller (set by --vc-forcegt-vcf) but other variant caller options, such as combining phased variants (set using --vc-combine-phased-variants-distance), is NOT supported at this time.

Command line with FASTQ input

If a FASTQ file is used as input, additional options, --RGID and --RGSM need to be set in the command line. An example of the command line for this use case as follows.

dragen \
-r /staging/human/reference/hg38_alt_aware+cnv+hla+rna_v2/DRAGEN/${HASH_TABLE_VERSION} \
-1 /scratch/NA11829.fq1.gz \
-2 /scratch/NA11829.fq2.gz \
--RGID DRAGEN_RGID \
--RGSM DRAGEN_RGSM \
--enable-map-align true \
--output-directory /staging/test/output \
--output-file-prefix NA11829 \
--enable-star-allele true \
--enable-variant-caller true \
--vc-emit-ref-confidence gvcf \
--enable-cnv true \ 
--cnv-enable-self-normalization true

Output files

Following completion of the DRAGEN Star Allele Caller run, the following output files are produced.

  1. When the Star Allele Caller is run with small variant calling, or directly from genome VCF input, then the main output file, <prefix>.targeted.json contains the complete and detailed results for all genes. This is an example output for one gene DPYD and for one sample NA19374.

{
  "genomeBuild": "hg38",
  "softwareVersion": "dragen v4.4.0-52-g09190b26",
  "sampleId": "HG00236",
  "phenotypeDatabaseSources": [
    "PharmCAT Phenotypes Version: Snapshot-2022.09.15"
  ],
  "starAlleleDatabaseSources": [
    "PharmGKB Database Version: Snapshot-2022.01.01",
    "PharmGKB Database Version: Snapshot-2022.03.01",
    "UGT Nomenclature Committee Version: Snapshot-01.01.2023",
    "Zhu et al. 2020, PMID: 33061533"
  ],
  "locusAnnotations": [
    {
      "gene": "CYP3A5",
      "geneId": "HGNC:2638",
      "starAlleleDatabaseSource": "PharmGKB Database Version: Snapshot-2022.01.01",
      "genotype": "*3/*3",
      "genotypeQuality": 43,
      "phenotypeDatabaseAnnotation": "Poor Metabolizer",
      "supportingVariants": [
        {
          "alleleId": "*3",
          "chrom": "chr7",
          "pos": 99672916,
          "ref": "T",
          "alt": "C,<NON_REF>",
          "gt": "1/1",
          "quality": 43
        }
      ],
      "variantStarAllelesFound": "*3",
      "missingVariantSites": []
    }
    {
      "gene": "F5",
      "geneId": "HGNC:3542",
      "starAlleleDatabaseSource": "PharmGKB Database Version: Snapshot-2022.01.01",
      "genotype": "rs6025reference(C)/rs6025reference(C)",
      "genotypeQuality": 0,
      "phenotypeDatabaseAnnotation": null,
      "supportingVariants": [],
      "variantStarAllelesFound": "",
      "missingVariantSites": [
        {
          "id": "169549811:C:T",
          "alleleIds": "rs6025variant(T)"
        }
      ]
    },

The fields in the json file are as follows.

  • "genomeBuild": Reference version being used

  • "softwareVersion": Version of DRAGEN being run

  • "sampleId": Sample name

  • "phenotypeDatabaseSources": Resources used for calling metabolism status (phenotype)

  • "starAlleleDatabaseSources": Resources used for identifying star alleles (genotype)

  • "locusAnnotations": List of star allele caller results, one for each gene

  • "gene": Gene name

  • "geneId": HGNC or Ensembl id of the gene that is static

  • "starAlleleDatabaseSource": Resource for the star allele definitions file

  • "genotype": The detected star allele diplotype (or haplotype for haploid gene)

  • "genotypeQuality": Phred scaled quality score for the genotype

  • "phenotypeDatabaseAnnotation": Metabolism status corresponding to the genotype called

  • "supportingVariants": List of star alleles that are satisfied by found variants. The id field denotes the name of the star allele. Each non-ref star allele has a list of supportingVariants which displays the variant details (same as from the small variant vcf file. The quality field denotes the gq field from the vcf record)

  • "missingVariantSites": List of relevant gene sites for which vcf records are missing or filtered

  • "variantStarAllelesFound": List of star allele haplotypes that are satisfied by the found variants

Each Star allele genotype contains one or two haplotypes (a haplotype for chrM gene MT-RNR1 and chrX gene G6PD for male samples, and a diplotype for all other genes) separated by a slash (e.g. *1/*2). Each haplotype is a pre-defined star allele and the definitions can be found under the allele definitions URL. Note that there may be some variance to star allele definitions and notations based on the resource and when it was last updated. When the Star Allele Caller cannot identify an optimal genotype for a gene, a no-call (./. or .) is made. In certain cases, more than one genotype is optimally satisfied, in that case all satisfied genotypes are listed, separated by a semi-colon (e.g. *1/*2;*3/*4).

  1. Tsv and json files (<prefix>.star_allele.tsv and <prefix>.star_allele.json, respectively) are produced when the Star Allele Caller is run stand-alone from a gvcf or vcf file or if the option --targeted-enable-legacy-output is set. The json file has the same format as <prefix>.targeted.json (shown above) while the tsv file contains summarized star allele calls for each gene. This is an example for one gene from the tsv output. The fields are gene name and genotype.

UGT1A1  *36/*80+*37

Last updated