# CheckFingerprint

CheckFingerprint identifies whether two or more sequencing datasets originate from the same individual. It supports multiple comparison modes depending on the available inputs and the desired trade-off between **statistical rigor**, **runtime**, and **scalability**.

LOD-based CheckFingerprint modes are broadly based on Picard CheckFingerprint and report logarithmic odds (LOD) scores to assess sample identity using probabilistic genotype modeling and haplotype blocks.

Pairwise pileup comparison mode is a new, experimental method that reports a simple **MatchRate** based on direct genotype concordance. This mode is intended for rapid screening and large-scale comparisons.

A positive LOD score or a high MatchRate indicates that samples are likely derived from the same individual.

***

## CheckFingerprint Modes (Summary)

CheckFingerprint supports two types of identity metrics:

* **LOD score** (logarithmic odds) quantifies how much more likely two samples are to come from the same person than from different people. A positive score indicates a likely match, with higher values indicating stronger evidence.
* **Match rate** is a simpler 0–1 concordance score based on direct genotype comparison, where values above 0.90–0.95 indicate a likely match.

| Mode                             | What you need                                                                                                | Metric     | Runtime   | When to use                                                                            |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------ | ---------- | --------- | -------------------------------------------------------------------------------------- |
| From reads (generate VCF)        | BAM/CRAM + expected genotype VCF. Requires the DRAGEN germline small variant caller to be enabled.           | LOD score  | Medium    | WGS or larger datasets; best general-purpose option                                    |
| Precomputed VCF                  | One or more observed genotype VCFs + expected genotype VCF (no BAM needed). Small variant caller is skipped. | LOD score  | Fast      | VCFs (must be germline) already available                                              |
| Pairwise pileup *(experimental)* | Pileup files generated during DRAGEN contamination detection                                                 | Match rate | Very fast | Batch screening across many samples; requires contamination detection to have been run |

***

## Processing Flow

### LOD-Based Modes (On-the-fly VCF / Precomputed Germline VCF)

In LOD-based modes, CheckFingerprint uses reference-specific haplotype map files (`*.map`), bundled with DRAGEN and automatically selected based on the reference, to define curated SNPs grouped into haplotype (linkage disequilibrium) blocks. Genotype likelihoods are estimated from VCF PL values, and evidence is aggregated at the haplotype level to avoid over-counting correlated variants. A logarithmic odds (LOD) score is then computed to quantify how much more likely the samples originate from the same individual than from different individuals.

**Processing steps:**

1. Select SNPs from reference-specific haplotype maps
2. Estimate genotype likelihoods from VCF PL values
3. Aggregate evidence across haplotype blocks
4. Compute LOD scores for sample pairs

LOD-based modes provide a statistically rigorous identity assessment and are recommended for final confirmation.

***

### Pairwise Pileup Mode (Experimental)

In pairwise pileup mode, CheckFingerprint performs a fast, direct comparison of genotypes using pileup files, without haplotype modeling or probabilistic inference. This mode is optimized for rapid screening and large-scale, multi-sample comparisons.

**Processing steps:**

1. Load pileup files for all input samples
2. Select overlapping marker sites across samples
3. Apply minimum depth and heterozygosity filters
4. Exclude uninformative sites (e.g. homozygous reference in both samples)
5. Compare genotypes at remaining sites
6. Compute a MatchRate for all pairwise sample comparisons

***

## Interpretation of Results

### LOD-Based Modes

* **LOD > 0**: samples likely from the same individual
* **LOD < 0**: samples likely from different individuals
* **LOD ≈ 0**: inconclusive (often due to low coverage)

LOD scores are reported on a base-10 logarithmic scale. For example, a LOD of 4 indicates the data are 10,000× more likely to match than not.

***

### Pairwise Pileup Mode

* **MatchRate ≥ 0.90–0.95**: samples likely from the same origin
* **Lower MatchRate**: samples likely from different individuals
* **MatchRate = NA**: insufficient overlapping informative sites

MatchRate is intended for screening and triage, not formal identity confirmation.

***

## Command-Line Options

### \[Required]

* `--enable-checkfingerprint true`

***

### \[Required for LOD-Based Modes]

* `--checkfingerprint-expected-vcf <expected.vcf>`

The expected VCF may contain one or multiple samples. The input sample is compared independently against each expected sample.

***

### \[Mode Selection Options]

| Option                                            | Description                                                                |
| ------------------------------------------------- | -------------------------------------------------------------------------- |
| `--checkfingerprint-enable-vcf-comparison true`   | Enable VCF comparison mode (required for either precomputed or on the fly) |
| `--checkfingerprint-observed-vcf <vcf>`           | Enable precomputed VCF comparison mode                                     |
| `--checkfingerprint-pairwise-read-files <pileup>` | Enable pairwise pileup mode (repeatable)                                   |

***

### \[Optional – Advanced (LOD-Based Modes)]

* `--checkfingerprint-haplotype-map <map_file>`\
  Specify a custom haplotype map file. By default, DRAGEN automatically selects a reference-specific haplotype map bundled with the software.

***

### \[Pairwise Pileup Mode Settings]

| Setting                                         | Description                                                                              | Default |
| ----------------------------------------------- | ---------------------------------------------------------------------------------------- | ------- |
| `--checkfingerprint-pairwise-min-depth`         | Minimum depth required at a locus                                                        | 10      |
| `--checkfingerprint-pairwise-het-width`         | Total AF window around 0.5 used to classify heterozygous sites (e.g. 0.5 → AF 0.25–0.75) | 0.5     |
| `--checkfingerprint-pairwise-min-passing-sites` | Minimum overlapping passing sites required to compute MatchRate                          | 500     |

***

### \[Tumor-Aware Settings – LOD Modes]

| Setting                                      | Description                                                   | Default |
| -------------------------------------------- | ------------------------------------------------------------- | ------- |
| `--checkfingerprint-enable-tumor-aware true` | Enable tumor-aware LOD computation                            |         |
| `--checkfingerprint-loss-of-het-rate`        | Rate at which heterozygous sites become homozygous due to LOH | 0.5     |

***

## Command-Line Examples

### On-the-fly VCF Comparison Mode

**Most applicable for:** Whole-genome sequencing (WGS) datasets (≈30× coverage) and general-purpose identity checking.

```bash
dragen -r <ref_dir> -b <input.bam> \
  --output-directory <outdir> \
  --output-file-prefix sample \
  --enable-checkfingerprint true \
  --checkfingerprint-expected-vcf expected.vcf \
  --checkfingerprint-enable-vcf-comparison true \
  --enable-variant-caller true
```

***

### Standalone VCF Comparison Mode

**Most applicable for:** VCF-only workflows where both observed and expected VCFs are already available.

```bash
dragen -r <ref_dir> \
  --output-directory <outdir> \
  --output-file-prefix sample \
  --enable-checkfingerprint true \
  --checkfingerprint-expected-vcf expected.vcf \
  --checkfingerprint-observed-vcf observed.vcf
```

***

### Pairwise Pileup Comparison Mode (Experimental)

**Most applicable for:** Rapid batch-level screening of many samples (e.g. WGS runs), duplicate detection, and large-scale identity sanity checks.

```bash
dragen -r <ref_dir> \
  --enable-checkfingerprint true \
  --checkfingerprint-pairwise-read-files sampleA.pileup.txt \
  --checkfingerprint-pairwise-read-files sampleB.pileup.txt \
  --checkfingerprint-pairwise-read-files sampleC.pileup.txt \
  --checkfingerprint-pairwise-min-depth 20 \
  --checkfingerprint-pairwise-het-width 0.5 \
  --output-directory <outdir> \
  --output-file-prefix batch
```

Pileup files can be generated during DRAGEN map-align steps by:

* DRAGEN contamination detection (`--qc-detect-contamination true`)
* External tools such as `samtools mpileup`

***

## Outputs

### LOD-Based Modes

* `<prefix>.CheckFingerprint.summary.txt`
* `<prefix>.CheckFingerprint.detail.txt`

***

### Pairwise Pileup Mode Output

* `<prefix>.CheckFingerprint.pairwise.csv`

The CSV file contains all pairwise sample comparisons, sorted by MatchRate (highest to lowest):

| Column               | Description                                      |
| -------------------- | ------------------------------------------------ |
| SampleA / SampleB    | Input pileup file names                          |
| OverlappingSites     | Total shared loci                                |
| PassingSites         | Loci passing depth and genotype filters          |
| UninformativeSites   | Loci where both samples are homozygous reference |
| MatchingGenotypes    | Matching genotype calls                          |
| MismatchingGenotypes | Mismatching genotype calls                       |
| MatchRate            | Matching / (Matching + Mismatching), or NA       |

If `PassingSites < checkfingerprint-pairwise-min-passing-sites`, MatchRate is reported as `NA`.

***

## Limitations

**Pairwise pileup mode:**

* Experimental; intended for rapid screening
* Non-probabilistic and haplotype-free
* Less sensitive for low-coverage or targeted panels

**LOD-based modes:**

* Tumor-aware LOD assumes loss of heterozygosity
* Observed and expected VCFs should originate from the same pipeline
* Compatible only with DRAGEN germline and tumor-only pipelines
