Single-Cell Multiomics

The DRAGEN Single-Cell Multiomics (scRNA + scATAC) Pipeline can process data sets from single-cell RNA-Seq and ATAC-Seq reads to a cell-by-feature count matrix.

The pipeline is compatible with library designs that have:

  • For single-cell RNA: one read in a fragment match to a transcript and the other containing a cell-barcode and UMI.

  • For single-cell ATAC: two reads matching to the genome and the third one containing cell-barcode.

The pipeline includes the following functions:

  • Alignment

    • RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.

    • ATAC-Seq alignment

  • Cell-barcode (both RNA- and ATAC-Seq) and UMI (RNA-Seq only) error correction for the barcode read.

  • Read counting

    • UMI counting per cell and gene to measure gene expression.

    • Fragment counting per cell and peak to measure chromatin accessibility.

  • Sparse matrix output and QC metrics.

Input Files

Alignment Reference

A standard DRAGEN reference hashtable with both DNA and RNA capability is required for the Single-Cell Multiomics Pipeline. Building a reference hashtable using --ht-build-rna-hashtable=true should satisfy this requirement. See the section Prepare a Reference Genome for more details.

The pipeline also requires a gene annotation file in GTF format, provided with the with the --annotation (-a) option.

Read Input

Since the multiomics workflow assumes at least one pair of single-cell RNA FASTQ files and one triple of single-cell ATAC FASTQ files, the only method to provide read input to DRAGEN is through the fastq-list file mechanism. A multiomics fastq-list file is a CSV file with the following mandatory columns:

ColumnDescription

Lane

Sequencing lane

RGID

Read group ID

RGSM

Read group sample

RGLB

Read group library

Read1File

Read 1 FASTQ file

Read2File

Read 2 FASTQ file

UmiFile

FASTQ file with cell-barcodes and/or UMIs

InputGroup

Input group: either scATAC or scRNA (not case sensitive)

Entries in a fastq-list file corresponding to single-cell RNA analysis must have read 2 FASTQ files in the UmiFile column and the Read2File column entries must be left empty. An example is shown below:

Lane,RGID,RGSM,RGLB,Read1File,Read2File,UmiFile,InputGroup
1,rna,sample1,illumina,scRNA.R1.fastq.gz,,scRNA.R2.fastq.gz,SCRNA
2,atac,sample1,illumina,scATAC.R1.fastq.gz,scATAC.R2.fastq.gz,scATAC.R3.fastq.gz,SCATAC

DRAGEN Single-cell Settings

To use the multiomics workflow, enter --enable-rna=true --enable-single-cell-rna=true --enable-single-cell-atac=true.

Barcode Position

By default the multiomics workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to identify the location of the single-cell barcode and single-cell UMI in the barcode read:

--scrna-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)] --scrna-umi-position <blockPos> --scatac-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)]

For more details, please consult with the corresponding section of scRNA and scATAC workflows' documentation.

Known Barcode Lists

Barcode lists are mandatory for both scRNA and scATAC reads. You need to provide the list of cell barcode sequences using the following command:

--scrna-barcode-sequence-list </path/to/scRNAbarcodeAllowlist.txt> --scatac-barcode-sequence-list </path/to/scATACbarcodeAllowlist.txt>

The files must contain one possible cell barcode sequence per line. You can compress the file with gzip (*.txt.gz). During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. If the barcodes cannot be corrected, they are filtered out.

Cell Filtering

For each individual modality (either scRNA or scATAC), DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample. For more information, see the corresponding section in scATAC/scRNA documentation.

After count thresholds in each individual modality is computed, DRAGEN performs a joint cell filtering step. Each cell barcode is represented in a 2-D space with coordinates computed as the total UMI count across genes and the total fragment count across peaks. Initially, a cell-barcode is considered as passing the joint filter if it is passing the filter in each individual modality. DRAGEN then groups all cell barcodes in two categories: those passing both individual modality filters and the rest of cell barcodes. A k-means algorithm with 2 clusters is run and the filtering status of each cell barcode is refined.

Command-line Example

The following is an example command line to run the DRAGEN Single Cell Multiomics Pipeline.


dragen \
--enable-rna=true \
--enable-single-cell-atac=true \
-r hg19.fa.default \
--ht-alt-aware-validate=false \
--fastq-list=multiomics_fastq_list.csv \
--umi-source=umifile \
--output-dir=multiomics_output \
--output-file-prefix=sample1 \
--scatac-barcode-position=0_15 \
--scrna-barcode-position=0_15 \
--scrna-umi-position=16_25 \
--single-cell-threshold=ratio \
--enable-single-cell-rna=true \
-a gencode.v32.primary_assembly.annotation.filtered.gtf \
--scrna-barcode-sequence-whitelist=737K-arc-v1-gex.gz \
--scatac-barcode-sequence-whitelist=737K-arc-v1-atac.gz

Outputs

Single-cell Multiomics outputs are found in the standard DRAGEN output location using the prefix <sample>. in case of a single library and the prefix <sample>.<libId>. in case of multiple libraries. All single-cell Multiomics output files contain word multiomics in their names.

Counts

The following three files provide information per-cell feature count level in matrix market (*.mtx) format:

  • <prefix>.multiomics.matrix.mtx.gz

    • Count of unique UMIs or fragments for each cell/feature pair in sparse matrix format.

  • <prefix>.multiomics.barcodes.tsv.gz

    • Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.

  • <prefix>.multiomics.features.tsv.gz

    • Feature name and ID for each feature in the matrix.

The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.multiomics.barcodeSummary.tsv indicated by values PASS and FAIL.

The output includes filtered matrix files which only include the per-cell feature count level for the filtered cells in matrix market (*.mtx) format. The multiomics.features.tsv.gz file is common for the unfiltered and filtered matrices:

  • <prefix>.multiomics.filtered.matrix.mtx.gz

    • Count of unique UMIs for each filtered cell/feature pair in sparse matrix format.

  • <prefix>.multiomics.filtered.barcodes.tsv.gz

    • Cell-barcode sequence for each filtered cell from the matrix.

Loading output in a dense matrix

Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in python).

The matrix can be converted into a "dense" representation through two python modules: scanpy and pandas. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.

First, it is necessary to install the required libraries:

> pip install -U scanpy pandas

Within python, the matrix can be loaded in "dense" representation using the following commands:

# import libraries
import pandas as pd
import scanpy as sc

# define path to input files
matrix_path = "path/to/matrix.mtx.gz"
features_path = "path/to/features.tsv.gz"
barcodes_path = "path/to/barcodes.tsv.gz"

# load matrix through scanpy
adata = sc.read_mtx(matrix_path).T
adata.var_names = pd.read_csv(features_path, sep="\t", header=None, compression="gzip")[1]
adata.obs_names = pd.read_csv(barcodes_path, sep="\t", header=None, compression="gzip")[0]

# convert scanpy internal format (AnnData) to dense pandas DataFrame
df = pd.DataFrame(adata.X.todense(), index=adata.obs_names, columns=adata.var_names)

# save it as CSV file
df.to_csv("output_matrix.csv")

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

Alignments

DRAGEN Single-Cell Multiomics outputs two BAM files sorted by coordinate - one with suffix scRNA.bam and one with suffix scATAC.bam. For more details, please consult the corresponding section from scATAC/scRNA user guide.

When running in multiomics mode, since the DRAGEN aligner processes both RNA and ATAC reads, there will be separate mapper metrics summary for each modality. This is identified via the RGID field as rna or atac. There is no common map/align metrics summary for all input reads since it would merge the two modalities. The <prefix>.mapping_metrics.csv file will also reflect this.

Below is an example snippet of the alignment metrics printed at the end of a DRAGEN Single-Cell Multiomics run.

MAPPING/ALIGNING PER RG   rna   Total reads in RG                                        52319680       100.00
MAPPING/ALIGNING PER RG   rna   Number of duplicate marked reads                         0              0.00
MAPPING/ALIGNING PER RG   rna   Number of duplicate marked and mate reads removed        NA
MAPPING/ALIGNING PER RG   rna   Number of unique reads (excl. duplicate marked reads)    52319680       100.00
...

MAPPING/ALIGNING PER RG   atac  Total reads in RG                                        203484254      100.00
MAPPING/ALIGNING PER RG   atac  Number of duplicate marked reads                         0              0.00
MAPPING/ALIGNING PER RG   atac  Number of duplicate marked and mate reads removed        NA
MAPPING/ALIGNING PER RG   atac  Number of unique reads (excl. duplicate marked reads)    203484254      100.00
...

Overall Metrics

The <prefix>.multiomics_metrics.csv file contains per sample scATAC and scRNA metrics. For more details, please consult with the corresponding section from scATAC/scRNA user guide.

Here is an example of how a <prefix>.multiomics_metrics.csv file can look like:

SINGLE-CELL ATAC METRICS,lib1,Invalid barcode fragments,0
SINGLE-CELL ATAC METRICS,lib1,Error free cell-barcode,308656
SINGLE-CELL ATAC METRICS,lib1,Error corrected cell-barcode,251364
...
SINGLE-CELL RNA METRICS,lib1,Median genes per cell,1456
SINGLE-CELL RNA METRICS,lib1,Total genes detected,14128
SINGLE-CELL METRICS,lib1,Passing cells,1920

Per-cell Metrics

The <prefix>.multiomics.barcodeSummary.tsv contains summary statistics for each unique cell-barcode per cell after error correction. Here is an example of how a <prefix>.multiomics.barcodeSummary.tsv file can look like:

ID  Barcode           BarcodeScAtac     BarcodeScRna      UniqueFragments  TotalFragments  PeakFragments  NonPrimaryContigFragments  ChimericFragments  LowMapqFragments  MitochondrialFragments  Peaks  TotalReads  GeneReads  UMIs  Genes  MitochondrialReads  Filter
1   AAACAAGCAAACAAAG  AAACAAGCAAACAAAG  GGCTAGTGTTCGGTAA  22               48              13             17                         0                  2                 7                       9      59          0          0     0      0                   LOW
2   AAACAAGCAAACCAGC  AAACAAGCAAACCAGC  GGCTAGTGTACTGAAT  7                7               3              0                          0                  0                 0                       3      8           0          0     0      0                   LOW

Last updated