Single-Cell Multiomics
Last updated
Last updated
The DRAGEN Single-Cell Multiomics (scRNA + scATAC) Pipeline can process data sets from single-cell RNA-Seq and ATAC-Seq reads to a cell-by-feature count matrix.
The pipeline is compatible with library designs that have:
For single-cell RNA: one read in a fragment match to a transcript and the other containing a cell-barcode and UMI.
For single-cell ATAC: two reads matching to the genome and the third one containing cell-barcode.
The pipeline includes the following functions:
Alignment
RNA-Seq (splice-aware) alignment and matching to annotated genes for the transcript reads.
ATAC-Seq alignment
Cell-barcode (both RNA- and ATAC-Seq) and UMI (RNA-Seq only) error correction for the barcode read.
Read counting
UMI counting per cell and gene to measure gene expression.
Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.
A standard DRAGEN reference hashtable with both DNA and RNA capability is required for the Single-Cell Multiomics Pipeline. Building a reference hashtable using --ht-build-rna-hashtable=true
should satisfy this requirement. See the section Prepare a Reference Genome for more details.
The pipeline also requires a gene annotation file in GTF format, provided with the with the --annotation
(-a
) option.
Since the multiomics workflow assumes at least one pair of single-cell RNA FASTQ files and one triple of single-cell ATAC FASTQ files, the only method to provide read input to DRAGEN is through the fastq-list
file mechanism. A multiomics fastq-list
file is a CSV file with the following mandatory columns:
Lane
Sequencing lane
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Read1File
Read 1 FASTQ file
Read2File
Read 2 FASTQ file
UmiFile
FASTQ file with cell-barcodes and/or UMIs
InputGroup
Input group: either scATAC
or scRNA
(not case sensitive)
Entries in a fastq-list
file corresponding to single-cell RNA analysis must have read 2 FASTQ files in the UmiFile column and the Read2File column entries must be left empty. An example is shown below:
To use the multiomics workflow, enter --enable-rna=true --enable-single-cell-rna=true --enable-single-cell-atac=true
.
By default the multiomics workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to identify the location of the single-cell barcode and single-cell UMI in the barcode read:
For more details, please consult with the corresponding section of scRNA and scATAC workflows' documentation.
Barcode lists are mandatory for both scRNA and scATAC reads. You need to provide the list of cell barcode sequences using the following command:
--scrna-barcode-sequence-list </path/to/scRNAbarcodeAllowlist.txt> --scatac-barcode-sequence-list </path/to/scATACbarcodeAllowlist.txt>
The files must contain one possible cell barcode sequence per line. You can compress the file with gzip (*.txt.gz
). During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. If the barcodes cannot be corrected, they are filtered out.
For each individual modality (either scRNA or scATAC), DRAGEN uses a threshold on the total count of unique UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample. For more information, see the corresponding section in scATAC/scRNA documentation.
After count thresholds in each individual modality is computed, DRAGEN performs a joint cell filtering step. Each cell barcode is represented in a 2-D space with coordinates computed as the total UMI count across genes and the total fragment count across peaks. Initially, a cell-barcode is considered as passing the joint filter if it is passing the filter in each individual modality. DRAGEN then groups all cell barcodes in two categories: those passing both individual modality filters and the rest of cell barcodes. A k-means algorithm with 2 clusters is run and the filtering status of each cell barcode is refined.
The following is an example command line to run the DRAGEN Single Cell Multiomics Pipeline.
Single-cell Multiomics outputs are found in the standard DRAGEN output location using the prefix <sample>
. in case of a single library and the prefix <sample>.<libId>
. in case of multiple libraries. All single-cell Multiomics output files contain word multiomics
in their names.
The following three files provide information per-cell feature count level in matrix market (*.mtx) format:
<prefix>.multiomics.matrix.mtx.gz
Count of unique UMIs or fragments for each cell/feature pair in sparse matrix format.
<prefix>.multiomics.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.multiomics.features.tsv.gz
Feature name and ID for each feature in the matrix.
The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.multiomics.barcodeSummary.tsv
indicated by values PASS
and FAIL
.
The output includes filtered matrix files which only include the per-cell feature count level for the filtered cells in matrix market (*.mtx
) format. The multiomics.features.tsv.gz
file is common for the unfiltered and filtered matrices:
<prefix>.multiomics.filtered.matrix.mtx.gz
Count of unique UMIs for each filtered cell/feature pair in sparse matrix format.
<prefix>.multiomics.filtered.barcodes.tsv.gz
Cell-barcode sequence for each filtered cell from the matrix.
Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in python).
The matrix can be converted into a "dense" representation through two python modules: scanpy
and pandas
. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.
First, it is necessary to install the required libraries:
Within python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
DRAGEN Single-Cell Multiomics outputs two BAM files sorted by coordinate - one with suffix scRNA.bam
and one with suffix scATAC.bam
. For more details, please consult the corresponding section from scATAC/scRNA user guide.
When running in multiomics mode, since the DRAGEN aligner processes both RNA and ATAC reads, there will be separate mapper metrics summary for each modality. This is identified via the RGID field as rna
or atac
. There is no common map/align metrics summary for all input reads since it would merge the two modalities. The <prefix>.mapping_metrics.csv
file will also reflect this.
Below is an example snippet of the alignment metrics printed at the end of a DRAGEN Single-Cell Multiomics run.
The <prefix>.multiomics_metrics.csv
file contains per sample scATAC and scRNA metrics. For more details, please consult with the corresponding section from scATAC/scRNA user guide.
Here is an example of how a <prefix>.multiomics_metrics.csv
file can look like:
The <prefix>.multiomics.barcodeSummary.tsv
contains summary statistics for each unique cell-barcode per cell after error correction. Here is an example of how a <prefix>.multiomics.barcodeSummary.tsv
file can look like: