scATAC
Last updated
Last updated
The DRAGEN Single-Cell ATAC (scATAC) Pipeline can process single-cell ATAC-Seq data sets from reads to a cell-by-peak read count matrix. The pipeline includes the following functions:
ATAC-Seq alignment.
Cell-barcode error correction for the barcode read.
Chromatin accessibility peak calling.
Fragment counting per cell and peak to measure chromatin accessibility.
Sparse matrix output and QC metrics.
The functionality and options related to alignment and gene annotation are identical to DNA pipelines. For information, see DRAGEN DNA Pipeline.
Use a standard DRAGEN DNA reference genome or hashtable for the scATAC Pipeline.
The DRAGEN scATAC Pipeline requires both the genomic sequence and the barcode sequence for each fragment (read) as input. The genomic sequence is aligned to the reference genome to determine the expressed gene, the single-cell barcode sequence is used to identify the unique cell. When starting from FASTQ, you can either include the UMI in the read name or provide separate cell-barcode FASTQ files.
Provide the genomic reads as a paired-end FASTQ files with the Barcode sequence in the eighth field of the read-name line. Separate sequences using a colon. The following example uses read2 (sample.R2.fastq.gz
) as the genomic read.
In the example, the GAAACTCGTTCAGCGC
sequence is the barcode read and the ACAG...
sequence is the genomic read.
These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode read. If using bclConvert, enter the barcode information using the OverrideCycles1
setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).
Note: bclConvert refers to the entire single-cell barcode sequence as UMI.
Enter the following command line option to use the generated FASTQ files from bclConvert: dragen -1 <file name 1> -2 <file name 2> --umi-source=qname
The option is also compatible with the --fastq-list
input options and with read input from BAM files.
A single-cell ATAC fastq-list file is a CSV file with the following mandatory columns:
Lane
Sequencing lane
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Read1File
Read 1 FASTQ file
Read2File
Read 2 FASTQ file
UmiFile
Read 3 FASTQ file (FASTQ file with cell-barcodes)
An example is shown below:
UMI Fastq
FilesAn alternative option is to provide the genomic and barcode sequences as three separate FASTQ files. Two files contain only the genomic reads and one contains the corresponding barcode-reads in the same order. This file is similar to how read-pairs are normally handled. If using separate UMI files, the sequencing system run setup and bclConvert are not aware of the UMI and treat it as normal read sequence by default.
Enter the following command line option to use the separate UMI FASTQ files: dragen -1 <file name 1> -2 <file name 2> --umi-fastq=<file name 3> --umi-source=fastq
To use this method with multiple FASTQ files, follow these steps:
Enter the barcode FASTQ files as read1 in the fastq-list
file, and then enter the genomic read FASTQ files matching the default fastq_list.csv generated by bclConvert as read2 and umifile.
Enter the following command: dragen --fastq-list fastq_list.csv --umi-source=umifile
The scATAC pipeline can process a single biological sample per DRAGEN run. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.
To use the scATAC workflow, enter --enable-single-cell-atac=true
. This section includes information on additional scATAC settings.
By default the scATAC workflow assumes that the overall barcode sequence is made up of a single-cell barcode (possibly split into multiple blocks). Enter the following command to identify the location of the single-cell barcode:
--scatac-barcode-position <blockPos>[+<blockPos>+<blockPos>...][(:-)|(:+)]
blockPos
describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>
. For example, for a library with a 16 bp cell-barcode, enter: --scatac-barcode-position 0_15
. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, enter: --scatac-barcode-position=0_8+21_29+43_51
.
By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, enter: --scatac-barcode-position 0_15:+
or --scatac-barcode-position=0_8+21_29+43_51:+
. Conversely, to specify the reverse strand, enter: --scatac-barcode-position 0_15:-
or --scatac-barcode-position=0_8+21_29+43_51:-
.
You can provide a list of cell barcode sequences to include using the following command:
--scatac-barcode-sequence-list </path/to/barcodeAllowlist.txt>
In the case where the --scatac-barcode-position
parameter is not split into multiple blocks (see Barcode Position section) the file must contain one possible cell barcode sequence per line. Differently, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell barcode block sequences for the corresponding block. Each section should start with a line with prefix #-
, e.g.:
The input file might be compressed with gzip (*.txt.gz
).
During cell-barcode error correction any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. See Barcode Error Correction for more information. If the barcodes cannot be corrected, they are filtered out.
DRAGEN uses a threshold on the total count of unique reads per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.
--single-cell-number-cells
--- [Optional] Set the expected number of cells. The default is 3000. Adjust only if the expected number of cells is so far from the default that DRAGEN does not call the correct cell filtering threshold automatically.
--single-cell-threshold
--- Specify the method for determining the count threshold value. The available values are fixed
, ratio
, or inflection
.
If using ratio
, DRAGEN estimates the expected number of cells as max(T_e, T_m)
. T_m
is a threshold based on a fraction of the counts seen in most abundant cell-barcodes. T_e
is a threshold based on a fraction of the least abundant expected cell.
If using inflection
, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts.
If using fixed
, the count threshold is set to force the expected number of cells (--single-cell-number-cells
option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
For example, to set a fixed number of cells rather than use the automatically determined threshold, use the following command:
--single-cell-threshold=fixed --single-cell-number-cells=X
The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.
The following are additional options you can use to configure the Single-Cell ATAC Pipeline settings.
--qc-enable-depth-metrics
--- Set to false
to disable depth metrics for faster run time. The default is true
.
--scatac-write-fragments
--- Set to true
to write counted fragments to the disk (in both tsv
(<prefix>.scATAC.fragments.tsv
) and BigWig
(<prefix>.scATAC.fragments.bigwig
) format). The default is false
.
The following is an example command line to run the DRAGEN Single Cell ATAC Pipeline.
Single-cell ATAC outputs are found in the standard DRAGEN output location using the prefix <sample>
. in case of a single library and the prefix <sample>.<libId>
. in case of multiple libraries. All single-cell ATAC output files contain word scATAC
in their names.
The following three files provide information about per-cell chromatin accessibility level in matrix market (*.mtx
) format:
<prefix>.scATAC.matrix.mtx.gz
Count of unique fragments for each cell/peak pair in sparse matrix format.
<prefix>.scATAC.barcodes.tsv.gz
Cell-barcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scATAC.peaks.tsv.gz
Peak name and ID for each peak in the matrix.
The subset of barcodes corresponding to passing cells can be found under the Filter column in <prefix>.scATAC.barcodeSummary.tsv
indicated by values PASS
and FAIL
.
The output includes filtered matrix files which only include the per-cell chromatin accessibility level for the PASS
cells in matrix market (*.mtx
) format. The scATAC.peaks.tsv.gz
file is common for the unfiltered and filtered matrices:
<prefix>.scATAC.filtered.matrix.mtx.gz
Count of unique UMIs for each filtered cell/peak pair in sparse matrix format.
<prefix>.scATAC.filtered.barcodes.tsv.gz
Cell-barcode sequence for each filtered cell from the matrix.
Some users might want to explore the output matrix in a human-readable format. To do so, a possible way would be to load the matrix in a "dense" dataframe in python (similar methodologies can be used in alternative programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in python).
The matrix can be converted into a "dense" representation through two python modules: scanpy
and pandas
. This has been tested with python 3.10.0, scanpy 1.9.3, pandas 1.5.3.
First, it is necessary to install the required libraries:
Within python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
Alignments of the genomic reads are sorted by coordinate and output as a BAM file. Each alignment is annotated with an XB
tag containing the cell-barcode. The alignments use the original sequences without any errors corrected. Fragments that did not have an associated barcode read, for example fragments trimmed on the input data, do not have XB
tag.
The <prefix>.scATAC_metrics.csv
file contains per sample scATAC metrics.
Invalid barcode read: Overall barcode sequence failed basic checks. For example, the barcode read was missing or too short.
Error free cell-barcode: Reads with cell-barcode sequences that were not altered during error correction. For example, if the read was an exact match to the allow list.
Error corrected cell-barcode: Reads with cell-barcode sequences successfully corrected to a valid sequence.
Filtered cell-barcode: Reads with cell-barcode sequences that could not be corrected to a valid sequence. For example, the sequence does not match allow list with at most one mismatch.
Fragments passing filters: Non-chimeric non-mitochondrial fragments that align to primary contigs with a high mapping quality (greater than 30 by default).
Non-primary contig fragments: Fragments that align to non-primary contigs (any contigs that are not autosome, X and Y).
Chimeric fragments: Fragments with the two reads aligning to different contigs.
Mitochondrial fragments: Fragments aligning to the mitochondrial contigs.
Low mapping quality fragments: Fragments with the two reads aligning with a mapping quality set to some specific value (default is 30).
Improperly mapped fragments: The two reads in the fragment are not mapped in proper pair (SAM flag "read mapped in proper pair" is set to 0).
Fragment threshold for passing cells: Number of fragments required for a cell-barcode to pass filtering.
Passing cells: Number of cell-barcodes that passed the filters.
Fraction peak fragments in passing cells: Percentage of counted fragments intersecting peaks assigned to cells that passed the filters.
Fraction fragments in passing cells: Percentage of all counted fragments assigned to cells that passed the filters.
Median fragments per cells: Total counted fragments per cell that passed the filters.
Median peaks per cells: Peaks with at least one fragment per cell that passed the filters.
Total peaks detected: Peaks with at least one fragment in at least one cell that passed the filters.
The <prefix>.scATAC.barcodeSummary.tsv
contains summary statistics for each unique cell-barcode per cell after error correction.
ID: Unique numeric ID for the cell-barcode.
Barcode: The cell-barcode sequence.
TotalFragments: Total fragments with the cell-barcode sequence.
UniqueFragments: Unique fragments counted towards a peak.
NonPrimaryContigFragments: Unique non-primary contig framgnets.
ChimericFragments: Unique chimeric fragments.
LowMapqFragments: Unique low mapping quality fragments.
MitochondrialFragments: Unique fragments mapped to mitochondrial genome.
Peaks: Unique peaks detected.
Filter: The following are the available filter values:
PASS
: Cell-barcode passes the filter.
LOW
: UMI count is below threshold.
Cell-barcode sequences from the input reads are error corrected based on the frequency with which each one is seen and an optional allow list of expected cell-barcode sequences. A cell-barcode sequence is considered a neighbor of another cell-barcode if there is at most one mismatch. A cell-barcode sequence is corrected to its neighbor in the following circumstances. When corrected, all reads with the cell-barcode are assigned instead to the neighboring cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020).
The neighboring cell-barcode is at least two times more frequent across all input reads.
The neighboring cell-barcode is on the cell-barcode allow list, but the original cell-barcode is not.
To avoid overcounting cell-barcodes based on sequence errors, cell-barcode error correction is performed among all reads with the same cell-barcode mapping to the same peak region. Cell-barcode sequences that are likely errors of another cell-barcodes are not counted.
Ref: Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 1 March 2022].
DRAGEN calls peaks using an algorithm based on MACS, version 3 (Zhang et al., 2008). To customize the behavior of the peak calling algorithm, modify any of the command line parameters specified in the table below.
atac-peak-qvalue
Threshold for q-value to call peaks
0.05
atac-peak-fold-change
Fold change threshold (relative to the background) to call peaks
1.0
atac-peak-min-length
Minimum length of a peak, bp
NA*
atac-peak-max-gap
Maximum pileup gap (if the gap is larger - initiate another peak), bp
NA**
(*), (**) - default value is computed automatically as the mean fragment length.
Alternatively, if fragment counting needs to be performed on a pre-specfified set of peaks, provide a peak BED file using command line parameter atac-peak-bed-file
.
Ref: Zhang, Y., Liu, T., Meyer, C.A., Eeckhoute, J., Johnson, D.S., Bernstein, B.E., Nusbaum, C., Myers, R.M., Brown M., Li. W. and Liu, X.S., 2008. Model-based Analysis of ChIP-Seq (MACS). [PDF] Available at: <https://pubmed.ncbi.nlm.nih.gov/18798982/> [Accessed 1 March 2022].
DRAGEN annotates each peak with respect to a gene symbol as promoter, distal, or intergenic depending on the genomic position of both the peak and the gene. The following rules are used to determine the annotation of a peak:
If a peak overlaps with promoter region (-1000bp, +100bp) of any transcription start site (TSS), it is annotated as a promoter peak of the gene.
If a peak is within 200kb of the closest TSS, and if it is not a promoter peak of the gene of the closest TSS, it will be annotated as a distal peak of that gene.
If a peak overlaps the body of a transcript, and it is not a promoter nor a distal peak of the gene, it will be annotated as a distal peak of that gene with distance set as zero.
If a peak has not been mapped to any gene at the step, it will be annotated as an intergenic peak without a gene symbol assigned.
To enable peak annotation in DRAGEN scATAC-Seq workflow, specify a gene annotation file (GTF) using the option -a
. Peak annotations are written to a file with name <prefix>.scATAC.peaks.tsv
and each annotation is represented as a row with the following 6 columns:
Chromosome number
Start position
End position
Gene symbol
Distance from peak to gene
Peak annotation
(i.e, promoter, distal, or intergenic).
In this analysis, peaks are matched to a set of known transcription factor (TF) binding sites and for each cell barcode the fragment counts are grouped based on transcription factors their peaks are assigned to. This results in a more compact representation of chromatin accessibility patterns. To enable TF motif analysis, specify a database of position-weight matrices (PWM) corresponding to transcription factor motifs in JASPAR format:
--atac-jaspar-database=JASPAR2022_CORE_non-redundant_pfms_jaspar.txt
DRAGEN will produce two files <prefix>.scATAC.tf.matrix.mtx.gz
and <prefix>.scATAC.tf.motifs.tsv.gz
which combined with file <prefix>.tf.barcodes.tsv.gz
represent the cell-by-TF count matrix in matrix market format.