Other scRNA Prep
The DRAGEN scRNA Pipeline can process multiplexed single-cell RNA-Seq data sets from reads to a cell-by-gene expression matrix. The pipeline is compatible with library designs that have one read in a fragment matched to a transcript and the other containing a cell barcode and a unique molecular identifier (UMI).
The pipeline includes the following functions:
Cell-barcode + UMI extraction and error correction for each read
RNA-Seq (splice-aware) alignment and matching to annotated genes
UMI counting per cell and gene to measure gene expression
Sparse matrix, molecule info (HDF5 file), h5ad, and QC metrics outputs
Feature counting, such as with cell-surface proteins

The functionality and options related to alignment and gene annotation are identical to the RNA-Seq pipeline (run in single-end read mode). For more information, see DRAGEN RNA Pipeline. None of the other DRAGEN RNA-Seq modules (variant calling, splice variant calling, gene fusion calling and transcript-level gene expression quantification) are supported for Single-Cell RNA.
Note: Illumina Single Cell 3' RNA Prep using PIPseq technology utilizes a new and novel way for molecular counting [1]. While file inputs and outputs are similar, the sequences described below as UMIs are used as binning indexes (BIs), and molecules are identified by a combination of BIs and intrinsic molecular identifiers (IMIs). The single-cell RNA pipeline is adapted to process these sequences when PIPseq mode is enabled with --scrna-enable-pipseq-mode=true. Furthermore, some arguments such as barcode positions, barcode whitelist and trimming are set automatically in PIPseq mode as well. Please see the Illumina Single-Cell RNA DRAGEN Analysis Workflow page for more information.
Input Files
Alignment Reference
Use a standard DRAGEN RNA reference genome or hashtable for the scRNA Pipeline. Recommended references (found on the DRAGEN Product Files page) include the linear hg38 v5 hashtable, the linear v5 versions of the hg19 and hs37d5 hashtables, and the linear hg38-mm39 v5 hashtable for barnyard datasets. The reference is provided with the --ref-dir (-r) option. Custom references can be built using --ht-build-rna-hashtable=true (see the DRAGEN Prepare a Reference Genome page for more details on creating a reference hashtable). The pipeline also requires a gene annotation file in GTF format, provided with the --annotation (-a) option. Recommended annotation files include the gencode v44 and the gencode hg38_v44+mm39_vM30 annotation files (available on the DRAGEN Product Files page as well).
It is recommended to exclude from the analysis pseudogenes and other entries that may overlap with protein-coding genes and reduce mapping accuracy. This can be done by setting the --scrna-filter-gene-biotypes option to true, which will automatically exclude the following biotypes:
shortRNA
microRNA
rRNA
pseudogenes
Read Input
The DRAGEN scRNA Pipeline requires both the transcript sequence and the barcode+UMI sequence for each fragment (read) as input. The pipeline typically expects read input as either FASTQ files or in a FASTQ list. The transcript sequence is typically provided in read 2 (R2) and the cell-barcode and binning index (BI) will be provided in read 1 (R1). Alternatively, the cell-barcode and BI can be provided in the read name (QNAME). The transcript sequence is aligned to the genome to determine the expressed gene, the barcode+UMI sequence is split into the single-cell barcode to identify the unique cell, and a single-cell UMI for unique molecule quantification.
FASTQ Input
If only a single pair of FASTQ files will be run, then FASTQ file input can be used. FASTQ file 1 typically contains the barcode and UMI sequences and FASTQ file 2 contains the transcript sequence (this is enabled using the --umi-source read1 option). The read group sample name (RGSM) and read group ID (RGID) should be set as well (to be used when labeling samples in output files).
Alternatively, the barcode and UMI can be provided in the read names of the transcript reads in a single FASTQ file. See below for details on adding the barcode and BI to the read name (using the --umi-source qname option).
FASTQ List Input
A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Lane
Sequencing lane
Read1File
Read 1 FASTQ file (usually FASTQ file with reads containing cell-barcodes and UMIs)
Read2File
Read 2 FASTQ file (usually transcript reads)
For example:
FASTQ list options:
The same FASTQ list option can be used if the barcodes and UMIs are provided in the read names.
UMIs in the Read Name
If barcode and UMI sequences are provided in the read name, then follow these steps:
Provide the transcript reads as a single-end FASTQ file with the Barcode+UMI sequence in the eighth field of the read-name line.
Separate sequences using a colon.
In the example, the sequence starting with AGA is the barcode+UMI read, the sequences beginning with GTG and ACG are the i7 and i5 sequences, respectively, and the sequence starting with TGCA is the transcript read sequence.
These FASTQ files can be generated by bclConvert and bcl2fastq using the UMI settings to define the single-cell barcode+UMI read. If using BCL Convert, enter the barcode/UMI information using the OverrideCycles setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).
Note: BCL Convert refers to the entire single-cell barcode+UMI sequence as the UMI.
The resulting file can be provided as fastq file 1 using FASTQ input or fastq file 2 using FASTQ list input. This option is also compatible with BAM files using the following options:
Using Multiple Libraries
The scRNA pipeline can process a single biological sample per DRAGEN run by default. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each. DRAGEN keeps the cells (barcodes and UMIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.
Example FASTQ list CSV file for multiple libraries:
Notes:
The RGID is referred to as the "read group" of the input file
The RGSM values should match for all reads to be processed in a single DRAGEN run
The RGLB values should match for all reads to be analyzed together (the results will be combined into a single metrics output file). There are two separate libraries here so two metrics files will be produced.
DRAGEN Single-cell Usage
The DRAGEN scRNA workflow is enabled with the options --enable-rna=true --enable-single-cell-rna=true. The following is an example command line.
Additional Options
The following are additional options you can use to configure the Single-Cell RNA Pipeline settings.
--rna-library-type--- Set the orientation of transcript reads relative to the genomes. EnterSFfor forward,SRfor reverse, orUfor unstranded. The default isSF.--enable-map-align-output--- Save the output BAM. The default is false. See the BAM File Output section for more details.--single-cell-threshold--- Set the cell filtering method. Available options arefixed,ratio, andinflection. The default isratio. See Cell Filtering for more details.--scrna-min-mapq--- Sets the minimum MAPQ threshold for reads to be counted.--scrna-count-introns--- Include intronic reads in gene expression estimation. The default istrue.--qc-enable-depth-metrics--- Set tofalseto disable depth metrics for faster run time. The default istrue.--bypass-anchor-mapping--- Set totrueto disable RNA anchor (two-pass) mapping for increased performance. The default isfalse.--scrna-global-umi--- Set totrueto require UMIs to be unique across all genes in a cell. Only the gene with the highest count will be considered for each UMI per cell. This option also causes all mappings of a read to be counted. The default isfalse.
Barcode Position
By default the scRNA workflow assumes that the overall barcode/UMI sequence is made up of a single-cell barcode (possibly split into multiple blocks) and a single UMI. Enter the following command to define the location of the single-cell barcode and single-cell UMI in the barcode read:
blockPos describes the offset of the first and last inclusive base of the block and is formatted as <startPos>_<endPos>. For example, for a library with a 16 bp cell-barcode followed by a 10 bp UMI, use: --scrna-barcode-position 0_15 --scrna-umi-position 16_25. For a library with the cell-barcode split into three blocks of 9 bp separated by fixed linker sequences and an 8 bp UMI, use: --scrna-barcode-position=0_8+21_29+43_51 --scrna-umi-position=52_59.
By default, the barcode position is assumed to be indicated on the forward strand. To explicitly specify the forward strand, use: --scrna-barcode-position 0_15:+ or --scrna-barcode-position=0_8+21_29+43_51:+. To explicitly specify the reverse strand, use: --scrna-barcode-position 0_15:- or --scrna-barcode-position=0_8+21_29+43_51:-.
UMI position can also be specified for feature reads, which are reads with a sequence tag specific to a feature (eg, cell-surface protein or antibody). When the feature-specific UMIs are located on Read 2, you can use --scrna-feature-barcode-r2umi=0_11 to specify a 12 bp feature UMI at the beginning of each feature read.
Known Barcode Lists
You can provide a list of cell barcode sequences to include using the following command:
In the case where the --scrna-barcode-position parameter is not split into multiple blocks (see the Barcode Position section) the file must contain one possible cell barcode sequence per line. Alternatively, when the barcode position is split into multiple blocks, the file must contain a list composed by multiple sections (one for each block): each section must indicate the possible cell-barcode block sequences for the corresponding block. Each section should start with a line with prefix #-. For example,
The input file can also be provided compressed with gzip (*.txt.gz).
During cell-barcode error correction, any observed barcodes that do not match a sequence specified in the file are considered errors. If possible, the barcodes are corrected to a similar allowed sequence. See Barcode Error Correction for more information. Barcodes that cannot be corrected are filtered out.
Barcode error correction
Cell-barcode sequences from the input reads are error-corrected based on their frequency, and optionally through a list of expected cell-barcode sequences. A cell-barcode sequence is corrected into another cell-barcode sequence if they differ only by one base (a Hamming distance of 1) and:
Either the corrected cell-barcode is at least two times more frequent across all input reads
Or the corrected cell-barcode is on the list of expected cell-barcode sequences, but the original cell-barcode is not
When corrected, all the original cell-barcode reads are assigned to the corrected cell-barcode. The sequence error correction scheme is similar to the directional algorithm described in (Smith, Heger and Sudbery, 2020)[2].
To avoid overcounting UMIs based on sequence errors, UMI error correction is performed using the same algorithm. UMIs are corrected for all reads within the same cell-barcode mapping to the same gene.
Mapping
The DRAGEN RNA mapper is used to map all of the transcript reads to the reference genome. Note that normally the barcode and UMI sequences are extracted from R1 and only the R2 reads are passed to the mapper. Therefore the mapper considers them to be single-ended reads labeled as R1 in the mapping metrics. Further information on the DRAGEN RNA mapper can be found on the RNA alignment page.
DRAGEN is also introducing support for the STAR aligner in DRAGEN 4.5 as an alternative to the DRAGEN RNA mapper. Further information on STAR aligner support can be found in the STAR aligner section.
Each read is mapped to the reference and then matched to genes based on where it aligns. If the read matches to the exons of a gene, then it is classified as an exon-matching read. Otherwise it is given another classification. DRAGEN scRNA currently only includes the primary alignment of each read (assigned randomly by the mapper if there are multiple alignments with identical scores). Secondary and supplementary alignments are skipped (but they are still included in the BAM output). The possible classifications for each read (based on its primary alignment) are:
Unique exon matching --- reads that overlap with gene exons by at least 50% of read length
Unique intron matching --- reads that overlap with the gene by at least 20% of read length
Mitochondrial --- reads that map to the mitochondrial genome
Filtered antisense --- reads that overlap a gene on the opposite strand defined by the library type
Filtered ambiguously matching --- reads that match to multiple genes
Filtered low MAPQ --- reads with low MAPQ (below 1 by default) that map ambiguously
Filtered non-matching --- reads that do not map to any genes
For the remainder of the pipeline, only the unique exon matching, unique intron matching, and mitochondrial reads (together referred to as "mapped reads") are counted toward the molecule counts of each barcode. The other "filtered" reads will be included in the "Total" reads per barcode but not counted towards the molecule counts of each gene. If --scrna-count-introns=false is set, then intron reads will also be considered "filtered" rather than "mapped".
The total number of reads in each classification can be found in the <prefix>.scRNA_metrics.csv file as described in the Read Matching Metrics section.
Biotype Filtering
Add the argument --scrna-filter-gene-biotypes=true to exclude pseudogene, shortRNA and rRNA biotypes during mapping.
Transcript Counting
DRAGEN scRNA performs UMI correction per cell-barcode and gene as described in the Barcode error correction section. Then the corrected UMIs are deduplicated and counted for each cell-barcode and gene.
Cell Filtering
DRAGEN uses a threshold on the total count of distinct UMIs (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.
--single-cell-number-cells--- [Optional] Sets the expected number of cells. The default is 3000. Adjust this value if using afixedthreshold or if the number of called cells deviates from expectation.--single-cell-threshold--- Specifies the method for determining the count threshold value. The available values arefixed,ratio, orinflection(the default isratio).If using
ratio, DRAGEN calculates the expected number of cells usingmax(T_e, T_m).T_mis a threshold based on a fraction of the molecules seen in the most abundant cell-barcodes. If the top decile cell is the cell occurring at the 90th percentile by molecule count, thenT_mis calculated as 10% of the number of molecules occurring in the top decile cell.T_eis a threshold based on a fraction of the least abundant expected cell. The least abundant expected cell is defined as the cell occurring at the rank of the number of expected cells (e.g. if the number of expected cells is 2,000, then the least abundant expected cell is the cell ranked 2,000th by molecule count abundance).T_eis calculated as 50% of the number of molecules found in the least abundant expected cell.For example, if dataset A is provided to DRAGEN with an expected number of cells of 2,000 and it is found to contain 30,000 cell-barcodes, and the 3,000th most abundant cell(i.e. the top decile cell) has 1,100 molecules while the 2,000th most abundant cell (i.e. the least abundant expected cell) has 3,700 molecules, then
T_mwould be equal to10% * 1,100 = 110 moleculesandT_ewould be equal to50% * 3,700 = 1,850 molecules. The threshold would then bemax(1,850, 110) = 1,850 molecules.
If using
inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts. The second derivatives of the read/UMI counts are used to estimate the inflection point that is chosen as the threshold.If using
fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cellsoption), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
--single-cell-threshold-filterby--- [Optional] Set the count distribution to consider for cell filtering. Can be eitherumi(default) orread.
To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:
The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.
Outputs
Single-cell RNA outputs are found in the standard DRAGEN output location. The files start with a <prefix> which may correspond to <sample>. in the case of a single library and <sample>.<libId>. in the case of multiple libraries. All single-cell RNA output files contain the word scRNA in their names.
Count Files
The count files contain the overall molecule counts on a per-cell gene expression level. Counts are reported in two formats: Market Exchange (MEX) and h5ad format. Furthermore, counts are reported for all barcodes (raw) as well as for the set of filtered barcodes that pass the cell filtering threshold.
Raw MEX files:
<prefix>.scRNA.matrix.mtx.gzCount of molecules for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gzBarcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.features.tsv.gzGene name, ID and feature type for each gene and feature in the matrix.
Raw h5ad file:
<prefix>.scRNA.h5adCount matrix in AnnData format. Barcodes are stored in the OBS group and genes are stored in the VAR group. Further information on the h5ad format can be found on the AnnData page.
Note: the h5ad output can be disabled with the option
--single-cell-enable-h5ad-output=false
Filtered MEX files (containing only cells that pass the cell filtering threshold):
<prefix>.scRNA.filtered.matrix.mtx.gzCount of unique MIs for each passing cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gzCell-barcode sequence for each passing cell from the matrix.
<prefix>.scRNA.filtered.features.tsv.gzGene name, ID and feature type for each gene and feature in the matrix (identical to
<prefix>.scRNA.features.tsv.gz).
Filtered h5ad file:
<prefix>.scRNA.filtered.h5adFiltered count matrix in AnnData format. Barcodes are stored in the OBS group and genes are stored in the VAR group. Further information on the h5ad format can be found on the AnnData page.
Note: the h5ad output can be disabled with the option
--single-cell-enable-h5ad-output=false
BAM file
--enable-map-align-output=true--- default isfalse--output-format=BAM--- default isBAMA BAM file can be generated by adding in the options above. Note that enabling the BAM output may increase the runtime and memory usage of DRAGEN.
All alignments of all of the reads are included in the BAM, with unmapped reads coming at the end. The primary alignment of each read (the only that is considered in the pipeline, see Mapping) is annotated with single cell-specific BAM tags. The following tags are used:
CR--- Uncorrected barcode sequenceCB--- Corrected barcode sequence (identical toCRif no correction occurred)UR--- Uncorrected BI sequence
Non-primary alignments and reads not associated with a barcode will not have any single cell BAM tags.
By default, the BAM file will not be sorted but all reads with the same barcode will be grouped together. The following options may be used to force the BAM to be sorted by reference position: --split-by-cell-barcode=false. However, this may lead to significant memory issues and is not recommended. If a sorted BAM file is desired, it is recommended to have DRAGEN produce an unsorted BAM and then run samtools sort input.bam -o output_sorted.bam to sort by position.
Additional metrics related to the mapping of reads to the reference may be found in the <prefix>.mapping_metrics.csv file. Note that the "mapped reads (RNA) to rRNA and filtered" metric in that file only includes non-coding rRNA (it excludes protein-coding ribosomal RNA transcripts), so it is not applicable to single cell RNA analyses.
Single Cell RNA Metrics
The <prefix>.scRNA_metrics.csv file contains per sample single cell RNA metrics. In scRNA, mapped reads are currently reported by default under R1 metrics, irrespective of whether R1 is the read aligned to the transcriptome or the one containing the cell barcode and BI. The single cell RNA metrics file is in CSV format with the following fields:
Metric group
Sample name
Metric name
Metric value
Metric percentage (omitted for some metrics)
The single cell RNA metrics are broken down into four metric groups (additional groups are included if certain options are enabled).
Barcode Read Metrics
Barcode read metrics report the breakdown of how many of the input read pairs contain valid barcodes (matching to the acceptable sequence list). The denominator of the metric percentages in this group is "Total input reads".
Reads missing barcodes: The number of reads for which the overall barcode sequence (cell barcode + BI) failed basic checks, e.g. the barcode read was missing or too short.
Reads with exactly matching barcodes: The number of reads with barcode sequences that were not altered during error-correction, i.e. the barcode sequence exactly matched a barcode in the sequence list.
Reads with corrected barcodes: The number of reads with barcode sequences that were corrected to a valid sequence in the sequence list (within a hamming distance of 1).
Reads with non-matching barcodes: The number of reads with barcode sequences that did not match valid sequences in the sequence list, even after barcode correction (within a hamming distance of 1).
Total input reads: The total number of reads provided as input for this DRAGEN run. The metric value will be the sum of the above four metrics.
Total barcoded reads: The number of reads with valid barcodes that will proceed to the read matching stage. The metric value will be the sum of "Reads with exactly matching barcodes" and "Reads with corrected barcodes".
Read Matching Metrics
Read matching metrics report how many barcoded reads match to genes and how many were filtered due to various reasons. The denominator of the metric percentages in this group is "Total barcoded reads".
Unique exon matching reads: The number of reads that uniquely match with >50% overlap to the exon region of a single gene.
Unique intron matching reads: The number of reads that do not meet the exon-matching criteria above but map to a single gene with >20% overlap. N.B. if
--scrna-count-introns=falseis set, then this metric will become "Filtered intron matching reads".Filtered antisense reads: The number of reads that overlap a gene on the opposite strand based on the library type of the reads.
Filtered ambiguously matching reads: The number of reads that match to multiple genes equally well.
Filtered low MAPQ reads: The number of reads that are filtered due to low MAPQ values (default threshold of 4).
Filtered non-matching reads: The number of reads that do not match to any gene.
Mitochondrial reads: The number of reads that map to mitochondria.
Mapped reads: The number of reads considered to be "mapping," i.e. that are found to match to a gene. The metric value will be the sum of "Unique exon matching reads," "Unique intron matching reads", and "Mitochondrial reads". N.B. if
--scrna-count-introns=falseis set, then introns will be excluded.Reads processed through mapper: The total number of reads provided as input to the mapper. The metric value will be the sum of all of the above Read matching metrics excluding "Mapped reads".
Molecule Count Metrics
Molecule count metrics report how many reads and molecules contribute toward the count table. No metric percentages are reported.
Reads with valid molecular identifier sequences: The number of reads with exactly matching molecular identifier (MI) sequences (i.e. BI sequences). If no MI correction is enabled, then all counted gene reads will be included here.
Reads with corrected molecular identifier sequences: The number of reads with MI sequences that were corrected to a valid sequence. If no MI correction is enabled, then the value will be 0.
Reads with invalid molecular identifier sequences: The number of reads that were not counted due to having an invalid MI (BI) sequence. Homopolymer sequences and Ns are examples of invalid MI sequences. If no MI correction is enabled, then the value will be 0.
Total gene reads: The total number of gene expression or transcript reads. This value includes both mapped reads and filtered reads and should match the "Reads processed through mapper" metric if all reads had valid BI sequences.
Total counted gene reads: The total number of mapped reads with valid barcode and BI sequences. The value includes exon-matching, intron-matching, and mitochondrial reads.
Total molecules: The total number of individual molecules found after performing IPM correction on the counted gene reads per unique barcode-BI-IMI combination.
Sequencing saturation: The likelihood that if an additional read is sequenced, it would already have been observed. It is calculated as the fraction of "duplicate reads" from the same molecule using the formula 1 - ( "Total molecules" / "Total counted gene reads" ).
Cell Metrics
Cell metrics report on the number of barcodes, passing cells based on the count threshold, and mean and median reads, molecules, and genes per cell. No metric percentages are reported.
Total unique barcodes: The total number of unique barcode sequences.
Passing cell threshold (by molecule/read count): The number of molecules or reads required for a barcode to be considered a (passing) cell. N.B. the
--single-cell-threshold-filter-byoption determines whether molecule or read counts are used for this threshold (default: molecule).Passing cells: The total number of cells called (i.e. the number of barcodes that passed the filter threshold).
Fraction of reads in passing cells: The fraction of reads that were assigned to passing cells, calculated as total gene reads in passing cells over total gene reads. N.B. total gene reads include filtered reads in the Read matching metrics group.
Fraction of reads in passing cells that are genic: The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.
Mean reads per cell (Total input reads / Passing cells): The average number of reads per passing cell calculated by dividing "Total input reads" by "Passing cells".
Median molecules per cell: The median number of molecules per passing cell based on sorting the passing cells by molecule count.
Median genes per cell: The median number of genes per passing cell based on sorting the passing cells by gene count.
Total genes detected: The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).
Barcode Summary File
<prefix>.scRNA.barcodeSummary.tsv
The barcode summary file provides a list of all barcodes as well as the number of reads, genes, and molecules in each barcode and its filter status (either PASS for above the cell filter threshold or LOW for below the threshold).
Fields in the barcode summary file:
ID: Unique numeric ID for the barcode. The ID corresponds to the line number of that barcode in the MEX output.
Barcode: The barcode sequence.
TotalReads: Total reads containing the barcode sequence. This includes counted gene reads as well as reads that were filtered for wrong strand, low MAPQ, ambiguously matched or not mapped to a gene.
GeneReads: Reads (primary read alignments) that were counted towards a gene. This includes exons, mitochondrial reads, and introns (if
scrna-count-introns=truewhich is set by default).Molecules: Total molecule count.
Genes: Unique genes detected.
MitochondrialReads: Reads mapped to the mitochondrial genome.
Filter: Cell-filtering status of the barcode. Possible values are:
PASS: Barcode passes the cell filter threshold.LOW: Barcode count is below the threshold.
Molecule Info File
<prefix>.scRNA.moleculeInfo.h5
The molecule info file provides a summary of the read counts for each barcode, gene, BI and IMI combination. The contents of the HDF5 file can be viewed using HDFview or using the h5dump command:
barcode_idx: the barcode index number (Note: this internal value does not match with the barcodes.tsv line number)
barcodes: the barcode sequence
binning_idx: the binning index (BI) as an integer (e.g. AAA -> 0)
gene_ids: the gene ID
gene_idx: the gene index number (Note: this internal value does not match with the features.tsv line number)
gene_names: the gene name
imis: the IMI value corresponding to the global position where the read was mapped
read_counts: the number of reads found for the given barcode, gene, BI and IMI
The HDF5 file can be further manipulated in python using the h5py package. Below is example code for creating a pandas DataFrame object from the molecule info file.
Barcode Rank Plot
DRAGEN can generate a barcode rank plot with the help of DRAGEN Reports.
The <prefix>.scRNA.barcodeCounts.txt file contains the final raw counts of every barcode in decreasing order. The barcode counts file is used by DRAGEN Reports to produce a barcode rank plot (also known as a "knee" plot).
If DRAGEN is run on the cloud via BSSH or ICA, then DRAGEN Reports will automatically be run and produce the final report.html file, which will contain the barcode rank plot on the "Sample Report" page.
If DRAGEN is run on-prem, then DRAGEN Reports can be run after DRAGEN single cell RNA has finished running. An example DRAGEN Reports command line is:
The barcode rank plot will appear in the output report.html "Sample Report" page. More information on DRAGEN Reports can be found here.
UMAP and Marker Gene Table
DRAGEN can also generate a UMAP plot and marker gene table with the help of DRAGEN Reports. Note: both the UMAP plot and marker gene table are only available if DRAGEN Reports is run on the cloud via BSSH or ICA.
If DRAGEN is run on the cloud via BSSH or ICA, then DRAGEN Reports will automatically be run and produce the final report.html file. When the file is opened, click the sample name to go to the "Sample Report" page and then click "Single-Cell Clustering". The UMAP Plot and Top Marker Genes table will appear.
If DRAGEN is run on-prem, then DRAGEN Reports can be run after DRAGEN single cell RNA has finished running. An example DRAGEN Reports command line is:
The Top Marker Genes table will appear in the output report.html under the "Single-Cell Clustering" tab of the "Sample Report" page. The UMAP plot and marker gene table are not available when running DRAGEN on-prem. More information on DRAGEN Reports can be found here.
Feature Counting
DRAGEN single cell RNA supports feature counting which can be used to profile the expression of proteins (e.g. antibodies or cell-surface proteins). Feature reads differ from transcript (or gene expression) reads in that their read 2 sequence (R2) contains an oligo tag corresponding to a particular type of protein. Reads marked as feature reads will be matched against the list of feature barcode reference sequences (in contrast to gene expression reads which are matched to genes). Only feature reads matching feature barcode reference sequences will be counted in the final matrix.
Inputs
Feature counting is activated by providing a feature barcode reference file using --scrna-feature-barcode-reference. The feature barcode reference file can either be in FASTA format or CSV format (see below for examples). If the feature barcode reference file is provided in FASTA format, then the feature names should be provided in the header and the feature barcode reference sequences should be provided as the sequence.
If the feature barcode reference file is provided in CSV format, the following fields should be included:
id
Feature/gene ID
name
Feature/gene name
read
Read that the feature sequence is on (R1 or R2)
position
Starting position of the feature barcode reference sequence relative to the start of the R1 or R2 sequence (should be >= 0)
sequence
Feature barcode reference sequence
feature_type
Name of feature (e.g. Antibody Capture)
When feature counting is used, a FASTQ list is usually provided containing both the gene expression FASTQs and the feature read FASTQs. The read groups (RGIDs) of the feature read input FASTQs must be provided using the --scrna-feature-barcode-groups=<feature_read_groups> option (multiple read groups should be comma-separated). See the CRISPR mode examples below for how to provide the read groups to the --scrna-feature-barcode-groups option.
Examples
Example feature barcode reference file (FASTA):
Example feature barcode reference file (CSV):
Example command line:
Outputs
When feature counting is enabled, DRAGEN produces additional outputs. The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz matrix file.
The feature barcode reference file will also be included in the DRAGEN output directory with the name <prefix>.feature_barcode_reference with the corresponding extension for a CSV or a FASTA file (depending on what was provided as input).
The following columns are also added to the <prefix>.scRNA.barcodeSummary.tsv file when feature counting is enabled:
TotalFeatureReads: The total number of feature reads found for the barcode sequence (including feature reads that did not successfully match to the feature barcode reference sequence).
FeatureMolecules: The number of feature molecules found for the barcode sequence (only including feature reads matching to the feature barcode reference sequence).
Features: The total number of features found for the barcode sequence.
TotalGene+FeatureReads: The sum of the TotalGeneReads and TotalFeatureReads for the barcode sequence.
Gene+FeatureMolecules: The sum of the Molecules and FeatureMolecules for the barcode sequence.
Genes+Features: The sum of the Genes and Features for the barcode sequence.
Feature Metrics
When feature counting is enabled, the following feature metrics are also output in <prefix>.scRNA_metrics.csv:
Total input feature reads: The total number of feature reads provided as input to DRAGEN. The value of this metric will equal the sum of "Total barcoded feature reads" and "Feature reads with invalid barcodes". The denominator for this metric percentage is "Total input feature reads".
Total barcoded feature reads: The total number of barcoded feature reads that will proceed to the feature barcode reference sequence matching stage. N.B. if CRISPR mode is enabled, this value will only include reads for which the specific CRISPR barcode linkers have been correctly detected. The denominator for this metric percentage is "Total input feature reads".
Feature reads with invalid barcodes: The total number of feature reads with barcodes that did not match valid sequences in the sequence list, even after barcode correction (within a hamming distance of 1). If CRISPR mode is enabled, this value also includes reads that did not have the correct CRISPR barcode linker distance. The denominator for this metric percentage is "Total input feature reads".
Feature matching reads: The number of feature reads that match to the feature barcode reference sequences (allowing for a hamming distance of 1). The denominator for this metric percentage is "Total barcoded feature reads".
Feature non-matching reads: The number of feature reads that do not match to the feature barcode reference sequences (even allowing for a hamming distance of 1). The denominator for this metric percentage is "Total barcoded feature reads".
Total feature reads: The total number of feature reads processed by the workflow (this should match the "Total barcoded feature reads" metric above). The denominator for this metric percentage is "Total barcoded feature reads".
Total feature molecules: The total number of feature molecules found (this should typically match the "Feature matching reads" metric above). No denominator.
Fraction of feature reads in passing cells: The fraction of feature reads that were assigned to passing cells, calculated as total feature reads in passing cells over total feature reads.
Mean feature reads per cell (Total input feature reads / Passing cells): The average number of feature reads per passing cell calculated by dividing "Total input feature reads" by "Passing cells".
Median feature molecules per passing cell: The median number of feature molecules per passing cell based on sorting the passing cells by feature molecule count.
Median features per passing cell: The median number of features per passing cell based on sorting the passing cells by feature count.
Total features detected: The total number of unique features detected in the passing cells (i.e. the number of features with at least one molecule in at least one passing cell).
N.B. the "Filtered low MAPQ reads" metric will be excluded from the <prefix>.scRNA_metrics.csv file if feature counting is enabled. The reads that are filtered due to low MAPQ will instead be counted toward the "Filtered non-matching reads" metric.
CRISPR Mode
DRAGEN also supports processing samples from Illumina's CRISPR Single Cell kits using PIPseq technology. Set --scrna-enable-pipseq-crispr-mode to true to activate this mode. This mode also requires feature counting to be enabled (CRISPR sequences are one type of feature read).
Activating PIPseq CRISPR mode automatically configures DRAGEN to process feature reads containing the CRISPR guide RNA (gRNA) sequences. This includes handling the offsets in the cell-barcode position for the gRNA reads, transforming the gRNA cell-barcodes to match the gene expression ones, utilizing the "hook and grab" approach for matching the feature barcode reference sequences and identifying the gRNA reads, and counting the gRNA reads (disregarding BIs and IMIs). Both gene expression and gRNA reads are processed in a single DRAGEN run. Note: unmapped reads do not contribute to gene expression read counts but are still included in gRNA counts if they match the hook sequence.
The “hook and grab” method is a targeted approach for identifying CRISPR perturbation reads. It leverages a conserved sequence within the guide RNA structural region as a “hook” to locate the guide RNA and then “grabs” the specific guide by mapping it to a database of known sequences based on their displacement from the hook.

Currently the application of combining CRISPR with other features (e.g. CITE-seq) in a single run is not fully supported yet due to the specialized handling of CRISPR sequences.
Inputs
PIPseq CRISPR mode is activated by setting --scrna-enable-pipseq-crispr-mode=true. CRISPR mode requires --scrna-enable-pipseq-mode to be set to true as well. Note that when --scrna-enable-pipseq-crispr-mode is set to true, the following options are also set automatically:
--Aligner.aln-min-score=25--scrna-crispr-guide-calling-mode=gmm
For a standard PIPseq CRISPR workflow, a FASTQ list is usually provided containing both the gene expression FASTQs and the CRISPR FASTQs. The read groups of the CRISPR input FASTQs must be provided using the --scrna-feature-barcode-groups=<CRISPR_read_groups> option (multiple read groups should be comma-separated). Furthermore, a CRISPR barcode reference file must be provided using the --scrna-feature-barcode-reference argument.
The CRISPR barcode reference file should be in CSV format with the following header fields:
id
Feature/gene ID
name
Feature/gene name
read
Read that the guide RNA is on (R1 or R2)
target
Target or "hook" sequence. If the value is set to "5p", then the target sequence is based on the beginning of the read (default value: "5p")
position
Displacement of the start of the identifying guide RNA sequence relative to the start of the target sequence (can be positive or negative)
sequence
Guide RNA sequence
feature_type
Name of feature (e.g. CRISPR Direct Capture)
Example FASTQ list CSV file:
Notes:
The RGID is referred to as the "read group" and CRISPR RGIDs should be passed to the
--scrna-feature-barcode-groupsoptionThe RGSM values should match for all reads to be processed in a single DRAGEN run
The RGLB values should match for all reads to be analyzed together (the results will be combined into a single metrics output file)
Example feature barcode reference file:
Examples of target, position, and sequence values for common use cases are provided below:

Example Command Line
CRISPR Outputs
The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz matrix file.
CRISPR Metrics
When PIPseq CRISPR mode is enabled, the following CRISPR metrics are also output in <prefix>.scRNA_metrics.csv (in addition to the Feature Metrics):
CRISPR reads with exactly matching barcodes: The number of CRISPR reads with barcode sequences that were not altered during error-correction, i.e. the barcode sequence exactly matched a barcode in the sequence list.
CRISPR reads with corrected barcodes: The number of CRISPR reads with barcode sequences that were corrected to a valid sequence in the sequence list (within a hamming distance of 1).
Total CRISPR reads matching known barcodes: The total number of CRISPR reads with valid barcodes that match the sequence list after error-correction. The metric value will be the sum of the above two metrics.
CRISPR tag mapping rate: The rate at which CRISPR reads match to the feature barcode sequences, calculated as the number of valid CRISPR reads (i.e. CRISPR reads that match to a feature barcode sequence) divided by "Total CRISPR reads matching known barcodes". The value for this metric should match the percentage value for the "Feature matching reads" metric.
CRISPR fraction valid reads in cells: The rate at which valid CRISPR reads are found in cells, calculated as the number of valid CRISPR reads in cells divided by "Total CRISPR reads matching known barcodes".
Fraction passing cells with CRISPR reads: The fraction of passing cells containing at least one CRISPR read, calculated as the number of passing cells with at least 1 CRISPR read divided by the total number of passing cells.
Guide RNA Calling
DRAGEN now also supports guide RNA calling - identifying cells that express guide RNA sequences. Although the counts matrix reports how many feature counts were found for each cell, there is a chance that the feature reads happened to match to the feature barcode reference sequence or the feature count was assigned to the cell by chance rather than due to real signal. Guide RNA calling applies an additional filtering step to distinguish between cells truly expressing guide RNA sequences and noise (similar to how cell filtering is performed to distinguish true passing cells from noise).
DRAGEN utilizes a Gaussian Mixture Model (GMM) approach to perform guide RNA calling. For each feature, DRAGEN constructs a GMM on the feature counts for the set of cells containing that feature. The GMM has two components: the signal distribution and the noise distribution. The model is simplified by assuming the signal and noise distributions have the same variance. Initial parameters are set based on the threshold that maximizes the likelihood of the model and then the Expectation-Maximization (EM) algorithm (add citation) is used to optimize the parameters. After the model converges to a final set of parameters, a threshold is chosen that maximally separates the signal and noise populations, and each cell is classified as expressing the feature or not.
When PIPseq CRISPR mode is enabled, guide RNA calling using GMMs is automatically enabled as well with the --scrna-crispr-guide-calling-mode=gmm option. Guide RNA calling can be disabled by setting --scrna-crispr-guide-calling-mode=none.
Guide Calling Outputs
If guide RNA calling is enabled, then DRAGEN will generate three additional files:
<prefix>.scRNA.positive_cell_guide_assignments.csvA CSV file containing all of the passing cells with the following fields:
cell_barcode: The cell barcode sequence.
num_features: The number of features called for the cell (may be 0).
feature_call: A list of all features called for the cell (separated with a "|"). May be empty if num_features is 0.
num_transcripts: The number of feature counts for each feature for the cell (separated with a "|"). May be empty if num_features is 0.
<prefix>.scRNA.singlet_cell_guide_assignments.csvA CSV file containing only the cells for which exactly one feature was called, with the following fields:
cell_barcode: The cell barcode sequence.
feature_call: The feature that was called for that cell.
<prefix>.scRNA.guide_metrics.csvA CSV metrics file containing the following Guide Metrics:
Total cells: The total number of passing cells.
Total cells with one or more guides detected: The number of cells with at least one feature passing the guide calling threshold.
Total cells with two or more guides detected: The number of cells with at least two features passing the guide calling threshold.
Fractional Downsampling
DRAGEN supports fractional downsampling with PIPseq mode. Fractional downsampling refers to downsampling the input reads are a specified rate. Downsampling is performed randomly on all input reads.
Fractional downsampling is enabled with the following options:
--enable-fractional-down-sampler=true--down-sampler-normal-subsample=<down_sampling_rate>
Set the down_sampling_rate equal to the desired fraction of reads to keep. For example, a rate of 70% means that 30% of the input reads will be filtered out.
More information on DRAGEN fractional downsampling can be found on this page.
Demultiplexing
DRAGEN supports different methods of demultiplexing samples. Genotype demultiplexing can be used if a VCF containing variants differentiating the samples of interest is available. Tumor-normal demultiplexing can be used to distinguish between tumor and non-tumor cells. Cell hashing can be used if the samples contain distinguishing oligo-tags (typically in read R2).
Genotype Demultiplexing
Genotype demultiplexing refers to using single nucleotide variants (SNVs) in the samples to differentiate them. A VCF file must be provided with those variants to use genotype demultiplexing. With genotype demultiplexing, DRAGEN assigns sample identity to cells based on the alleles observed in reads in each cell - only SNVs are considered. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.
DRAGEN supports two flavors of genotype demultiplexing. Genotype-based sample demultiplexing should be used if the VCF file matches the samples in the data set provided to DRAGEN (e.g. if the VCF was generated from matched DNA samples for the same individual). Genotype-free sample demultiplexing should be used if the VCF file is derived from a background population corresponding to the samples in the data set - the VCF is not expected to directly match the samples but the variant genotypes in the VCF should appear in the samples at some frequency.
Command Line Options
Genotype-based sample demultiplexing is enabled with the option:
--scrna-demux-sample-vcf=<VCF_file>- the VCF file should contain the genotypes corresponding to the exact samples provided to DRAGEN. Note: the number of samples for DRAGEN to report is automatically detected from the number of SAMPLE fields in the VCF.
Genotype-free sample demultiplexing is enabled with the options:
--scrna-demux-reference-vcf=<VCF_file>- the VCF file should contain genotypes for the background population from which the samples originated (they do not have to match the samples exactly)--scrna-demux-number-samples=<number>— The number of samples to be detected by DRAGEN. DRAGEN will demultiplex the dataset into this many samples.
Additional options:
--scrna-demux-detect-doublets=<true/false>— Enable doublet detection for demultiplexing (set totrueby default).
Example command line for genotype-based sample demultiplexing:
Example command line for genotype-free sample demultiplexing:
Outputs
When genotype demultiplexing (demux) is enabled, additional outputs are produced by DRAGEN. Additional columns are added to the barcode summary file and two new files are genereated: a demux TSV file with the number of reads and SNPs used to demultiplex each barcode and a demux metrics file with some summary metrics for the demultiplexed samples.
The <prefix>.scRNA.barcodeSummary.tsv file contains the per-cell, barcode summary metrics. When demultiplexing is enabled, this file will have these additional columns describing the sample each barcode is assigned to. See the Barcode Summary File section for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.
SampleIdentity
The SampleIdentity column will contain one of the following values:
sampleX—The particular barcode is uniquely assigned to a sample.AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as a doublet of sampleX and sampleY. For example,MIX(50*sampleX+50*sampleY).
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample-identity call. After DRAGEN determines the doublet status of the cell as singlet, ambiguous, or doublet, the identity Q-score is defined as -10 * log10(Probability that the assigned identity is correct, given the second most likely identity and the doublet status). Higher values for the identity Q-score correspond to more confident sample-identity calls.
The <prefix>.scRNA.demux.tsv file contains sample-demultiplexing statistics that were used to infer the sample identity of each cell barcode.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of SNPs that the reads of the cell barcode intersect.
DemuxReadCount
The number of reads of the cell barcode that intersect at least one SNP.
Pure samples
Samples from the VCF file.
BestMixtureIdentity
Mixture sample with the highest log likelihood. Only available if --single-cell-demux-detect-doublets=true is set.
BestMixtureLogLikelihood
The log likelihood of the best mixture sample. Only available if --single-cell-demux-detect-doublets=true is set.
The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-sample metrics for each demultiplexed sample, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.
Passing cells
The number of cells called (i.e. the number of barcodes that passed the filter threshold).
Fraction genic reads in cells
The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.
Median reads per cell
The median number of reads per passing cell based on sorting the passing cells by read count.
Median molecules per cell
The median number of molecules per passing cell based on sorting the passing cells by molecule count.
Median genes per cell
The median number of genes per passing cell based on sorting the passing cells by gene count.
Total genes detected
The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).
Sample-Specific FASTQ Output Files
If you have enabled either of the sample demultiplexing algorithms, you can output sample-specific FASTQ files after the sample identities for each cell is available using the command line option --scrna-demux-sample-fastq.
If gzip is specified, then the sample-specific output FASTQ files are compressed in gzip format. If fastq is specified, then the sample-specific FASTQ files are not compressed. The default option is none, which indicates that no sample-specific FASTQ files are produced.
Tumor/non-tumor Cell Demultiplexing
In addition to cell demultiplexing based on sample or population genotypes, there is an additional option to calculate read likelihoods for all scRNA reads conditioned on a given list of SNVs. The SNVs are specified by providing a somatic VCF (typically from a DNA WES tumor-normal somatic VC output). Note that the input VCF is "trusted" completely (ignoring variant call quality). The read-likelihoods are combined at the cell level and used to classify cells as tumor or non-tumor.
Command-Line Options
--scrna-demux-tumor-normal-vcf— A tumor-normal somatic variant call VCF that has tumor and normal sample columns. It is recommended for the VCF to have more than ~100 confident SNVs but that threshold may vary based on the coverage of the scRNA data. Having too few somatic variants runs the risk of not having enough scRNA reads intersecting the SNVs, leading to most cells being unclassified due to insufficient information.--scrna-demux-tn-threshold— The log-likelihood ratio threshold for classifying a cell as tumor or non-tumor. The default value is set to 2.0, but we recommend user to test control samples (e.g. high and low tumor-fraction samples) to find an optimal threshold tailored to their experimental design and sample type.
Example DRAGEN Single Cell RNA command line using genotype-based demultiplexing:
Outputs
The output for tumor/non-tumor scRNA cell demultiplexing can be found in the following three files.
The <prefix>.scRNA.barcodeSummary.tsv file contains per-cell metrics, including cell barcodes. The following columns contain information on demultiplexing per-cell. See Barcode Summary File for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.
SampleIdentity
The SampleIdentity column can contain the following values::
SNG:TUMOR' or 'SNG:NORMAL—The particular cell (barcode) is uniquely classified as a tumor or non-tumor cell, respectively.AMB(NORMAL,TUMOR)—The algorithm has found variant-supporting reads, but the likelihoods of the cell being tumor or non-tumor are the same, and therefore an ambiguous call is made.NA—The cell barcode has no variant-supporting reads, so there is not enough information to make a classification.
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample identity call. After DRAGEN determines the identity of the cell, the identity Q-score is defined as -10 * log10(probability of the most likely identity / (probability of the most likely identity + probability of the secound most likely identity)). Higher values of identity Q-score correspond to more confident sample identity calls.
The <prefix>.scRNA.demux.tsv file contains demultiplexing statistics that are used to infer the identity of each cell. Only cells in which variant-supporting reads are found are recorded in this file.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of variants that the reads of the cell barcode intersect.
DemuxReadCount
The number of MIs of the cell barcode that intersect at least one variant.
NORMAL
The log-likelihood of the cell being non-tumor
TUMOR
The log-likelihood of a cell being tumor.
The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-cell metrics, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv. Ambiguous calls, i.e. 'AMB(NORMAL,TUMOR)', are excluded from the calculation.
Passing cells
The number of cell barcodes that are PASS.
Fraction genic reads in cells
Fraction of gene-coding reads assigned to PASS cells.
Median reads per cell
Median count of reads per cell that passed the filters.
Median molecules per cell
Median count of MIs per cell that passed the filters.
Median genes per cell
Median count of genes with at least one MI per cell that passed the filters.
Cell-Hashing
Cell-hashing refers to using oligo-tags usually in read R2 sequences to demultiplex samples. A cell-hashing reference file is required to match those oligo-tags to the corresponding samples. DRAGEN will match each R2 sequence against the cell-hashing reference sequences to determine the sample identity of each barcode. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.
The cell-hashing reference file should be a FASTA or CSV file and is specified with the option --scrna-cell-hashing-reference=<cell_hashing_ref_file>. The format of the file is similar to that required for feature counting (see Feature Counting). If the cell-hashing reference file is provided in FASTA format, then the sample names should be provided in the header and the oligo-tag reference sequences should be provided as the sequence.
If the cell-hashing reference file is provided in CSV format, the following fields should be included:
id— Sample identifiername— Sample nameread— Read that the oligo-tag sequence is on (R1 or R2)position— Starting position of the oligo-tag reference sequence relative to the start of the R1 or R2 sequence (should be >= 0)sequence— DNA sequence of the sample-specific oligo. For example, CAGCCCGATTAAGGT.type— Type of feature (e.g. Cell Hashing)
Read groups (RGID) for FASTQ files that contain cell hashing oligos should be specified with the option --scrna-hto-barcode-groups=<cell_hashing_RGIDs>.
Command Line Options
Cell-hashing is enabled with the options:
--scrna-cell-hashing-reference=<cell_hashing_ref_file>- the FASTA or CSV file with the sample-specific oligo-tags.--scrna-hto-barcode-groups=<cell_hashing_RGIDs>- the RGIDs of the FASTQ files containing cell hashing reads.
Additionally, doublet detection can be enabled/disabled with the following option:
--scrna-demux-detect-doublets=<true/false>— Enable/disable doublet detection for cell-hashing sample demultiplexing (set totrueby default).
Example command line for cell-hashing:
Outputs
When cell-hashing sample demultiplexing (demux) is enabled, additional outputs are produced by DRAGEN. An additional column is added to the barcode summary file and two new files are genereated: a demux TSV file with the number of reads used to demultiplex each barcode and a demux metrics file with some summary metrics for the demultiplexed samples.
The <prefix>.scRNA.barcodeSummary.tsv file contains the per-cell, barcode summary metrics. When demultiplexing is enabled, this file will have an additional column describing the sample each barcode is assigned to. See the Barcode Summary File section for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.
SampleIdentity
The SampleIdentity column will contain one of the following values:
sampleX—The particular barcode is uniquely assigned to a sample.AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as a doublet of sampleX and sampleY. For example,MIX(50*sampleX+50*sampleY).
The <prefix>.scRNA.demux.tsv file contains sample-demultiplexing statistics that were used to infer the sample identity of each cell barcode.
Barcode
The cell barcode associated with the cell.
DemuxReadCount
Cell-hashing read count for each sample.
The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-sample metrics for each demultiplexed sample, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.
Passing cells
The number of cells called (i.e. the number of barcodes that passed the filter threshold).
Fraction genic reads in cells
The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.
Median reads per cell
The median number of reads per passing cell based on sorting the passing cells by read count.
Median molecules per cell
The median number of molecules per passing cell based on sorting the passing cells by molecule count.
Median genes per cell
The median number of genes per passing cell based on sorting the passing cells by gene count.
Total genes detected
The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).
STAR Aligner
To add optional compatibility with legacy PIPseq workflows, we have included Spliced Transcripts Alignment to a Reference (STAR) as an alternative to the default hardware-accelerated DRAGEN mapper. STAR is an open-source RNA-sequence read mapper, with support for splice-junction and fusion read detection.
For further information on STAR mapper:
While STAR may be run as an independent executable, running STAR as an internal feature from DRAGEN offers simpler and seamless functionality within the scRNA pipeline.
Important notes:
To run STAR, the Single Cell RNA pipeline must be enabled with either
--enable-single-cell-rna trueor--scrna-enable-pipseq-mode true.STAR uses its own format for reference genome hash tables. This format is both different and incompatible with DRAGEN's reference. When using STAR, both a DRAGEN and a STAR reference must be supplied, and these references must be built from identical whole-genome FASTA files and annotations. If STAR and DRAGEN are given references built from different files, undefined behavior may result.
When running STAR mapper, the following intermediate files will be created:
If a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, then an additional barcode/BI merged FASTQ file will be created with the following name:
<prefix>.umi_merged.rg_<Read group index>.fastq. An intermediate FASTQ file will be created for each read group provided.A BAM file with the name:
<prefix>.star_mapper.Aligned.out.bam. This is the output BAM record containing all mapped reads.
The following are options for configuring STAR within DRAGEN:
--single-cell-enable-star-mapper-- ("true" or "false") Enable STAR and use it for mapping in lieu of the DRAGEN mapper.--single-cell-star-mapper-build-reference-- ("true" or "false") Enable STAR and use it for building STAR-formatted reference genome hash tables. STAR may build a reference and map reads from the same command line. STAR cannot use a DRAGEN-formatted reference.--single-cell-star-mapper-reference-dir-- (required, string) Users must supply a directory for STAR-formatted reference genome hash tables. If STAR is building a reference, this directory is where the output reference files will be stored. If the STAR mapper runs from a pre-built reference, this directory is where the mapper looks for the reference files.--single-cell-star-mapper-path-- (optional, string) Users may input their own path to a STAR executable. If not supplied, DRAGEN uses an internally-supplied STAR executable.--single-cell-star-mapper-options-- (optional, string) Users may manually adjust settings for STAR and overwrite the default settings set by DRAGEN. For a complete list of settings for STAR, please refer to its user manual--single-cell-compress-intermediate-fastq-- (optional, "true" or "false", default="false") When a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, intermediate FASTQ files are created. This option enables compressing those intermediate FASTQ files with a ".fastq.gz" format.--single-cell-num-parallel-fastq-umis-- (optional, integer, default=4) When a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, intermediate FASTQ files are created. This option sets how many parallel FASTQ files are processed simultaneously.
Because STAR uses a different map/align algorithm than DRAGEN, there will be some differences in the read-mapping results and thus in the cell and gene counts results as well.
Some STAR options that can be added (using the --single-cell-star-mapper-options argument) to further align STAR's output to DRAGEN's include the following:
--alignIntronMax 25000- limit the maximum intron size to 25,000 bp--clip5pNbases 1- clip one base from the 5' end of read 2--clip3pAdapterSeq polyA- trim poly-A sequences from 3' end of read 2
Command-line Examples Using STAR
Example command line for running the STAR mapper with DRAGEN single cell for Illumina Single Cell Prep.
Example command line for running the STAR mapper from a FASTQ list CSV with two non-default settings for STAR.
Example command line for creating a STAR reference and running the STAR mapper with DRAGEN single cell for Illumina Single Cell Prep. Note: STAR requires an uncompressed annotations file.
Tertiary Analysis
We recommend using Illumina Connected Multiomics (ICM) for single cell tertiary analysis. ICM is a fully-featured tool that allows users to combine and analyze different biological data sets to turn raw samples into published results faster and more easily.
Loading Count Files to a Dense Matrix
Some users might want to explore the output matrix in a human-readable format. There are several tools available including Illumina Connected Multiomics (ICM), scanpy, and seurat. To do so, a possible way would be to load the matrix in a "dense" dataframe in Python (similar methodologies can be used in other programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in Python).
The matrix can be converted into a "dense" representation through two Python modules: scanpy and pandas. This has been tested with Python 3.9.7, scanpy 1.10.3, pandas 2.2.3.
First, it is necessary to install the required libraries:
Within Python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
References
Fontanez et al., 2024. Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing. bioRxiv 2024.10.04.616561.
Smith, T., Heger, A. and Sudbery, I., 2020. UMI-Tools: Modeling Sequencing Errors In Unique Molecular Identifiers To Improve Quantification Accuracy. [PDF] Cold Spring Harbor Laboratory Press. Available at: <https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/> [Accessed 15 October 2020].
Last updated
Was this helpful?