Illumina PIPseq scRNA
Illumina scRNA Prep
The DRAGEN scRNA PIPseq Pipeline is designed to process data obtained with the proprietary Illumina Single Cell 3' RNA Prep library based on PIPseq technology [1]. The command line option --scrna-enable-pipseq-mode=true runs the scRNA pipeline with methods that are optimized to process the unique structure of the Illumina Single Cell 3' RNA Prep library.
DRAGEN scRNA Pipeline
The DRAGEN scRNA Pipeline can process multiplexed single-cell RNA-Seq data sets from reads to a cell-by-gene expression matrix.
The pipeline includes the following functions:
Cell-barcode and binning index extraction and error correction for each read
RNA-seq (splice-aware) alignment and matching to annotated genes
Molecule counting per cell and gene to measure gene expression
Sparse matrix, molecule info (HDF5 file), h5ad, and QC metrics outputs
Feature counting, e.g. CITE-seq, Perturb-seq, etc.

The functionality and options related to alignment and gene annotation are identical to the RNA-Seq pipeline (run in single-end read mode). For more information, see the DRAGEN RNA Pipeline. None of the other DRAGEN RNA-Seq modules (variant calling, splice variant calling, gene fusion calling and transcript-level gene expression quantification) are supported for Single-Cell RNA.
Input Files
Alignment Reference
Use a standard DRAGEN RNA reference genome or hashtable for the scRNA Pipeline. Recommended references (found on the DRAGEN Product Files page) include the linear hg38 v5 hashtable, the linear v5 versions of the hg19 and hs37d5 hashtables, and the linear hg38-mm39 v5 hashtable for barnyard datasets. The reference is provided with the --ref-dir (-r) option. Custom references can be built using --ht-build-rna-hashtable=true (see the DRAGEN Prepare a Reference Genome page for more details on creating a reference hashtable). The pipeline also requires a gene annotation file in GTF format, provided with the --annotation (-a) option. Recommended annotation files include the gencode v44 and the gencode hg38_v44+mm39_vM30 annotation files (available on the DRAGEN Product Files page as well).
It is recommended to exclude from the analysis pseudogenes and other entries that may overlap with protein-coding genes and reduce mapping accuracy. This can be done by setting the --scrna-filter-gene-biotypes option to true, which will automatically exclude the following biotypes:
shortRNA
microRNA
rRNA
pseudogenes
Read Input
The DRAGEN scRNA Pipeline typically expects read input as either FASTQ files or in a FASTQ list. The transcript sequence is typically provided in read 2 (R2) and the cell-barcode and binning index (BI) will be provided in read 1 (R1). Alternatively, the cell-barcode and BI can be provided in the read name (QNAME). More info about the barcode and BI for Illumina scRNA Prep is provided in the FASTQ Processing section.
FASTQ Input
If only a single pair of FASTQ files will be run, then FASTQ file input can be used. FASTQ file 1 is expected to contain the barcode and BI sequences and FASTQ file 2 will contain the transcript sequence (this is enabled using the --umi-source read1 option which is automatically set when PIPseq mode is enabled). The read group sample name (RGSM) and read group ID (RGID) should be set as well (to be used when labeling samples in output files).
Alternatively, the barcode and BI can be provided in the read names of the transcript reads in a single FASTQ file. See below for details on adding the barcode and BI to the read name (using the --umi-source qname option).
FASTQ List Input
A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:
RGID
Read group ID
RGSM
Read group sample
RGLB
Read group library
Lane
Sequencing lane
Read1File
Read 1 FASTQ file (usually FASTQ file with reads containing cell-barcodes and BIs)
Read2File
Read 2 FASTQ file (usually transcript reads)
For example:
FASTQ list options (--umi-source=read1 automatically set with PIPseq mode):
The same FASTQ list option can be used if the barcodes and BIs are provided in the read names.
Barcode and BIs in the Read Name
If barcode and BI sequences are provided in the read name, then follow these steps:
Provide the transcript reads as a single-end FASTQ file with the Barcode+BI sequence in the eighth field of the read-name line.
Separate sequences using a colon.
In the example, the sequence starting with AGA is the barcode+BI read, the sequences beginning with GTG and ACG are the i7 and i5 sequences, respectively, and the sequence starting with TGCA is the transcript read sequence.
These FASTQ files can be generated by BCL Convert using the UMI settings to define the single-cell barcode+BI read. If using BCL Convert, enter the barcode/BI information using the OverrideCycles setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).
Note: BCL Convert refers to the entire single-cell barcode+BI sequence as the UMI.
The resulting file can be provided as fastq file 1 using FASTQ input or fastq file 2 using FASTQ list input. This option is also compatible with BAM files using the following options:
Using Multiple Libraries
The scRNA pipeline processes a single biological sample per DRAGEN run by default. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each. DRAGEN keeps the cells (barcodes and BIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.
Example FASTQ list CSV file for multiple libraries:
Notes:
The RGID is referred to as the "read group" of the input file
The RGSM values should match for all reads to be processed in a single DRAGEN run
The RGLB values should match for all reads to be analyzed together (the results will be combined into a single metrics output file). There are two separate libraries here so two metrics files will be produced.
scRNA PIPseq Licensing
For DRAGEN OnPrem Servers and DRAGEN FPGA Cloud BYOL customers, this pipeline requires the 'PipSeq' license. The cost of the PipSeq license is included with the associated Assay and has been automatically assigned. For those using DRAGEN OnPrem Servers, please see the OnPrem Licensing Installation Reference Section for more information. If your pre-existing DRAGEN OnPrem Server is not connected to the internet, please contact us to retrieve your PipSeq License.
Usage
The following is an example command line to run the DRAGEN Single Cell RNA PIPseq Pipeline.
When the --scrna-enable-pipseq-mode batch option is set to true, the following options are automatically included:
--enable-rna=true--enable-single-cell-rna=true--bypass-anchor-mapping=true--annotation-file-ignore-biotypes=pseudogene,shortrna,rrna--scrna-barcode-position=0_7+11_16+20_25+31_38--scrna-umi-position=39_41--umi-source=read1--scrna-barcode-sequence-list=<PIPSEQ_BARCODE_SEQ_LIST>--scrna-enable-barcode-phasing=true--scrna-num-barcode-phases=4--scrna-count-introns=true--scrna-min-mapq=1--qc-enable-depth-metrics=false--read-trimmers=fixed-len,adapter,polyg--trim-r1-5prime=1--trim-adapter-stringency=12--trim-adapter-r1-5prime=<PIPSEQ_ADAPTER_LIST>--trim-adapter-read1=<POLYA_TRIM_LIST>--soft-read-trimmers=none
Note that the binning index position is set using the --scrna-umi-position option. The <PIPSEQ_BARCODE_SEQ_LIST>, <PIPSEQ_ADAPTER_LIST>, and <POLYA_TRIM_LIST> are packaged with DRAGEN and the path to the files will be automatically detected. --scrna-min-mapq sets the minimum MAPQ threshold for reads to be counted. If any of these options are also included in the DRAGEN command line, then it will override the value set by the batch option.
Additional optional arguments:
--rna-library-type--- Set the orientation of transcript reads relative to the genomes. EnterSFfor forward,SRfor reverse, orUfor unstranded. The default isSF.--enable-map-align-output--- Save the output BAM. The default is false. See the BAM File Output for more details.--single-cell-threshold--- Set the cell filtering method. Available options arefixed,ratio, andinflection. The default isratio. See Cell Filtering for more details.
Pipeline
FASTQ Processing
The purpose of this step is to:
Extract cell-barcodes and binning indices (BIs) from reads
Correct cell-barcode sequences
Remove unwanted technical sequences from the data before mapping
Assign intrinsic molecular identifiers (IMIs) to reads
Structure of the Illumina Single Cell 3' RNA Prep FASTQ Input Files
The figure below describes the fragmentation process taking place in PIPseq chemistry. Each read in R1 includes 0 to 3 bases at the start indicating the "phase" of the read, a barcode sequence with 4 tiers, which identifies the individual PIP, followed by a 3-base binning index (BI) sequence. The number of phasing bases varies from read to read to increase base diversity for Read 1. Each barcode tier is composed of 96 possible sequences, allowing 96^4 total barcode combinations. By default, the barcode positions are set to 0_7+11_16+20_25+31_38. The binning index provides information for identifying the individual molecules within a cell. By default, the UMI positions are set to 39_41.
Read 1 barcode and BI structure: 
R2 includes the sequenced cDNA constructs created from the captured mRNA, which contain random cut sites that serve as intrinsic molecular identifiers (IMIs), and are used for molecular counting. Up to fifteen different cut sites can be created from a single captured molecule using five cycles of whole transcriptome amplification. The genome alignment position, which will be different for each cut site, serves as the IMI for each read.

Barcode Extraction and Correction
The barcode sequence for each read is extracted from read R1 and then compared against the barcode sequence list. Illumina Single Cell 3' RNA Prep uses a tiered barcode system, where every one of four tiers includes one of a specified list of possible barcodes. Barcode matching can therefore be done highly efficiently by matching each tier in isolation.
DRAGEN performs barcode phasing for each read. Between 0 and 3 bases are included at the start of the barcode to diversify the sequences. These are referred to as the "phase" of the barcode and must be removed. DRAGEN processes the barcodes by considering all possible sequences from each of the four phases, counts the number of mismatches between each sequence and the best matching sequence in the barcode sequence list, and chooses the phase with the fewest mismatches.
After phasing, DRAGEN performs error correction for each barcode. The pipeline allows for a maximum Hamming distance of 1 per tier, i.e., these may be one mismatch between the candidate and the sequence list barcode for each tier and still be matched. Barcodes within that Hamming distance will be corrected to the first matching barcode in the barcode sequence list. Barcodes with a greater Hamming distance will be considered non-matching and removed from downstream processing. If a read is truncated and missing the barcode sequence, then it will be considered a read with an invalid barcode. DRAGEN reports the total number of reads with invalid barcodes, exactly-matching barcodes, corrected barcodes, and non-matching barcodes in the <prefix>.scRNA_metrics.csv file. See the Barcode Read Metrics section for more details.
The BI sequence is also extracted from R1 but no correction is performed on it.
The barcode positions, BI positions and the expected barcode list are set automatically with the --scrna-enable-pipseq-mode argument, so they do not need to be specified by the user.
Custom barcode positions, BI positions and expected barcode list can be set with the following arguments:
Barcode positions
--scrna-barcode-position
0_7+11_16+20_25+31_38
BI position
--scrna-umi-position
39_41
Expected barcode list
--scrna-barcode-sequence-list
<PIPSEQ_BARCODE_SEQ_LIST_PATH>
R2 Trimming
The R2 file, which includes the cDNA construct, also includes technical sequences related to the PIPseq chemistry, namely the template switch oligo (TSO) and poly-A sequences. The prevalence of those sequences is inversely correlated with the fragment size. Because a read is less likely to be mapped accurately if it contains extraneous sequences, TSO sequences are removed from the 5’ end and poly-A sequences are removed from the 3’ end in the R2 FASTQ file. In addition, the first base, a constant T, is removed from the 5’ end of R2. These are done using the DRAGEN trimmers, with the settings automatically included under the --scrna-enable-pipseq-mode flag. Stats on the number of reads and bases trimmed can be found in the <prefix>.trimmer_metrics.csv file.
Mapping
The DRAGEN RNA mapper is used to map all of the R2 reads to the reference genome. Note that the barcode and BI sequences are extracted from R1 and only the R2 reads are passed to the mapper. Therefore the mapper considers them to be single-ended reads labeled as R1 in the mapping metrics. Further information on the DRAGEN RNA mapper can be found on the RNA alignment page.
DRAGEN is also introducing support for the STAR aligner in DRAGEN 4.5 as an alternative to the DRAGEN RNA mapper. Further information on STAR aligner support can be found in the STAR aligner section.
Each read is mapped to the reference and then matched to genes based on where it aligns. If the read matches to the exons of a gene, then it is classified as an exon-matching read. Otherwise it is given another classification. DRAGEN scRNA currently only includes the primary alignment of each read (assigned randomly by the mapper if there are multiple alignments with identical scores). Secondary and supplementary alignments are skipped (but they are still included in the BAM output). The possible classifications for each read (based on its primary alignment) are:
Unique exon matching --- reads that overlap with gene exons by at least 50% of read length
Unique intron matching --- reads that overlap with the gene by at least 20% of read length
Mitochondrial --- reads that map to the mitochondrial genome
Filtered antisense --- reads that overlap a gene on the opposite strand defined by the library type
Filtered ambiguously matching --- reads that match to multiple genes
Filtered low MAPQ --- reads with low MAPQ (below 1 by default) that map ambiguously
Filtered non-matching --- reads that do not map to any genes
For the remainder of the pipeline, only the unique exon matching, unique intron matching, and mitochondrial reads (together referred to as "mapped reads") are counted toward the molecule counts of each barcode. The other "filtered" reads will be included in the "Total" reads per barcode but not counted towards the molecule counts of each gene. If --scrna-count-introns=false is set, then intron reads will also be considered "filtered" rather than "mapped".
The total number of reads in each classification can be found in the <prefix>.scRNA_metrics.csv file as described in the Read Matching Metrics section.
Biotype Filtering
Add the argument --scrna-filter-gene-biotypes=true to exclude pseudogene, shortRNA and rRNA biotypes during mapping.
Transcript Counting
Illumina Single Cell 3' RNA Prep generates fragments with unique starting positions, which serve as intrinsic identifiers that can be used for molecular counting. Reads are grouped together based on the cell barcode and the assigned gene. Within each barcode and gene combination, IMIs are grouped in one of 64 bins, based on the 3-base binning index. For each bin, all identical IMIs are collapsed into a single count, since they are likely PCR duplicates of the same fragment generated during library prep.
At this point, any barcode and gene combination which has ten or fewer unique binning indexes is assigned the number of unique binning indexes as its final count estimate. The pipeline then totals the number of IMIs associated with each remaining barcode and gene combination, and divides that number by a correction factor estimated from the data, which accounts for the additional copies generated from a single captured molecule during five amplification cycles. The final count estimate is the maximum between the floor of this value and the number of unique binning indexes for this barcode and gene.
Estimating the correction factor is described in detail in the next section. The example below shows molecular counting with a correction factor of 1.35. The barcode and gene combination in this example has 12 unique binning indexes, so the correction factor is applied to produce the final count. The correction factor of 1.35 indicates that, on average, a single molecule produced 1.35 fragmented copies in this sample. Note that the corrected count will always be at least equal to the number of unique bins, and will not exceed the number of unique IMI and bin combinations. The result of this correction process, 14, is the number that is recorded in the count matrix for this barcode and gene.

Estimating the Correction Factor
A single molecule can produce anywhere between one and fifteen fragmented copies during five amplification cycles. The true distribution of the number of fragmented copies varies between datasets, and is influenced by factors such as sample type and sequencing depth. For a given dataset, the correction factor is an estimate of the average of this distribution, the average IMIs per molecule (IPM) in the sample.
Because all IMIs from the same parent molecule share a binning index, the number of unique binning indexes observed within a specific barcode and gene is determined by the number of molecules, and is not impacted by the number of IMIs that were produced by the molecules. This means the probabilistic relationship between the number of unique bins and the true number of molecules in a barcode and gene combination is constant, and is the result of random sampling from the 64 possible bin indexes when each molecule is captured. We can represent this process by the Coupon Collector Problem (CCP) [2], which defines this distribution of the number of samples (molecules) required to a certain number of distinct types (unique bins) among the total collection of unique types (64 binning indexes). The combination of constant molecular probabilities based on the number of unique bins and observed unique IMIs in barcode and gene combinations in the sample enable the IPM estimation. We select the subset of barcode and gene combinations with between 5 and 32 unique bin indexes. For each of these barcode and genes, we divide the total number of IMIs by the average number of molecules expected based on the number of unique bin indexes. The estimated average IMIs per molecule (IPM) for the sample is the average of these values across this subset of barcode and genes. The subset of barcode and gene combinations is used rather than all available combinations because we expect the CCP model to be most representative of the true molecular count and noise from sources such as sequencing errors or reads outside the cell fraction to be minimized within this subset.
After calculating the average IMIs per molecule in the sample, we apply the IPM to estimate the molecular count from the IMI count in individual barcode and gene combinations. The estimated molecular count for a barcode and gene is the total number of IMIs divided by the IPM, rounded down. The more true molecules a barcode and gene combination has, the closer the estimated molecular count is expected to be with this approach (as the number of molecules increases, the true average IMIs per molecule for a barcode and gene should approach the average IPM of the sample). For barcode and gene combinations with very few molecules, the number of unique bins is expected to be a better predictor of the molecular count than the number of IMIs because the variance in the true IMIs per molecule among this group is high since the number of samples (molecules) in each individual barcode and gene combination is low. For this reason, we apply IPM correction for barcode and gene combinations with more than 10 unique bin indexes, and otherwise the corrected count is equal to the number of unique bin indexes.
Cell Filtering
DRAGEN uses a threshold on the total count of distinct molecular identifiers (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.
--single-cell-number-cells--- [Optional] Sets the expected number of cells. The default value is set to 2,000, 10,000 or 100,000 which is estimated based on the total number of barcodes observed. Adjust this value if using afixedthreshold or if the number of called cells deviates from expectation.--single-cell-threshold--- Specifies the method for determining the count threshold value. The available values arefixed,ratio, orinflection(the default isratio).If using
ratio, DRAGEN calculates the expected number of cells usingmax(T_e, T_m).T_mis a threshold based on a fraction of the molecules seen in the most abundant cell-barcodes. If the top decile cell is the cell occurring at the 90th percentile by molecule count, thenT_mis calculated as 10% of the number of molecules occurring in the top decile cell.T_eis a threshold based on a fraction of the least abundant expected cell. The least abundant expected cell is defined as the cell occurring at the rank of the number of expected cells (e.g. if the number of expected cells is 2,000, then the least abundant expected cell is the cell ranked 2,000th by molecule count abundance).T_eis calculated as 50% of the number of molecules found in the least abundant expected cell.For example, if dataset A is provided to DRAGEN with an expected number of cells of 2,000 and it is found to contain 30,000 cell-barcodes, and the 3,000th most abundant cell(i.e. the top decile cell) has 1,100 molecules while the 2,000th most abundant cell (i.e. the least abundant expected cell) has 3,700 molecules, then
T_mwould be equal to10% * 1,100 = 110 moleculesandT_ewould be equal to50% * 3,700 = 1,850 molecules. The threshold would then bemax(1,850, 110) = 1,850 molecules.
If using
inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts. The second derivatives of the read/molecule counts are used to estimate the inflection point that is chosen as the threshold.If using
fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cellsoption), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.
--single-cell-threshold-filterby--- [Optional] Set the count distribution to consider for cell filtering. Can be eitherumi(default) orread.
To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:
The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.
Outputs
Single-cell RNA outputs are found in the standard DRAGEN output location. All single-cell RNA output files contain the word scRNA in their names.
Count Files
The count files contain the overall molecule counts on a per-cell gene expression level. Counts are reported in two formats: Market Exchange (MEX) and h5ad format. Furthermore, counts are reported for all barcodes (raw) as well as for the set of filtered barcodes that pass the cell filtering threshold.
Raw MEX files:
<prefix>.scRNA.matrix.mtx.gzCount of molecules for each cell/gene pair in sparse matrix format.
<prefix>.scRNA.barcodes.tsv.gzBarcode sequence for each cell from the matrix. This includes all cell-barcodes.
<prefix>.scRNA.features.tsv.gzGene name, ID and feature type for each gene and feature in the matrix.
Raw h5ad file:
<prefix>.scRNA.h5adCount matrix in AnnData format. Barcodes are stored in the OBS group and genes are stored in the VAR group. Further information on the h5ad format can be found on the AnnData page.
Note: the h5ad output can be disabled with the option
--single-cell-enable-h5ad-output=false
Filtered MEX files (containing only cells that pass the cell filtering threshold):
<prefix>.scRNA.filtered.matrix.mtx.gzCount of unique MIs for each passing cell/gene pair in sparse matrix format.
<prefix>.scRNA.filtered.barcodes.tsv.gzCell-barcode sequence for each passing cell from the matrix.
<prefix>.scRNA.filtered.features.tsv.gzGene name, ID and feature type for each gene and feature in the matrix (identical to
<prefix>.scRNA.features.tsv.gz).
Filtered h5ad file:
<prefix>.scRNA.filtered.h5adFiltered count matrix in AnnData format. Barcodes are stored in the OBS group and genes are stored in the VAR group. Further information on the h5ad format can be found on the AnnData page.
Note: the h5ad output can be disabled with the option
--single-cell-enable-h5ad-output=false
BAM file
--enable-map-align-output=true--- default isfalse--output-format=BAM--- default isBAMA BAM file can be generated by adding in the options above. Note that enabling the BAM output may increase the runtime and memory usage of DRAGEN.
All alignments of all of the reads are included in the BAM, with unmapped reads coming at the end. The primary alignment of each read (the only that is considered in the pipeline, see Mapping) is annotated with single cell-specific BAM tags. The following tags are used:
CR--- Uncorrected barcode sequenceCB--- Corrected barcode sequence (identical toCRif no correction occurred)UR--- Uncorrected BI sequence
Non-primary alignments and reads not associated with a barcode will not have any single cell BAM tags.
By default, the BAM file will not be sorted but all reads with the same barcode will be grouped together. The following options may be used to force the BAM to be sorted by reference position: --scrna-split-counts-by-barcode=false. However, this may lead to significant memory issues and is not recommended. If a sorted BAM file is desired, it is recommended to have DRAGEN produce an unsorted BAM and then run samtools sort input.bam -o output_sorted.bam to sort by position.
Additional metrics related to the mapping of reads to the reference may be found in the <prefix>.mapping_metrics.csv file. Note that the "mapped reads (RNA) to rRNA and filtered" metric in that file only includes non-coding rRNA (it excludes protein-coding ribosomal RNA transcripts), so it is not applicable to single cell RNA analyses.
Single Cell RNA Metrics
The <prefix>.scRNA_metrics.csv file contains per sample single cell RNA metrics. In scRNA, mapped reads are currently reported by default under R1 metrics, irrespective of whether R1 is the read aligned to the transcriptome or the one containing the cell barcode and BI. The single cell RNA metrics file is in CSV format with the following fields:
Metric group
Sample name
Metric name
Metric value
Metric percentage (omitted for some metrics)
The single cell RNA metrics are broken down into four metric groups (additional groups are included if certain options are enabled).
Barcode Read Metrics
Barcode read metrics report the breakdown of how many of the input read pairs contain valid barcodes (matching to the acceptable sequence list). The denominator of the metric percentages in this group is "Total input reads".
Reads missing barcodes: The number of reads for which the overall barcode sequence (cell barcode + BI) failed basic checks, e.g. the barcode read was missing or too short.
Reads with exactly matching barcodes: The number of reads with barcode sequences that were not altered during error-correction, i.e. the barcode sequence exactly matched a barcode in the sequence list.
Reads with corrected barcodes: The number of reads with barcode sequences that were corrected to a valid sequence in the sequence list (within a hamming distance of 1).
Reads with non-matching barcodes: The number of reads with barcode sequences that did not match valid sequences in the sequence list, even after barcode correction (within a hamming distance of 1).
Total input reads: The total number of reads provided as input for this DRAGEN run. The metric value will be the sum of the above four metrics.
Total barcoded reads: The number of reads with valid barcodes that will proceed to the read matching stage. The metric value will be the sum of "Reads with exactly matching barcodes" and "Reads with corrected barcodes".
Read Matching Metrics
Read matching metrics report how many barcoded reads match to genes and how many were filtered due to various reasons. The denominator of the metric percentages in this group is "Total barcoded reads".
Unique exon matching reads: The number of reads that uniquely match with >50% overlap to the exon region of a single gene.
Unique intron matching reads: The number of reads that do not meet the exon-matching criteria above but map to a single gene with >20% overlap. N.B. if
--scrna-count-introns=falseis set, then this metric will become "Filtered intron matching reads".Filtered antisense reads: The number of reads that overlap a gene on the opposite strand based on the library type of the reads.
Filtered ambiguously matching reads: The number of reads that match to multiple genes equally well.
Filtered low MAPQ reads: The number of reads that are filtered due to low MAPQ values (default threshold of 4).
Filtered non-matching reads: The number of reads that do not match to any gene.
Mitochondrial reads: The number of reads that map to mitochondria.
Mapped reads: The number of reads considered to be "mapping," i.e. that are found to match to a gene. The metric value will be the sum of "Unique exon matching reads," "Unique intron matching reads", and "Mitochondrial reads". N.B. if
--scrna-count-introns=falseis set, then introns will be excluded.Reads processed through mapper: The total number of reads provided as input to the mapper. The metric value will be the sum of all of the above Read matching metrics excluding "Mapped reads".
Molecule Count Metrics
Molecule count metrics report how many reads and molecules contribute toward the count table. No metric percentages are reported.
Reads with valid molecular identifier sequences: The number of reads with exactly matching molecular identifier (MI) sequences (i.e. BI sequences). If no MI correction is enabled, then all counted gene reads will be included here.
Reads with corrected molecular identifier sequences: The number of reads with MI sequences that were corrected to a valid sequence. If no MI correction is enabled, then the value will be 0.
Reads with invalid molecular identifier sequences: The number of reads that were not counted due to having an invalid MI (BI) sequence. Homopolymer sequences and Ns are examples of invalid MI sequences. If no MI correction is enabled, then the value will be 0.
Total gene reads: The total number of gene expression or transcript reads. This value includes both mapped reads and filtered reads and should match the "Reads processed through mapper" metric if all reads had valid BI sequences.
Total counted gene reads: The total number of mapped reads with valid barcode and BI sequences. The value includes exon-matching, intron-matching, and mitochondrial reads.
Total molecules: The total number of individual molecules found after performing IPM correction on the counted gene reads per unique barcode-BI-IMI combination.
Sequencing saturation: The likelihood that if an additional read is sequenced, it would already have been observed. It is calculated as the fraction of "duplicate reads" from the same molecule using the formula 1 - ( "Total molecules" / "Total counted gene reads" ).
Cell Metrics
Cell metrics report on the number of barcodes, passing cells based on the count threshold, and mean and median reads, molecules, and genes per cell. No metric percentages are reported.
Total unique barcodes: The total number of unique barcode sequences.
Passing cell threshold (by molecule/read count): The number of molecules or reads required for a barcode to be considered a (passing) cell. N.B. the
--single-cell-threshold-filter-byoption determines whether molecule or read counts are used for this threshold (default: molecule).Passing cells: The total number of cells called (i.e. the number of barcodes that passed the filter threshold).
Fraction of reads in passing cells: The fraction of reads that were assigned to passing cells, calculated as total gene reads in passing cells over total gene reads. N.B. total gene reads include filtered reads in the Read matching metrics group.
Fraction of reads in passing cells that are genic: The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.
Mean reads per cell (Total input reads / Passing cells): The average number of reads per passing cell calculated by dividing "Total input reads" by "Passing cells".
Median molecules per cell: The median number of molecules per passing cell based on sorting the passing cells by molecule count.
Median genes per cell: The median number of genes per passing cell based on sorting the passing cells by gene count.
Total genes detected: The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).
Barcode Summary File
<prefix>.scRNA.barcodeSummary.tsv
The barcode summary file provides a list of all barcodes as well as the number of reads, genes, and molecules in each barcode and its filter status (either PASS for above the cell filter threshold or LOW for below the threshold).
Fields in the barcode summary file:
ID: Unique numeric ID for the barcode. The ID corresponds to the line number of that barcode in the MEX output.
Barcode: The barcode sequence.
TotalReads: Total reads containing the barcode sequence. This includes counted gene reads as well as reads that were filtered for wrong strand, low MAPQ, ambiguously matched or not mapped to a gene.
GeneReads: Reads (primary read alignments) that were counted towards a gene. This includes exons, mitochondrial reads, and introns (if
scrna-count-introns=truewhich is set by default).Molecules: Total molecule count.
Genes: Unique genes detected.
MitochondrialReads: Reads mapped to the mitochondrial genome.
Filter: Cell-filtering status of the barcode. Possible values are:
PASS: Barcode passes the cell filter threshold.LOW: Barcode count is below the threshold.
Molecule Info File
<prefix>.scRNA.moleculeInfo.h5
The molecule info file provides a summary of the read counts for each barcode, gene, BI and IMI combination. The contents of the HDF5 file can be viewed using HDFview or using the h5dump command:
barcode_idx: the barcode index number (Note: this internal value does not match with the barcodes.tsv line number)
barcodes: the barcode sequence
binning_idx: the binning index (BI) as an integer (e.g. AAA -> 0)
gene_ids: the gene ID
gene_idx: the gene index number (Note: this internal value does not match with the features.tsv line number)
gene_names: the gene name
imis: the IMI value corresponding to the global position where the read was mapped
read_counts: the number of reads found for the given barcode, gene, BI and IMI
The HDF5 file can be further manipulated in python using the h5py package. Below is example code for creating a pandas DataFrame object from the molecule info file.
Barcode Rank Plot
DRAGEN can generate a barcode rank plot with the help of DRAGEN Reports.
The <prefix>.scRNA.barcodeCounts.txt file contains the final raw counts of every barcode in decreasing order. The barcode counts file is used by DRAGEN Reports to produce a barcode rank plot (also known as a "knee" plot).
If DRAGEN is run on the cloud via BSSH or ICA, then DRAGEN Reports will automatically be run and produce the final report.html file, which will contain the barcode rank plot on the "Sample Report" page.
If DRAGEN is run on-prem, then DRAGEN Reports can be run after DRAGEN single cell RNA has finished running. An example DRAGEN Reports command line is:
The barcode rank plot will appear in the output report.html "Sample Report" page. More information on DRAGEN Reports can be found here.
UMAP and Marker Gene Table
DRAGEN can also generate a UMAP plot and marker gene table with the help of DRAGEN Reports. Note: both the UMAP plot and marker gene table are only available if DRAGEN Reports is run on the cloud via BSSH or ICA.
If DRAGEN is run on the cloud via BSSH or ICA, then DRAGEN Reports will automatically be run and produce the final report.html file. When the file is opened, click the sample name to go to the "Sample Report" page and then click "Single-Cell Clustering". The UMAP Plot and Top Marker Genes table will appear.
If DRAGEN is run on-prem, then DRAGEN Reports can be run after DRAGEN single cell RNA has finished running. An example DRAGEN Reports command line is:
The Top Marker Genes table will appear in the output report.html under the "Single-Cell Clustering" tab of the "Sample Report" page. The UMAP plot and marker gene table are not available when running DRAGEN on-prem. More information on DRAGEN Reports can be found here.
Feature Counting
DRAGEN single cell RNA supports feature counting which can be used to profile the expression of proteins (e.g. antibodies or cell-surface proteins). Feature reads differ from transcript (or gene expression) reads in that their read 2 sequence (R2) contains an oligo tag corresponding to a particular type of protein. Reads marked as feature reads will be matched against the list of feature barcode reference sequences (in contrast to gene expression reads which are matched to genes). Only feature reads matching feature barcode reference sequences will be counted in the final matrix.
Inputs
Feature counting is activated by providing a feature barcode reference file using --scrna-feature-barcode-reference. The feature barcode reference file can either be in FASTA format or CSV format (see below for examples). If the feature barcode reference file is provided in FASTA format, then the feature names should be provided in the header and the feature barcode reference sequences should be provided as the sequence.
If the feature barcode reference file is provided in CSV format, the following fields should be included:
id
Feature/gene ID
name
Feature/gene name
read
Read that the feature sequence is on (R1 or R2)
position
Starting position of the feature barcode reference sequence relative to the start of the R1 or R2 sequence (should be >= 0)
sequence
Feature barcode reference sequence
gene_id
(Optional) ID of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)
gene_name
(Optional) Name of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)
When feature counting is used, a FASTQ list is usually provided containing both the gene expression FASTQs and the feature read FASTQs. The read groups (RGIDs) of the feature read input FASTQs must be provided using the --scrna-feature-barcode-groups=<feature_read_groups> option (multiple read groups should be comma-separated). See the CRISPR mode examples below for how to provide the read groups to the --scrna-feature-barcode-groups option.
Examples
Example feature barcode reference file (FASTA):
Example feature barcode reference file (CSV):
Example command line:
Outputs
When feature counting is enabled, DRAGEN produces additional outputs. The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz matrix file.
The feature barcode reference file will also be included in the DRAGEN output directory with the name <prefix>.feature_barcode_reference with the corresponding extension for a CSV or a FASTA file (depending on what was provided as input).
The following columns are also added to the <prefix>.scRNA.barcodeSummary.tsv file when feature counting is enabled:
TotalFeatureReads: The total number of feature reads found for the barcode sequence (including feature reads that did not successfully match to the feature barcode reference sequence).
FeatureMolecules: The number of feature molecules found for the barcode sequence (only including feature reads matching to the feature barcode reference sequence).
Features: The total number of features found for the barcode sequence.
TotalGene+FeatureReads: The sum of the TotalGeneReads and TotalFeatureReads for the barcode sequence.
Gene+FeatureMolecules: The sum of the Molecules and FeatureMolecules for the barcode sequence.
Genes+Features: The sum of the Genes and Features for the barcode sequence.
Feature Metrics
When feature counting is enabled, the following feature metrics are also output in <prefix>.scRNA_metrics.csv:
Total input feature reads: The total number of feature reads provided as input to DRAGEN. The value of this metric will equal the sum of "Total barcoded feature reads" and "Feature reads with invalid barcodes". The denominator for this metric percentage is "Total input feature reads".
Total barcoded feature reads: The total number of barcoded feature reads that will proceed to the feature barcode reference sequence matching stage. N.B. if CRISPR mode is enabled, this value will only include reads for which the specific CRISPR barcode linkers have been correctly detected. The denominator for this metric percentage is "Total input feature reads".
Feature reads with invalid barcodes: The total number of feature reads with barcodes that did not match valid sequences in the sequence list, even after barcode correction (within a hamming distance of 1). If CRISPR mode is enabled, this value also includes reads that did not have the correct CRISPR barcode linker distance. The denominator for this metric percentage is "Total input feature reads".
Feature matching reads: The number of feature reads that match to the feature barcode reference sequences (allowing for a hamming distance of 1). The denominator for this metric percentage is "Total barcoded feature reads".
Feature non-matching reads: The number of feature reads that do not match to the feature barcode reference sequences (even allowing for a hamming distance of 1). The denominator for this metric percentage is "Total barcoded feature reads".
Total feature reads: The total number of feature reads processed by the workflow (this should match the "Total barcoded feature reads" metric above). The denominator for this metric percentage is "Total barcoded feature reads".
Total feature molecules: The total number of feature molecules found (this should typically match the "Feature matching reads" metric above). No denominator.
Fraction of feature reads in passing cells: The fraction of feature reads that were assigned to passing cells, calculated as total feature reads in passing cells over total feature reads.
Mean feature reads per cell (Total input feature reads / Passing cells): The average number of feature reads per passing cell calculated by dividing "Total input feature reads" by "Passing cells".
Median feature molecules per passing cell: The median number of feature molecules per passing cell based on sorting the passing cells by feature molecule count.
Median features per passing cell: The median number of features per passing cell based on sorting the passing cells by feature count.
Total features detected: The total number of unique features detected in the passing cells (i.e. the number of features with at least one molecule in at least one passing cell).
N.B. the "Filtered low MAPQ reads" metric will be excluded from the <prefix>.scRNA_metrics.csv file if feature counting is enabled. The reads that are filtered due to low MAPQ will instead be counted toward the "Filtered non-matching reads" metric.
CRISPR Mode
DRAGEN also supports processing samples from Illumina's CRISPR Single Cell kits using PIPseq technology. Set --scrna-enable-pipseq-crispr-mode to true to activate this mode. This mode also requires feature counting to be enabled (CRISPR sequences are one type of feature read).
Activating PIPseq CRISPR mode automatically configures DRAGEN to process feature reads containing the CRISPR guide RNA (gRNA) sequences. This includes handling the offsets in the cell-barcode position for the gRNA reads, transforming the gRNA cell-barcodes to match the gene expression ones, utilizing the "hook and grab" approach for matching the feature barcode reference sequences and identifying the gRNA reads, and counting the gRNA reads (disregarding BIs and IMIs). Both gene expression and gRNA reads are processed in a single DRAGEN run. Note: unmapped reads do not contribute to gene expression read counts but are still included in gRNA counts if they match the hook sequence.
The “hook and grab” method is a targeted approach for identifying CRISPR perturbation reads. It leverages a conserved sequence within the guide RNA structural region as a “hook” to locate the guide RNA and then “grabs” the specific guide by mapping it to a database of known sequences based on their displacement from the hook.

Currently the application of combining CRISPR with other features (e.g. CITE-seq) in a single run is not fully supported yet due to the specialized handling of CRISPR sequences.
Inputs
PIPseq CRISPR mode is activated by setting --scrna-enable-pipseq-crispr-mode=true. CRISPR mode requires --scrna-enable-pipseq-mode to be set to true as well. Note that when --scrna-enable-pipseq-crispr-mode is set to true, the following options are also set automatically:
--Aligner.aln-min-score=25--scrna-crispr-guide-calling-mode=gmm
For a standard PIPseq CRISPR workflow, a FASTQ list is usually provided containing both the gene expression FASTQs and the CRISPR FASTQs. The read groups of the CRISPR input FASTQs must be provided using the --scrna-feature-barcode-groups=<CRISPR_read_groups> option (multiple read groups should be comma-separated). Furthermore, a CRISPR barcode reference file must be provided using the --scrna-feature-barcode-reference argument.
The CRISPR barcode reference file should be in CSV format with the following header fields:
id
Feature/gene ID
name
Feature/gene name
read
Read that the guide RNA is on (R1 or R2)
target
Target or "hook" sequence. If the value is set to "5p", then the target sequence is based on the beginning of the read (default value: "5p")
position
Displacement of the start of the identifying guide RNA sequence relative to the start of the target sequence (can be positive or negative)
sequence
Guide RNA sequence
feature_type
Name of feature (e.g. CRISPR Direct Capture)
gene_id
(Optional) ID of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)
gene_name
(Optional) Name of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)
Example FASTQ list CSV file:
Notes:
The RGID is referred to as the "read group" and CRISPR RGIDs should be passed to the
--scrna-feature-barcode-groupsoptionThe RGSM values should match for all reads to be processed in a single DRAGEN run
The RGLB values should match for all reads to be analyzed together (the results will be combined into a single metrics output file)
Example feature barcode reference file:
Examples of target, position, and sequence values for common use cases are provided below:

Example Command Line
CRISPR Outputs
The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz matrix file.
CRISPR Metrics
When PIPseq CRISPR mode is enabled, the following CRISPR metrics are also output in <prefix>.scRNA_metrics.csv (in addition to the Feature Metrics):
CRISPR reads with exactly matching barcodes: The number of CRISPR reads with barcode sequences that were not altered during error-correction, i.e. the barcode sequence exactly matched a barcode in the sequence list.
CRISPR reads with corrected barcodes: The number of CRISPR reads with barcode sequences that were corrected to a valid sequence in the sequence list (within a hamming distance of 1).
Total CRISPR reads matching known barcodes: The total number of CRISPR reads with valid barcodes that match the sequence list after error-correction. The metric value will be the sum of the above two metrics.
CRISPR tag mapping rate: The rate at which CRISPR reads match to the feature barcode sequences, calculated as the number of valid CRISPR reads (i.e. CRISPR reads that match to a feature barcode sequence) divided by "Total CRISPR reads matching known barcodes". The value for this metric should match the percentage value for the "Feature matching reads" metric.
CRISPR fraction valid reads in cells: The rate at which valid CRISPR reads are found in cells, calculated as the number of valid CRISPR reads in cells divided by "Total CRISPR reads matching known barcodes".
Fraction passing cells with CRISPR reads: The fraction of passing cells containing at least one CRISPR read, calculated as the number of passing cells with at least 1 CRISPR read divided by the total number of passing cells.
Guide RNA Calling
DRAGEN now also supports guide RNA calling - identifying cells that express guide RNA sequences. Although the counts matrix reports how many feature counts were found for each cell, there is a chance that the feature reads happened to match to the feature barcode reference sequence or the feature count was assigned to the cell by chance rather than due to real signal. Guide RNA calling applies an additional filtering step to distinguish between cells truly expressing guide RNA sequences and noise (similar to how cell filtering is performed to distinguish true passing cells from noise).
DRAGEN utilizes a Gaussian Mixture Model (GMM) approach to perform guide RNA calling. For each feature, DRAGEN constructs a GMM on the feature counts for the set of cells containing that feature. The GMM has two components: the signal distribution and the noise distribution. The model is simplified by assuming the signal and noise distributions have the same variance. Initial parameters are set based on the threshold that maximizes the likelihood of the model and then the Expectation-Maximization (EM) algorithm (add citation) is used to optimize the parameters. After the model converges to a final set of parameters, a threshold is chosen that maximally separates the signal and noise populations, and each cell is classified as expressing the feature or not.
When PIPseq CRISPR mode is enabled, guide RNA calling using GMMs is automatically enabled as well with the --scrna-crispr-guide-calling-mode=gmm option. Guide RNA calling can be disabled by setting --scrna-crispr-guide-calling-mode=none.
Guide Calling Outputs
If guide RNA calling is enabled, then DRAGEN will generate three additional files:
<prefix>.scRNA.positive_cell_guide_assignments.csvA CSV file containing all of the passing cells with the following fields:
cell_barcode: The cell barcode sequence.
num_features: The number of features called for the cell (may be 0).
feature_call: A list of all features called for the cell (separated with a "|"). May be empty if num_features is 0.
num_transcripts: The number of feature counts for each feature for the cell (separated with a "|"). May be empty if num_features is 0.
<prefix>.scRNA.singlet_cell_guide_assignments.csvA CSV file containing only the cells for which exactly one feature was called, with the following fields:
cell_barcode: The cell barcode sequence.
feature_call: The feature that was called for that cell.
<prefix>.scRNA.guide_metrics.csvA CSV metrics file containing the following Guide Metrics:
Total cells: The total number of passing cells.
Total cells with one or more guides detected: The number of cells with at least one feature passing the guide calling threshold.
Total cells with two or more guides detected: The number of cells with at least two features passing the guide calling threshold.
Fractional Downsampling
DRAGEN supports fractional downsampling with PIPseq mode. Fractional downsampling refers to downsampling the input reads are a specified rate. Downsampling is performed randomly on all input reads.
Fractional downsampling is enabled with the following options:
--enable-fractional-down-sampler=true--down-sampler-normal-subsample=<down_sampling_rate>
Set the down_sampling_rate equal to the desired fraction of reads to keep. For example, a rate of 70% means that 30% of the input reads will be filtered out.
More information on DRAGEN fractional downsampling can be found on this page.
Demultiplexing
DRAGEN supports different methods of demultiplexing samples. Genotype demultiplexing can be used if a VCF containing variants differentiating the samples of interest is available. Cell hashing can be used if the samples contain distinguishing oligo-tags (typically in read R2).
Genotype Demultiplexing
Genotype demultiplexing refers to using single nucleotide variants (SNVs) in the samples to differentiate them. A VCF file must be provided with those variants to use genotype demultiplexing. With genotype demultiplexing, DRAGEN assigns sample identity to cells based on the alleles observed in reads in each cell - only SNVs are considered. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.
DRAGEN supports two flavors of genotype demultiplexing. Genotype-based sample demultiplexing should be used if the VCF file matches the samples in the data set provided to DRAGEN (e.g. if the VCF was generated from matched DNA samples for the same individual). Genotype-free sample demultiplexing should be used if the VCF file is derived from a background population corresponding to the samples in the data set - the VCF is not expected to directly match the samples but the variant genotypes in the VCF should appear in the samples at some frequency.
Command Line Options
Genotype-based sample demultiplexing is enabled with the option:
--scrna-demux-sample-vcf=<VCF_file>- the VCF file should contain the genotypes corresponding to the exact samples provided to DRAGEN. Note: the number of samples for DRAGEN to report is automatically detected from the number of SAMPLE fields in the VCF.
Genotype-free sample demultiplexing is enabled with the options:
--scrna-demux-reference-vcf=<VCF_file>- the VCF file should contain genotypes for the background population from which the samples originated (they do not have to match the samples exactly)--scrna-demux-number-samples=<number>— The number of samples to be detected by DRAGEN. DRAGEN will demultiplex the dataset into this many samples.
Additional options:
--scrna-demux-detect-doublets=<true/false>— Enable doublet detection for demultiplexing (set totrueby default).
Example command line for genotype-based sample demultiplexing:
Example command line for genotype-free sample demultiplexing:
Outputs
When genotype demultiplexing (demux) is enabled, additional outputs are produced by DRAGEN. Additional columns are added to the barcode summary file and two new files are genereated: a demux TSV file with the number of reads and SNPs used to demultiplex each barcode and a demux metrics file with some summary metrics for the demultiplexed samples.
The <prefix>.scRNA.barcodeSummary.tsv file contains the per-cell, barcode summary metrics. When demultiplexing is enabled, this file will have these additional columns describing the sample each barcode is assigned to. See the Barcode Summary File section for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.
SampleIdentity
The SampleIdentity column will contain one of the following values:
sampleX—The particular barcode is uniquely assigned to a sample.AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as a doublet of sampleX and sampleY. For example,MIX(50*sampleX+50*sampleY).
IdentityQscore
The IdentityQscore column contains the value used to estimate the confidence of the sample-identity call. After DRAGEN determines the doublet status of the cell as singlet, ambiguous, or doublet, the identity Q-score is defined as -10 * log10(Probability that the assigned identity is correct, given the second most likely identity and the doublet status). Higher values for the identity Q-score correspond to more confident sample-identity calls.
The <prefix>.scRNA.demux.tsv file contains sample-demultiplexing statistics that were used to infer the sample identity of each cell barcode.
Barcode
The cell barcode associated with the cell.
DemuxSNPCount
The number of SNPs that the reads of the cell barcode intersect.
DemuxReadCount
The number of reads of the cell barcode that intersect at least one SNP.
Pure samples
Samples from the VCF file.
BestMixtureIdentity
Mixture sample with the highest log likelihood. Only available if --single-cell-demux-detect-doublets=true is set.
BestMixtureLogLikelihood
The log likelihood of the best mixture sample. Only available if --single-cell-demux-detect-doublets=true is set.
The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-sample metrics for each demultiplexed sample, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.
Passing cells
The number of cells called (i.e. the number of barcodes that passed the filter threshold).
Fraction genic reads in cells
The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.
Median reads per cell
The median number of reads per passing cell based on sorting the passing cells by read count.
Median molecules per cell
The median number of molecules per passing cell based on sorting the passing cells by molecule count.
Median genes per cell
The median number of genes per passing cell based on sorting the passing cells by gene count.
Total genes detected
The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).
Cell-Hashing
Cell-hashing refers to using oligo-tags usually in read R2 sequences to demultiplex samples. A cell-hashing reference file is required to match those oligo-tags to the corresponding samples. DRAGEN will match each R2 sequence against the cell-hashing reference sequences to determine the sample identity of each barcode. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.
The cell-hashing reference file should be a FASTA or CSV file and is specified with the option --scrna-cell-hashing-reference=<cell_hashing_ref_file>. The format of the file is similar to that required for feature counting (see Feature Counting). If the cell-hashing reference file is provided in FASTA format, then the sample names should be provided in the header and the oligo-tag reference sequences should be provided as the sequence.
If the cell-hashing reference file is provided in CSV format, the following fields should be included:
id— Sample identifiername— Sample nameread— Read that the oligo-tag sequence is on (R1 or R2)position— Starting position of the oligo-tag reference sequence relative to the start of the R1 or R2 sequence (should be >= 0)sequence— DNA sequence of the sample-specific oligo. For example, CAGCCCGATTAAGGT.type— Type of feature (e.g. Cell Hashing)
Read groups (RGID) for FASTQ files that contain cell hashing oligos should be specified with the option --scrna-hto-barcode-groups=<cell_hashing_RGIDs>.
Command Line Options
Cell-hashing is enabled with the options:
--scrna-cell-hashing-reference=<cell_hashing_ref_file>- the FASTA or CSV file with the sample-specific oligo-tags.--scrna-hto-barcode-groups=<cell_hashing_RGIDs>- the RGIDs of the FASTQ files containing cell hashing reads.
Additionally, doublet detection can be enabled/disabled with the following option:
--scrna-demux-detect-doublets=<true/false>— Enable/disable doublet detection for cell-hashing sample demultiplexing (set totrueby default).
Example command line for cell-hashing:
Outputs
When cell-hashing sample demultiplexing (demux) is enabled, additional outputs are produced by DRAGEN. An additional column is added to the barcode summary file and two new files are genereated: a demux TSV file with the number of reads used to demultiplex each barcode and a demux metrics file with some summary metrics for the demultiplexed samples.
The <prefix>.scRNA.barcodeSummary.tsv file contains the per-cell, barcode summary metrics. When demultiplexing is enabled, this file will have an additional column describing the sample each barcode is assigned to. See the Barcode Summary File section for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.
SampleIdentity
The SampleIdentity column will contain one of the following values:
sampleX—The particular barcode is uniquely assigned to a sample.AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as a doublet of sampleX and sampleY. For example,MIX(50*sampleX+50*sampleY).
The <prefix>.scRNA.demux.tsv file contains sample-demultiplexing statistics that were used to infer the sample identity of each cell barcode.
Barcode
The cell barcode associated with the cell.
DemuxReadCount
Cell-hashing read count for each sample.
The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-sample metrics for each demultiplexed sample, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.
Passing cells
The number of cells called (i.e. the number of barcodes that passed the filter threshold).
Fraction genic reads in cells
The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.
Median reads per cell
The median number of reads per passing cell based on sorting the passing cells by read count.
Median molecules per cell
The median number of molecules per passing cell based on sorting the passing cells by molecule count.
Median genes per cell
The median number of genes per passing cell based on sorting the passing cells by gene count.
Total genes detected
The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).
STAR Aligner
To add optional compatibility with legacy PIPseq workflows, we have included Spliced Transcripts Alignment to a Reference (STAR) as an alternative to the default hardware-accelerated DRAGEN mapper. STAR is an open-source RNA-sequence read mapper, with support for splice-junction and fusion read detection.
For further information on STAR mapper:
While STAR may be run as an independent executable, running STAR as an internal feature from DRAGEN offers simpler and seamless functionality within the scRNA pipeline.
Important notes:
To run STAR, the Single Cell RNA pipeline must be enabled with either
--enable-single-cell-rna trueor--scrna-enable-pipseq-mode true.STAR uses its own format for reference genome hash tables. This format is both different and incompatible with DRAGEN's reference. When using STAR, both a DRAGEN and a STAR reference must be supplied, and these references must be built from identical whole-genome FASTA files and annotations. If STAR and DRAGEN are given references built from different files, undefined behavior may result.
When running STAR mapper, the following intermediate files will be created:
If a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, then an additional barcode/BI merged FASTQ file will be created with the following name:
<prefix>.umi_merged.rg_<Read group index>.fastq. An intermediate FASTQ file will be created for each read group provided.A BAM file with the name:
<prefix>.star_mapper.Aligned.out.bam. This is the output BAM record containing all mapped reads.
The following are options for configuring STAR within DRAGEN:
--single-cell-enable-star-mapper-- ("true" or "false") Enable STAR and use it for mapping in lieu of the DRAGEN mapper.--single-cell-star-mapper-build-reference-- ("true" or "false") Enable STAR and use it for building STAR-formatted reference genome hash tables. STAR may build a reference and map reads from the same command line. STAR cannot use a DRAGEN-formatted reference.--single-cell-star-mapper-reference-dir-- (required, string) Users must supply a directory for STAR-formatted reference genome hash tables. If STAR is building a reference, this directory is where the output reference files will be stored. If the STAR mapper runs from a pre-built reference, this directory is where the mapper looks for the reference files.--single-cell-star-mapper-path-- (optional, string) Users may input their own path to a STAR executable. If not supplied, DRAGEN uses an internally-supplied STAR executable.--single-cell-star-mapper-options-- (optional, string) Users may manually adjust settings for STAR and overwrite the default settings set by DRAGEN. For a complete list of settings for STAR, please refer to its user manual--single-cell-compress-intermediate-fastq-- (optional, "true" or "false", default="false") When a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, intermediate FASTQ files are created. This option enables compressing those intermediate FASTQ files with a ".fastq.gz" format.--single-cell-num-parallel-fastq-umis-- (optional, integer, default=4) When a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, intermediate FASTQ files are created. This option sets how many parallel FASTQ files are processed simultaneously.
Because STAR uses a different map/align algorithm than DRAGEN, there will be some differences in the read-mapping results and thus in the cell and gene counts results as well.
Some STAR options that can be added (using the --single-cell-star-mapper-options argument) to further align STAR's output to DRAGEN's include the following:
--alignIntronMax 25000- limit the maximum intron size to 25,000 bp--clip5pNbases 1- clip one base from the 5' end of read 2--clip3pAdapterSeq polyA- trim poly-A sequences from 3' end of read 2
Command-line Examples Using STAR
Example command line for running the STAR mapper with DRAGEN single cell for Illumina Single Cell Prep.
Example command line for running the STAR mapper from a FASTQ list CSV with two non-default settings for STAR.
Example command line for creating a STAR reference and running the STAR mapper with DRAGEN single cell for Illumina Single Cell Prep. Note: STAR requires an uncompressed annotations file.
Tertiary Analysis
We recommend using Illumina Connected Multiomics (ICM) for single cell tertiary analysis. ICM is a fully-featured tool that allows users to combine and analyze different biological data sets to turn raw samples into published results faster and more easily.
Loading Count Files to a Dense Matrix
Some users might want to explore the output matrix in a human-readable format. There are several tools available including Illumina Connected Multiomics (ICM), scanpy, and seurat. To do so, a possible way would be to load the matrix in a "dense" dataframe in Python (similar methodologies can be used in other programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in Python).
The matrix can be converted into a "dense" representation through two Python modules: scanpy and pandas. This has been tested with Python 3.9.7, scanpy 1.10.3, pandas 2.2.3.
First, it is necessary to install the required libraries:
Within Python, the matrix can be loaded in "dense" representation using the following commands:
The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.
References
Fontanez et al., 2024. Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing. bioRxiv 2024.10.04.616561.
“The Coupon Collector Problem.” 2022. University of Alabama in Huntsville. April 24, 2022. https://stats.libretexts.org/@go/page/10250.
Last updated
Was this helpful?