Illumina PIPseq scRNA

Illumina scRNA Prep

The DRAGEN scRNA PIPseq Pipeline is designed to process data obtained with the proprietary Illumina Single Cell 3' RNA Prep library based on PIPseq technology [1]. The command line option --scrna-enable-pipseq-mode=true runs the scRNA pipeline with methods that are optimized to process the unique structure of the Illumina Single Cell 3' RNA Prep library.

DRAGEN scRNA Pipeline

The DRAGEN scRNA Pipeline can process multiplexed single-cell RNA-Seq data sets from reads to a cell-by-gene expression matrix.

The pipeline includes the following functions:

  • Cell-barcode and binning index extraction and error correction for each read

  • RNA-seq (splice-aware) alignment and matching to annotated genes

  • Molecule counting per cell and gene to measure gene expression

  • Sparse matrix, molecule info (HDF5 file), h5ad, and QC metrics outputs

  • Feature counting, e.g. CITE-seq, Perturb-seq, etc.

The functionality and options related to alignment and gene annotation are identical to the RNA-Seq pipeline (run in single-end read mode). For more information, see the DRAGEN RNA Pipeline. None of the other DRAGEN RNA-Seq modules (variant calling, splice variant calling, gene fusion calling and transcript-level gene expression quantification) are supported for Single-Cell RNA.

Input Files

Alignment Reference

Use a standard DRAGEN RNA reference genome or hashtable for the scRNA Pipeline. Recommended references (found on the DRAGEN Product Filesarrow-up-right page) include the linear hg38 v5 hashtable, the linear v5 versions of the hg19 and hs37d5 hashtables, and the linear hg38-mm39 v5 hashtable for barnyard datasets. The reference is provided with the --ref-dir (-r) option. Custom references can be built using --ht-build-rna-hashtable=true (see the DRAGEN Prepare a Reference Genome page for more details on creating a reference hashtable). The pipeline also requires a gene annotation file in GTF format, provided with the --annotation (-a) option. Recommended annotation files include the gencode v44 and the gencode hg38_v44+mm39_vM30 annotation files (available on the DRAGEN Product Filesarrow-up-right page as well).

It is recommended to exclude from the analysis pseudogenes and other entries that may overlap with protein-coding genes and reduce mapping accuracy. This can be done by setting the --scrna-filter-gene-biotypes option to true, which will automatically exclude the following biotypes:

  • shortRNA

  • microRNA

  • rRNA

  • pseudogenes

Read Input

The DRAGEN scRNA Pipeline typically expects read input as either FASTQ files or in a FASTQ list. The transcript sequence is typically provided in read 2 (R2) and the cell-barcode and binning index (BI) will be provided in read 1 (R1). Alternatively, the cell-barcode and BI can be provided in the read name (QNAME). More info about the barcode and BI for Illumina scRNA Prep is provided in the FASTQ Processing section.

FASTQ Input

If only a single pair of FASTQ files will be run, then FASTQ file input can be used. FASTQ file 1 is expected to contain the barcode and BI sequences and FASTQ file 2 will contain the transcript sequence (this is enabled using the --umi-source read1 option which is automatically set when PIPseq mode is enabled). The read group sample name (RGSM) and read group ID (RGID) should be set as well (to be used when labeling samples in output files).

Alternatively, the barcode and BI can be provided in the read names of the transcript reads in a single FASTQ file. See below for details on adding the barcode and BI to the read name (using the --umi-source qname option).

FASTQ List Input

A single-cell RNA fastq-list file is a CSV file with the following mandatory columns:

Column
Description

RGID

Read group ID

RGSM

Read group sample

RGLB

Read group library

Lane

Sequencing lane

Read1File

Read 1 FASTQ file (usually FASTQ file with reads containing cell-barcodes and BIs)

Read2File

Read 2 FASTQ file (usually transcript reads)

For example:

FASTQ list options (--umi-source=read1 automatically set with PIPseq mode):

The same FASTQ list option can be used if the barcodes and BIs are provided in the read names.

Barcode and BIs in the Read Name

If barcode and BI sequences are provided in the read name, then follow these steps:

  • Provide the transcript reads as a single-end FASTQ file with the Barcode+BI sequence in the eighth field of the read-name line.

  • Separate sequences using a colon.

In the example, the sequence starting with AGA is the barcode+BI read, the sequences beginning with GTG and ACG are the i7 and i5 sequences, respectively, and the sequence starting with TGCA is the transcript read sequence.

These FASTQ files can be generated by BCL Convert using the UMI settings to define the single-cell barcode+BI read. If using BCL Convert, enter the barcode/BI information using the OverrideCycles setting. For more information, see the BCL Convert Software Guide (document # 1000000094725).

Note: BCL Convert refers to the entire single-cell barcode+BI sequence as the UMI.

The resulting file can be provided as fastq file 1 using FASTQ input or fastq file 2 using FASTQ list input. This option is also compatible with BAM files using the following options:

Using Multiple Libraries

The scRNA pipeline processes a single biological sample per DRAGEN run by default. To process multiple single-cell libraries together, split the single sample into multiple single-cell libraries with a unique set of cells in each. DRAGEN keeps the cells (barcodes and BIs) from each library separate and provides merged outputs across all. Read groups are used to specify the library for each FASTQ file using the RGLB attribute.

Example FASTQ list CSV file for multiple libraries:

Notes:

  • The RGID is referred to as the "read group" of the input file

  • The RGSM values should match for all reads to be processed in a single DRAGEN run

  • The RGLB values should match for all reads to be analyzed together (the results will be combined into a single metrics output file). There are two separate libraries here so two metrics files will be produced.

scRNA PIPseq Licensing

For DRAGEN OnPrem Servers and DRAGEN FPGA Cloud BYOL customers, this pipeline requires the 'PipSeq' license. The cost of the PipSeq license is included with the associated Assay and has been automatically assigned. For those using DRAGEN OnPrem Servers, please see the OnPrem Licensing Installation Reference Section for more information. If your pre-existing DRAGEN OnPrem Server is not connected to the internet, please contact usarrow-up-right to retrieve your PipSeq License.

Usage

The following is an example command line to run the DRAGEN Single Cell RNA PIPseq Pipeline.

When the --scrna-enable-pipseq-mode batch option is set to true, the following options are automatically included:

  • --enable-rna=true

  • --enable-single-cell-rna=true

  • --bypass-anchor-mapping=true

  • --annotation-file-ignore-biotypes=pseudogene,shortrna,rrna

  • --scrna-barcode-position=0_7+11_16+20_25+31_38

  • --scrna-umi-position=39_41

  • --umi-source=read1

  • --scrna-barcode-sequence-list=<PIPSEQ_BARCODE_SEQ_LIST>

  • --scrna-enable-barcode-phasing=true

  • --scrna-num-barcode-phases=4

  • --scrna-count-introns=true

  • --scrna-min-mapq=1

  • --qc-enable-depth-metrics=false

  • --read-trimmers=fixed-len,adapter,polyg

  • --trim-r1-5prime=1

  • --trim-adapter-stringency=12

  • --trim-adapter-r1-5prime=<PIPSEQ_ADAPTER_LIST>

  • --trim-adapter-read1=<POLYA_TRIM_LIST>

  • --soft-read-trimmers=none

Note that the binning index position is set using the --scrna-umi-position option. The <PIPSEQ_BARCODE_SEQ_LIST>, <PIPSEQ_ADAPTER_LIST>, and <POLYA_TRIM_LIST> are packaged with DRAGEN and the path to the files will be automatically detected. --scrna-min-mapq sets the minimum MAPQ threshold for reads to be counted. If any of these options are also included in the DRAGEN command line, then it will override the value set by the batch option.

Additional optional arguments:

  • --rna-library-type --- Set the orientation of transcript reads relative to the genomes. Enter SF for forward, SR for reverse, or U for unstranded. The default is SF.

  • --enable-map-align-output --- Save the output BAM. The default is false. See the BAM File Output for more details.

  • --single-cell-threshold --- Set the cell filtering method. Available options are fixed, ratio, and inflection. The default is ratio. See Cell Filtering for more details.

Pipeline

FASTQ Processing

The purpose of this step is to:

  • Extract cell-barcodes and binning indices (BIs) from reads

  • Correct cell-barcode sequences

  • Remove unwanted technical sequences from the data before mapping

  • Assign intrinsic molecular identifiers (IMIs) to reads

Structure of the Illumina Single Cell 3' RNA Prep FASTQ Input Files

The figure below describes the fragmentation process taking place in PIPseq chemistry. Each read in R1 includes 0 to 3 bases at the start indicating the "phase" of the read, a barcode sequence with 4 tiers, which identifies the individual PIP, followed by a 3-base binning index (BI) sequence. The number of phasing bases varies from read to read to increase base diversity for Read 1. Each barcode tier is composed of 96 possible sequences, allowing 96^4 total barcode combinations. By default, the barcode positions are set to 0_7+11_16+20_25+31_38. The binning index provides information for identifying the individual molecules within a cell. By default, the UMI positions are set to 39_41.

Read 1 barcode and BI structure:

R2 includes the sequenced cDNA constructs created from the captured mRNA, which contain random cut sites that serve as intrinsic molecular identifiers (IMIs), and are used for molecular counting. Up to fifteen different cut sites can be created from a single captured molecule using five cycles of whole transcriptome amplification. The genome alignment position, which will be different for each cut site, serves as the IMI for each read.

Barcode Extraction and Correction

The barcode sequence for each read is extracted from read R1 and then compared against the barcode sequence list. Illumina Single Cell 3' RNA Prep uses a tiered barcode system, where every one of four tiers includes one of a specified list of possible barcodes. Barcode matching can therefore be done highly efficiently by matching each tier in isolation.

DRAGEN performs barcode phasing for each read. Between 0 and 3 bases are included at the start of the barcode to diversify the sequences. These are referred to as the "phase" of the barcode and must be removed. DRAGEN processes the barcodes by considering all possible sequences from each of the four phases, counts the number of mismatches between each sequence and the best matching sequence in the barcode sequence list, and chooses the phase with the fewest mismatches.

After phasing, DRAGEN performs error correction for each barcode. The pipeline allows for a maximum Hamming distance of 1 per tier, i.e., these may be one mismatch between the candidate and the sequence list barcode for each tier and still be matched. Barcodes within that Hamming distance will be corrected to the first matching barcode in the barcode sequence list. Barcodes with a greater Hamming distance will be considered non-matching and removed from downstream processing. If a read is truncated and missing the barcode sequence, then it will be considered a read with an invalid barcode. DRAGEN reports the total number of reads with invalid barcodes, exactly-matching barcodes, corrected barcodes, and non-matching barcodes in the <prefix>.scRNA_metrics.csv file. See the Barcode Read Metrics section for more details.

The BI sequence is also extracted from R1 but no correction is performed on it.

The barcode positions, BI positions and the expected barcode list are set automatically with the --scrna-enable-pipseq-mode argument, so they do not need to be specified by the user.

Custom barcode positions, BI positions and expected barcode list can be set with the following arguments:

Input
Command Line Argument
Default Value in PIPseq Mode

Barcode positions

--scrna-barcode-position

0_7+11_16+20_25+31_38

BI position

--scrna-umi-position

39_41

Expected barcode list

--scrna-barcode-sequence-list

<PIPSEQ_BARCODE_SEQ_LIST_PATH>

R2 Trimming

The R2 file, which includes the cDNA construct, also includes technical sequences related to the PIPseq chemistry, namely the template switch oligo (TSO) and poly-A sequences. The prevalence of those sequences is inversely correlated with the fragment size. Because a read is less likely to be mapped accurately if it contains extraneous sequences, TSO sequences are removed from the 5’ end and poly-A sequences are removed from the 3’ end in the R2 FASTQ file. In addition, the first base, a constant T, is removed from the 5’ end of R2. These are done using the DRAGEN trimmers, with the settings automatically included under the --scrna-enable-pipseq-mode flag. Stats on the number of reads and bases trimmed can be found in the <prefix>.trimmer_metrics.csv file.

Mapping

The DRAGEN RNA mapper is used to map all of the R2 reads to the reference genome. Note that the barcode and BI sequences are extracted from R1 and only the R2 reads are passed to the mapper. Therefore the mapper considers them to be single-ended reads labeled as R1 in the mapping metrics. Further information on the DRAGEN RNA mapper can be found on the RNA alignment page.

DRAGEN is also introducing support for the STAR aligner in DRAGEN 4.5 as an alternative to the DRAGEN RNA mapper. Further information on STAR aligner support can be found in the STAR aligner section.

Each read is mapped to the reference and then matched to genes based on where it aligns. If the read matches to the exons of a gene, then it is classified as an exon-matching read. Otherwise it is given another classification. DRAGEN scRNA currently only includes the primary alignment of each read (assigned randomly by the mapper if there are multiple alignments with identical scores). Secondary and supplementary alignments are skipped (but they are still included in the BAM output). The possible classifications for each read (based on its primary alignment) are:

  • Unique exon matching --- reads that overlap with gene exons by at least 50% of read length

  • Unique intron matching --- reads that overlap with the gene by at least 20% of read length

  • Mitochondrial --- reads that map to the mitochondrial genome

  • Filtered antisense --- reads that overlap a gene on the opposite strand defined by the library type

  • Filtered ambiguously matching --- reads that match to multiple genes

  • Filtered low MAPQ --- reads with low MAPQ (below 1 by default) that map ambiguously

  • Filtered non-matching --- reads that do not map to any genes

For the remainder of the pipeline, only the unique exon matching, unique intron matching, and mitochondrial reads (together referred to as "mapped reads") are counted toward the molecule counts of each barcode. The other "filtered" reads will be included in the "Total" reads per barcode but not counted towards the molecule counts of each gene. If --scrna-count-introns=false is set, then intron reads will also be considered "filtered" rather than "mapped".

The total number of reads in each classification can be found in the <prefix>.scRNA_metrics.csv file as described in the Read Matching Metrics section.

Biotype Filtering

Add the argument --scrna-filter-gene-biotypes=true to exclude pseudogene, shortRNA and rRNA biotypes during mapping.

Transcript Counting

Illumina Single Cell 3' RNA Prep generates fragments with unique starting positions, which serve as intrinsic identifiers that can be used for molecular counting. Reads are grouped together based on the cell barcode and the assigned gene. Within each barcode and gene combination, IMIs are grouped in one of 64 bins, based on the 3-base binning index. For each bin, all identical IMIs are collapsed into a single count, since they are likely PCR duplicates of the same fragment generated during library prep.

At this point, any barcode and gene combination which has ten or fewer unique binning indexes is assigned the number of unique binning indexes as its final count estimate. The pipeline then totals the number of IMIs associated with each remaining barcode and gene combination, and divides that number by a correction factor estimated from the data, which accounts for the additional copies generated from a single captured molecule during five amplification cycles. The final count estimate is the maximum between the floor of this value and the number of unique binning indexes for this barcode and gene.

Estimating the correction factor is described in detail in the next section. The example below shows molecular counting with a correction factor of 1.35. The barcode and gene combination in this example has 12 unique binning indexes, so the correction factor is applied to produce the final count. The correction factor of 1.35 indicates that, on average, a single molecule produced 1.35 fragmented copies in this sample. Note that the corrected count will always be at least equal to the number of unique bins, and will not exceed the number of unique IMI and bin combinations. The result of this correction process, 14, is the number that is recorded in the count matrix for this barcode and gene.

Estimating the Correction Factor

A single molecule can produce anywhere between one and fifteen fragmented copies during five amplification cycles. The true distribution of the number of fragmented copies varies between datasets, and is influenced by factors such as sample type and sequencing depth. For a given dataset, the correction factor is an estimate of the average of this distribution, the average IMIs per molecule (IPM) in the sample.

Because all IMIs from the same parent molecule share a binning index, the number of unique binning indexes observed within a specific barcode and gene is determined by the number of molecules, and is not impacted by the number of IMIs that were produced by the molecules. This means the probabilistic relationship between the number of unique bins and the true number of molecules in a barcode and gene combination is constant, and is the result of random sampling from the 64 possible bin indexes when each molecule is captured. We can represent this process by the Coupon Collector Problem (CCP) [2], which defines this distribution of the number of samples (molecules) required to a certain number of distinct types (unique bins) among the total collection of unique types (64 binning indexes). The combination of constant molecular probabilities based on the number of unique bins and observed unique IMIs in barcode and gene combinations in the sample enable the IPM estimation. We select the subset of barcode and gene combinations with between 5 and 32 unique bin indexes. For each of these barcode and genes, we divide the total number of IMIs by the average number of molecules expected based on the number of unique bin indexes. The estimated average IMIs per molecule (IPM) for the sample is the average of these values across this subset of barcode and genes. The subset of barcode and gene combinations is used rather than all available combinations because we expect the CCP model to be most representative of the true molecular count and noise from sources such as sequencing errors or reads outside the cell fraction to be minimized within this subset.

After calculating the average IMIs per molecule in the sample, we apply the IPM to estimate the molecular count from the IMI count in individual barcode and gene combinations. The estimated molecular count for a barcode and gene is the total number of IMIs divided by the IPM, rounded down. The more true molecules a barcode and gene combination has, the closer the estimated molecular count is expected to be with this approach (as the number of molecules increases, the true average IMIs per molecule for a barcode and gene should approach the average IPM of the sample). For barcode and gene combinations with very few molecules, the number of unique bins is expected to be a better predictor of the molecular count than the number of IMIs because the variance in the true IMIs per molecule among this group is high since the number of samples (molecules) in each individual barcode and gene combination is low. For this reason, we apply IPM correction for barcode and gene combinations with more than 10 unique bin indexes, and otherwise the corrected count is equal to the number of unique bin indexes.

Cell Filtering

DRAGEN uses a threshold on the total count of distinct molecular identifiers (or reads) per cell barcode, to determine which barcodes are likely to correspond to single-cells in the original sample, instead of background noise. The threshold is determined based on the distribution of counts along barcodes and on the expected number of true cells in the sample.

  • --single-cell-number-cells --- [Optional] Sets the expected number of cells. The default value is set to 2,000, 10,000 or 100,000 which is estimated based on the total number of barcodes observed. Adjust this value if using a fixed threshold or if the number of called cells deviates from expectation.

  • --single-cell-threshold --- Specifies the method for determining the count threshold value. The available values are fixed, ratio, or inflection (the default is ratio).

    • If using ratio, DRAGEN calculates the expected number of cells using max(T_e, T_m).

      • T_m is a threshold based on a fraction of the molecules seen in the most abundant cell-barcodes. If the top decile cell is the cell occurring at the 90th percentile by molecule count, then T_m is calculated as 10% of the number of molecules occurring in the top decile cell.

      • T_e is a threshold based on a fraction of the least abundant expected cell. The least abundant expected cell is defined as the cell occurring at the rank of the number of expected cells (e.g. if the number of expected cells is 2,000, then the least abundant expected cell is the cell ranked 2,000th by molecule count abundance). T_e is calculated as 50% of the number of molecules found in the least abundant expected cell.

      • For example, if dataset A is provided to DRAGEN with an expected number of cells of 2,000 and it is found to contain 30,000 cell-barcodes, and the 3,000th most abundant cell(i.e. the top decile cell) has 1,100 molecules while the 2,000th most abundant cell (i.e. the least abundant expected cell) has 3,700 molecules, then T_m would be equal to 10% * 1,100 = 110 molecules and T_e would be equal to 50% * 3,700 = 1,850 molecules. The threshold would then be max(1,850, 110) = 1,850 molecules.

    • If using inflection, DRAGEN estimates the count threshold by analyzing inflection points in the cumulative distribution of counts. The second derivatives of the read/molecule counts are used to estimate the inflection point that is chosen as the threshold.

    • If using fixed, the count threshold is set to force the expected number of cells (--single-cell-number-cells option), rather than estimating it from the data. The exact number of passing cells might be slightly larger than the number of requested single-cells because several cells in the tail of the count distribution can have the same count.

  • --single-cell-threshold-filterby --- [Optional] Set the count distribution to consider for cell filtering. Can be either umi (default) or read.

To set a specific, fixed number of cells, rather than use the automatically determined threshold, use the following command:

The command forces DRAGEN to select the top X cells and extra cells with the same number of counts of the last selected cell.

Outputs

Single-cell RNA outputs are found in the standard DRAGEN output location. All single-cell RNA output files contain the word scRNA in their names.

Count Files

The count files contain the overall molecule counts on a per-cell gene expression level. Counts are reported in two formats: Market Exchange (MEX) and h5ad format. Furthermore, counts are reported for all barcodes (raw) as well as for the set of filtered barcodes that pass the cell filtering threshold.

Raw MEX files:

  • <prefix>.scRNA.matrix.mtx.gz

    • Count of molecules for each cell/gene pair in sparse matrix format.

  • <prefix>.scRNA.barcodes.tsv.gz

    • Barcode sequence for each cell from the matrix. This includes all cell-barcodes.

  • <prefix>.scRNA.features.tsv.gz

    • Gene name, ID and feature type for each gene and feature in the matrix.

Raw h5ad file:

  • <prefix>.scRNA.h5ad

    • Count matrix in AnnData format. Barcodes are stored in the OBS group and genes are stored in the VAR group. Further information on the h5ad format can be found on the AnnData pagearrow-up-right.

    • Note: the h5ad output can be disabled with the option --single-cell-enable-h5ad-output=false

Filtered MEX files (containing only cells that pass the cell filtering threshold):

  • <prefix>.scRNA.filtered.matrix.mtx.gz

    • Count of unique MIs for each passing cell/gene pair in sparse matrix format.

  • <prefix>.scRNA.filtered.barcodes.tsv.gz

    • Cell-barcode sequence for each passing cell from the matrix.

  • <prefix>.scRNA.filtered.features.tsv.gz

    • Gene name, ID and feature type for each gene and feature in the matrix (identical to <prefix>.scRNA.features.tsv.gz).

Filtered h5ad file:

  • <prefix>.scRNA.filtered.h5ad

    • Filtered count matrix in AnnData format. Barcodes are stored in the OBS group and genes are stored in the VAR group. Further information on the h5ad format can be found on the AnnData pagearrow-up-right.

    • Note: the h5ad output can be disabled with the option --single-cell-enable-h5ad-output=false

BAM file

  • --enable-map-align-output=true --- default is false

  • --output-format=BAM --- default is BAM A BAM file can be generated by adding in the options above. Note that enabling the BAM output may increase the runtime and memory usage of DRAGEN.

All alignments of all of the reads are included in the BAM, with unmapped reads coming at the end. The primary alignment of each read (the only that is considered in the pipeline, see Mapping) is annotated with single cell-specific BAM tags. The following tags are used:

  • CR --- Uncorrected barcode sequence

  • CB --- Corrected barcode sequence (identical to CR if no correction occurred)

  • UR --- Uncorrected BI sequence

Non-primary alignments and reads not associated with a barcode will not have any single cell BAM tags.

By default, the BAM file will not be sorted but all reads with the same barcode will be grouped together. The following options may be used to force the BAM to be sorted by reference position: --scrna-split-counts-by-barcode=false. However, this may lead to significant memory issues and is not recommended. If a sorted BAM file is desired, it is recommended to have DRAGEN produce an unsorted BAM and then run samtools sort input.bam -o output_sorted.bam to sort by position.

Additional metrics related to the mapping of reads to the reference may be found in the <prefix>.mapping_metrics.csv file. Note that the "mapped reads (RNA) to rRNA and filtered" metric in that file only includes non-coding rRNA (it excludes protein-coding ribosomal RNA transcripts), so it is not applicable to single cell RNA analyses.

Single Cell RNA Metrics

The <prefix>.scRNA_metrics.csv file contains per sample single cell RNA metrics. In scRNA, mapped reads are currently reported by default under R1 metrics, irrespective of whether R1 is the read aligned to the transcriptome or the one containing the cell barcode and BI. The single cell RNA metrics file is in CSV format with the following fields:

  • Metric group

  • Sample name

  • Metric name

  • Metric value

  • Metric percentage (omitted for some metrics)

The single cell RNA metrics are broken down into four metric groups (additional groups are included if certain options are enabled).

Barcode Read Metrics

Barcode read metrics report the breakdown of how many of the input read pairs contain valid barcodes (matching to the acceptable sequence list). The denominator of the metric percentages in this group is "Total input reads".

  • Reads missing barcodes: The number of reads for which the overall barcode sequence (cell barcode + BI) failed basic checks, e.g. the barcode read was missing or too short.

  • Reads with exactly matching barcodes: The number of reads with barcode sequences that were not altered during error-correction, i.e. the barcode sequence exactly matched a barcode in the sequence list.

  • Reads with corrected barcodes: The number of reads with barcode sequences that were corrected to a valid sequence in the sequence list (within a hamming distance of 1).

  • Reads with non-matching barcodes: The number of reads with barcode sequences that did not match valid sequences in the sequence list, even after barcode correction (within a hamming distance of 1).

  • Total input reads: The total number of reads provided as input for this DRAGEN run. The metric value will be the sum of the above four metrics.

  • Total barcoded reads: The number of reads with valid barcodes that will proceed to the read matching stage. The metric value will be the sum of "Reads with exactly matching barcodes" and "Reads with corrected barcodes".

Read Matching Metrics

Read matching metrics report how many barcoded reads match to genes and how many were filtered due to various reasons. The denominator of the metric percentages in this group is "Total barcoded reads".

  • Unique exon matching reads: The number of reads that uniquely match with >50% overlap to the exon region of a single gene.

  • Unique intron matching reads: The number of reads that do not meet the exon-matching criteria above but map to a single gene with >20% overlap. N.B. if --scrna-count-introns=false is set, then this metric will become "Filtered intron matching reads".

  • Filtered antisense reads: The number of reads that overlap a gene on the opposite strand based on the library type of the reads.

  • Filtered ambiguously matching reads: The number of reads that match to multiple genes equally well.

  • Filtered low MAPQ reads: The number of reads that are filtered due to low MAPQ values (default threshold of 4).

  • Filtered non-matching reads: The number of reads that do not match to any gene.

  • Mitochondrial reads: The number of reads that map to mitochondria.

  • Mapped reads: The number of reads considered to be "mapping," i.e. that are found to match to a gene. The metric value will be the sum of "Unique exon matching reads," "Unique intron matching reads", and "Mitochondrial reads". N.B. if --scrna-count-introns=false is set, then introns will be excluded.

  • Reads processed through mapper: The total number of reads provided as input to the mapper. The metric value will be the sum of all of the above Read matching metrics excluding "Mapped reads".

Molecule Count Metrics

Molecule count metrics report how many reads and molecules contribute toward the count table. No metric percentages are reported.

  • Reads with valid molecular identifier sequences: The number of reads with exactly matching molecular identifier (MI) sequences (i.e. BI sequences). If no MI correction is enabled, then all counted gene reads will be included here.

  • Reads with corrected molecular identifier sequences: The number of reads with MI sequences that were corrected to a valid sequence. If no MI correction is enabled, then the value will be 0.

  • Reads with invalid molecular identifier sequences: The number of reads that were not counted due to having an invalid MI (BI) sequence. Homopolymer sequences and Ns are examples of invalid MI sequences. If no MI correction is enabled, then the value will be 0.

  • Total gene reads: The total number of gene expression or transcript reads. This value includes both mapped reads and filtered reads and should match the "Reads processed through mapper" metric if all reads had valid BI sequences.

  • Total counted gene reads: The total number of mapped reads with valid barcode and BI sequences. The value includes exon-matching, intron-matching, and mitochondrial reads.

  • Total molecules: The total number of individual molecules found after performing IPM correction on the counted gene reads per unique barcode-BI-IMI combination.

  • Sequencing saturation: The likelihood that if an additional read is sequenced, it would already have been observed. It is calculated as the fraction of "duplicate reads" from the same molecule using the formula 1 - ( "Total molecules" / "Total counted gene reads" ).

Cell Metrics

Cell metrics report on the number of barcodes, passing cells based on the count threshold, and mean and median reads, molecules, and genes per cell. No metric percentages are reported.

  • Total unique barcodes: The total number of unique barcode sequences.

  • Passing cell threshold (by molecule/read count): The number of molecules or reads required for a barcode to be considered a (passing) cell. N.B. the --single-cell-threshold-filter-by option determines whether molecule or read counts are used for this threshold (default: molecule).

  • Passing cells: The total number of cells called (i.e. the number of barcodes that passed the filter threshold).

  • Fraction of reads in passing cells: The fraction of reads that were assigned to passing cells, calculated as total gene reads in passing cells over total gene reads. N.B. total gene reads include filtered reads in the Read matching metrics group.

  • Fraction of reads in passing cells that are genic: The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.

  • Mean reads per cell (Total input reads / Passing cells): The average number of reads per passing cell calculated by dividing "Total input reads" by "Passing cells".

  • Median molecules per cell: The median number of molecules per passing cell based on sorting the passing cells by molecule count.

  • Median genes per cell: The median number of genes per passing cell based on sorting the passing cells by gene count.

  • Total genes detected: The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).

Barcode Summary File

  • <prefix>.scRNA.barcodeSummary.tsv

The barcode summary file provides a list of all barcodes as well as the number of reads, genes, and molecules in each barcode and its filter status (either PASS for above the cell filter threshold or LOW for below the threshold).

Fields in the barcode summary file:

  • ID: Unique numeric ID for the barcode. The ID corresponds to the line number of that barcode in the MEX output.

  • Barcode: The barcode sequence.

  • TotalReads: Total reads containing the barcode sequence. This includes counted gene reads as well as reads that were filtered for wrong strand, low MAPQ, ambiguously matched or not mapped to a gene.

  • GeneReads: Reads (primary read alignments) that were counted towards a gene. This includes exons, mitochondrial reads, and introns (if scrna-count-introns=true which is set by default).

  • Molecules: Total molecule count.

  • Genes: Unique genes detected.

  • MitochondrialReads: Reads mapped to the mitochondrial genome.

  • Filter: Cell-filtering status of the barcode. Possible values are:

    • PASS: Barcode passes the cell filter threshold.

    • LOW: Barcode count is below the threshold.

Molecule Info File

  • <prefix>.scRNA.moleculeInfo.h5

The molecule info file provides a summary of the read counts for each barcode, gene, BI and IMI combination. The contents of the HDF5 file can be viewed using HDFviewarrow-up-right or using the h5dump command:

  • barcode_idx: the barcode index number (Note: this internal value does not match with the barcodes.tsv line number)

  • barcodes: the barcode sequence

  • binning_idx: the binning index (BI) as an integer (e.g. AAA -> 0)

  • gene_ids: the gene ID

  • gene_idx: the gene index number (Note: this internal value does not match with the features.tsv line number)

  • gene_names: the gene name

  • imis: the IMI value corresponding to the global position where the read was mapped

  • read_counts: the number of reads found for the given barcode, gene, BI and IMI

The HDF5 file can be further manipulated in python using the h5py packagearrow-up-right. Below is example code for creating a pandas DataFrame object from the molecule info file.

Barcode Rank Plot

DRAGEN can generate a barcode rank plot with the help of DRAGEN Reports.

The <prefix>.scRNA.barcodeCounts.txt file contains the final raw counts of every barcode in decreasing order. The barcode counts file is used by DRAGEN Reports to produce a barcode rank plot (also known as a "knee" plot).

If DRAGEN is run on the cloud via BSSH or ICA, then DRAGEN Reports will automatically be run and produce the final report.html file, which will contain the barcode rank plot on the "Sample Report" page.

If DRAGEN is run on-prem, then DRAGEN Reports can be run after DRAGEN single cell RNA has finished running. An example DRAGEN Reports command line is:

The barcode rank plot will appear in the output report.html "Sample Report" page. More information on DRAGEN Reports can be found here.

UMAP and Marker Gene Table

DRAGEN can also generate a UMAP plot and marker gene table with the help of DRAGEN Reports. Note: both the UMAP plot and marker gene table are only available if DRAGEN Reports is run on the cloud via BSSH or ICA.

If DRAGEN is run on the cloud via BSSH or ICA, then DRAGEN Reports will automatically be run and produce the final report.html file. When the file is opened, click the sample name to go to the "Sample Report" page and then click "Single-Cell Clustering". The UMAP Plot and Top Marker Genes table will appear.

If DRAGEN is run on-prem, then DRAGEN Reports can be run after DRAGEN single cell RNA has finished running. An example DRAGEN Reports command line is:

The Top Marker Genes table will appear in the output report.html under the "Single-Cell Clustering" tab of the "Sample Report" page. The UMAP plot and marker gene table are not available when running DRAGEN on-prem. More information on DRAGEN Reports can be found here.

Feature Counting

DRAGEN single cell RNA supports feature counting which can be used to profile the expression of proteins (e.g. antibodies or cell-surface proteins). Feature reads differ from transcript (or gene expression) reads in that their read 2 sequence (R2) contains an oligo tag corresponding to a particular type of protein. Reads marked as feature reads will be matched against the list of feature barcode reference sequences (in contrast to gene expression reads which are matched to genes). Only feature reads matching feature barcode reference sequences will be counted in the final matrix.

Inputs

Feature counting is activated by providing a feature barcode reference file using --scrna-feature-barcode-reference. The feature barcode reference file can either be in FASTA format or CSV format (see below for examples). If the feature barcode reference file is provided in FASTA format, then the feature names should be provided in the header and the feature barcode reference sequences should be provided as the sequence.

If the feature barcode reference file is provided in CSV format, the following fields should be included:

Header field
Description

id

Feature/gene ID

name

Feature/gene name

read

Read that the feature sequence is on (R1 or R2)

position

Starting position of the feature barcode reference sequence relative to the start of the R1 or R2 sequence (should be >= 0)

sequence

Feature barcode reference sequence

gene_id

(Optional) ID of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)

gene_name

(Optional) Name of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)

When feature counting is used, a FASTQ list is usually provided containing both the gene expression FASTQs and the feature read FASTQs. The read groups (RGIDs) of the feature read input FASTQs must be provided using the --scrna-feature-barcode-groups=<feature_read_groups> option (multiple read groups should be comma-separated). See the CRISPR mode examples below for how to provide the read groups to the --scrna-feature-barcode-groups option.

Examples

Example feature barcode reference file (FASTA):

Example feature barcode reference file (CSV):

Example command line:

Outputs

When feature counting is enabled, DRAGEN produces additional outputs. The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz matrix file.

The feature barcode reference file will also be included in the DRAGEN output directory with the name <prefix>.feature_barcode_reference with the corresponding extension for a CSV or a FASTA file (depending on what was provided as input).

The following columns are also added to the <prefix>.scRNA.barcodeSummary.tsv file when feature counting is enabled:

  • TotalFeatureReads: The total number of feature reads found for the barcode sequence (including feature reads that did not successfully match to the feature barcode reference sequence).

  • FeatureMolecules: The number of feature molecules found for the barcode sequence (only including feature reads matching to the feature barcode reference sequence).

  • Features: The total number of features found for the barcode sequence.

  • TotalGene+FeatureReads: The sum of the TotalGeneReads and TotalFeatureReads for the barcode sequence.

  • Gene+FeatureMolecules: The sum of the Molecules and FeatureMolecules for the barcode sequence.

  • Genes+Features: The sum of the Genes and Features for the barcode sequence.

Feature Metrics

When feature counting is enabled, the following feature metrics are also output in <prefix>.scRNA_metrics.csv:

  • Total input feature reads: The total number of feature reads provided as input to DRAGEN. The value of this metric will equal the sum of "Total barcoded feature reads" and "Feature reads with invalid barcodes". The denominator for this metric percentage is "Total input feature reads".

  • Total barcoded feature reads: The total number of barcoded feature reads that will proceed to the feature barcode reference sequence matching stage. N.B. if CRISPR mode is enabled, this value will only include reads for which the specific CRISPR barcode linkers have been correctly detected. The denominator for this metric percentage is "Total input feature reads".

  • Feature reads with invalid barcodes: The total number of feature reads with barcodes that did not match valid sequences in the sequence list, even after barcode correction (within a hamming distance of 1). If CRISPR mode is enabled, this value also includes reads that did not have the correct CRISPR barcode linker distance. The denominator for this metric percentage is "Total input feature reads".

  • Feature matching reads: The number of feature reads that match to the feature barcode reference sequences (allowing for a hamming distance of 1). The denominator for this metric percentage is "Total barcoded feature reads".

  • Feature non-matching reads: The number of feature reads that do not match to the feature barcode reference sequences (even allowing for a hamming distance of 1). The denominator for this metric percentage is "Total barcoded feature reads".

  • Total feature reads: The total number of feature reads processed by the workflow (this should match the "Total barcoded feature reads" metric above). The denominator for this metric percentage is "Total barcoded feature reads".

  • Total feature molecules: The total number of feature molecules found (this should typically match the "Feature matching reads" metric above). No denominator.

  • Fraction of feature reads in passing cells: The fraction of feature reads that were assigned to passing cells, calculated as total feature reads in passing cells over total feature reads.

  • Mean feature reads per cell (Total input feature reads / Passing cells): The average number of feature reads per passing cell calculated by dividing "Total input feature reads" by "Passing cells".

  • Median feature molecules per passing cell: The median number of feature molecules per passing cell based on sorting the passing cells by feature molecule count.

  • Median features per passing cell: The median number of features per passing cell based on sorting the passing cells by feature count.

  • Total features detected: The total number of unique features detected in the passing cells (i.e. the number of features with at least one molecule in at least one passing cell).

N.B. the "Filtered low MAPQ reads" metric will be excluded from the <prefix>.scRNA_metrics.csv file if feature counting is enabled. The reads that are filtered due to low MAPQ will instead be counted toward the "Filtered non-matching reads" metric.

CRISPR Mode

DRAGEN also supports processing samples from Illumina's CRISPR Single Cell kits using PIPseq technology. Set --scrna-enable-pipseq-crispr-mode to true to activate this mode. This mode also requires feature counting to be enabled (CRISPR sequences are one type of feature read).

Activating PIPseq CRISPR mode automatically configures DRAGEN to process feature reads containing the CRISPR guide RNA (gRNA) sequences. This includes handling the offsets in the cell-barcode position for the gRNA reads, transforming the gRNA cell-barcodes to match the gene expression ones, utilizing the "hook and grab" approach for matching the feature barcode reference sequences and identifying the gRNA reads, and counting the gRNA reads (disregarding BIs and IMIs). Both gene expression and gRNA reads are processed in a single DRAGEN run. Note: unmapped reads do not contribute to gene expression read counts but are still included in gRNA counts if they match the hook sequence.

The “hook and grab” method is a targeted approach for identifying CRISPR perturbation reads. It leverages a conserved sequence within the guide RNA structural region as a “hook” to locate the guide RNA and then “grabs” the specific guide by mapping it to a database of known sequences based on their displacement from the hook.

Currently the application of combining CRISPR with other features (e.g. CITE-seq) in a single run is not fully supported yet due to the specialized handling of CRISPR sequences.

Inputs

PIPseq CRISPR mode is activated by setting --scrna-enable-pipseq-crispr-mode=true. CRISPR mode requires --scrna-enable-pipseq-mode to be set to true as well. Note that when --scrna-enable-pipseq-crispr-mode is set to true, the following options are also set automatically:

  • --Aligner.aln-min-score=25

  • --scrna-crispr-guide-calling-mode=gmm

For a standard PIPseq CRISPR workflow, a FASTQ list is usually provided containing both the gene expression FASTQs and the CRISPR FASTQs. The read groups of the CRISPR input FASTQs must be provided using the --scrna-feature-barcode-groups=<CRISPR_read_groups> option (multiple read groups should be comma-separated). Furthermore, a CRISPR barcode reference file must be provided using the --scrna-feature-barcode-reference argument.

The CRISPR barcode reference file should be in CSV format with the following header fields:

Header field
Description

id

Feature/gene ID

name

Feature/gene name

read

Read that the guide RNA is on (R1 or R2)

target

Target or "hook" sequence. If the value is set to "5p", then the target sequence is based on the beginning of the read (default value: "5p")

position

Displacement of the start of the identifying guide RNA sequence relative to the start of the target sequence (can be positive or negative)

sequence

Guide RNA sequence

feature_type

Name of feature (e.g. CRISPR Direct Capture)

gene_id

(Optional) ID of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)

gene_name

(Optional) Name of the gene corresponding to the feature (for downstream analysis, not processed by DRAGEN)

Example FASTQ list CSV file:

Notes:

  • The RGID is referred to as the "read group" and CRISPR RGIDs should be passed to the --scrna-feature-barcode-groups option

  • The RGSM values should match for all reads to be processed in a single DRAGEN run

  • The RGLB values should match for all reads to be analyzed together (the results will be combined into a single metrics output file)

Example feature barcode reference file:

Examples of target, position, and sequence values for common use cases are provided below:

Example Command Line

CRISPR Outputs

The features that were found are appended to the end of <prefix>.scRNA.features.tsv.gz. The feature counts for all the barcodes are found in the <prefix>.scRNA.matrix.mtx.gz raw matrix file and the feature counts for the filtered barcodes are found in the <prefix>.scRNA.filtered.matrix.mtx.gz matrix file.

CRISPR Metrics

When PIPseq CRISPR mode is enabled, the following CRISPR metrics are also output in <prefix>.scRNA_metrics.csv (in addition to the Feature Metrics):

  • CRISPR reads with exactly matching barcodes: The number of CRISPR reads with barcode sequences that were not altered during error-correction, i.e. the barcode sequence exactly matched a barcode in the sequence list.

  • CRISPR reads with corrected barcodes: The number of CRISPR reads with barcode sequences that were corrected to a valid sequence in the sequence list (within a hamming distance of 1).

  • Total CRISPR reads matching known barcodes: The total number of CRISPR reads with valid barcodes that match the sequence list after error-correction. The metric value will be the sum of the above two metrics.

  • CRISPR tag mapping rate: The rate at which CRISPR reads match to the feature barcode sequences, calculated as the number of valid CRISPR reads (i.e. CRISPR reads that match to a feature barcode sequence) divided by "Total CRISPR reads matching known barcodes". The value for this metric should match the percentage value for the "Feature matching reads" metric.

  • CRISPR fraction valid reads in cells: The rate at which valid CRISPR reads are found in cells, calculated as the number of valid CRISPR reads in cells divided by "Total CRISPR reads matching known barcodes".

  • Fraction passing cells with CRISPR reads: The fraction of passing cells containing at least one CRISPR read, calculated as the number of passing cells with at least 1 CRISPR read divided by the total number of passing cells.

Guide RNA Calling

DRAGEN now also supports guide RNA calling - identifying cells that express guide RNA sequences. Although the counts matrix reports how many feature counts were found for each cell, there is a chance that the feature reads happened to match to the feature barcode reference sequence or the feature count was assigned to the cell by chance rather than due to real signal. Guide RNA calling applies an additional filtering step to distinguish between cells truly expressing guide RNA sequences and noise (similar to how cell filtering is performed to distinguish true passing cells from noise).

DRAGEN utilizes a Gaussian Mixture Model (GMM) approach to perform guide RNA calling. For each feature, DRAGEN constructs a GMM on the feature counts for the set of cells containing that feature. The GMM has two components: the signal distribution and the noise distribution. The model is simplified by assuming the signal and noise distributions have the same variance. Initial parameters are set based on the threshold that maximizes the likelihood of the model and then the Expectation-Maximization (EM) algorithm (add citation) is used to optimize the parameters. After the model converges to a final set of parameters, a threshold is chosen that maximally separates the signal and noise populations, and each cell is classified as expressing the feature or not.

When PIPseq CRISPR mode is enabled, guide RNA calling using GMMs is automatically enabled as well with the --scrna-crispr-guide-calling-mode=gmm option. Guide RNA calling can be disabled by setting --scrna-crispr-guide-calling-mode=none.

Guide Calling Outputs

If guide RNA calling is enabled, then DRAGEN will generate three additional files:

  • <prefix>.scRNA.positive_cell_guide_assignments.csv

    • A CSV file containing all of the passing cells with the following fields:

      • cell_barcode: The cell barcode sequence.

      • num_features: The number of features called for the cell (may be 0).

      • feature_call: A list of all features called for the cell (separated with a "|"). May be empty if num_features is 0.

      • num_transcripts: The number of feature counts for each feature for the cell (separated with a "|"). May be empty if num_features is 0.

  • <prefix>.scRNA.singlet_cell_guide_assignments.csv

    • A CSV file containing only the cells for which exactly one feature was called, with the following fields:

      • cell_barcode: The cell barcode sequence.

      • feature_call: The feature that was called for that cell.

  • <prefix>.scRNA.guide_metrics.csv

    • A CSV metrics file containing the following Guide Metrics:

      • Total cells: The total number of passing cells.

      • Total cells with one or more guides detected: The number of cells with at least one feature passing the guide calling threshold.

      • Total cells with two or more guides detected: The number of cells with at least two features passing the guide calling threshold.

Fractional Downsampling

DRAGEN supports fractional downsampling with PIPseq mode. Fractional downsampling refers to downsampling the input reads are a specified rate. Downsampling is performed randomly on all input reads.

Fractional downsampling is enabled with the following options:

  • --enable-fractional-down-sampler=true

  • --down-sampler-normal-subsample=<down_sampling_rate>

Set the down_sampling_rate equal to the desired fraction of reads to keep. For example, a rate of 70% means that 30% of the input reads will be filtered out.

More information on DRAGEN fractional downsampling can be found on this page.

Demultiplexing

DRAGEN supports different methods of demultiplexing samples. Genotype demultiplexing can be used if a VCF containing variants differentiating the samples of interest is available. Cell hashing can be used if the samples contain distinguishing oligo-tags (typically in read R2).

Genotype Demultiplexing

Genotype demultiplexing refers to using single nucleotide variants (SNVs) in the samples to differentiate them. A VCF file must be provided with those variants to use genotype demultiplexing. With genotype demultiplexing, DRAGEN assigns sample identity to cells based on the alleles observed in reads in each cell - only SNVs are considered. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.

DRAGEN supports two flavors of genotype demultiplexing. Genotype-based sample demultiplexing should be used if the VCF file matches the samples in the data set provided to DRAGEN (e.g. if the VCF was generated from matched DNA samples for the same individual). Genotype-free sample demultiplexing should be used if the VCF file is derived from a background population corresponding to the samples in the data set - the VCF is not expected to directly match the samples but the variant genotypes in the VCF should appear in the samples at some frequency.

Command Line Options

Genotype-based sample demultiplexing is enabled with the option:

  • --scrna-demux-sample-vcf=<VCF_file> - the VCF file should contain the genotypes corresponding to the exact samples provided to DRAGEN. Note: the number of samples for DRAGEN to report is automatically detected from the number of SAMPLE fields in the VCF.

Genotype-free sample demultiplexing is enabled with the options:

  • --scrna-demux-reference-vcf=<VCF_file> - the VCF file should contain genotypes for the background population from which the samples originated (they do not have to match the samples exactly)

  • --scrna-demux-number-samples=<number> — The number of samples to be detected by DRAGEN. DRAGEN will demultiplex the dataset into this many samples.

Additional options:

  • --scrna-demux-detect-doublets=<true/false> — Enable doublet detection for demultiplexing (set to true by default).

Example command line for genotype-based sample demultiplexing:

Example command line for genotype-free sample demultiplexing:

Outputs

When genotype demultiplexing (demux) is enabled, additional outputs are produced by DRAGEN. Additional columns are added to the barcode summary file and two new files are genereated: a demux TSV file with the number of reads and SNPs used to demultiplex each barcode and a demux metrics file with some summary metrics for the demultiplexed samples.

The <prefix>.scRNA.barcodeSummary.tsv file contains the per-cell, barcode summary metrics. When demultiplexing is enabled, this file will have these additional columns describing the sample each barcode is assigned to. See the Barcode Summary File section for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.

Column
Description

SampleIdentity

The SampleIdentity column will contain one of the following values:

  • sampleX—The particular barcode is uniquely assigned to a sample.

  • AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.

  • MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as a doublet of sampleX and sampleY. For example, MIX(50*sampleX+50*sampleY).

IdentityQscore

The IdentityQscore column contains the value used to estimate the confidence of the sample-identity call. After DRAGEN determines the doublet status of the cell as singlet, ambiguous, or doublet, the identity Q-score is defined as -10 * log10(Probability that the assigned identity is correct, given the second most likely identity and the doublet status). Higher values for the identity Q-score correspond to more confident sample-identity calls.

The <prefix>.scRNA.demux.tsv file contains sample-demultiplexing statistics that were used to infer the sample identity of each cell barcode.

Column
Description

Barcode

The cell barcode associated with the cell.

DemuxSNPCount

The number of SNPs that the reads of the cell barcode intersect.

DemuxReadCount

The number of reads of the cell barcode that intersect at least one SNP.

Pure samples

Samples from the VCF file.

BestMixtureIdentity

Mixture sample with the highest log likelihood. Only available if --single-cell-demux-detect-doublets=true is set.

BestMixtureLogLikelihood

The log likelihood of the best mixture sample. Only available if --single-cell-demux-detect-doublets=true is set.

The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-sample metrics for each demultiplexed sample, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.

Column
Description

Passing cells

The number of cells called (i.e. the number of barcodes that passed the filter threshold).

Fraction genic reads in cells

The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.

Median reads per cell

The median number of reads per passing cell based on sorting the passing cells by read count.

Median molecules per cell

The median number of molecules per passing cell based on sorting the passing cells by molecule count.

Median genes per cell

The median number of genes per passing cell based on sorting the passing cells by gene count.

Total genes detected

The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).

Cell-Hashing

Cell-hashing refers to using oligo-tags usually in read R2 sequences to demultiplex samples. A cell-hashing reference file is required to match those oligo-tags to the corresponding samples. DRAGEN will match each R2 sequence against the cell-hashing reference sequences to determine the sample identity of each barcode. DRAGEN flags any doublets, such as droplets that contain multiple cells from different individuals.

The cell-hashing reference file should be a FASTA or CSV file and is specified with the option --scrna-cell-hashing-reference=<cell_hashing_ref_file>. The format of the file is similar to that required for feature counting (see Feature Counting). If the cell-hashing reference file is provided in FASTA format, then the sample names should be provided in the header and the oligo-tag reference sequences should be provided as the sequence.

If the cell-hashing reference file is provided in CSV format, the following fields should be included:

  • id — Sample identifier

  • name — Sample name

  • read — Read that the oligo-tag sequence is on (R1 or R2)

  • position — Starting position of the oligo-tag reference sequence relative to the start of the R1 or R2 sequence (should be >= 0)

  • sequence — DNA sequence of the sample-specific oligo. For example, CAGCCCGATTAAGGT.

  • type — Type of feature (e.g. Cell Hashing)

Read groups (RGID) for FASTQ files that contain cell hashing oligos should be specified with the option --scrna-hto-barcode-groups=<cell_hashing_RGIDs>.

Command Line Options

Cell-hashing is enabled with the options:

  • --scrna-cell-hashing-reference=<cell_hashing_ref_file> - the FASTA or CSV file with the sample-specific oligo-tags.

  • --scrna-hto-barcode-groups=<cell_hashing_RGIDs> - the RGIDs of the FASTQ files containing cell hashing reads.

Additionally, doublet detection can be enabled/disabled with the following option:

  • --scrna-demux-detect-doublets=<true/false> — Enable/disable doublet detection for cell-hashing sample demultiplexing (set to true by default).

Example command line for cell-hashing:

Outputs

When cell-hashing sample demultiplexing (demux) is enabled, additional outputs are produced by DRAGEN. An additional column is added to the barcode summary file and two new files are genereated: a demux TSV file with the number of reads used to demultiplex each barcode and a demux metrics file with some summary metrics for the demultiplexed samples.

The <prefix>.scRNA.barcodeSummary.tsv file contains the per-cell, barcode summary metrics. When demultiplexing is enabled, this file will have an additional column describing the sample each barcode is assigned to. See the Barcode Summary File section for more information on the <prefix>.scRNA.barcodeSummary.tsv metrics.

Column
Description

SampleIdentity

The SampleIdentity column will contain one of the following values:

  • sampleX—The particular barcode is uniquely assigned to a sample.

  • AMB(sampleX,sampleY)—The algorithm cannot determine the sample to assign the barcode to.

  • MIX(mixing_coef*sampleX+(100-mixing_coef)*sampleY)—The cell barcode is classified as a doublet of sampleX and sampleY. For example, MIX(50*sampleX+50*sampleY).

The <prefix>.scRNA.demux.tsv file contains sample-demultiplexing statistics that were used to infer the sample identity of each cell barcode.

Column
Description

Barcode

The cell barcode associated with the cell.

DemuxReadCount

Cell-hashing read count for each sample.

The <prefix>.scRNA.demuxSamples_metrics.csv file contains per-sample metrics for each demultiplexed sample, similar to the metrics reported for the overall dataset in <prefix>.scRNA_metrics.csv.

Column
Description

Passing cells

The number of cells called (i.e. the number of barcodes that passed the filter threshold).

Fraction genic reads in cells

The fraction of reads assigned to passing cells that are genic reads i.e. matching to genes, calculated as total counted gene reads in passing cells over total gene reads in passing cells. Genic reads correspond to counted gene reads which include exon-matching, intron-matching, and mitochondrial reads.

Median reads per cell

The median number of reads per passing cell based on sorting the passing cells by read count.

Median molecules per cell

The median number of molecules per passing cell based on sorting the passing cells by molecule count.

Median genes per cell

The median number of genes per passing cell based on sorting the passing cells by gene count.

Total genes detected

The total number of unique genes detected in the passing cells (i.e. the number of genes with at least one molecule in at least one passing cell).

STAR Aligner

To add optional compatibility with legacy PIPseq workflows, we have included Spliced Transcripts Alignment to a Reference (STAR) as an alternative to the default hardware-accelerated DRAGEN mapper. STAR is an open-source RNA-sequence read mapper, with support for splice-junction and fusion read detection.

For further information on STAR mapper:

While STAR may be run as an independent executable, running STAR as an internal feature from DRAGEN offers simpler and seamless functionality within the scRNA pipeline.

Important notes:

  • To run STAR, the Single Cell RNA pipeline must be enabled with either --enable-single-cell-rna true or --scrna-enable-pipseq-mode true.

  • STAR uses its own format for reference genome hash tables. This format is both different and incompatible with DRAGEN's reference. When using STAR, both a DRAGEN and a STAR reference must be supplied, and these references must be built from identical whole-genome FASTA files and annotations. If STAR and DRAGEN are given references built from different files, undefined behavior may result.

  • When running STAR mapper, the following intermediate files will be created:

    • If a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, then an additional barcode/BI merged FASTQ file will be created with the following name: <prefix>.umi_merged.rg_<Read group index>.fastq. An intermediate FASTQ file will be created for each read group provided.

    • A BAM file with the name: <prefix>.star_mapper.Aligned.out.bam. This is the output BAM record containing all mapped reads.

The following are options for configuring STAR within DRAGEN:

  • --single-cell-enable-star-mapper -- ("true" or "false") Enable STAR and use it for mapping in lieu of the DRAGEN mapper.

  • --single-cell-star-mapper-build-reference -- ("true" or "false") Enable STAR and use it for building STAR-formatted reference genome hash tables. STAR may build a reference and map reads from the same command line. STAR cannot use a DRAGEN-formatted reference.

  • --single-cell-star-mapper-reference-dir -- (required, string) Users must supply a directory for STAR-formatted reference genome hash tables. If STAR is building a reference, this directory is where the output reference files will be stored. If the STAR mapper runs from a pre-built reference, this directory is where the mapper looks for the reference files.

  • --single-cell-star-mapper-path -- (optional, string) Users may input their own path to a STAR executable. If not supplied, DRAGEN uses an internally-supplied STAR executable.

  • --single-cell-star-mapper-options -- (optional, string) Users may manually adjust settings for STAR and overwrite the default settings set by DRAGEN. For a complete list of settings for STAR, please refer to its user manualarrow-up-right

  • --single-cell-compress-intermediate-fastq -- (optional, "true" or "false", default="false") When a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, intermediate FASTQ files are created. This option enables compressing those intermediate FASTQ files with a ".fastq.gz" format.

  • --single-cell-num-parallel-fastq-umis -- (optional, integer, default=4) When a FASTQ list is provided as input and barcode/BI sequences are contained in read R1, intermediate FASTQ files are created. This option sets how many parallel FASTQ files are processed simultaneously.

Because STAR uses a different map/align algorithm than DRAGEN, there will be some differences in the read-mapping results and thus in the cell and gene counts results as well.

Some STAR options that can be added (using the --single-cell-star-mapper-options argument) to further align STAR's output to DRAGEN's include the following:

  • --alignIntronMax 25000 - limit the maximum intron size to 25,000 bp

  • --clip5pNbases 1 - clip one base from the 5' end of read 2

  • --clip3pAdapterSeq polyA - trim poly-A sequences from 3' end of read 2

Command-line Examples Using STAR

Example command line for running the STAR mapper with DRAGEN single cell for Illumina Single Cell Prep.

Example command line for running the STAR mapper from a FASTQ list CSV with two non-default settings for STAR.

Example command line for creating a STAR reference and running the STAR mapper with DRAGEN single cell for Illumina Single Cell Prep. Note: STAR requires an uncompressed annotations file.

Tertiary Analysis

We recommend using Illumina Connected Multiomicsarrow-up-right (ICM) for single cell tertiary analysis. ICMarrow-up-right is a fully-featured tool that allows users to combine and analyze different biological data sets to turn raw samples into published results faster and more easily.

Loading Count Files to a Dense Matrix

Some users might want to explore the output matrix in a human-readable format. There are several tools available including Illumina Connected Multiomics (ICM)arrow-up-right, scanpy, and seurat. To do so, a possible way would be to load the matrix in a "dense" dataframe in Python (similar methodologies can be used in other programming languages). It is important to remember, however, that when possible a "sparse" representation of the matrix is preferable, due to the significant usage of memory and disk space of "dense" matrices. Several tools are available to work efficiently with "sparse" representations of single cell matrices (e.g., scanpy in Python).

The matrix can be converted into a "dense" representation through two Python modules: scanpy and pandas. This has been tested with Python 3.9.7, scanpy 1.10.3, pandas 2.2.3.

First, it is necessary to install the required libraries:

Within Python, the matrix can be loaded in "dense" representation using the following commands:

The matrix can be saved through different output formats (e.g., CSV), although this is not recommended due to high disk usage.

References

  1. Fontanez et al., 2024. Intrinsic molecular identifiers enable robust molecular counting in single-cell sequencing. bioRxiv 2024.10.04.616561.

  2. “The Coupon Collector Problem.” 2022. University of Alabama in Huntsville. April 24, 2022. https://stats.libretexts.org/@go/page/10250.

Last updated

Was this helpful?