ORA Compression
DRAGEN ORA Compression is a fully lossless compression, that compresses *.fastq and *.fastq.gz files into *.fastq.ora files. DRAGEN ORA supports FASTQ generated by Illumina sequencing systems. When using the ORA format, the md5 checksum of the FASTQ content is preserved after a compression and decompression cycle to ensure a lossless compression.
DRAGEN ORA Compression requires a separate license. Decompression and ingestion of *.fastq.ora files into the DRAGEN map/align does not require a license. If your DRAGEN server is connected to a network, DRAGEN ORA compression can be used after installing DRAGEN v3.8 or later. If your DRAGEN server is offline, contact Illumina Customer Service.
For human data generated by the NovaSeq 6000, NextSeq 1000, or NextSeq 2000 sequencing systems, the compression ratio is expected to be up to 6x compared to the *.fastq.gz. The compressed file uses the *.fastq.ora extension.
Input of DRAGEN ORA Compression is *.fastq or *.fastq.gz. Input can be a single file or a list of files. A list of files can be specified on the command line, or from a *.fastq-list.csv generated by the BCL Convert BaseSpace Sequence Hub App or DRAGEN BCL convert. Input located in local storage, AWS S3 or Azure Blob storage is supported.
*.fastq.ora files are decompressed into *.fastq.gz.
Note: *.fastq.ora can be generated starting from BCL. To convert BCL into *.fastq.ora, specific commands need to be used. Follow the DRAGEN ORA compression from BCL instructions.
ORA Reference
To compress or decompress ORA files, you must provide the ORA reference files and specify an ORA reference directory.
Several references to compress data from different species and from different type of human data are supported. Refer to the list of supported references below.
You can download ORA reference files from the DRAGEN Software Support Site page. To ensure proper management of the reference files, do not change any of the file names of the downloaded archive.
To specify an ORA reference directory, do as follows.
Download the
oradata-2.tar.gz
(or archive relevant to your studied model) from the DRAGEN Software Support Site.Move the file to the location you would like to contain the reference directory in, and then enter the following to extract the contents.
tar -xzvf oradata-2.tar.gz
Set the
--ora-reference
command line option to the extracted/oradata
folder path.
The oradata folder should follow the following structures:
When only one reference is handled:
--ora-reference
should still point to the parent oradata folder.
When one ore more references are handled:
You can select at compression which reference species to use with option --ora-compression-species <species_scientific_name>
. If unspecified, Homo sapiens reference will be used by default. Using a reference species that does not match the organism sequenced in your FASTQ file will still produce valid ORA compressed file, albeit with lower compression ratio. If the oradata folder pointed by --ora-reference
does not contain the requested species, DRAGEN will stop with error. At decompression, detection of the species used to compress the ORA file is automatic. DRAGEN will look for the appropriate species in the oradata folder pointed by --ora-reference
. If it is missing, DRAGEN will stop with an error message indicating the name of the missing species. In that case download it from the DRAGEN Software Support Site page.
Command Line Options
The following example command contains the required DRAGEN ORA compression options to compress regular human data:
dragen --enable-map-align false --ora-input <FILE> --enable-ora true --ora-reference <...> --output-directory <...>
or
dragen --enable-map-align false --fastq-list <FILE .csv> --enable-ora true --ora-reference <...> --output-directory <...>
The following example command contains the required DRAGEN ORA decompression options (for human and non-human data):
dragen --enable-map-align false --ora-input <FILE> --enable-ora true --ora-decompress true --ora-reference <...> --output-directory <...>
The following examples command contains the required options to compress FASTQs of a fastq-list.csv file containing multiple samples (regular human data):
When all samples must be compressed: dragen --enable-map-align false --fastq-list <FILE .csv> --enable-ora true --fastq-list-all-samples true --ora-reference <...> --output-directory <...>
When only specific samples must be compressed: dragen --enable-map-align false --fastq-list <FILE .csv> --enable-ora true --fastq-list-sample-id <sample> --ora-reference <...> --output-directory <...>
The following examples command contains the required options to achieve an interleaved compression of paired-read files from a fastq-list.csv file (regular human data) :
dragen --enable-map-align false --fastq-list <FILE .csv> --enable-ora true --ora-interleaved-compression true --ora-reference <...> --output-directory <...>
The following example command contains the required DRAGEN ORA compression options to compress non-human or specific human data, chicken data in this case:
dragen --enable-map-align false --ora-input <FILE> --enable-ora true --ora-compression-species <gallus_gallus> --ora-reference <...> --output-directory <...>
The following example command prints the file information summary of an ORA compressed file. Compression or decompression is not performed.
dragen --enable-map-align false --ora-input <FILE> --enable-ora=true --ora-print-file-info
The following example command compares FASTQ file checksum and decompressed FASTQ.ORA file checksum and outputs "ORA integrity check successful" if both checksums are equal or "integrity check failed" if checksums are not equal.
dragen --enable-map-align false --ora-input <FILE> --enable-ora=true --ora-reference <...> --ora-check-file-integrity=true
The following example command contains the required DRAGEN ORA compression options to print the list of available references:
dragen --enable-map-align false --enable-ora true --ora-list-species true
The following are the command line options for running DRAGEN ORA Compression and Decompression.
Option | Required | Description |
---|---|---|
--enable-map-align | Yes | Set to |
--enable-ora | Yes | Set to |
--ora-reference | Yes | Path to the directory that contains the compression reference and index file. |
--ora-input | Yes (or --fastq-list) | Specifies the input files for compression or decompression. |
--fastq-list | Yes (or --ora-input) | Specifies a .csv file with list of FASTQ files to be compressed. This option is not specific to the DRAGEN ora compression and the usage is explained in the FASTQ CSV File Format Section of this manual. Compression of a list of FASTQ containing different species is not supported while decompression of FASTQ containing different species is supported. |
--ora-input2 | No | Used for interleaved compression of paired-read files when input files are specified with |
--ora-interleaved-compression | No | Used for interleaved compression of paired-read files when input files are specified with |
--ora-compression-species | No | Sring to specify the reference species to compress data on. Possible values |
--ora-decompress | No | Set to |
--force | No | Compresses to output directory even if the compressed file already exists. The existing compressed file is overwritten. |
--ora-threads-per-file <#> | No | Manually controls the number of CPU threads for compressing each FASTQ input file. The default value is 8. |
--ora-parallel-files <#> | No | Manually controls the number of input FASTQ files processed in parallel. The default value is 4. |
--ora-use-hw | No | Set to |
--ora-print-file-info | No | Prints file information summary of ORA compressed files. Note: this option cannot be used simultaneously with the --ora-decompress option and the --ora-check-file-integrity option. |
--ora-list-species | No | Set to |
--ora-check-file-integrity | No | Set to |
--ora-enable-md5 | No | Set to |
--ora-delete-input-files | No | Set to |
--ora-original-name | No | At decompression, set to |
Use the --output-directory
option to specify the directory to store output compressed/decompressed files.
Interleaved Compression
There are two methods to achieve a paired compression aka interleaved compression:
when using
--ora-input
and--ora-input2
. The nth file of the--ora-input
list is compressed together with the nth file of the--ora-input2
when using
--fastq-list
and--ora-interleaved-compression
set totrue
. The paired-read files from the nth line of fast-list.csv are compressed together
Both files are interleaved within a single ORA output file with file name containing -interleaved
. Using these options to compress paired files together improves compression by up to 10%. If decompressing an ORA file that contains paired data, the file is automatically decompressed to two separate files. To map an ORA file that contains paired interleaved data with the DRAGEN mapper, use the --interleaved
option.
How to use ORA input files with DRAGEN Map/Align
DRAGEN can directly process ORA files. The same options as the other FASTQ input file types can be used. To use the ORA file, replace the FASTQ file name with the ORA file name and specify the ORA reference directory using --ora-reference
.
The following command represents paired-end in two matched ORA FASTQ files (-1 and -2 options).
List of supported references
Below is a list of supported references. This list may not be exhaustive, the most up-to-date list of supported references can be found on the DRAGEN Software Support Site page.
Either the whole database or specific references can be dowloaded.
Model | Valid string value | Size |
---|---|---|
Human | Homo_sapiens | 6.5 GB |
Human methylated data | Homo_sapiens_bisulfite | 11 GB |
Pig | Sus_scrofa | 5.0 GB |
Chicken | Gallus_gallus | 3.8 GB |
Rice | Oryza_sativa | 1.9 GB |
Arabidopsis | Arabidopsis_thaliana | 478 MB |
Wheat | Triticum_aestivum | 13 GB |
Cattle | Bos_taurus | 5.3 GB |
Soybean | Glycine_max | 2.0 GB |
Rat | Rattus_norvegicus | 4.5 GB |
Maize | Zea_mays | 4.2 GB |
Zebrafish | Danio_rerio | 4.8 GB |
Mouse | Mus_musculus | 4.5 GB |
Roundworm | Caenorhabditis_elegans | 569 MB |
Last updated