DRAGEN Amplicon Pipeline

Amplicon sequencing is a highly targeted approach that enables you to analyze genetic variation in specific genomic regions. The ultradeep sequencing of PCR products (amplicons) allows you to efficiently identify and characterize variants. This method uses oligonucleotide probes designed to target and capture regions of interest, followed by next-generation sequencing (NGS).

The Amplicon Pipeline supports both DNA and RNA data. The Amplicon Pipeline turns off duplicate marking because there are only a few unique start and end positions for fragments from an amplicon target due to the assay.

The DNA Amplicon Pipeline uses the DRAGEN DNA Pipeline by including an additional step after mapping and aligning to soft-clip primers and rewrite alignments. If the target amplicon is found, DRAGEN tags each alignment with the target amplicon and performs soft-clipping on the primer sequences. DRAGEN performs tagging by adding an XN:Z:<amplicon name> tag to the output BAM/CRAM record. Soft-clipping makes sure that the primer sequences do not contribute to the variant calls.

In the primer clipping step, poorly aligned reads are also unaligned with MAPQ set to 0:

  • Alignments that don't consume any reference bases after soft-clipping.

  • Off-target alignments overlapping target regions.

  • Alignments with a substitution fraction more than a threshold. Substitution fraction is the ratio of match count to match and mismatch count and the probe regions are excluded from the calculation. The threshold is specified by --amplicon-max-substitution-fraction with a default of 0.04.

  • Alignments with read base count less than the short-read threshold after soft-clipping and with a substitution fraction more than a threshold including the probes. The short-read threshold is specified by --amplicon-shortread-length-threshold with a default of 30. The probe regions are included in the calculation and soft-clipped bases are treated as mismatches. The substitution threshold is set by --amplicon-max-shortread-substitution-fraction with a default of 0.1.

  • Alignments with a soft-clipping fraction more than a threshold. The probe regions are excluded from the calculation and the treshold is set by --amplicon-max-softclip-fraction with a default of 0.1.

  • Off-target alignments with a soft-clipping fraction more than a threshold. The probe regions are included in the calculation and the threshold is set by --amplicon-max-offtarget-softclip-fraction with a default of 0.2.

The RNA Amplicon Pipeline uses the DRAGEN RNA Pipeline. Amplicon-specific parameters are set for fusion calling, including a fusion scoring model trained on RNA amplicon data. Small variant calling is not supported in RNA amplicon mode.

Amplicon BED File

The DRAGEN Amplicon Pipeline requires an amplicon BED file and all input files required by the DRAGEN DNA or RNA pipeline. Each row in an amplicon BED file describes an amplicon target. The fields are as follows.

FieldDescription

chrom

The name of the chromosome.

chromStart

The 0-based inclusive start position of the target, excluding the primer.

chromEnd

The 0-based exclusive end position of the target, excluding the primer.

name

The name of the amplicon target.

gene

[Optional] The gene ID.

targetType

[Optional] The target type.

In copy number variant calling of DNA amplicon mode, the default segmentation mode is bed and could be modified via --cnv-segmentation-mode. The CNV segmentation bed is gene-level and auto-generated based on the gene ID column in the amplicon BED file. In RNA amplicon mode, targetType is used to identify fusion targets, whose targetType is Fusion. The gene IDs for fusion targets are collected and written to an output file. The default value of --rna-gf-enriched-genes is then set to this file containing fusion gene IDs. A candidate fusion is required to have both partner genes in the gene list. Base-level and read-level coverage is calculated for each region in the amplicon BED file. It is recommended that the fusion targets are commented to avoid competition with gene expression targets.

DRAGEN DNA Amplicon Settings

To use the DNA amplicon pipeline, set --enable-dna-amplicon to true. Use --amplicon-target-bed to specify the path to your amplicon BED file.

To enable small variant calling, set --enable-variant-calling to true. To enable copy number variant calling, set set --enable-cnv to true. GC bias correction when generating target counts is enabled by default. The generation of the target counts for the normal samples should also have identical command line options with the case sample under analysis. To enable structural variant calling, set --enable-sv to true. The target small variant calling BED input is set to amplicon BED file by default and could be modified via --vc-target-bed. The CNV segmentation bed is auto generated based on the gene ID column in the amplicon BED file and could be modified via cnv-segmentation-bed. See CNV Targeted Segmentation (Segment BED) for more information. The amplicon pipeline can be run in either germline or somatic mode. For the somatic mode, specify a tumor-only or tumor-normal input. For more details about somatic mode, see Somatic Mode and Somatic Mode Options. For more information on the multicaller (germline & somatic) workflows, see Multicaller Workflows. If calling somatic small variants, we also recommend to set --vc-use-somatic-hotspots to false.

By default the maximum amplicon primer length is set to 50. You can specify a different value using --amplicon-primer-length. The parameter affects whether an alignment is assigned to an amplicon target. If an alignment starts inside the primer region of the amplicon target, the alignment is assigned to the amplicon. For a properly paired alignment, both the alignment and the mate must come from the same amplicon target. However, in order to detect deletion events that are close to the target boundaries, we now require only one of the reads to start in the primer region (--amplicon-allow-partial-target=true by default). For candidate deletions, we rewrite the CIGAR to make them candidates for columnwise detection (--amplicon-enable-deletion-realigner=true by default).

  |-- primer --|-- amplicon target --|-- primer --|
     ---------- read ----------------->
              <---------- read -----------------

The following is an example command line to run the DRAGEN DNA Amplicon Pipeline with copy number, structural variant and germline small variant calling.

dragen --enable-dna-amplicon true --enable-map-align=true --enable-sort=true --enable-map-align-output=true -r reference_genomes/Hsapiens/hg19_alt_aware/DRAGEN/8 --amplicon-target-bed=CancerHotSpot-v2.dna_manifest.20180509.bed --enable-variant-caller=true --enable-cnv=true --enable-sv=true --fastq-file1=read1.fastq.gz --fastq-file2=read2.fastq.gz --RGSM NA12878 --RGID 1 --output-directory=/staging/out --output-file-prefix=NA12878

DRAGEN RNA Amplicon Settings

To use the RNA amplicon pipeline, set --enable-rna-amplicon to true. Use --amplicon-target-bed to specify the path to your amplicon BED file.

We do not recommend enabling RNA quantification to produce the .sf quantification output files as a panel-specific GTF file is usually not used. The .target_bed_read_cov_report.bed read-level coverage output file should be used instead. This file is automatically produced when map/align is output enabled.

To enable RNA gene fusion calling, set --enable-rna-gene-fusion to true. Fusion calling parameters are automatically set in RNA amplicon mode but can be overridden in the command line. If fusion targets are not listed in the amplicon BED file, users can explicitly set --rna-gf-enriched-genes to a file containing fusion gene IDs or symbols.

The following is an example command line to run the DRAGEN RNA Amplicon Pipeline with gene fusion calling.

dragen --enable-rna-amplicon true --enable-map-align=true --enable-sort=true --enable-map-align-output=true -r reference_genomes/Hsapiens/hg19_alt_aware/DRAGEN/8 --amplicon-target-bed=Myeloid.rna_manifest.20201014.bed --enable-rna-gene-fusion=true --ann-sj-file=gencode.v19.annotation.gtf --output-format=BAM --fastq-file1=read1.fastq.gz --fastq-file2=read2.fastq.gz --RGSM Seraseq --RGID 1 --output-directory=/staging/out --output-file-prefix=Seraseq

Last updated