Tagging and indexing approaches in metabarcoding studies
Today, the most commonly used high-throughput sequencing platform for
metabarcoding studies is the Illumina series, where for example the
MiSeq, iSeq, HiSeq, NextSeq, and NovaSeq have been employed
(Jarman et al.2018). These platforms offer high-throughput, relatively low error
rates, and \soutlong paired-end reads, typically up to 150bp of each
paired read on the NextSeq550/1000/2000, HiSeq 3000/4000 and NovaSeq (up
to 250 bp on SP flow cell), and 300bp of each paired read on the MiSeq
platform
(www.illumina.com,
applied in e.g. Shehzad et al. 2012b; Quéméré et al. 2013;
Hope et al. 2014; Elbrecht et al. 2017; Stoeck et
al. 2018; Singer et al. 2019).
The sequencing depth required per sample is commonly much lower in
metabarcoding studies than in shotgun sequencing studies
(e.g.
Srivathsan et al. 2015; Stat et al. 2017), and in
metabarcoding studies it is (economically) feasible to sequence tens,
hundreds, or even thousands of samples per sequencing run. To allow
pooling and parallel sequencing of this magnitude, different molecular
labelling systems have been developed. For metabarcoding studies, the
addition of sample-specific identifiers to PCR amplicons can be achieved
either as nucleotide tags during the metabarcoding PCR, or as library
indices when converting amplicons into sequencing libraries.
Metabarcoding approaches can be divided into three overall strategies
for adding nucleotide tags and library indices
(Taberlet et al.2018) (Fig. 2):
- The ‘one-step PCR’ approach in which sample DNA extracts are amplified
and built into sequence libraries in one reaction. Here, metabarcoding
primers carry sequencing adapters and library indices, referred to as
‘fusion primers’ (Fig. 2B). This approach is used in e.g. Kozich et
al. (2013),
Elbrecht and Leese
(2015),
Sickel et al.
(2015),
Grealy et al.
(2016), Berry
et al.
(2017),
Elbrecht et al.
(2017), Hardy
et al.
(2017),
Seersholm et al.
(2018) and
Bessey et al.
(2020). In
the one-step PCR approach, each PCR replicate or sample is a
sequencing library and as such is returned as a separate fastq file
following sequencing. It should be noted that a few studies modify
this approach by adding nucleotide tags to the fusion primers instead
of library indices
(e.g.
Elbrecht & Steinke 2018). When doing that, each PCR replicate is not
an individual sequencing library.
- The ‘two-step PCR’ approach in which sample DNA extracts are
PCR-amplified with two primer sets. In the primary reaction
metabarcoding primers carry 5’ sequence overhangs of ca. 33-34
nucleotides in length and no nucleotide tags. The sequence overhangs
allow the resulting amplicons to be targeted by the second round of
primers, which carry sequencing adapters and indices (Fig. 2C). Most
commonly, two consecutive PCRs are carried out, such as in Miya et al.
(2015), de
Vere et al,
(2017), Galan
et al.
(2017),
Kaunisto et al.
(2017), Swift
et al. (2018)
and Vesterinen et al.
(2018).
However, a few studies carry out only one reaction with the two primer
sets, such as Clarke et al.
(2014a). The
two-step PCR approach is based on Illumina’s 16S rRNA system
originally developed for microbiome studies (www.illumina.com). In the
two-step approach, each PCR replicate is an individual sequencing
library and as such is returned as a separate fastq file following
sequencing. It should be noted that a few studies modify the two-step
PCR approach to include nucleotide labelling in the first PCR, see
Kitson et al.
(2018).
- The ‘tagged PCR’ approach, in which sample DNA extracts are PCR
amplified with metabarcoding primers that carry 5’ nucleotide tags.
The individually tagged PCR products are pooled, and ligation-based
library preparation is carried out on pools of 5’ tagged amplicons.
The ligated adapters can themselves contain indices, which eliminates
the need for a second PCR step
(e.g.
Thomsen et al. 2016; Carøe & Bohmann 2020), or the adapter
ligation can be followed by a PCR step with indexed primers
(e.g.
Hope et al. 2014; Bohmann et al. 2018). This approach
was first demonstrated by Binladen et al.
(2007) on the
454 FLX platform and has been since been used in e.g. Shehzad et al.
(2012a),
Hibert et al.
(2013), Hope
et al.
(2014),
Thomsen et al.
(2016),
Apothéloz-Perret-Gentil et al.
(2017),
Sigsgaard et al.
(2017),
Bakker et al.
(2017),
Kocher et al.
(2017),
Thomsen and Sigsgaard
(2019) and
Lynggaard et al.
(2020) (Fig.
2D). In this approach, each library pool of PCR replicates is a
sequencing library and is returned as a separate fastq file, each of
which can contain data from a large number of PCR replicates.
All three main strategies offer the option to add extra nucleotides to
shift PCR amplicons in relation to each other and thereby to increase
sequence complexity on the flow cell (‘heterogeneity spacers’, see e.g.
De Barba et al. 2014; Elbrecht & Leese 2015; Bohmann et al. 2018). Note
that given the inconsistent use of terminology in the metabarcoding
literature, for clarity, we use the original term for nucleotide tags in
amplicon sequencing as used by Binladen et al .
(2007) and
Illumina’s terminology to describe the nucleotide reads that are used to
demultiplex sequencing libraries, the i5 and i7 index reads. That is, 5’
nucleotide tags are sequenced with the metabarcoding marker and primers
in the Illumina sequencing read 1 (and read 2 for paired-end
sequencing), while library indices are sequenced as separate index
reads, i.e. if dual-indexing is performed as i5 and i7 reads (Fig. 2A)
(https://support.illumina.com).
In this article, we discuss the three main metabarcoding strategies. One
approach not mentioned here is library preparation on individual
unlabelled PCR products through a ligation-based library preparation
protocol with or without an index PCR step. However, such ligation based
protocol would entail several steps on each PCR product, such as
end-repair and ligation of adapters (e.g. carrying indices such as in
Illumina’s TruSeq Nano DNA Library Prep kit, see Zizka et al
(2019). The
reason that we do not consider this approach a main metabarcoding
strategy is due to low reported use of this method, its high cost and
workload and thereby limited throughput
(Zizka et al. 2019).