Tagging and indexing approaches in metabarcoding studies
Today, the most commonly used high-throughput sequencing platform for metabarcoding studies is the Illumina series, where for example the MiSeq, iSeq, HiSeq, NextSeq, and NovaSeq have been employed (Jarman et al.2018). These platforms offer high-throughput, relatively low error rates, and \soutlong paired-end reads, typically up to 150bp of each paired read on the NextSeq550/1000/2000, HiSeq 3000/4000 and NovaSeq (up to 250 bp on SP flow cell), and 300bp of each paired read on the MiSeq platform (www.illumina.com, applied in e.g. Shehzad et al. 2012b; Quéméré et al. 2013; Hope et al. 2014; Elbrecht et al. 2017; Stoeck et al. 2018; Singer et al. 2019).
The sequencing depth required per sample is commonly much lower in metabarcoding studies than in shotgun sequencing studies (e.g. Srivathsan et al. 2015; Stat et al. 2017), and in metabarcoding studies it is (economically) feasible to sequence tens, hundreds, or even thousands of samples per sequencing run. To allow pooling and parallel sequencing of this magnitude, different molecular labelling systems have been developed. For metabarcoding studies, the addition of sample-specific identifiers to PCR amplicons can be achieved either as nucleotide tags during the metabarcoding PCR, or as library indices when converting amplicons into sequencing libraries.
Metabarcoding approaches can be divided into three overall strategies for adding nucleotide tags and library indices (Taberlet et al.2018) (Fig. 2):
  1. The ‘one-step PCR’ approach in which sample DNA extracts are amplified and built into sequence libraries in one reaction. Here, metabarcoding primers carry sequencing adapters and library indices, referred to as ‘fusion primers’ (Fig. 2B). This approach is used in e.g. Kozich et al. (2013), Elbrecht and Leese (2015), Sickel et al. (2015), Grealy et al. (2016), Berry et al. (2017), Elbrecht et al. (2017), Hardy et al. (2017), Seersholm et al. (2018) and Bessey et al. (2020). In the one-step PCR approach, each PCR replicate or sample is a sequencing library and as such is returned as a separate fastq file following sequencing. It should be noted that a few studies modify this approach by adding nucleotide tags to the fusion primers instead of library indices (e.g. Elbrecht & Steinke 2018). When doing that, each PCR replicate is not an individual sequencing library.
  2. The ‘two-step PCR’ approach in which sample DNA extracts are PCR-amplified with two primer sets. In the primary reaction metabarcoding primers carry 5’ sequence overhangs of ca. 33-34 nucleotides in length and no nucleotide tags. The sequence overhangs allow the resulting amplicons to be targeted by the second round of primers, which carry sequencing adapters and indices (Fig. 2C). Most commonly, two consecutive PCRs are carried out, such as in Miya et al. (2015), de Vere et al, (2017), Galan et al. (2017), Kaunisto et al. (2017), Swift et al. (2018) and Vesterinen et al. (2018). However, a few studies carry out only one reaction with the two primer sets, such as Clarke et al. (2014a). The two-step PCR approach is based on Illumina’s 16S rRNA system originally developed for microbiome studies (www.illumina.com). In the two-step approach, each PCR replicate is an individual sequencing library and as such is returned as a separate fastq file following sequencing. It should be noted that a few studies modify the two-step PCR approach to include nucleotide labelling in the first PCR, see Kitson et al. (2018).
  3. The ‘tagged PCR’ approach, in which sample DNA extracts are PCR amplified with metabarcoding primers that carry 5’ nucleotide tags. The individually tagged PCR products are pooled, and ligation-based library preparation is carried out on pools of 5’ tagged amplicons. The ligated adapters can themselves contain indices, which eliminates the need for a second PCR step (e.g. Thomsen et al. 2016; Carøe & Bohmann 2020), or the adapter ligation can be followed by a PCR step with indexed primers (e.g. Hope et al. 2014; Bohmann et al. 2018). This approach was first demonstrated by Binladen et al. (2007) on the 454 FLX platform and has been since been used in e.g. Shehzad et al. (2012a), Hibert et al. (2013), Hope et al. (2014), Thomsen et al. (2016), Apothéloz-Perret-Gentil et al. (2017), Sigsgaard et al. (2017), Bakker et al. (2017), Kocher et al. (2017), Thomsen and Sigsgaard (2019) and Lynggaard et al. (2020) (Fig. 2D). In this approach, each library pool of PCR replicates is a sequencing library and is returned as a separate fastq file, each of which can contain data from a large number of PCR replicates.
All three main strategies offer the option to add extra nucleotides to shift PCR amplicons in relation to each other and thereby to increase sequence complexity on the flow cell (‘heterogeneity spacers’, see e.g. De Barba et al. 2014; Elbrecht & Leese 2015; Bohmann et al. 2018). Note that given the inconsistent use of terminology in the metabarcoding literature, for clarity, we use the original term for nucleotide tags in amplicon sequencing as used by Binladen et al . (2007) and Illumina’s terminology to describe the nucleotide reads that are used to demultiplex sequencing libraries, the i5 and i7 index reads. That is, 5’ nucleotide tags are sequenced with the metabarcoding marker and primers in the Illumina sequencing read 1 (and read 2 for paired-end sequencing), while library indices are sequenced as separate index reads, i.e. if dual-indexing is performed as i5 and i7 reads (Fig. 2A) (https://support.illumina.com).
In this article, we discuss the three main metabarcoding strategies. One approach not mentioned here is library preparation on individual unlabelled PCR products through a ligation-based library preparation protocol with or without an index PCR step. However, such ligation based protocol would entail several steps on each PCR product, such as end-repair and ligation of adapters (e.g. carrying indices such as in Illumina’s TruSeq Nano DNA Library Prep kit, see Zizka et al (2019). The reason that we do not consider this approach a main metabarcoding strategy is due to low reported use of this method, its high cost and workload and thereby limited throughput (Zizka et al. 2019).