Metabarcodes for the monitoring of freshwater benthic
biodiversity through environmental DNA
Gentile Francesco Ficetola1,2, Frédéric
Boyer1, Alice Valentini3, Aurélie
Bonin1,2, Albin Meyer4, Tony
Dejean3, Coline Gaboriaud3, Philippe
Usseglio-Polatera4, Pierre
Taberlet1,5
1 Univ. Grenoble Alpes, CNRS, Univ. Savoie Mont Blanc,
LECA, Laboratoire d’Ecologie Alpine, F-38000 Grenoble
2 Department of Environmental Sciences and Policy;
University of Milano, Milano, Italia.
3 SPYGEN, Le Bourget du Lac, France
4 Université de Lorraine, CNRS, LIEC, F-57000 Metz,
France
5 UiT – The Arctic University of Norway, Tromsø
Museum, Tromsø, Norway
Abstract
Environmental DNA and metabarcoding have great potential for the
biomonitoring of freshwater environments. However, successful
application of metabarcoding to biodiversity monitoring requires
universal primers with high taxonomic coverage that amplify
highly-variable, short metabarcodes with high taxonomic resolution.
Moreover, reliable and extensive reference databases are essential to
match the outcome of metabarcoding analyses with the available taxonomy
and biomonitoring indices. Benthic invertebrates, particularly insects,
are key taxa for freshwater biomonitoring. Nevertheless, so far, no
formal comparison has assessed primers for metabarcoding of freshwater
macrobenthos. Here we combined in vitro and in silicoanalyses to test the performance of metabarcoding primers amplifying
regions in the 18S rDNA (Euka02 metabarcode), 16S rDNA (Inse01), and COI
(BF1_BR2-COI) genes, and developed an extensive database of benthic
invertebrates of France and Europe, with a special focus on three key
insect orders (Ephemeroptera, Plecoptera and Trichoptera). In
vitro analyses on 1514 individuals, belonging to 578 different
taxonomic units showed very different amplification rates across primer
combinations. The Euka02 marker showed the highest universality, while
the Inse01 marker showed excellent performance for the amplification of
insects. The BF1_BR2-COI metabarcode showed the highest resolution,
while the resolution of Euka02 was often limited. By combining in
vitro data with GenBank information, we developed a curated database
including sequences representing 822 genera. The heterogeneous
performance of the different metabarcodes highlights the complexity of
the identification of the best markers, and advocates for the
integration of multiple metabarcodes for a more comprehensive and
accurate understanding of ecological impacts on freshwater biodiversity.
Keywords: freshwater biodiversity; biomonitoring; biotic indices; DNA
metabarcoding; primer bias; invertebrates; cytochrome c oxidase I;
amplification rate; universality; taxonomic resolution.
1 INTRODUCTION
Freshwater environments are essential providers of clean water and other
services for human society. They also host a substantial biodiversity,
still they are globally subjected to the joint impact of multiple
stressors such as pollution, eutrophication, climate change and
hydrological and hydromorphological modifications (Noges et al.2016; Iversen et al. 2019). As a consequence, numerous
regulations have been adopted at both the national and international
level for the protection of water resources, such as the European Water
Framework Directive (Directive 2000/60/EC) and the Clean Water Act of
the US Environmental Protection Agency (33 U.S.C. §§1251-1387 1972)
(Pawlowski et al. 2018). These regulations generally require the
monitoring of freshwater environments through a combination of
physicochemical, hydrological, and biotic parameters, to obtain prompt
measurements of water quality and of the ecological status of
ecosystems.
Multiple approaches exist to assess freshwater quality using aquatic
organisms. Benthic invertebrates are perhaps the most frequently used
biological group in aquatic bioassessment (Birk et al. 2012),
because they are (i) taxonomically, biologically and functionally
diverse (Usseglio-Polatera et al. 2000; Usseglio-Polateraet al. 2001), (ii) rather easy to identify at the genus or family
levels (Tachet et al. 2010), (iii) often sedentary and reacting
rapidly to anthropogenic pressures in all types of freshwater bodies
(Hering et al. 2006b; Archaimbault et al. 2010; Heringet al. 2013), and (iv) their occurrence integrates the effects of
environmental changes over several months (Floury et al. 2013).
Invertebrate assemblages are thus a tool of choice to assess the
ecological status of water bodies (e.g. Marzin et al.2012; Hering et al. 2013; Mondy & Usseglio-Polatera 2013) and to
demonstrate environmental degradation (Miler et al. 2013; Mondy
& Usseglio-Polatera 2013; Theodoropoulos et al. 2020) or
restoration (Arce et al. 2014; Kupilas et al. 2016;
Camargo 2017; Carlson et al. 2018).
Generally, bioassessment indices relying on benthic communities are
based on the standardized collection of invertebrate assemblages from
monitored sites, followed by organism sorting and taxonomic
identification using morphological criteria. Then, quality scores can be
attributed on the basis of the presence and/or abundance of certain taxa
(Friberg et al. 2006; Birk et al. 2012). As morphological
identification is often challenging, in many case protocols do not
require species-level identification, and identification at the genus or
family level (and, in some cases, even at coarser levels) can be enough
for the calculation of many biotic indices evaluating the ecological
status of rivers (Bailey et al. 2001; Chessman et al.2007; Birk et al. 2012). Nevertheless, the morphological
identification of hundreds collected specimens, including young,
small-sized, larval stages and organisms damaged during sampling,
remains time-consuming and requires a substantial taxonomic expertise,
increasing the cost and time required for in-depth assessment of water
quality (Haase et al. 2004; Hering et al. 2018).
Environmental DNA (eDNA) and metabarcoding are revolutionizing the
monitoring of biodiversity at all levels, because they circumvent the
challenge of morphological identification and allow the efficient
detection of many taxa that are difficult to capture and detect using
traditional methods (Taberlet et al. 2018). eDNA and
metabarcoding are therefore extremely promising for the assessment of
freshwater communities (Hering et al. 2018). DNA can be extracted
from the tissue of pooled invertebrate communities, amplified using
universal primers, sequenced, and identified on the basis of reference
databases (Baird & Hajibabaei 2012; Yu et al. 2012; Andújaret al. 2018). This approach uses the same starting material than
traditional biomonitoring, but allows skipping the complexity of
morphology-based taxonomy (Baird & Hajibabaei 2012). Alternatively, DNA
can be obtained directly from the water (Ficetola et al. 2008).
Environmental DNA extracted from freshwaters allows the detection of
many taxa that are difficult to capture and detect using traditional
methods, but also poses new challenges compared to metabarcoding
performed on the tissues of captured individuals. When in aquatic
environments, DNA undergoes rapid degradation (Eichmiller et al.2016; Buxton et al. 2017); therefore eDNA is generally
characterized by small fragment sizes (Jo et al. 2017; Bylemanset al. 2018), but see also (Sigsgaard et al. 2017). This
generally precludes the use of ”standard” barcode primers, which often
amplify long DNA fragments (e.g. >300 bp in the most
frequently used COI markers; Andújar et al. 2018). Furthermore,
highly degenerated primers increase the risk of non-specific
amplification, thus this kind of primers is not really suitable for the
amplification of the complex mix of DNA extracted from the environment.
As a consequence, the monitoring of benthic invertebrates using eDNA
requires the development and assessment of primers with appropriate
features.
Besides the length of the amplified region, three main characteristics
are essential for satisfactory eDNA metabarcodes. First, the eDNA
amplification rate generally decreases with the number of mismatches
between target fragments and primers. Primers must therefore be designed
in order to have a consistently low number of mismatches within
sequences of the target group (high universality or taxonomic coverage;
Ficetola et al. 2010; Piñol et al. 2015; Marquina et
al. 2019). Taxonomic coverage can be assessed through both in
silico and in vitro analyses. In silico analyses can
allow the rapid assessment of all the taxa for which information is
publicly available in databases, but in vitro tests are still
needed to confirm the conditions under which primers work in the real
world. Second, the amplified region must be highly variable, to ensure
the identification of amplified organisms at the desired taxonomic level
(high resolution; Ficetola et al. 2010; Tang et al. 2012;
Marquina et al. 2019). Finally, extensive databases are essential
if we want to assign the amplified sequences to known taxa. Even though
attempts have been made for the assessment of environmental quality
without a taxonomic assignment of DNA fragments (Ji et al. 2013;
Apothéloz-Perret-Gentil et al. 2017), taxonomic assignment is
essential if we want to produce data comparable with traditional indices
of water quality, or if we want to combine eDNA data with information
obtained through traditional methods (e.g. to analyse long-term
series of water body surveys). Despite several attempts to assess
freshwater quality using eDNA (Hering et al. 2018; Serranaet al. 2019; Czechowski et al. 2020; Pont et al.2020; Yang & Zhang 2020), so far no formal comparison has been
performed among short primers suitable for eDNA metabarcoding of
freshwater macrobenthos. In addition, there is a pressing need of
exhaustive reference databases for taxonomic assignment.
In this study we combined in vitro and in silico analyses
to compare the performance of three primer pairs potentially suitable
for the analysis of eDNA from freshwater invertebrates (macrobenthos),
and we developed an extensive reference database for benthic
invertebrates living in European freshwaters. We mostly focused on three
insect orders (Ephemeroptera, Plecoptera and Trichoptera), which are
among the most frequently used invertebrates for the bioassessment of
streams (e.g. Brabec et al. 2004; Hering et al. 2006a;
Gabriels et al. 2010; Arman et al. 2019; but see also Coxet al. 2019). We also considered a broad range of organisms
belonging to other orders of insects and to other classes. We first
produced the metabarcodes on the broadest available number of taxa from
France, and then combined metabarcodes obtained in vitro with
sequences available in public database, to obtain extensive and reliable
measures of metabarcode performance, and to produce a extensive
reference database for the monitoring of freshwaters through eDNA.
2 MATERIAL AND METHODS
We used the standardized database of European freshwater organisms
(Schmidt-Kloiber & Hering 2015; download on 01 March 2018) as taxonomic
reference for our analyses, considering all the benthic
macro-invertebrates. Although in some cases this database considers
non-monophyletic groups (e.g. Crustacea), it provides an
exhaustive checklist of benthic macroinvertebrates that serve as an
essential basis for monitoring bioassessment.
2.1 In vitro analyses of reference specimens
Most of the reference specimens were provided by OPIE-Benthos which is a
working group of OPIE (Office Pour les Insectes et leur Environnement)
especially dedicated to aquatic insect studies and aquatic ecosystem
protection in France. OPIE-Benthos has developed a national inventory
and reference collection of aquatic insects, including Ephemeroptera,
Plecoptera, Trichoptera, and more recently aquatic Coleoptera, aquatic
and semi-aquatic Heteroptera, aquatic larval stages of Megaloptera,
Nevroptera and Diptera (Ptychopteridae)
(http://www.opie-benthos.fr/opie/insecte.php). Corresponding organisms,
identified at the highest possible level (species, if possible) by
experienced taxonomists, were provided in triplicates (i.e. three
specimens per taxon). The collection was completed by additional taxa
(e.g. non-insect taxa) specifically sampled by the authors for
this reference database.
Specimens were stored in 99% ethanol before DNA extraction. Total DNA
was extracted from the entire organism. Samples (constituted of one
specimen) were initially incubated overnight at 56 °C in 0.5 ml of lysis
buffer (Tris-HCl 0.1 M, EDTA 0.1 M, NaCl 0.01 M and N-lauroyl sarcosine
1%, pH 7.5–8.0). Extractions were then completed using the DNeasy
Blood Tissue Kit (Qiagen GmbH, Hilden, Germany), according to the
manufacturer’s instructions. DNA extracts were recovered in a total
volume of 300 μl of elution buffer. Negative extractions without
specimens were systematically performed to monitor possible
contaminations. Three DNA amplifications were carried out for each
sample using the following primer pairs: Inse01, amplifying a
~155 bp region of the 16S mitochondrial rDNA (Taberletet al. 2018); Euka02, amplifying a ~123 bp region
of the 18S rDNA (Guardiola et al. 2015; Taberlet et al.2018); and the BF1 and BR2 primers, which amplify a ~316
bp region of the cytochrome c oxidase I (Elbrecht & Leese 2017). Inse01
has been developed mostly to amplify insects, Euka02 to amplify all
eukaryotes, while BF1 and BR2 were designed to amplify freshwater
macroinvertebrates (Elbrecht & Leese 2017; Taberlet et al.2018). DNA amplifications were performed in a final volume of 20 μL,
using 2 μL of DNA extract as template. The amplification mixture
contained 10 µL of Applied Biosystems™ Master Mix AmpliTaq Gold™ 360,
0.2 μg/μL of bovine serum albumin (BSA, Roche Diagnostic, Basel,
Switzerland) and 0.5 µM of each primer for COI and Inse01, or 0.2 µM of
each primer for Euka2. Forward and reverse primers were 5’-labeled with
eight-nucleotide tags with at least three differences between any pair
of tags, so that each PCR replicate was identified by a unique
combination of tags. This allowed the assignment of each sequence to the
corresponding replicate during sequence analysis (Coissac 2012; Taberletet al. 2018). The PCR mixture was denatured at 95°C for 10 min,
followed by 35 cycles of 30 s at 95°C, 30 s at 52°C for COI and Inse01
or 45°C for Euka2, and 1 min at 72°C (1m 30s for COI), and followed by a
final elongation at 72°C for 7 min. Negative DNA extraction and PCR
controls (ultrapure water, with 3 replicates as well) were analysed in
parallel with the samples to monitor possible contaminations during the
PCR step.
For Euka02 and Inse01, sequencing was performed by 2 × 125-bp pair-end
sequencing on Illumina HiSeq 2500 platform, while for BF1_BR2-COI
sequencing was performed by 2 × 250-bp pair-end sequencing on Illumina
MiSeq platform at Fasteris (Geneva, Switzerland). Sequencing data were
processed using the OBITools (Boyer et al. 2016). Raw sequences
were first aligned (illuminapairedend) to recover the amplicon sequence
and then demultiplex (ngsfilter) to assign them to the samples. This was
followed by dereplication (obiuniq) keeping track for each sequence of
its count in the samples. Then for each sample, the ratio of counts for
the most abundant sequence and the second most abundant sequence was
calculated. Only the most abundant sequences having a count greater than
1000 and a ratio above 1/10 were considered to get rid of badly
amplified samples and samples were several product were amplified.
As a further validation step, all the retrieved metabarcodes were
matched against NCBI using BLAST, to identify eventual cases in which
the obtained metabarcode is a spurious amplification of a non-target
organism (e.g. fungi or algae). The in vitro amplification
rate was measured for each taxon as the proportion of specimens for
which we obtained valid metabarcodes.
2.2 Setting up the composite reference databases
For each species within the database of European freshwater organisms
(Schmidt-Kloiber & Hering 2015), we matched the binomial name with the
NCBI taxonomy database to retrieve their NCBI taxonomic code (taxid).
All the available metabarcodes for the three regions of interest,
together with their associated taxid, were extracted from the EMBL
sequence data repository (release 136) using the ecoPCR program
(Ficetola et al. 2010) by matching the primer sequences with up
to 3 errors and restricting the metabarcodes to relevant lengths (length
>30 bp for Euka02, length 70-270 bp for Inse01, length
100-500 for BF1_BR2-COI). The three composite reference databases (one
for each metabarcode region) were then built by aggregating metabarcodes
for each genus with those obtained from specimens analysed in
vitro . In order to obtain the most complete coverage of genera found in
France, we obtained the taxid of all metabarcodes produced throughin vitro analyses as well as metabarcodes extracted from EMBL and
associated to the taxid of a species found in France. For genera for
which no such metabarcode existed, we included the metabarcodes
extracted from EMBL and associated to the taxid of a species of the same
genus found in Europe. If no such metabarcode existed, we included all
the metabarcodes extracted from EMBL, and associated to a taxid
belonging to this genus, also considering species that are not native in
Europe.
2.3 Assessing the resolution of metabarcodes
We assessed the resolution of each metabarcoding region with the same
procedure. First, the metabarcodes obtained as described above were
compared to each other to find identical metabarcodes; this allowed
producing a list of unique metabarcodes. For each unique metabarcode, we
obtained the list of all the associated taxids. We tested taxonomic
resolution at four levels: order, family, genus, and species. More
specifically, we tested if, at a given taxonomic level, the list of
associated taxids would collapse to a unique taxid or not (i.e.all taxids have the same ancestor taxid at that level). If a list would
not collapse to one unique taxid for the tested taxonomic level, it
meant that this metabarcode was not discriminant for this taxonomic
level. Consider for instance a given metabarcode associated to multiple
species within multiple genera within one single family. This particular
metabarcode showed a family-level resolution, but not a species- or a
genus-level resolution. It must be remarked that these measures of
taxonomic resolution are heavily dependent on the available database.
For example, if the database includes the metabarcode of only one
species within a genus, this analysis could return a species-level
resolution, even though it is possible that unanalysed species within
the same genus share the same metabarcode.
3 RESULTS
3.1 In vitro analyses of reference specimens
We extracted and amplified DNA from 1514 individuals, belonging to 578
different taxa (Table 1). The majority of specimens were insects, and
three insect orders with macrobenthic larvae (Ephemeroptera, Plecoptera
and Trichoptera) altogether accounted for 80% of analysed specimens.
Out of these specimens, 99% were morphologically identified at the
family level or higher, 95% at the genus level or higher, and 62% at
the species level. The average number of sampled individuals was 2.6
individuals per taxon (range: 1-12; median: 3). For Ephemeroptera,
Plecoptera, Trichoptera and Megaloptera the analysed specimens covered
well the diversity of French and European benthic fauna (100%, 74%,
78% and 100% of genera recorded in France for Ephemeroptera,
Plecoptera, Trichoptera and Megaloptera, respectively; 70%, 52%, 65%
and 100% of all the genera recorded in Europe; Table 2). Representation
was relatively good for Coleoptera, Hemiptera and Neuroptera, whereas
coverage was weaker for the remaining orders of insects and for
non-insects.
The amplification rate using the three metabarcodes was highly
heterogeneous among taxonomic groups (Fig. 1). Euka02 (18S) showed the
highest average amplification success (88%), with consistently high
amplification success in all the taxa except Malacostraca (Fig. 1).
Within insects, Euka02 showed excellent amplification success in most of
orders, but its amplification success was poor with Diptera (Fig. 1b).
As expected, Inse01 showed good amplification success for insects
(82%), while it showed a limited amplification of the remaining taxa
(Fig. 1a). Within insects, Inse01 showed excellent amplification success
in all the orders except Trichoptera, where amplification success was
71% (Fig. 1b).
Finally, BF1_BR2-COI showed an average amplification rate of 48%, with
highly variable results among taxa (Fig. 1a). BF1_BR2-COI showed a good
amplification rate with Gastropoda, Clitellata and Malacostraca, while
the rate was lower for several orders of insects. Within insects,
BF1_BR2-COI showed good performance in Coleoptera and Diptera
(amplification success ≥ 74%), while it amplified less than 50% of
specimens from Ephemeroptera, Plecoptera and Trichoptera (Fig. 1b).
3.2 Combined database
When we combined sequences obtained in vitro with sequences
obtained from GenBank, we obtained a total of 18 834 metabarcodes (3 441
for Euka02, 9 715 for Inse01 and 5 678 for BF1_BR2-COI). Insects
accounted for the majority of metabarcodes, followed by Crustacea and
Clitellata (Table 3). The combined database showed a good coverage of
the diversity of European benthic fauna. For the Euka02 primer pair, the
completeness of the database was particularly good (>80%)
for Turbellaria, Coleoptera and Odonata. For Inse01, the completeness
was particularly good for Coleoptera, Ephemeroptera and Odonata, while
BF1_BR2-COI showed a relatively homogeneous completeness across taxa,
with values between 50 and 70% for most of taxa (Fig. 2).
3.3 Taxonomic resolution of metabarcodes
The taxonomic resolution was strongly different among metabarcodes. For
Euka02, 21% of metabarcodes were associated with more than one species
in the database (Fig. 3a). The best resolution was observed for
BF1_BR2-COI, with just 3% of metabarcodes associated with more than
one species, while Inse01 showed an intermediate resolution (10% of
metabarcodes associated with more than one species; Fig. 3a). The
taxonomic resolutions of these metabacodes were clearly better if we
consider the identification at the genus level (Fig. 3b). Euka02 showed
the weakest performance, with around 6% of metabarcodes associated with
more than one genus, while BF1_BR2-COI showed the best performance,
with less than 1% of metabarcodes associated with more than one genus.
Inse01 showed a generally good performance, with less than 1% of
metabarcodes associated with more than one genus for most taxa. The
performance was slightly poorer for Plecoptera and Trichoptera, with
around 4% of metabarcodes associated with more than one genus. Family
level identification was very good for all the metabarcodes, with a
slightly poorer performance of Euka02 (Fig. 3c). It must be remarked
that these values of resolution are calculated on an incomplete set of
data, since our database did not include the sequences of many species
and genera (Table 3), and all resolution estimates would probably be
poorer if calculated on a complete database.