INTRODUCTION
A decade and a half ago, DNA barcoding was presented as a novel system
to provide wide-scale and quick species identification using certain
gene sequences as molecular species-specific tags (Hebert, Cywinska,
Ball, & de Waard 2003). Since then, the number of species sequenced has
increased exponentially, and DNA barcodes are available for almost 200K
named species in international databases such as the Barcode of Life
Data Systems BOLD (Ratnasingham & Hebert 2007). DNA barcoding often
allows for the identification of morphologically cryptic species and
individuals at life stages difficult to determine morphologically (e.g.
insect larvae) (Bonal, Muñoz & Vogler 2011; Ahrens, Fabrizi, Skipek &
Lago 2013). It has boosted biodiversity inventories and environmental
monitoring, and constitutes a useful tool in taxonomy, ecology,
agriculture and conservation as well as for customs, police, food and
feed control (Savolainen, Cowan, Vogler, Roderick & Lane 2005; Jinbo,
Kato & Ito. 2011; Bergsten et al. 2012). The Barcoding of
Life initiative constitutes a historical feat, but this success does not
mean that the method is free of shortcomings (Dubey, Michaux, Brünner,
Hutterer & Vogler 2009; Berthier, Chapuis, Moosavi, Tohidi-Esfahani &
Sword 2011; Bergsten et al. 2012; Nicholls, Challis, Mutun &
Stone 2012). One of them is the potential decline of identification
accuracy as intra-specific genetic divergence increases along with the
geographical scale (Meyer & Paulay 2005; Bergsten el al . 2012,
but see Lukhtanov, Sourakov, Zakharov & Herbert 2009). In this study,
we approach whether this problem aggravates when genetic diversity
hotspots are undersampled.
In animals, a 648 bp section of the universal mitochondrial gene
encoding for the protein cytochrome c oxidase subunit I (COI) has been
adopted as the standard barcode (Hebert et al. 2003; Ratnasingham
& Hebert 2007). The logic behind DNA barcoding relies on the structure
of genetic variability above and below the species level. Individuals of
the same species display lower levels of genetic divergence among
themselves compared with heterospecific individuals (Hebert et
al. 2003; Hajibabaei, Singer, Hebert & Hickey 2007). Any genetic
threshold used for identification of queries is arbitrary, ideally
optimized for the dataset in question, and may for instance be 1, 2 or
3% (Ratnasingham & Hebert 2007; Lemos, Fulthrope, Triplett & Roesch
2011; Collins & Cruickshank 2013). BOLD identification engine for
instance uses 1% for species-level taxon assignment (Ratnasingham &
Hebert 2007). However, a plethora of methods have been proposed for
identification of unknowns against a reference library, tree-based,
distance-based, character-based, but few improve noticeably on a
standard “best close match” sequence distance strategy (Spouge 2016).
Many of the more sophisticated methods are also too slow to be
applicable to the growing needs of taxon assignment from the DNA
metabarcoding community.
While the presence of pseudogenes (Dubey et al. 2009; Berthieret al. 2011), former hybridization or incomplete lineage sorting
(Nicholls et al. 2012) may mislead identification, one of the
main caveats of DNA barcoding is not related with the evolutionary
history of the genes, but with the geographical distribution of the
samples. When the geographical scale increases intra-specific divergence
increases, and the distance to the closest related taxa decreases, which
results in more ambiguous specimen identification (Bergsten et
al. 2012).
The geographical scale effect on intra-specific divergence is based on
the well-known concept of genetic isolation by distance (Wright 1943),
but the relationship between genetic divergence and distance may differ
geographically. Taking Europe as an example, studies with different
types of organisms have demonstrated that, far from being homogeneously
distributed, genetic diversity is concentrated in certain areas of the
continent (Hewitt 1996; Avise 2000; Schmitt 2007). Thus, for a given
spatial distance between the sampling sites of two DNA barcodes, the
genetic distance could be higher if at least one of them comes from a
genetic diversity hotspot.
In Europe, apart from taxon-specific projects, large national barcoding
initiatives are all in north and central Europe, e.g. Germany
(Gemeinholzer et al. 2011), Netherlands (Beentjes, Speksnijder,
Van der Hoorn & Van Tol 2015), Norway (Ekrem et al. 2015) and
Finland (Huemer, Mutanen, Sefc & Hebert, 2014; Pentinsaari, Hebert &
Mutanen 2014), far from the southern Peninsulas (Iberia, Italy and the
Balkans) that host higher levels of biodiversity, endemism and genetic
diversity (Hewitt 1996; Murienne & Giribert 2009; Pinto, Muñoz,
Chávez-Galarza & De la Rua 2012; Geiger et al. 2014). In fact,
when a few smaller-scale barcoding initiatives have been carried out in
southern Europe for specific groups (like butterflies in the Iberian
Peninsula of freshwater fish around the Mediterranean Basin), the
results have revealed a high genetic richness and distinctiveness and
the existence of a number of potential cryptic species (Geiger et
al. 2014; Dincă et al. 2015).
In this study, we analysed the geographical scale effect on
intra-specific genetic distance using as study model a group of
Heteroceran Lepidoptera (i.e. moths) whose caterpillars feed on oak
(Quercus spp.) leaves. These moth species are widely distributed
over most parts of Europe (Camus 1936-1954). We could thus download a
high number of DNA barcodes from the public repository BOLD that were
pooled in the analyses with newly sequenced Iberian samples. Previous
reports on Lepidoptera have shown little intra-specific divergence at a
large geographical scale between Central and Northern Europe (Huemeret al. 2014). In this study, we included individuals from the
south of the continent to assess the effect of genetic diversity
hotspots on intra-specific genetic distance and identification success.
Our concrete objectives were:
- To analyse to which extent the availability of DNA barcodes is biased
towards central and northern Europe.
- To know whether, in pairwise sequence comparisons, for any given
spatial distance the genetic divergence is higher if at least one of
the sequences comes from a southern European peninsula.
- To reconstruct a COI gene-tree to assess the geographical distribution
of genetic diversity on the continent and the presence of monophyletic
clades (intra-specific distinct lineages and potential cryptic
species) exclusive of southern Europe.