INTRODUCTION
A decade and a half ago, DNA barcoding was presented as a novel system to provide wide-scale and quick species identification using certain gene sequences as molecular species-specific tags (Hebert, Cywinska, Ball, & de Waard 2003). Since then, the number of species sequenced has increased exponentially, and DNA barcodes are available for almost 200K named species in international databases such as the Barcode of Life Data Systems BOLD (Ratnasingham & Hebert 2007). DNA barcoding often allows for the identification of morphologically cryptic species and individuals at life stages difficult to determine morphologically (e.g. insect larvae) (Bonal, Muñoz & Vogler 2011; Ahrens, Fabrizi, Skipek & Lago 2013). It has boosted biodiversity inventories and environmental monitoring, and constitutes a useful tool in taxonomy, ecology, agriculture and conservation as well as for customs, police, food and feed control (Savolainen, Cowan, Vogler, Roderick & Lane 2005; Jinbo, Kato & Ito. 2011; Bergsten et al. 2012). The Barcoding of Life initiative constitutes a historical feat, but this success does not mean that the method is free of shortcomings (Dubey, Michaux, Brünner, Hutterer & Vogler 2009; Berthier, Chapuis, Moosavi, Tohidi-Esfahani & Sword 2011; Bergsten et al. 2012; Nicholls, Challis, Mutun & Stone 2012). One of them is the potential decline of identification accuracy as intra-specific genetic divergence increases along with the geographical scale (Meyer & Paulay 2005; Bergsten el al . 2012, but see Lukhtanov, Sourakov, Zakharov & Herbert 2009). In this study, we approach whether this problem aggravates when genetic diversity hotspots are undersampled.
In animals, a 648 bp section of the universal mitochondrial gene encoding for the protein cytochrome c oxidase subunit I (COI) has been adopted as the standard barcode (Hebert et al. 2003; Ratnasingham & Hebert 2007). The logic behind DNA barcoding relies on the structure of genetic variability above and below the species level. Individuals of the same species display lower levels of genetic divergence among themselves compared with heterospecific individuals (Hebert et al. 2003; Hajibabaei, Singer, Hebert & Hickey 2007). Any genetic threshold used for identification of queries is arbitrary, ideally optimized for the dataset in question, and may for instance be 1, 2 or 3% (Ratnasingham & Hebert 2007; Lemos, Fulthrope, Triplett & Roesch 2011; Collins & Cruickshank 2013). BOLD identification engine for instance uses 1% for species-level taxon assignment (Ratnasingham & Hebert 2007). However, a plethora of methods have been proposed for identification of unknowns against a reference library, tree-based, distance-based, character-based, but few improve noticeably on a standard “best close match” sequence distance strategy (Spouge 2016). Many of the more sophisticated methods are also too slow to be applicable to the growing needs of taxon assignment from the DNA metabarcoding community.
While the presence of pseudogenes (Dubey et al. 2009; Berthieret al. 2011), former hybridization or incomplete lineage sorting (Nicholls et al. 2012) may mislead identification, one of the main caveats of DNA barcoding is not related with the evolutionary history of the genes, but with the geographical distribution of the samples. When the geographical scale increases intra-specific divergence increases, and the distance to the closest related taxa decreases, which results in more ambiguous specimen identification (Bergsten et al. 2012).
The geographical scale effect on intra-specific divergence is based on the well-known concept of genetic isolation by distance (Wright 1943), but the relationship between genetic divergence and distance may differ geographically. Taking Europe as an example, studies with different types of organisms have demonstrated that, far from being homogeneously distributed, genetic diversity is concentrated in certain areas of the continent (Hewitt 1996; Avise 2000; Schmitt 2007). Thus, for a given spatial distance between the sampling sites of two DNA barcodes, the genetic distance could be higher if at least one of them comes from a genetic diversity hotspot.
In Europe, apart from taxon-specific projects, large national barcoding initiatives are all in north and central Europe, e.g. Germany (Gemeinholzer et al. 2011), Netherlands (Beentjes, Speksnijder, Van der Hoorn & Van Tol 2015), Norway (Ekrem et al. 2015) and Finland (Huemer, Mutanen, Sefc & Hebert, 2014; Pentinsaari, Hebert & Mutanen 2014), far from the southern Peninsulas (Iberia, Italy and the Balkans) that host higher levels of biodiversity, endemism and genetic diversity (Hewitt 1996; Murienne & Giribert 2009; Pinto, Muñoz, Chávez-Galarza & De la Rua 2012; Geiger et al. 2014). In fact, when a few smaller-scale barcoding initiatives have been carried out in southern Europe for specific groups (like butterflies in the Iberian Peninsula of freshwater fish around the Mediterranean Basin), the results have revealed a high genetic richness and distinctiveness and the existence of a number of potential cryptic species (Geiger et al. 2014; Dincă et al. 2015).
In this study, we analysed the geographical scale effect on intra-specific genetic distance using as study model a group of Heteroceran Lepidoptera (i.e. moths) whose caterpillars feed on oak (Quercus spp.) leaves. These moth species are widely distributed over most parts of Europe (Camus 1936-1954). We could thus download a high number of DNA barcodes from the public repository BOLD that were pooled in the analyses with newly sequenced Iberian samples. Previous reports on Lepidoptera have shown little intra-specific divergence at a large geographical scale between Central and Northern Europe (Huemeret al. 2014). In this study, we included individuals from the south of the continent to assess the effect of genetic diversity hotspots on intra-specific genetic distance and identification success. Our concrete objectives were:
  1. To analyse to which extent the availability of DNA barcodes is biased towards central and northern Europe.
  2. To know whether, in pairwise sequence comparisons, for any given spatial distance the genetic divergence is higher if at least one of the sequences comes from a southern European peninsula.
  3. To reconstruct a COI gene-tree to assess the geographical distribution of genetic diversity on the continent and the presence of monophyletic clades (intra-specific distinct lineages and potential cryptic species) exclusive of southern Europe.