Materials and Methods
Plant material
For this study, we used the materials collected from 78 populations ofCeratonia siliqua across the Mediterranean basin as described in Viruel et al. (2020). Based on field observations, the habitat of each population was recorded as natural, semi-natural or cultivated. Habitats where human impact is low and there is no evidence of recent land use were recorded as ‘natural’. ‘Semi-natural habitats’ lack cultivation but show the clear presence of human activities such as pasture, old farming or sometimes scattered almond, carob or olive trees from which fruits are occasionally harvested. Habitats containing carob trees in specialized or consociated orchards (with cereals or other fruit trees) ―even if abandoned―, were recorded as ‘cultivated’.
Preliminary delimitation of Ceratonia siliqua units using microsatellite data
We used microsatellite data for 17 loci and 56 localities across the Mediterranean basin as described in Viruel et al. (2020). We selected localities with at least ten carob genotypes per population (totalling 1020 trees). We also scored 15 single nucleotide polymorphisms present in the flanking regions of the 17 SSR loci following Viruel et al. (2018). We calculated genetic differentiation between localities using D index (Winter 2012, Mmod R package) and converted it into Euclidean genetic distances based on the coordinates of a NMDS (Non-metric Multidimensional Scaling; vegan R package). The Euclidean genetic distances and Euclidean geographical distances were then processed by ClustGeo (Chavent et al. 2018; clustgeo R package), which relies on Ward’s method to minimize intra-group variance for both geographical and genetic distances. K and 𝛼 are the main parameters. K is the number of clusters and 𝛼 is a mixing parameter determining the weight of the spatial constraint. Several values of 𝛼 were tested to optimize the spatial contiguity of populations without deteriorating genetic differentiation structure. Several values of K were also tested. This clustering method provided genetically homogeneous groups of spatially adjacent carob populations, named “CEUs” (Carob Evolutionary Units), which were used in subsequent analyses. A neighbor-joining tree based on pairwise genetic differentiation (Gst, Mmod package) among the seven CEUs was built to display the overall differentiation structure.
RADseq methodology and sequencing
CEUs were used to select 376 samples representative of the genetic diversity and structure of C. siliqua for RADseq. Eight samples of the only other species in the genus, C. oreothauma, were added. Genomic library preparation and sequencing were conducted by Microsynth ecogenics GmbH (Blagach, Switzerland). DNA samples (200-400 ng input) were digested with the restriction enzymes EcoRI/MSeI following heat inactivation according to the manufacturer’s protocol (New England Biolabs, NEB). Fragments between 500 and 600 bp were selected by automated gel cut, Illumina Y-shaped adaptors were ligated, and ligation products were bead purified. Each library was then individually barcoded by PCR using a dual-indexing strategy. Individually barcoded libraries were pooled and subsequently purified before sequencing on an Illumina NextSeq platform (300 million of 75 bp reads per run).
Bioinformatic pipeline to extract and filter SNPs from RADseq data
The bioinformatic approach used in this study is summarized in Fig. 1. FASTQC reports (multiQC, Ewels et al., 2016) were used to exclude 26C. siliqua and 4 C. oreothauma samples due to low sequencing coverage. The remaining 354 samples (350 C. siliquaand 4 C. oreothauma ) had an average of 3 million raw reads, ranging between 0.7 and 15 million. Assembly (Fig. 1) was performed using ipyrad (Eaton and Overcast 2020) in a high-performance computing cluster (HPC pytheas). Thereafter, a “locus” is a RADseq marker of 65 bp, that resulted from the ipyrad workflow, and a “SNP” is a polymorphic position of a specific locus, considering that a locus can contain several SNPs. We conducted an assembly limited to C. siliqua following two steps. First, we selected 36 samples representative of the diversity in C. siliqua with high sequencing coverage to conduct four de novo assemblies with varying thresholds of clustering reads (clust_threshold ) between 0.9 and 0.96, the minimum number of samples per locus (min_sample_locus ) fixed to 30, the minimum depth (mindepth_statistical ) for base calling fixed to 8, the minimum size of reads fixed to 50 and the first 5 bp of both ends of locus trimmed. The maximum number of indels and the maximum percentage of SNPs per locus were set to 1 and 10%, respectively. Default values were used for all other parameters. Considering the number of loci, heterozygosity, and error rates, we estimated an optimal clustering threshold of 0.94 (see Tab. S1 supplementary material). A reference file was generated by extracting the first sequence of each locus with pyrad2fasta script (available on https://github.com/pimbongaerts). Second, this file was used for a subsequent reference assembly for the 350 samples with the same parameters as previously except for the minimum number of samples per loci, which was fixed to 45. The dataset constructed using this reference assembly contained more than 64% missing data. The vcf file obtained from ipyrad was then processed to run a principal component analysis (PCA) using the glPca function of adegenet R package, and a neighbor-joining tree, based on pairwise genetic differentiation (Gst, Mmod R package) among the seven CEUs, to check that the structure from RAD markers was congruent with previous analysis based on microsatellite markers.
We used Matrix condenser software (de Medeiros and Farrell, 2018; de Medeiros 2019) to visualize the samples causing this high missing data rate. We filtered samples to optimize locus coverage and to keep the sampling equilibrium among CEUs, and reconducted the reference assembly with ipyrad on a new set of 190 samples. The data was then reduced to one SNP by locus, based on two criteria: SNP having the maximal minimum allelic frequency per locus and only keeping SNPs with allelic frequency above 1.05 % (i.e., rarest allele present in at least two individuals). To examine the carob genetic diversity based on neutral processes, such as genetic drift and gene flow, we reduced the effect of outlier loci (i.e. having unexpectedly high differentiation among populations, which could have a great effect on the variance among populations). We used Outflank software (Whitlock & Lotterhos, 2015), which produces a lower false-positive rate compared to other methods (e.g. Silliman 2019), considering CEUs as populations. The false discovery rate (qvalue) was fixed to 0.05. BLAST searches using the nucleotide NCBI database limited to flowering plant sequences were conducted for outliers, which were mostly assigned to plastid (pDNA) or mitochondrial (mtDNA) genomes. Nuclear sequences were then discarded whereas pDNA and mtDNA sequences were used as a reference in a new ipyrad assembly, with adapted parameters for haploid data. The sequences of these new sets of markers were concatenated to construct separate alignments of mitochondrial and plastid haplotypes for the 190 carob trees.
Hereafter, different analyses were conducted on three datasets: the genome-wide nuclear data (GWN), was used for the analysis of genetic differentiation, population divergence history and gene flow, whereas the mitochondrial and plastid-based haplotypes datasets were used in phylogenetic reconstructions (MH and PH datasets respectively).
Raw and filtered vcf files, as well as custom R scripts are available in ####: ####.
Population genomic analysis
The GWN dataset was converted from genind to genlight, vcf and treemix data formats using dartR (Gruber et al., 2018) and radiator (Gosselin 2020) R packages. Population structure analysis was performed using snmf (R LEA package, Frichot and François 2015) which estimates admixture coefficients from the genotypic matrix assuming K ancestral populations. Snmf runs fast even on large datasets and without loss of accuracy compared to other Bayesian modelling software such as ADMIXTURE (Alexander et al. 2009; Frichot et al. 2014). Snmf was run for K = 2 to 15, 100 repetitions, regularization parameter of 250 and 25% of the genotypes were masked to compute the cross-entropy criterion. Barplots showing ancestry coefficients were obtained with the compoplot function in adegenet R package (Jombart and Ahmed 2011), with genotypes sorted according to CEUs. To visualize the genetic diversity structure, we performed a PCA using the glPca function of adegenet R package. Pairwise differentiation among CEUs (Gst) were also computed with the Mmod package. The R scripts of these analyses are available in #### web site: ####.
A new reference assembly was performed to add C. oreothaumasamples to the GWN dataset and obtain one including the two species. As previously, to obtain unlinked SNPs, we retained one SNP per loci, choosing the one with the highest frequency. Using PAUP*4.0a (Swofford, 2018), we built a coalescent-based tree with the SDV quartet method (Chifman & Kubatko 2014). Ten million randomly selected quartets were analyzed and node support was assessed by 1,000 bootstrap replicates. We used the ‘distribute’ option for heterozygous sites.
The GWN dataset without C. oreothauma genotypes was used to perform a Treemix analysis (Pickrell and Pritchard 2012) to infer population splits and mixtures across the evolutionary history of the carob tree. Treemix builds a backbone tree based on population allelic frequency without gene flow and then adds reticulate branches, representing gene flow, aiming at improving the fit of the data. For this analysis, we used the CEUs as populations. One of the CEUs was fixed as an outgroup according to the results obtained in the phylogenetic analysis (see SDV quartet method in Results). As previously, only unlinked SNPs were used.
Assemblies based on MH and PH sequences failed to include C. oreothauma . The MH and PH matrices were analyzed with IQTREE with the default parameters. The F81+F+I and TN+F+I+G4 models were retained for the MH and PH matrices, respectively.
Footprints of domestication were investigated by estimating genetic diversity and/or the presence of candidate loci under selection (outliers) for the CEUs. A stronger effect of domestication is expected in cultivated habitats compared to natural habitats and also in the eastern CEUs compared to the western ones. Therefore, the analyses were organized according to combinations of these two factors. Genetic diversity indexes (Hobs, Hexp, f and rbardD) were estimated using hierfstats and poppr R packages (Goudet 2005; Kamvar et al. 2014). Outliers were searched with the Outflank method with a FDR threshold of 0.05 as described above when building the GWN dataset considering combinations of habitats and CEUs as populations.