Material and Methods
Study sites, sample, and physicochemical data
collection
Our twelve study sites are located at the junction of the Ottawa river
and the St. Lawrence River near Montreal, QC, Canada (Fig. 1). The
Ottawa River water is calcium-poor (10-15 mg/L calcium), and the St.
Lawrence River water is comparatively calcium-rich (30-40 mg/L) due to
the different geological characteristics of their watersheds. These
water masses mix at the junction of two major river systems at Lake
Saint Louis, a widening of the St. Lawrence River, but the calcium
gradient persists in the north and south shores, and water masses are
distinct (Vis et al., 1998). In 2005, round gobies invaded the upper St.
Lawrence River and the southern shore of Lake Saint-Louis, but not the
calcium-poor Ottawa River nor calcium-poor sites on the north shore of
Lake Saint-Louis (Kipp & Ricciardi, 2012).
Twelve Amnicola populations were sampled from the study sites in
this fluvial ecosystem (Fig. 1), with three populations fully in the
Ottawa River, three fully in the St. Lawrence River, and six populations
in the Lake St-Louis, including three on the north shore and three on
the south shore. We coded populations collected in the Ottawa River
water as LCGA (low calcium- water and gobies absent) and populations
from the St. Lawrence River water as HCGP (high calcium water and gobies
present). Two populations had inverted patterns: RAF is calcium-poor,
but gobies are present (LCGP), and PDC is calcium-rich, but gobies are
absent (HCGA). It is noteworthy that PDC is located in a refuge habitat
(wetland; Astorg et al., 2021) but close to invaded sites, and thus
might receive strong gene flow from nearby invaded populations.
Field-collected Amnicola snails were obtained near the shore via
hand picking and brought back to the lab for further processing (DNA
extractions and the common garden experiment) in June-October 2017. Goby
abundance was measured in the field between July and September 2017 on a
single occasion at each site. For this, each site was sampled using
three seine net passes, with intermission periods between seining times.
The seine net used for sampling nearshore habitats was 30 feet long by 6
feet deep and 1/8 mesh on a 10 m distance. Round gobies were placed into
bins and released after the three hauls. The geographic location and
environmental characteristics of our sampling sites are detailed in
Table S1. We measured dissolved oxygen (DO; mgL-1),
pH, water temperature (oC), and conductivity
(µS.cm-2) using a Professional Plus Model YSI
multi-parameter probe (model 10102030; Yellow Springs Inc.) at each
study site in 2017 at the time of gastropod collection. On the same
occasions, we collected water samples and analyzed them for calcium
(Ca), total phosphorus (TP), total nitrogen (TN), as well as dissolved
organic carbon (DOC) at the GRIL-UQAM analytical lab (Supplementary
Methods). Site-specific invasion status by round goby (invaded /
uninvaded) is defined by presence/absence (Table S1).
De novo genome assembly and
pool-sequencing
For the de novo genome
assembly, we extracted DNA from the tissue of one individual snail
collected in 2017 using a standard Phenol Chloroform extraction method,
after removing the shell and excising the mollusk guts to avoid
contaminants. Briefly, tissue samples were placed in a digestion buffer
containing proteinase K and digested at 55°C. DNA was then isolated
using an isoamyl-phenol-chloroform solution, followed by ethanol
precipitation. DNA quantity and quality were verified using a
combination of different quality control methods: Qubit assay (Thermo
Fisher Scientific Inc.), Tapestation (Agilent Inc.), and Femto Pulse
(Agilent Inc,). Fragments longer than 1 kb were selected for further
processing. Library preparation was performed using 10X Chromium
Linked-Read library kit (10X Genomics Inc.) and sequenced on 3 lanes of
Illumina HiSeqX PE150 at Genome Quebec. We ran fastp v.0.23.4 on the
three 10X paired-end reads to obtain the insert size, using the -Q
option to disable quality filtering. Fastp results showed two estimated
insert size peaks at 175 and 270 bp. Reads were assembled with Supernova
v.2.1.1. The assembled genome is 1,899,346,312 bp in length, with
815,134 scaffolds and a N50 of approximately 5kb. We estimated the
genome size with Jellyfish 2.3.0 by reading simultaneously the R1 reads
of the three runs of 10X sequencing using the options -F 3, -m 21, and
-s 2G. The resulting histogram was then processed with GenomeScope
http://qb.cshl.edu/genomescope/ (Vurture et al., 2017), which yielded an
estimated haploid genome length of 382,882,063 bp, with 2.25% of
repeats and 4.72% of heterozygosity. This is much smaller than the
assembled genome size (1,899,346,312 bp) due to fragmentation. We also
used Benchmarking Universal Single-Copy Orthologs BUSCO v5.2.2 (Manni et
al., 2021) to assess gene completeness by searching for core mollusc
orthologous genes, using the option -genome and the BUSCO.v4 lineage
mollusca_odb10.2019-11-20. Most core genes were missing from our draft
genome (74.7%), with only 19% complete core genes recovered
(C:19%[S:18.1%, D:0.9%], F: 6.3%, M:74.7%, n=5295, with C:
complete single copy BUSCO genes, S: complete and single-copy BUSCOs, D:
complete and duplicated BUSCOs, F: fragmented BUSCOs, M: missing BUSCOs,
n: total BUSCOs searched).
For the pooled sequencing, we extracted DNA from the tissues of 40
individuals per pool/population using the same standard Phenol
Chloroform extraction method mentioned above. We quantified all samples
using a Picogreen ds DNA assay (Thermo Fisher Scientific Inc.) on an
Infinite 200 Nanoquant (Tecan Group Ltd). Samples were normalized to a
dsDNA concentration of 15ng/µL, re-quantified, and pooled according to
the sampling population. Thus, we created 12 pools of 40 individuals
each at 15ng/µL. Libraries were prepared with the NEB Ultra II kit for
shotgun sequencing and sequenced on 5 lanes of HiSeq2500 125 bp
pair-ended at Genome Quebec. The number of reads sequenced per
population was between 187-248 million paired-end reads. We used the
following formula to calculate the expected coverage: read length
number of reads / estimated haploid genome length. Given an estimated
genome size of 382,882,063 bp, a read length of 125bp, and 93.7-124.2
million single-end reads sequenced, we calculated that our expected
coverage was between 30-40X. We assessed the quality of our pool-seq
illumina libraries with fastqc 0.11.5, from which we obtained a
percentage of repeats between 18.4 and 39.7%.
Read processing and SNPs
calling
We prepared the assembled reference genome of Amnicola limosus by
first indexing it with the Burrows-Wheeler Aligner ( BWA; Li & Durbin,
2009) v0.7.17 and with Samtools faidx v1.12, and by creating a
dictionary with Picard Tools v2.23.3. We then used a custom pipeline for
pool-seq quality processing, read alignment, and SNP discovery. We first
trimmed reads with the function trim-fastq.pl from popoolation v1.2.2
(Kofler, Orozco-terWengel, et al., 2011) for a base quality of 20 and a
minimum length of 50 bp, and assessed the quality of the trimmed reads
with fastqc. We aligned trimmed reads to the reference genome with
bwa-mem v0.7.17. We filtered out ambiguously aligned reads with samtools
v1.13 using a score of 20 and sorted bam files with samtools. We used
samtools flagstat to find the percentage of Illumina reads aligned to
the reference genome, which was on average 53.5% SD 4.0%. We obtained
an mpileup file with samtools mpileup, then filtered SNPs with a minimum
global coverage of 5. We converted the mpileup file to a sync file with
Popoolation2 v1.10.03 (Kofler, Pandey, et al., 2011), with a quality
score of 20. The sync file was then converted to a ”pooldata” object
with the poolfstat package in R (Hivert et al., 2018), using a haploid
pool size of 80 for all populations, a minimum read count per base of
two, a minimum coverage of five and a maximum of 300, a minimal minor
allele frequency of 0.0125 (to remove singletons) and discarding indels.
This pipeline retained 21,312,700 biallelic SNPs.
Detecting genomic signatures of
selection
To detect putative loci under selection, we used both outlier and
environmental association analyses approaches. We conducted the outlier
analysis using the core model from hierarchical Bayesian models
implemented in Baypass, using default parameters (Gautier, 2015).
Baypass is advantageous in the context of our study (potential
bottlenecks in invaded populations) as it enables the detection of
outlier SNPs after taking demographic history into account, thus
avoiding the confounding effect of demography. The core model estimates
the scaled covariance matrix Ω of population allele frequencies, which
summarizes population history and is then explicitly accounted for
through Ω. The full dataset was divided into 27 pseudo-independent
datasets to overcome computing limitations. The ”pooldata” object from
poolfstat was converted to the 27 sub-dataset Baypass input files with
the ”thinning” subsampling method and sub-sample size of 750,000 SNPs.
We used the core model to estimate the XtX statistic and associated
p-value under a χ2 distribution with 12 degrees of
freedom (bilateral test, Baypass manual). We considered SNPs as outliers
when their p-value derived from the XtX estimator was < 0.001.
The shape of the histogram p-values derived from the XtX statistics
confirmed that they were well-behaved (A peak close to 0 for loci
putatively under selection and a uniform distribution between [0,1]
for neutral loci; Fig. S1B; François et al., 2016). The Ω matrices from
the 27 sub-datasets were compared visually to assess the concordance of
the results, and then the statistics obtained for each SNP were
combined.
For the environmental association analysis, we opted for the standard
model STD under the Importance Sampling approach in Baypass, in which
the association between covariables and SNP allele frequencies is
assessed independently. This model computes for each SNP its regression
coefficient βik of the association between the SNP
allele frequencies and a covariable to compare the model with
association (βik ≠ 0) against the null model
(βik = 0), from which a Bayes factor
BFis is derived. We selected two environmental
covariables: invasion status (presence/absence of the gobies) and
calcium concentration. We also estimated the
C2-statistic with the STD model (Olazcuaga et al.,
2021), which is more appropriate for binary variables and was used for
the association with goby presence/absence. We checked the Pearson
correlation coefficient between covariables with the function
pairs.panel() in the package psych in R, which was r = 0.71 (slightly
above the recommended threshold for the regression method of
|r| < 0.7, Fig. S2).\(\hat{}\)\(\hat{}\)For the calcium covariable, we ran three
independent runs of the STD model with the -seed option to ensure
consistency of the MCMC results, then computed the median
BFIS across runs. To check for convergence of the
independent runs, we calculated the Forstner and Moonen distance (FMD;
Förstner & Moonen, 2003) between Ω matrices from each sub-dataset with
the fmd.dist function in R (included in Baypass). Results were found to
be consistent, with all FMD values < 0.12. Covariables were
all standardized to \(\hat{}\) = 0 and \(\hat{}\) = 1. For the
calcium association, SNPs were considered significantly associated with
a covariable when BFis > 20 dB . For the
association with goby presence/absence, we used the R package qvalue to
correct for multiple hypothesis testing on the p-values derived from the
C2-statistic and applied a False Discovery Rate of α =
0.01 as a q-values cut-off for outlier detection.
As a complementary analysis to investigate the potential for adaptation
to the invasive predator and low calcium concentrations, we identified
outlier SNPs showing consistent allele frequency differences between
environment types using poolFreqDiff (Wiberg et al., 2017). Note that
with this approach, our aim was not to identify independent instances of
parallel adaptation but rather detect genotype-environment associations;
consistent allele frequency differences could arise due to adaptation
occurring in a shared recent ancestor. This method relies on modeling
allele frequencies with a generalized linear model (GLM) and a
quasibinomial error distribution, which should result in a uniform
distribution of p-values between [0,1] under the neutral (null)
scenario (Wiberg et al., 2017). It also accounts for bias in allele
frequency estimation (e.g., Gautier et al., 2013) by rescaling allele
counts to an effective sample size neff (Feder et al.,
2012). We ran the analysis separately for the two covariables as binary
comparisons: invasion status (presence/absence) and calcium
concentration (low < 24.3 mg/L, high > 34.3
mg/L). For the minimum read count per base, and the minimum and maximum
coverage, we used the same values as for the poolfstat filtering, and we
also rescaled the allele counts with neff and added one
to zero count cells. To account for demography and genetic structure, we
applied the empirical-null hypothesis approach (François et al., 2016)
to recalibrate p-values based on a genomic inflation factor of λ = 0.85.
We confirmed that recalibrated p-values were well-behaved based on the
observed peak close to 0 and the uniform distribution between [0,1]
(Fig. S3; François et al., 2016). We then transformed the recalibrated
p-values into q-values with the R package qvalue, and defined outliers
if their q-value was below the FDR α = 0.01.
Reciprocal transplant
experiment
We conducted a laboratory reciprocal transplant experiment at UQAM with
field-collected F0-generation A. limosus to
investigate the response of gastropods with different source population
habitat types (low calcium/uninvaded Ottawa River or high
calcium/invaded St. Lawrence River) to home and transplant water
(calcium-rich water from the St. Lawrence River or calcium-poor water
from the Ottawa River), in the presence or absence of goby cues. The
goby cue treatment was used to test for predator effects on life history
fitness components (survival and fecundity). Amnicola snails that were
involved in the experiment were mostly at adult or sub-adult stages as
we selected the largest individuals collected in the field and the dates
of collection correspond to the presence of adult cohorts in the field
(Pinel-Alloul & Magnin, 1973). Two additional water treatments were
also tested: the artificial freshwater medium COMBO , with and without
the addition of calcium, to test for the specific effect of calcium (Ca)
concentration on fitness components. The overall design was therefore a
two (origin water: St. Lawrence River SL versus Ottawa River OR)\(\times\) four (treatment water from St. Lawrence versus Ottawa River,
growth media with/without Ca) \(\times\) two (presence versus absence of
round goby cue) factorial experiment, with 12 replicates (corresponding
to our sampling populations) per treatment combination.
We raised wild F0 individuals from the 12 populations in
the laboratory for up to 73 days. Between 15 and 22 individuals
(average: 19.6\(\pm\)1.3) were initially placed in 250 ml plastic cups
with river water and reared in growth chambers (Thermo Scientific
Precision Model 818) at 18°C with a light:dark cycle of 12:12 hours. We
fed Amnicola snails ad libitum with defrosted spinach every 2-3
days if needed or at each water change. Water in the water treatments
was changed, and old spinach was removed every 3-4 days. For the goby
cues treatment, gobies were kept in a 50-liter aquarium for two weeks
prior to the experiment, set in a growth chamber at 18°C with a 12:12h
light. Gobies were fed 3-4 times a week with flake fish food (TetraFin).
The goby cue treatment was added as 5 mL of water from the goby aquarium
per Amnicola culture at each water change (every 2-3 days), which
represents 2% of the volume of the culture. The addition of water was
done manually with a 30 mL syringe. We recorded survival and fecundity
(total number of eggs produced per individual) as response variables
every 19 \(\pm\) 13 days throughout the experiment, using
high-resolution stereomicroscopes (Olympus). However, due to the very
low survival for all populations for the treatment testing the effects
of calcium in growth media, we removed this comparison from further
analyses (see Fig. S4).
We analyzed fecundity (total number of eggs produced) and survival rates
with a generalized linear model (GLM) and a generalized linear mixed
effect model (GLMM) using the lme4 package in R respectively. We modeled
fecundity with a negative binomial distribution while survival was
modeled with a binomial distribution and a logit link function. We
checked the models for overdispersion using the overdisp_fun function
from https://bbolker.github.io. We tested both models with and
without the random effect of populations, using an AIC approach
corrected for small sample size (AICc) and the ΔAIC criterion to
evaluate the random effects (kept when ΔAIC > 2) with the R
package bbmle. Likelihood ratio tests were used to evaluate the fixed
effects for both the GLM and GLMM models. For the GLM model of
fecundity, we checked for the influence of outliers on the model, by
using both visual and quantitative diagnostics of the leverage and
Cook’s distance. We did not find a consistent effect of outliers on this
model and thus did not remove outliers. Fixed effect coefficients and
their confidence intervals were converted to incident rate ratios
(fecundity) and odd ratios (survival) using an exponential function.
Population structure, genetic diversity, and
demography
We first estimated population structure with the core model from
Baypass, with the scaled covariance matrix Ω of population allele
frequencies summarizing some aspects of population history. We also
obtained a genome-wide pairwise FST matrix from the
poolfstat package, using the same parameters as described above. We used
the pairwise FST matrix to assess the potential for
isolation by distance, using the relationship between the genetic
distance (FST/(1- FST); Rousset, 1997)
and the log of the geographical distance (2D distribution of
populations) with a Mantel test (9999 permutations) using the vegan
package in R. Distances between sites were obtained by measuring paths
between populations along the rivers (in m) with Google Earth
v.10.38.0.0. We also tested for isolation by environment, by first
calculating the environmental distance between population pairs using
the squared Mahalanobis distance, calculated from the calcium
concentration and goby presence/absence with the R package ecodist. We
verified that there was no correlation between the environmental
distance and geographic distance (non-significant Mantel test with 9999
permutations: r2 = 0.03, p-value = 0.20). Then we
tested for a relationship between environmental distance and genetic
distance as FST/(1-FST) with a Mantel
test (9999 permutations).
We obtained the observed heterozygosity from the poolfstat package and
compared heterozygosity levels between habitat types with a t-test after
checking for the assumptions of normality (qqplot) and homoscedasticity
(Bartlett test). We calculated genome-wide diversity indices (Tajima’s
pi, Watterson theta, and Tajima’s D) using popoolation (Kofler,
Orozco-terWengel, et al., 2011). First, we generated mpileup files for
each population separately with samtools (Li et al., 2009) from the
sorted.bam files output by the custom pipeline. Then we computed the
genome-wide diversity indices using non-overlapping windows of 100kb, a
minimum coverage of 20 (as recommended in Kofler, Orozco-terWengel, et
al., 2011 except for Tajima’s D with minimum coverage = 13, as the
corrected estimator requires the pool size < 3 minimum
coverage), a minimum quality of 20, a minimum fraction covered of 0.05
and a pool size of 40. It should be noted that popoolation calculates
the diversity indices along chromosomes; thus, due to the fragmentation
of our draft genome, the diversity indices were calculated mostly among
separate contigs and in windows < 100kb. The minimum number of
SNPs per window across populations using this filter was 25. We used
Hedge’s G to detect a potential difference in the three diversity
indices between the St. Lawrence and Ottawa rivers.
We also investigated the demographic history of three population pairs
using the diffusion approximation method implemented in δaδi (Gutenkunst
et al., 2009). We aimed to detect a potential bottleneck in the invaded
populations and to quantify the magnitude and direction of gene flow
between the two habitat types (invaded and refuge). We selected the
populations PB-LCGA, IPE-LCGA, and PDC-HCGA as refuges, paired with
PG-HCGP, BEA-HCGP, and GOY-HCGP as invaded populations respectively.
PDC-HCGA was of particular interest as a high calcium population located
in an uninvaded wetland (refuge). Note that the limited number of
population pairs investigated is due to the large computation time
required to analyze the various models considered. Our most complex
model (Fig. S8) has defined effective population sizes after the split
(nu1 and nu2), followed by a bottleneck in both populations (modeling a
scenario in which the goby invasion impacted population abundance at the
whole ecosystem scale) followed by exponential recovery in both
populations. TS is the scaled time between the split and
the bottleneck and TB is the scaled time between the
bottleneck and present. Migration rates are asymmetric but constant
through time after the split, with mIR the migration
from refuge to invaded populations and mRI in the
opposite direction. As we knew the time of the potential bottleneck (12
years before sampling with one generation per year), we set
TB as a fixed parameter. As we set TB as
a fixed parameter, the parameter \(\theta=4\mu L\) was an explicit
parameter in the models that included a bottleneck. We defined θwith μ the mutation rate of \(7.6{\times 10}^{-9}\)substitutions per site per year from the Caenograstropoda speciesNucella lamellose , and L the effective sequenced length,
calculated as\(L\ \approx\ total\ length\ of\ sequence\ analyzed\ \times\ SNPs\ retained\ for\ use\ in\ dadi/total\ SNPs\ in\ analyzed\ sequence\).
We investigated six additional non-nested models: a) bottleneck and
growth only in the invaded population (constant Ne for
the refuge population) with uneven migration, b) only bottlenecks in
both populations without recovery and uneven migration, c) only
bottleneck in the invaded population with uneven migration (constant
Ne for the refuge population), d) a simple population
split at TS with uneven migration, e) a population split
with symmetric migration and f) a population split without migration.
The default local optimizer was used on the log of parameters with
random perturbation of the parameters to obtain a set of parameter
values resulting in the highest composite likelihood. Optimization was
conducted repeatedly until convergence was reached (i.e., three
optimization runs with log-likelihood within 1% of the best
likelihood). Only one model in one population pair did not reach
convergence after 30 optimization runs (bottleneck without recovery for
the PDC-HCGA and GOY-HCGP population pair). Finally, we compared our
seven models based on the differences in the likelihoods and plots of
residuals of the models. As we obtained unlikely results during the
conversion of parameters in our best models, possibly due to imprecise
mutation rates, we did not conduct parameter conversion. To obtain
uncertainties on our parameters while accounting for the effect of
linkage, we used bootstrapping and the Godambe Information Matrix
approach (Coffman et al., 2016). For this, we generated 100 bootstrapped
datasets with a chunk size of
To address the potential effect of using a pool-seq approach on the
variance in allele frequency estimates stemming from differences in
coverage between pools (Gautier et al., 2013), we used a filter to
obtain relatively homogenous coverage between our two selected
populations/pools. From the initial SNPs dataset output by poolfstat
(21,312,700 SNPs), we retained SNPs that fell within the
1st and 3rd quartiles of coverage in
both populations (11-19X for PB-LCGA and 10-18X for PG-HCGP; 11-18X for
IPE-LCGA and 9-15X for BEA-HCGP; 10-16X for PDC-HCGA and 10-17X for
GOY-HCGP). We also filtered out SNPs that were detected as outliers
(putatively under selection) in the Baypass (core and aux or STD models)
and poolFreqDiff analyses and removed uninformative SNPs (fixed or lost
in both populations). To accommodate for large computation time during
the optimization, the datasets were thinned at random to retain a final
dataset of ≈ 1 million SNPs per population pair. We used a custom script
and the dadi_input_pools function from the genomalicious R package
(Thia & Riginos, 2019) with the “probs” parameter in the methodSFS
option to transform allele frequency data into the SNP data format from
δaδi. We used δaδi to infer the folded SFS as we did not have
information on the ancestral allele state. Due to low confidence in the
low-frequency estimates, we also masked entries from 0 to 5
reads