Training Data
Data required for ESPM include training and predictor datasets. Training (sometimes called test or response) data (Figure 1B) are derived from ecosystem field surveys, as described above. These data must undergo similar preparations, refinements, and partitioning recommended for species (see Guisan et al 2017) and community-level modelling (see Ovaskainen and Abrego 2020, Mokany et al 2022), as applicable. A notable distinction is that ecosystem training data include both biotic and abiotic variables. Differences among these respective survey variables introduce incompatible data structures in the training pool, with a bearing on predictive model outputs. We provide a brief overview of these issues and suggest options to address them. Last, we introduce measures to finalize training data selected for model fitting.
Different measurement scales and units are typically employed for quantifying biotic and abiotic phenomena in field surveys. In our study, vascular plant, bryophyte, and lichen taxa were measured by estimating areal cover by individual species. These estimates were expressed as a percent of total cover for individual vertical vegetation strata (e.g., canopy tree, woody shrub, tree sapling and seedling, herbaceous, lichen, and bryophyte). In contrast, abiotic attributes were recorded using a variety of measurement scales and units, depending on the attribute. These abiotic records include measures in linear (e.g., humus depth in centimeters), categorical (e.g., landform type), concentration (e.g., milligrams/kilogram of calcium in soil), volume (e.g., above-ground woody biomass in cubic meters), ratio (e.g., percent slope), or logarithmic (e.g., pH) units. Pooling ecological variables with disparate measurement scales, units, and limits misrepresents their statistical relationships to one another, and introduces interoperable dimensions into a training dataset (König et al 2019). One possible means to alleviate this problem is to convert all variables to presence/absence records. This method was particularly effective for the types of categorical variables we have in our data (e.g., species identity and abiotic attributes such as humus, soil, and bedrock type) but less appropriate for continuous measures, particularly those which were present at all surveys locations (e.g., vegetation height). To address this latter problem, ubiquitous variables can be eliminated from the data pool, a procedure implemented on biotic training data employed in community-level modelling (e.g., Ovaskainen and Abrego 2020). Alternatively, some continuous data (e.g., pH) could be converted to categorical classes, employing empirically supported class intervals. Given the diverse data types in our case study, we excluded ubiquitous variables and fit a hurdle model (see Ovaskainen and Abrego 2020) with and without abundance data.
A second challenge arising in training data compilation is the disproportionate ratio of biotic to abiotic variables recorded in field surveys. Tens or hundred of species are frequently documented from individual ecosystem plots, whereas the number of abiotic attributes assessed (Figure 1A) is rarely comparable. Our case study plot data are comprised of abundance records of over 900 vascular plant, lichen, and bryophyte species, coupled with only 20 abiotic variables recorded at the site-level and 10 to 30 additional abiotic measurements taken from soil profiles at each survey location (see section S2). Imbalanced numbers of biotic and abiotic data survey variables unduly weights biota in training data pools, with implications for the relative influence these variables have on predictions of joint concordance among biotic and abiotic ecosystem constituents. One option to address this imbalance is to aggregate species into groups based on common ecological traits (i.e., life history, morphological, or physiological characteristic of an individual) and to employ the resulting trait groups to represent biota in a test data pool. Incorporating species traits into predictive spatial biodiversity models has been shown to greatly improve model performance (Regos et al 2019). Traits can also be invaluable for understanding species response to their environment and their ecological roles in communities and ecosystems (Kissling et al 2018). One solution we implemented in our case study was to aggregate species with common traits to the genus level; an example was Sphagnum , a species-rich genus commonly found on moister soils.
Methods outlined above can be applied with different subsets of training data. This approach can help reveal ecosystems at different scales, and to determine whether separate models, with distinct pools of training data, are required for predicting sub-components of total ecosystem diversity in a study landscape. Variations in training data structure and origin can strongly shape spatial biodiversity model predictive performance (Guillera‐Arroita et al 2015, Mod et al 2020). In our case study, pools of test data were scaled across levels of ecological complexity to determine how variations in training data dimensionality affected model performance. For example, differences in growth-form (tree, shrub, herb) dominance, surficial origin (glacial, aeolian, marine, lithic), soil properties (upland, wetland soils), and biogeoclimatic regionality (alpine, boreal, temperate) were employed to parse training data into less complex subsets.