Figure 4: Maps of the mean difference in the ClimateBench target
variables for each baseline emulator against the target NorESM values
under the test ssp245 scenario averaged between 2051-2100. Differences
insignificant at the p<5% level are masked from the plots.
Gaussian process
regression
Gaussian processes (GPs) (Rasmussen and Williams, 2005) are
probabilistic models which assume predictions can be modelled jointly as
normally distributed. GPs have been widely used for nonlinear and
nonparametric regression problems in the geosciences (Camps-Valls et
al., 2016).. A GP is fully determined by the expectation of individual
predictions – referred to as the mean – and the covariance between
pairs of predictions. Such covariance is typically user-specified as a
bivariate function of the input data called the kernel function. The
choice of the kernel function allows to restrict the functional class
the GP belongs to, offering, for example, control over functional
smoothness. GPs for regression solve a supervised problem where the
observed input-output sample pairs are used to: (1) infer the emulator
parameters (typically the noise variance and the kernel parameters only)
by maximising the log-likelihood of the observations under the
gaussianity assumption (the so-called evidence); and then (2) allow to
obtain its posterior probability distribution that is used to make
predictions over unseen inputs.
To prepare the input samples, we average ensemble members for SO2 and BC
emission maps and CO2 and CH4 global emissions. The dimensionality of
each aerosol emission map is reduced with principal component analysis,
restricting ourselves to the 5 first principal components. All input
covariates and target outputs are standardised using training data mean
and standard deviation.
The GP is set with a constant mean prior and separate kernels are
devised for each species. Automatic relevance determination (ARD)
kernels are used for SO2 and BC, allowing each principal component to be
treated independently with its own lengthscale parameter. The GP
covariance function is obtained by summing all kernels together, thus
accounting for multiscale feature relations (see Camps-Valls et al.,
2016 for several composite kernel constructions in remote sensing and
geoscience problems). To account for internal variability between
ensemble members, we consider an additional white noise term with
homoscedastic variance over the output targets, which is also inferred
from the training phase.
We use Matérn-1.5 kernels for each input. This guarantees the GP is a
continuous function; details are provided in Section A1. The mean value,
kernels parameters and internal variability variance are jointly tuned
against the training data by marginal likelihood maximisation with the
L-BFGS optimisation scheme. The emulators used have 16 parameters in
total.
As reported in Table 2, the 2050-2100 averaged RMSE of the mean
predictions with the GPs are the best of all the emulators when
predicting surface air temperature, diurnal temperature range,
precipitation rate and 90th percentile of
precipitation. This is remarkable given the limited number of parameters
that are tuned. It suggests the GP prior is an adequate choice for the
purposes of emulation. Study of the inferred kernel variance (not shown)
suggests that cumulative CO2 emissions generally influence all
predictions, and unequivocally dominate the predictions for surface air
temperature and diurnal temperature range. CH4 and BC emissions on the
other hand appear to have negligible influence on the predictions. Since
the GP also provides posterior estimates of the variance (which will
incorporate an estimate of internal variability) we also calculate the
CPRS for this emulator (see Table 2). While we are unable to compare
these scores with the other baseline methods the similarity to the RMSE
indicates that the GP is also predicting the internal variability
accurately (otherwise it would be penalised in the CPRS relative to the
RMSE).
Random Forests
Random forests aggregate predictions of multiple decision trees (Ho,
1995; Breiman, 2001). These trees repeatedly split data into subsets
according to its features such that in-subset variance is low and
between-subset variance is high. This makes decision trees good at
modelling non-linear functions, in particular interactions between
different variables. However, they are prone to overfitting (Ho, 1995).
This problem is alleviated by ensemble methods which train a large
number of different trees. Weak learners are combined to give strong
learners. Bagging, used in Random Forests, describes training different
trees on different subsets of the data or holding back some of the data
dimensions for each individual tree. The Forest makes a prediction by
averaging over the predictions of all individual trees.
Two main arguments support an ensemble method approach to climate model
emulation: These methods are skillful at interpolation tasks, but by
construction are unable to extrapolate (Breiman, 2001). However, for
applications of climate model emulation, interesting predictions will
likely lie inside the hypercube delimited by historical data,
low-emissions (ssp126 ) and business-as-usual (ssp585 )
scenarios. A major advantage of ensemble methods over more complex ML
methods such as neural networks (and even ESMs) is their
interpretability. This is important as ultimately predictions should
inform decision-making. Being able to provide explanations why a given
input led to a prediction helps to understand the consequences of
decisions about emission pathways.
To train the random forest emulator, we use the historical dataset and
the socio-economic pathways ssp585 , 126 , and 370 .
For each dataset, we take the average of its ensemble members and reduce
the dimensionality of SO2 and BC emission maps to the first five
principal components. Separate random forest emulators are trained for
the four target variables. The following hyperparameters are tuned using
random search: number of trees, tree depth, number of samples required
to split a node and to be at each leaf node. The hyperparameters used
for each emulator are indicated in Section A2.
As shown in Table 2, the mean RMSE scores for the years 2051-2100 of the
random forest regressors are comparable to the performance of the other
emulators for diurnal temperature range, precipitation and extreme
precipitation. The temperature prediction is comparable but slightly
worse than the predictions of the neural network architecture. To assess
the impact of the four input features on the prediction, we calculate
the permutation feature importance. It is defined as the decrease in a
model score when a single feature value is randomly shuffled (Breiman,
2001). Figure 5 shows that CO2 concentrations dominate the predictions.
For temperature predictions the other features are negligible. SO2 and
BC aerosol emissions have a small impact on the global mean temperature
and precipitation predictions. This is in line with the physical
understanding that while anthropogenic aerosol can influence
precipitation rates (both radiatively and through aerosol-cloud
interactions), aerosol contributions play a negligible role at the end
of the century in the test scenario. The regional influences may be more
significant however and this will be explored separately.