Figure 4: Maps of the mean difference in the ClimateBench target variables for each baseline emulator against the target NorESM values under the test ssp245 scenario averaged between 2051-2100. Differences insignificant at the p<5% level are masked from the plots.

Gaussian process regression

Gaussian processes (GPs) (Rasmussen and Williams, 2005) are probabilistic models which assume predictions can be modelled jointly as normally distributed. GPs have been widely used for nonlinear and nonparametric regression problems in the geosciences (Camps-Valls et al., 2016).. A GP is fully determined by the expectation of individual predictions – referred to as the mean – and the covariance between pairs of predictions. Such covariance is typically user-specified as a bivariate function of the input data called the kernel function. The choice of the kernel function allows to restrict the functional class the GP belongs to, offering, for example, control over functional smoothness. GPs for regression solve a supervised problem where the observed input-output sample pairs are used to: (1) infer the emulator parameters (typically the noise variance and the kernel parameters only) by maximising the log-likelihood of the observations under the gaussianity assumption (the so-called evidence); and then (2) allow to obtain its posterior probability distribution that is used to make predictions over unseen inputs.
To prepare the input samples, we average ensemble members for SO2 and BC emission maps and CO2 and CH4 global emissions. The dimensionality of each aerosol emission map is reduced with principal component analysis, restricting ourselves to the 5 first principal components. All input covariates and target outputs are standardised using training data mean and standard deviation.
The GP is set with a constant mean prior and separate kernels are devised for each species. Automatic relevance determination (ARD) kernels are used for SO2 and BC, allowing each principal component to be treated independently with its own lengthscale parameter. The GP covariance function is obtained by summing all kernels together, thus accounting for multiscale feature relations (see Camps-Valls et al., 2016 for several composite kernel constructions in remote sensing and geoscience problems). To account for internal variability between ensemble members, we consider an additional white noise term with homoscedastic variance over the output targets, which is also inferred from the training phase.
We use Matérn-1.5 kernels for each input. This guarantees the GP is a continuous function; details are provided in Section A1. The mean value, kernels parameters and internal variability variance are jointly tuned against the training data by marginal likelihood maximisation with the L-BFGS optimisation scheme. The emulators used have 16 parameters in total.
As reported in Table 2, the 2050-2100 averaged RMSE of the mean predictions with the GPs are the best of all the emulators when predicting surface air temperature, diurnal temperature range, precipitation rate and 90th percentile of precipitation. This is remarkable given the limited number of parameters that are tuned. It suggests the GP prior is an adequate choice for the purposes of emulation. Study of the inferred kernel variance (not shown) suggests that cumulative CO2 emissions generally influence all predictions, and unequivocally dominate the predictions for surface air temperature and diurnal temperature range. CH4 and BC emissions on the other hand appear to have negligible influence on the predictions. Since the GP also provides posterior estimates of the variance (which will incorporate an estimate of internal variability) we also calculate the CPRS for this emulator (see Table 2). While we are unable to compare these scores with the other baseline methods the similarity to the RMSE indicates that the GP is also predicting the internal variability accurately (otherwise it would be penalised in the CPRS relative to the RMSE).

Random Forests

Random forests aggregate predictions of multiple decision trees (Ho, 1995; Breiman, 2001). These trees repeatedly split data into subsets according to its features such that in-subset variance is low and between-subset variance is high. This makes decision trees good at modelling non-linear functions, in particular interactions between different variables. However, they are prone to overfitting (Ho, 1995). This problem is alleviated by ensemble methods which train a large number of different trees. Weak learners are combined to give strong learners. Bagging, used in Random Forests, describes training different trees on different subsets of the data or holding back some of the data dimensions for each individual tree. The Forest makes a prediction by averaging over the predictions of all individual trees.
Two main arguments support an ensemble method approach to climate model emulation: These methods are skillful at interpolation tasks, but by construction are unable to extrapolate (Breiman, 2001). However, for applications of climate model emulation, interesting predictions will likely lie inside the hypercube delimited by historical data, low-emissions (ssp126 ) and business-as-usual (ssp585 ) scenarios. A major advantage of ensemble methods over more complex ML methods such as neural networks (and even ESMs) is their interpretability. This is important as ultimately predictions should inform decision-making. Being able to provide explanations why a given input led to a prediction helps to understand the consequences of decisions about emission pathways.
To train the random forest emulator, we use the historical dataset and the socio-economic pathways ssp585 , 126 , and 370 . For each dataset, we take the average of its ensemble members and reduce the dimensionality of SO2 and BC emission maps to the first five principal components. Separate random forest emulators are trained for the four target variables. The following hyperparameters are tuned using random search: number of trees, tree depth, number of samples required to split a node and to be at each leaf node. The hyperparameters used for each emulator are indicated in Section A2.
As shown in Table 2, the mean RMSE scores for the years 2051-2100 of the random forest regressors are comparable to the performance of the other emulators for diurnal temperature range, precipitation and extreme precipitation. The temperature prediction is comparable but slightly worse than the predictions of the neural network architecture. To assess the impact of the four input features on the prediction, we calculate the permutation feature importance. It is defined as the decrease in a model score when a single feature value is randomly shuffled (Breiman, 2001). Figure 5 shows that CO2 concentrations dominate the predictions. For temperature predictions the other features are negligible. SO2 and BC aerosol emissions have a small impact on the global mean temperature and precipitation predictions. This is in line with the physical understanding that while anthropogenic aerosol can influence precipitation rates (both radiatively and through aerosol-cloud interactions), aerosol contributions play a negligible role at the end of the century in the test scenario. The regional influences may be more significant however and this will be explored separately.