Figure 6: The relative change in global mean precipitation as a function of global mean temperature change in the baseline emulators and NorESM2 averaged in 5-year increments to reduce internal-variability. Hollow and solid points indicate years before and after 2050 respectively. The change predicted by the Clausius-Clapeyron relationship and energy conservation considerations are shown as dashed lines.
There has been much attention recently given to ‘interpretable’ and ‘explainable’ machine learning models, the former of which are said to behave in a-priori understandable ways, while the latter provide mechanisms to determine post-hoc understanding. While these may be desirable properties in many settings they are very subjective concepts and much weaker foundations on which to build trust than physical laws and thorough evaluation. For example, in a GP the lengthscales that are inferred reflect the corresponding feature relevance and the regularization term accounts for observational noise, but this may not be so obvious for a climate scientist. Indeed, the physical ESMs currently considered the ‘gold standard’ of climate modelling are only interpretable or explainable by expert practitioners, and it is often part of their role to explain the behaviour of their models in response to different drivers. Indeed, given their computational efficiency it is hoped that ClimateBench emulators might be useful in analysing and understanding the response of the underlying physical models themselves.

Research opportunities

While the challenges outlined above are mostly surmountable with modern architectures and carefully chosen workflows, there are also several broader opportunities ClimateBench presents to develop the state-of-the-art in climate model emulation.
As already mentioned, one area of particular interest is the use of hybrid modelling whereby statistical or ML based emulators embed physical equations, constraints or symmetries in order to improve accuracy, robustness and generalisability (Camps-Valls et al., 2021; Reichstein et al., 2019; Karpatne et al., 2017). One obvious way in which to apply such approaches to ClimateBench is to marry the simple impulse response models discussed in Section 1 with more complex methods to predict the spatial response. Such an approach has recently been demonstrated for temperature (Beusch et al., 2021) but could conceivably be extended to modelling each of the fields targeted in ClimateBench. A more unified, and ambitious, approach would be to model the ordinary differential equations of the response to a forcing directly in the statistical emulator using either numerical GPs (Raissi et al., 2018) or Fourier neural operators (Li et al., 2020).
Another important open question when using data-driven approaches to emulate the climate is how to ensure predictions are performed at locations within the distribution of the training data. In other words, how to ensure the emulator is being used to interpolate existing model simulations rather than extrapolating to completely unseen regions of input space. This can be easy to test for in low dimensions, but it becomes increasingly difficult in higher dimensions and while the training and test data in ClimateBench have been chosen to minimise the risk of extrapolation broader use could be hindered by the risk of inadvertently asking for an out-of-distribution prediction. While the predictive variance of GPs provide such indications (out of the sample range the GP mean returns to the prior and the covariance is maximised), it is not so easy for other techniques and the use of modern techniques to detect such occurrences (e.g., Lee et al., 2018; Rabanser et al., 2018) could be of great value to minimise this risk.

Application to detection and attribution

The use of an efficient and accurate way of estimating the climate impacts of different emission scenarios is not limited to exploring future pathways. We may also ask: ‘What observed climate states and events can be attributed to anthropogenic emissions?’. A whole field, which started with the seminal work of Hasselmann (1993) has developed rapidly in the last decade (Stott et al., 2016; Barnett et al., 2005; Stott et al., 2010; Shindell et al., 2009; Otto et al., 2016) attempting to answer this question. A common approach is to use climate model (or ESM) simulations to determine optimal ‘fingerprints’ with which to test observations as well as the power of such a fingerprint under internal variability. These typically have to make fairly strong assumptions about the form of the climate response however (often relying on multiple linear regression) and can incorporate observations of only a few dimensions.
One possible application of the efficient emulators trained using ClimateBench could then be to allow the inference of higher dimensional attribution problems, incorporating more information (such as the DTR and PR) and potentially providing more confident assessments. It would be straightforward to implement such an approach using the ESEm package which provides a convenient interface for such inferences using e.g., ABC, variational inference or Markov Chain Monte-Carlo sampling. Future work will investigate these possibilities.