Introduction
Many different emission pathways exist that are compatible with the
Paris Agreement of limiting global mean temperatures to “well below 2°C
above pre-industrial levels and pursuing efforts to limit the
temperature increase to 1.5 °C”, and many more are possible that miss
that target. Sampling possible emissions scenarios is therefore crucial
for policy makers to weigh the economic cost and societal impact of
different mitigation and adaptation strategies. While many of the most
complex Earth System Models (ESMs) have simulated a small selection of
‘Shared Socioeconomic Pathways’ (SSPs; self-consistent emissions
scenarios based on assumptions about future socio-economic changes and
imperatives) it is impractical to use these expensive models to fully
explore the space of possibilities (O’Neill et al., 2016). Therefore,
such explorations mostly rely on one-dimensional impulse response
models, or simple pattern scaling approaches to approximate the physical
climate response to a given scenario (e.g., Millar et al., 2017).
Impulse response models (Smith et al., 2018; Meinshausen et al., 2011;
Nicholls et al., 2020) are physically interpretable and can capture the
general non-linear behaviour of the system, but are inherently unable to
model regional climate changes, while pattern scaling approaches rely on
a simple scaling of spatial distributions of temperature (e.g., Alexeeff
et al., 2018) by global mean temperature changes. This approach breaks
down when considering precipitation, however, because of the strong
non-linearities in its response to temperature (e.g., Cabré et al.,
2010). Statistical emulators of the regional climate have been developed
although these have been quite bespoke (Castruccio et al., 2014) or
focus on the relatively simple problem of emulating temperature (Holden
and Edwards, 2010). These approaches also do not account for the
influence of aerosol, which can be important for both regional
temperature and precipitation (e.g. Kasoar et al. 2018, Wilcox et al.
2020). As has been noted recently (Watson-Parris, 2021), approaches
including non-linear pattern scaling (Beusch et al., 2020) and Gaussian
process (GP) regression of long-term climate responses (Mansfield et
al., 2020) suggest the possibility of using modern machine learning (ML)
tools to produce robust and general emulators of future scenarios.
However, comparing and contrasting these approaches is currently
hindered by the lack of a consistent benchmark.
ClimateBench defines a set of criteria and metrics for objectively
evaluating such climate model emulation; aims to demonstrate the
feasibility of such emulators; and provides a curated dataset that will
allow, and hopefully encourage, broader engagement with this challenge
in the same way WeatherBench (Rasp et al., 2020) has achieved for
weather modeling. The target is to predict annual mean global
distributions of temperature (T), diurnal temperature range (DTR),
precipitation (PR) and the 90th percentile of
precipitation (PR90). These variables are chosen to represent a range of
important climate variables which respond differently to each forcing
and include extreme changes (PR90) that might not be expected to scale
in the same way as the mean. For example, while T has been shown to
scale roughly linearly with global mean temperature changes (Castruccio
et al., 2014), PR responds non-linearly, and DTR is more sensitive to
aerosol perturbations than global mean temperature changes (Hansen et
al., 1995). Four of the main anthropogenic forcing agents are provided
as emulator inputs (predictors): carbon dioxide (CO2), sulphur dioxide
(SO2; a precursor to sulfate aerosol), black carbon (BC) and methane
(CH4). To enable spatially accurate emulators ClimateBench includes
(annual mean): spatial distributions of emissions for the short-lived
aerosol species (SO2 and BC), globally averaged emissions of CH4, and
global total concentrations of CO2.
The training data which is provided in order to support such predictions
is generated from the simulations performed by the second (and latest)
version of the Norwegian Earth System Model (NorESM2; Seland et al.,
2020) as part of the sixth coupled model intercomparison project (CMIP6;
Eyring et al., 2016). The provided inputs are constructed from the same
input data that is used to drive the original simulations. While we
could have included simulations from multiple different models, only one
model submitted all of the DECK (Diagnostic, Evaluation, and
Characterization of Klima), historical, AerChemMIP (Collins et al.,
2017) and ScenarioMIP (O’Neill et al., 2016) experiments required for
our purposes, making it impossible to provide a harmonised dataset.
Further, there is no agreed way of robustly combining multiple models,
and while statistically combining multiple different models can lead to
improved skill (Pincus et al., 2008) the resulting variance is not
reliable since the models are not truly independent (Knutti et al.,
2013). Nevertheless, this single model dataset still allows us to
explore both scenario uncertainty and internal variability. Further, it
is common with simple climate models to fit different emulators
independently, allowing improved interpretability, and if an emulator is
shown to have good skill in this task it seems reasonable to assume that
it will perform similarly well for other models (or combinations of
models) and so multi-model ensembles may be easily incorporated in the
future.
The remainder of this paper describes the development of the dataset
including the underlying ESM and all post-processing (Section 2), the
evaluation metrics used to rank ClimateBench submissions (Section 3), a
selection of baseline emulators that have been developed to demonstrate
a variety of approaches to tackle ClimateBench (Section 4), a discussion
of such approaches and future opportunities for diverse approaches
(Section 5) before providing a few concluding remarks in Section 6.