where \(F(x)\) and \(F^{t}\left(x\right)\) are the cumulative distribution functions (CDFs) over the predicted and target ensembles respectively (Gneiting et al. 2005). This measures the area between the two CDFs so that smaller values are better and has the benefit of retaining a well-defined interpretation in the case of only a single target observation (whose CDF would be the Heaviside function). The CDFs can be approximated over finite ensembles using quadrature, or direct integration if the PDFs can be assumed to be Gaussian. Methods to calculate both metrics based on the climpred (Brady and Spring, 2021) package are provided in the example notebooks included with the dataset. While this metric is not included in the headline ranking of ClimateBench approaches, we include an example approach using GPs which is discussed in more detail in Section 4.1.

Baseline evaluation

Before evaluating some baseline statistical emulators, it is useful to consider two cases with which we hope to bracket the data-driven approaches. The first is the internal variability of the NorESM2 target ensemble which provides an upper bound on the predictability of the scenario in the presence of the natural variability of the Earth system. The second is a comparison to another ESM which also performed the test projection, in this case the UKESM1 model (Sellar et al., 2019). This provides an example of the inter-model spread encountered within CMIP6 and a lower bound on the accuracy we would like our emulators to achieve.
As noted previously, the NorESM2 ssp245 projections included three ensemble members sampling internal variability by choosing different initial model states from the start of the piControl simulation at intervals of 30 model years apart. The average RMSE for each variable at each target time between ensemble members 1 and 2, and 1 and 3 are provided in Table 2 and provide an estimate of the best achievable skill over this period (since the members only differ by their internal state). In practice, the emulators can (and do) outperform this baseline because they target the mean over all three ensemble members, reducing the effect of internal variability.
The UKESM1 model performed the same ssp245 -2 in NorESM2) and may be due to the different land models used; UKESM1 uses JULES (Harper et al. 2018) and NorESM2 uses CLM5 (Lawrence et al. 2019). Despite the large difference in temperature response, interestingly the precipitation response is broadly in agreement, suggesting quite distinct changes in the hydrological cycle. For example, the UKESM1 precipitation does not show a clear shift in the ITCZ and shows larger changes in the extra-tropics. The RMSE between UKESM1 and NorESM2 is correspondingly large for the temperature metrics and closer to the baseline approaches for precipitation, as shown in Table 2.