where \(F(x)\) and \(F^{t}\left(x\right)\) are the cumulative
distribution functions (CDFs) over the predicted and target ensembles
respectively (Gneiting et al. 2005). This measures the area between the
two CDFs so that smaller values are better and has the benefit of
retaining a well-defined interpretation in the case of only a single
target observation (whose CDF would be the Heaviside function). The CDFs
can be approximated over finite ensembles using quadrature, or direct
integration if the PDFs can be assumed to be Gaussian. Methods to
calculate both metrics based on the climpred (Brady and Spring, 2021)
package are provided in the example notebooks included with the dataset.
While this metric is not included in the headline ranking of
ClimateBench approaches, we include an example approach using GPs which
is discussed in more detail in Section 4.1.
Baseline evaluation
Before evaluating some baseline statistical emulators, it is useful to
consider two cases with which we hope to bracket the data-driven
approaches. The first is the internal variability of the NorESM2 target
ensemble which provides an upper bound on the predictability of the
scenario in the presence of the natural variability of the Earth system.
The second is a comparison to another ESM which also performed the test
projection, in this case the UKESM1 model (Sellar et al., 2019). This
provides an example of the inter-model spread encountered within CMIP6
and a lower bound on the accuracy we would like our emulators to
achieve.
As noted previously, the NorESM2 ssp245 projections included
three ensemble members sampling internal variability by choosing
different initial model states from the start of the piControl
simulation at intervals of 30 model years apart. The average RMSE for
each variable at each target time between ensemble members 1 and 2, and
1 and 3 are provided in Table 2 and provide an estimate of the best
achievable skill over this period (since the members only differ by
their internal state). In practice, the emulators can (and do)
outperform this baseline because they target the mean over all three
ensemble members, reducing the effect of internal variability.
The UKESM1 model performed the same ssp245 -2 in
NorESM2) and may be due to the different land models used; UKESM1 uses
JULES (Harper et al. 2018) and NorESM2 uses CLM5 (Lawrence et al. 2019).
Despite the large difference in temperature response, interestingly the
precipitation response is broadly in agreement, suggesting quite
distinct changes in the hydrological cycle. For example, the UKESM1
precipitation does not show a clear shift in the ITCZ and shows larger
changes in the extra-tropics. The RMSE between UKESM1 and NorESM2 is
correspondingly large for the temperature metrics and closer to the
baseline approaches for precipitation, as shown in Table 2.