Over the last decades, climate science has evolved rapidly across multiple expert domains. Our best tools to capture state-of-the-art knowledge in an internally self-consistent modelling framework are the increasingly complex fully coupled Earth System Models (ESMs). However, computational limitations and the structural rigidity of ESMs mean that the full range of uncertainties across multiple domains are difficult to capture with ESMs alone. The tools of choice are instead more computationally efficient reduced complexity models (RCMs), which are structurally flexible and can span the response dynamics across a range of domain-specific models and ESM experiments. Here we present Phase 2 of the Reduced Complexity Model Intercomparison Project (RCMIP Phase 2), the first comprehensive intercomparison of RCMs that are probabilistically calibrated with key benchmark ranges from specialised research communities. Unsurprisingly, but crucially, we find that models which have been constrained to reflect the key benchmarks better reflect the key benchmarks. Under the low-emissions SSP1-1.9 scenario, across the RCMs, median peak warming projections range from 1.3 to 1.7{degree sign}C (relative to 1850-1900, using an observationally-based historical warming estimate of 0.8{degree sign}C between 1850-1900 and 1995-2014). Further developing methodologies to constrain these projection uncertainties seems paramount given the international community’s goal to contain warming to below 1.5{degree sign}C above pre-industrial in the long-term. Our findings suggest that users of RCMs should carefully evaluate their RCM, specifically its skill against key benchmarks and consider the need to include projections benchmarks either from ESM results or other assessments to reduce divergence in future projections.