Climate model emulators are widely used to generate temperature projections for climate scenarios, including in the recent IPCC Sixth Assessment Report. Here we evaluate the performance of a two-layer energy balance model in emulating historical and future temperature projections from CMIP6 models. We find that prediction errors can be large (greater than 0.5oC in a given year) and differ markedly between climate models, forcing scenarios and time periods. Errors arise in emulating the near-surface temperature response to both greenhouse gas and aerosol forcing; in some periods the errors due to these forcings oppose one another, giving the spurious impression of better emulator performance. Time-varying and state-dependent feedbacks may contribute to prediction errors. Close emulations can be produced for a given period but, crucially, this does not guarantee reliable emulations of other scenarios and periods. Therefore, rigorous out-of-sample evaluation is necessary to characterize emulator performance.