Earth System Models (ESMs) are essential tools for understanding the interaction of the human and Earth systems. One key application of these models is studying extreme weather events, such as heat waves or high intensity precipitation events, which have significant socioeconomic consequences. However, the computational demands of running a sufficient number of simulations to robustly characterize expected changes in these hazards, and therefore provide a strong basis to analyze the ensuing risks, are often prohibitive. In this paper we demonstrate that diffusion models – a class of generative deep learning models – can effectively emulate the spatio-temporal trends of ESM daily output. Trained on a handful of runs, reflecting a wide range of radiative forcings, our DiffESM model takes monthly mean precipitation or temperature as input and is capable of producing daily values of temperature and precipitation that have statistical characteristics close to the ESM output. This approach requires only a small fraction of the computational resources that would be needed to run a large ensemble under any scenario of interest. We evaluate model behavior over a range of scenarios, time horizons and two ESMs, using a number of extreme metrics, including ones that have been long established in the climate modeling and analysis community. Our results show that the samples produced by DiffESM closely matches the spatio-temporal behavior of the ESM output it emulates in terms of the frequency and spatial characteristics of phenomena such as heat waves, dry spells, or rainfall intensity.