Causal Drivers of Land-Atmosphere Carbon Fluxes from Machine Learning
Models and Data
Abstract
Interactions among atmospheric, root-soil, and vegetation processes
drive carbon dioxide fluxes (Fc) from land to atmosphere. Eddy
covariance measurements are commonly used to measure Fc at sub-daily
timescales and validate process-based and data-driven models. However,
these validations do not reveal process interactions, thresholds, and
key differences in how models replicate them. We use information
theory-based measures to explore multivariate information flow pathways
from forcing data to observed and modeled hourly Fc, using flux tower
datasets in the Midwestern U.S. in intensively managed corn-soybean
landscapes. We compare Multiple Linear Regressions (MLR), Long-Short
Term Memory (LSTM), and Random Forests (RF) to evaluate how different
model structures use information from combinations of sources to predict
Fc. We extend a framework for model predictive performance and
functional performance, which examines the full suite of dependencies
from all forcing variables to the observed or modeled target. Of the
three model types, RF exhibited the highest functional and predictive
performance. Regionally trained models demonstrate lower predictive but
higher functional performance compared to site-specific models,
suggesting superior reproduction of observed relationships. This study
shows that some metrics of predictive performance encapsulate functional
behaviors better than others, highlighting the need for multiple metrics
of both types. This study improves our understanding of carbon fluxes in
an intensively managed landscape, and more generally provides insight
into how model structures and forcing variables translate to
interactions that are well versus poorly captured in models.