Mozhgan A Farahani

and 1 more

Interactions among atmospheric, root-soil, and vegetation processes drive carbon dioxide fluxes (Fc) from land to atmosphere. Eddy covariance measurements are commonly used to measure Fc at sub-daily timescales and validate process-based and data-driven models. However, these validations do not reveal process interactions, thresholds, and key differences in how models replicate them. We use information theory-based measures to explore multivariate information flow pathways from forcing data to observed and modeled hourly Fc, using flux tower datasets in the Midwestern U.S. in intensively managed corn-soybean landscapes. We compare Multiple Linear Regressions (MLR), Long-Short Term Memory (LSTM), and Random Forests (RF) to evaluate how different model structures use information from combinations of sources to predict Fc. We extend a framework for model predictive performance and functional performance, which examines the full suite of dependencies from all forcing variables to the observed or modeled target. Of the three model types, RF exhibited the highest functional and predictive performance. Regionally trained models demonstrate lower predictive but higher functional performance compared to site-specific models, suggesting superior reproduction of observed relationships. This study shows that some metrics of predictive performance encapsulate functional behaviors better than others, highlighting the need for multiple metrics of both types. This study improves our understanding of carbon fluxes in an intensively managed landscape, and more generally provides insight into how model structures and forcing variables translate to interactions that are well versus poorly captured in models.

Tarun Agrawal

and 2 more

Surface runoff and infiltrated water interact with dynamic landscape properties en route to the stream, ranging from vegetation and microbial activities to soil and geological attributes. Stream solute concentrations are highly variable and interconnected due to these interactions, flow paths, and residence times, and often exhibit hysteresis with flow. Significant unknowns remain about how point measurements of stream solute chemistry reflect interdependent hydrobiogeochemical and physical processes, and how signatures are encapsulated as nonlinear dynamical relationships between variables. We take a machine learning approach to understand and capture these dynamical relationships and improve predictions of solutes at short and long time scales. We introduce a physical process-based ”flow-gate” into an LSTM (long short-term memory) model, which enables the model to learn hysteresis behaviors if they exist. Further, we use information-theoretic metrics to detect how solutes are interdependent, and iteratively select source solutes that best predict a given target solute concentration. The ”flow-gate LSTM” model improves model predictions (RSME values decrease from 1% to 32%) relative to the standard LSTM model for all nine solutes included in the study. The predictive improvements from the flow-gate LSTM model highlight the importance of lagged concentration and discharge relationships for certain solutes. It also indicates a potential limitation in the traditional LSTM model approach since flow rates are always provided as input sources, but this information is not fully utilized. This work provides a starting point for a predictive understanding of geochemical interdependencies using machine-learning approaches and highlights potential improvements in model architecture.