Revealing causal controls of storage-streamflow relationships with a
data-centric Bayesian framework combining machine learning and
process-based modeling
Abstract
Some machine learning (ML) methods such as classification trees are
useful tools to generate hypotheses about how hydrologic systems
function. However, data limitations dictate that ML alone often cannot
differentiate between causal and associative relationships. For example,
previous ML analysis suggested that soil thickness is the key
physiographic factor determining the storage-streamflow correlations in
the eastern US. This conclusion is not robust, especially if data are
perturbed, and there were alternative, competing explanations including
soil texture and terrain slope. However, typical causal analysis based
on process-based models (PBMs) is inefficient and susceptible to human
bias. Here we demonstrate a more efficient and objective analysis
procedure where ML is first applied to generate data-consistent
hypotheses, and then a PBM is invoked to verify these hypotheses. We
employed a surface-subsurface processes model and conducted perturbation
experiments to implement these competing hypotheses and assess the
impacts of the changes. The experimental results strongly support the
soil thickness hypothesis as opposed to the terrain slope and soil
texture ones, which are co-varying and coincidental factors. Thicker
soil permits larger saturation excess and longer system memory that
carries wet season water storage to influence dry season baseflows. We
further suggest this analysis could be formalized into a novel,
data-centric Bayesian framework. This study demonstrates that PBM
present indispensable value for problems that ML cannot solve alone, and
is meant to encourage more synergies between ML and PBM in the future.