Stefan F. Gary

and 6 more

River sediment microbial respiration is a key indicator of ecosystem functioning and the biogeochemical fluxes across this critical zone link surface and subsurface waters. As such, there is tremendous interest in measuring and mapping these respiration rates. Respiration observations are expensive and labor intensive; there is limited data available to the community. An open science, collaborative initiative is collecting samples for respiration rate analysis and multi-scale metadata; this evolving data set is being used for making machine learning (ML) predictions at unsampled sites to help inform continued community engagement. However, it is a challenge to find an optimum configuration for ML models to work with this feature-rich (i.e. 100+ possible input variables) data set. Here, we present results from a two-tiered approach to managing the analysis of this complex data set: 1) a stacked ensemble of models that automatically optimizes hyperparameters and manages the training of many models and 2) feature permutation importance to detect the most important features in the models. The major elements of this workflow are modular, portable, open, and cloud-based thus making this implementation a potential template for other applications. The models developed here predict that sediment organic matter chemistry is one of the most important features for predicting sediment respiration rate. Other larger-scale, important features fall into the categories of climatic, ecological, geological, and fluvial settings. Leveraging these larger-scale features to generate data-driven estimates of river sediment respiration rates reveals spatially consistent but heterogeneous patterns across the river network of the Columbia River Basin.