Machine learning advances
ML is a type of data-driven approach that trains a regression or classification model through complex nonlinear mapping with adjustable parameters, based on a training data set. Several recent river carbon cycle studies have used random forest ML algorithms (RF); for example, Maavara et al.(Maavara et al., 2023) calibrated a RF that extrapolated GPP to almost 100,000 river reaches and lakes within the watershed using available predictor data such as flow, temperature, and canopy cov(Abbe & Montgomery, 1996)er. Segatto et al. (2021, 2023) also improved metabolic upscaling by incorporating a temporal dimension into their predictions of metabolic regimes by training RFs using long-term, sensor-based estimates of GPP and ER in the Ybbs River catchment in Austria, as well as catchment physical and climate properties. However, RFs typically require large datasets and their transferability to systems for which they have not been trained can be problematic. DL is an additional branch of ML, distinguished by multiple layers of neurons in neural network architecture, which provide a higher ability to represent complex functions than non-deep neural networks (Zhang et al., 2021a).
Accurate quantification of carbon emissions from aquatic systems remains constrained by scientific uncertainties, high complexity of physical and chemical process linkages such as non-stationarity, dynamism, and non-linearity. As a result, prediction and forecasting with process-driven methods can be inaccurate; for rivers, water temperature and discharge data currently provide the best opportunities for forecasting, whereas research on near-term biological/chemical predictions has advanced more quickly for lakes (McClure et al., 2021). DL has been suggested as a potential means to overcome uncertainty and nonlinearity in river sciences (Shen, 2018) and is now being applied in hydrologic predictions (water level, discharge (Xu et al., 2022)), regional rainfall-runoff linkages (Zhang et al., 2021a) and water quality dynamics (Zheng et al., 2023). This is important due to the increased need to reduce flood risk due to climate change. DL also has relevance in aquatic ecosystem prediction, including data mining and identifying outliers (Kim et al., 2022). With respect to water quality data, DL methods have been shown to offer potential to predict N and P concentrations from physical data that can be collected more easily with sesnors (e.g. pH, turbidity, temperature, DO, conductivity) (Ba-Alawi et al., 2023). Moreover, DL can serve both as an auxiliary tool for process-driven methods, reducing computational loads in uncertainty analyses (Li et al., 2020) and as a component of process-driven models, describing a process difficult to characterize mathematically (Huang et al., 2022).
Physical models can now be embedded into DL models to improve performance and mitigate risks, by providing important supplementary information (Reichstein et al., 2019, Huang et al., 2022). Physics-informed neural network (PINN) models incorporate the residual of physics principles (e.g. governing equations) as a regulation in loss functions to enable learning by penalizing poor predictions (Tartakovsky et al., 2020). PINN is increasingly being applied in areas such as estimating water quantity and quality (Liang et al., 2019). Therefore, the development of physics informed surrogate models that link DOM concentrations and other water quality data with river flows could offer the potential for forecasting carbon emissions with greater accuracy and with improved consideration of uncertainty propagation.
Transfer learning (TL) developments offer additional potential for DL applications in water resource science and management. TL recognizes knowledge from a previous task and applies it to a new task (Pan & Yang, 2010). The previous task is usually an efficient ML model trained on large datasets, and then new tasks are related to the previous task but with smaller datasets. TL methods in hydrology have focused mainly on data interpolation and prediction in areas where observed data are missing or unavailable. For example, Willard et al. (2021) showed how lake water temperature can be predicted in areas without monitoring, and Zhou(Zhou, 2020) developed real-time predictions of river water quality applied to situations where data were missing (e.g. broken sensors). Applications to river carbon cycle understanding and management could include learning between catchments that differ in data availability (e.g. Figure 1), enabling knowledge gained from the better-studied catchment(s) to advance understanding of the less-studied system(s).
Despite numerous successful DL applications in aquatic sciences, challenges and risks remain in applying these approaches for aquatic carbon management. Overarching issues for all ML applications include the potential for sensor and data processing security breaches (Richards et al., 2023) leading to risks for water security. A second issue concerns detection, as the accuracy of DL methods relies on the quantity of observational data. Insufficient data may prevent DL from achieving satisfactory precision (Cao et al., 2022); however, even in developed countries with well-established infrastructures, the cost of obtaining a substantial volume of high-precision environmental monitoring data such as that needed for river carbon cycle estimation could hinder the application of DL in some locations (Richards et al., 2023). Moreover, even water quality monitoring networks in developing countries are often limited by financial resources and technical capabilities and so must prioritize resource allocation. Third, DL methods work well only when training and test data are drawn from the same data feature space and distribution (Pan & Yang, 2010). This implies that DL methods must be specifically designed and tailored for context. Due to the influence of factors such as geometry and land cover, aquatic systems often differ between watersheds, meaning models from other study areas can lead to errors in prediction and risks for decision-making. However, by incorporating explicit mechanisms into the training process DL models are beginning to emerge to overcome these issues, offering strong potential to advance further our understanding of river carbon cycling and emissions.