Figure 5 . Boxplots of mean KGE for the evaluation of multiple variables with different calibration strategies. (I) Evaluation for the period of calibration (2009 – 2012); (II) Evaluation for a different period than calibration (2006 – 2008 for Q, A, TWS, ET; 2013 – 2014 for h and W). “Initial guess” refers to model runs with the a priori parameter sets. (a) Single-variable (discharge, water level, flood extent, TWS, vegetation ET, soil moisture) and (b) multi-variable calibration (all except discharge, water level + soil moisture). The spread of the values in the boxplots stems from 300 model runs (100 for each of three calibration experiments). Numbers next to the boxplots represent Skill Score (%). Colors refer to classes of skill score. Please note that the KGE scales are different for each variable. Asterisks refer to cases when the evaluation period resulted in a different performance than the calibration period (i.e., positive Skill Score in calibration followed by negative Skill Score in evaluation, or vice-versa). Please note that Skill Score values are computed based on mean values, while the boxplots depict median values.

How does RS-based model calibration improve the water cycle representation?

When performing a single-variable calibration, the performance of the variable itself always improves, which is evidenced by the positive values in the main diagonal (Figure 5-I-a, for calibration period, and Figure 5-II-a, for evaluation period). Calibration with water level was also able to improve estimates of flood extent, TWS, ET and soil moisture (cal period), and all variables (eval period). Calibration with flood extent improved water level, TWS, ET and soil moisture. Calibration with TWS improved all variables. Calibration with ET was able to improve discharge and flood extent. Calibration with soil moisture improved all variables but ET. Results for calibration and evaluation periods agree (i.e., improvement (positive Skill Score) or deterioration (negative Skill Score) for both cal and eval) in 43 out of the 48 cases (89.6%). In the five remaining cases (10.4%), results between calibration and evaluation periods differ: three of them are in the evaluation with TWS, and two of them are in the discharge evaluation (calibration with water level and flood extent).
In the best modeling scenario, calibration with any variable should improve the performance of all other variables. However, we have identified that this did not happen in our experiments. This can be due to uncertainties in model structure, in parameterization, or in the observations. Previous studies have also found significant advantages in using RS-based model calibration to identify structural model issues (e.g., Werth et al., 2009; Willem Vervoort et al., 2014; Winsemius et al., 2008), detect uncertainties in input data (e.g., Milzow et al., 2011), identify deficiencies in model parameterization (e.g., Franks et al., 1998; Koppa et al., 2019), or increase model reliability (e.g., Koch et al., 2018; Manfreda et al., 2018).
According to Figure 4b and supporting information (Figure S1), calibration with discharge improved estimates of almost all variables. However, calibration with discharge deteriorated the performance for vegetation ET time series. Vegetation ET estimated by MOD16 varies at maximum 30mm/month. MGB calibration with discharge led to ET variations of 100 mm/month, reaching around 30 mm/month in the driest periods, while MOD16 estimates are limited to a minimum of 100 mm/month in these periods (time series in Figure 4b). However, one can notice that not even the seasonality between MGB and MOD16 time series agree. This could be due to relatively high uncertainties in vegetation ET estimates from MOD16 for the Amazon basin (around 23 mm/month, according to Gomis-Cebolla et al., 2019). Nonetheless, it could also be related to model structural and/or parameter deficiencies, in which case the model might be “right for the wrong reasons”. In order to identify the source of this ET inconsistency, we have compared MOD16 and MGB results to in-situ measurements of ET in Purus River Basin, provided by Gomis-Cebolla et al. (2019) and Maeda et al. (2017). We found a much stronger agreement both in seasonality and in amplitude of in-situ observations with MOD16 observations than with MGB model output. Hasler & Avissar (2007) and Pan et al (2020) have already warned about the overestimation of dry season water stress in hydrological models, probably related to the misrepresentation of soil water availability for plants. This was also found by Maeda et al. (2017), which highlighted that ET was not water-limited because of the plants’ access to deep soil water, which has also been previously documented by Nepstad et al. (1994). They found that, in the Southern Amazon ecotone, deep root water intake plays a key role in maintaining ecosystem productivity during dry season. MGB model is probably misrepresenting these processes, which would remain unknown if it were only compared to discharge time series.
Even though the calibration with discharge observations was not able to accurately estimate ET, calibration with the remaining variables (except for soil moisture) was able to improve ET estimates. For instance, in Figure 3b, ET and water level presented low correlation (r = 0.08), but calibration with water level improved ET estimates by S = 16.9% (cal period) and S = 25.6% (eval period). However, in Figure 3b, ET and TWS presented high correlation (r=0.47), but calibration with TWS improved ET estimates by only S = 7.9% (cal period) and S = 13.1% (eval period).
In general, calibration with TWS did not present much influence on any of the variables. In spite of some improvements, skill scores were usually low. Consistently, TWS estimates got relatively easily improved by calibration with any variable (except for ET, for cal period; or discharge, for eval period). These results for TWS contrast with previous work from Lo et al. (2010), Nijzink et al. (2018), Rakovec et al. (2016), Schumacher et al. (2018), and Werth & Güntner (2010), which highlighted the value of GRACE data when incorporated into hydrological modeling. This can be due to the high seasonality of Purus River Basin, in which TWS does not aggregate much information, biasing the calibration with high correlation values. Even for the initial guess (uncalibrated) setup TWS performances were already very good: KGE values were around 0.8, while for all other variables, except for ET (for which KGE values were negative), KGE values were around 0.3 for the uncalibrated setup.
Flood extent and water level performances were improved by calibration with discharge, water level and flood extent, but it did not affect much ET (which actually was degraded with discharge calibration) and soil moisture. This is probably due to the relationship between water level and flood extent with river transport processes (e.g., flood routing and floodplain storage), while ET and soil moisture are more related to vertical hydrological processes (e.g., soil water balance). This highlights the complementarity between variables that relate to different processes.
Calibration with soil moisture improves performances of all variables (water level to a lesser extent), except for ET. Consistently, calibration with all variables (except ET) are able to improve soil moisture to some extent.

What is the added value of complementary RS observations?

By calibrating with all variables together except Q (Figure 5b), we found improvements for almost all variables, with the most significant improvements for flood extent ( S = 25% for cal and eval periods) and ET (S = 20% for cal and eval periods). For discharge, performance for the evaluation period was improved (S = 17.4%), which is important for estimating discharge in poorly gauged basins. However, for the calibration period, Skill Score for discharge performance was S = 1.7%, which might reflect some limitations in retrieving discharge based on the calibration of the RS-derived variables, as discussed previously.
Therefore, we chose a specific arrangement of two complementary variables in order to check if this calibration setup might lead to better retrievals for discharge and the other variables. The chosen variables were soil moisture and water level, because of their complementarity. Based on the Skill Score values in Figure 5-I, calibration with water level only improves all variables but discharge (and soil moisture to a lesser extent), while calibration with soil moisture only improves all variables, but ET (and water level to a lesser extent).
The calibration arrangement of water level and soil moisture led to improvements not only to soil moisture and water level themselves, but also to all other variables (ET to a lesser extent). For instance, flood extent was improved by S = 52.6% and S = 34.1% (cal and eval period, respectively). Discharge was improved by S = 59.9%, with a resulting mean KGE = 0.70 for the calibration period (S = 45.0% and mean KGE = 0.35 for evaluation period). These results agree with previous works that found an improvement in model performances by multi-variable calibration of soil moisture and evapotranspiration (e.g., Koppa et al., 2019; López et al., 2017), discharge and evapotranspiration (e.g., Herman et al., 2018; Pan et al., 2018; Poméon et al., 2018), discharge and soil moisture (e.g., Li et al., 2018; Rajib et al., 2016), discharge and TWS (e.g., Rakovec et al., 2016; Schumacher et al., 2018; Werth & Güntner, 2010), and discharge and water level (e.g., Kittel et al., 2018; Schneider et al., 2017; W. Sun et al., 2012). However, it is difficult to compare this study to previous works, because most of them used discharge observations as constraints. In this study, we avoided the use of discharge observations for multi-variable calibration, in order to analyze the applicability of the RS-based calibration method for poorly-gauged regions.
Calibration with water level and soil moisture did not present much influence on ET performance, because of the specificities regarding ET in this watershed, i.e., given that the model setup does not represent deep root water intake during dry season, as discussed previously.
By comparing the two frameworks for multi-variable calibration (all except Q versus h+W calibration), we found that calibration with all variables except Q is useful to some extent, but consistently selecting complementary variables for model calibration resulted in best overall performance.

Are we getting the right results for the right sets of parameters?

When analyzing the dispersions of parameters before and after calibration with each variable (Figure 6 for a few selected parameters, Supporting Information Figure S2 for all calibrated parameters), it can be observed that the range of parameters vary largely depending on the calibration variable. For instance, Wm is a soil conceptual parameter related to maximum water storage in the soil. In the calibration based on single variables (except ET) it converged to low values (300 mm), while in the calibration with ET it reached high values (2000 mm). This probably occurred in order to compensate, by overparameterization, a structural error in the model, i.e., the model inability to represent deep root water uptake in dry season. These trade-offs between model parameters during calibration has also been reported and discussed by Koppa et al. (2019).
The surface resistance parameter also resulted in a wide range of values depending on the calibration target variable. When calibrated with water level, flood extent, or ‘all except Q’ experiments, it reached median values higher than 150 s/m, but calibration with h+W led to median values lower than 50 s/m. Surface resistance is a vegetation parameter directly related to ET dynamics, so it is important to note that calibration with ET was able to reduce the dispersion of this parameter, reaching a median value of about 80 s/m (similar to calibration with Q and W).
Another interesting result relates to channel Manning’s coefficient, which presented different values for each calibration experiment. This agrees with previous findings about Manning parameter being often used as an effective parameter that compensates for neglected hydrodynamic processes as localized channel head losses, poor cross section representation, or non-represented 2D processes (Neal et al 2015).
Many previous studies have highlighted the use of multi-variable calibration to narrow parameters’ search space (Nijzink et al., 2018; W. Sun et al., 2018), but this was not observed in our results. Based on the limited multi-variable calibration experiments performed here (‘all except Q’ and h+W), no narrowing in parameters’ search space was found. For most parameters (except for Wm), calibration with ‘all except Q’ and h+W resulted in a wide range of values. This can be due to differing convergence sets of parameters between each of the triplicate runs. A more robust experiment comparing more multi-variable calibration strategies (e.g., Q + different R-based variables) might provide better understanding on this topic.