Are remote sensing evapotranspiration models reliable across South American climates and ecosystems?

Many remote sensing-based evapotranspiration (RSBET) algorithms have been proposed in the past decades and evaluated using flux tower data, mainly over North America and Europe. Model evaluation across South America has been done locally or using only a single algorithm at a time. Here, we provide the first evaluation of multiple RSBET models, at a daily scale, across a wide variety of biomes, climate zones, and land uses in South America. We used meteorological data from 25 flux towers to force four remote sensing based ET models: Priestley & Taylor Jet Propulsion Laboratory (PT-JPL), Global Land Evaporation Amsterdam Model (GLEAM), Penman-Monteith Mu model (PM-MOD), and Penman-Monteith Nagler model (PM-VI). ET was predicted satisfactorily by all four models, with correlations consistently higher (R2>0.6) for GLEAM and PT-JPL, and PM-MOD and PM-VI presenting overall better responses in terms of PBIAS (-10

tification of their sources of uncertainty [Fisher et al., 2017;Zhang et al., 2017;Paca et al., 145 2019]. Indeed, despite the recent developments of remote sensing methods, there are still 146 challenges concerning the refinement of those algorithms to remedy the lack of information 147 on specific surface characteristics and fluxes of undersampled climate zones and vegetation 148 types, such as fractional vegetation cover and net radiation, which are a substantial source 149 of uncertainty in global satellite-based estimates [Ferguson et al., 2010;Vinukollu et al., 150 2011;Badgley et al., 2015]. 151 Here, we evaluated the predictive skills of four satellite-based models, designed 152 for regional and continental scale applications, over South America. The main question we 153 seek to answer is whether such models can be applied consistently to reliably capture in The study area encompasses five biomes (  [Olson et al., 2001]. 164 We used daily meteorological data from 25 flux tower sites located across various 165 South American biomes and land cover types to verify the predictive skill of the selected 166 RSBET models ( Figure 1a, Table S3 in SM). The time period considered for analysis was 167 determined by the available time-series for each site ( Figure S1 in SM). Further information 168 about each biome is provided in SM. Ten sites are from FLUXNET [Pastorello et al., 2020], 169 AmeriFlux networks [Novick et al., 2018] and Large-Scale Biosphere-Atmosphere Exper-170 iment in the Amazon (LBA) project [Saleska et al., 2013], while the remaining data were 171 obtained from the respective principal investigators. The spatial patterns of mean annual pre-172 cipitation ( ), air temperature ( ), and potential evapotranspiration ( ) show that selected 173 sites encompass a wide variety of climates ( Figure 1b). 174 As we are interested in assessing models, instead of using the EC-measured latent 189 heat flux, , to represent , we derived from the other energy balance fluxes, i.e. Wilson et al., 2002;Stoy et al., 2013;Fisher et al., 191 2020], where is the net radiation, is the soil heat flux, and is the sensible heat flux.

192
The closure of the energy budget is rarely observed with flux tower measurements [Wilson 193 et al., 2002;Foken, 2008]. Usually, the available energy ( − ) is greater than ( + ).

194
The imbalances in the surface energy budget, reported here as an energy balance ratio, EBR 195 (i.e. ( + )/( − )), range from 0.73 to 1.16 (mean ∼0.90) ( daytime records and twenty during the night. As in Mu et al. [2011], the shortwave incoming 202 radiation ( ↓) was used to distinguish between daytime ( ↓>10 W m −2 ) and nighttime 203 ( ↓ < 10 W m −2 ). Regarding the fluxes, we used quality checked raw data that had not 204 been gap-filled. 205 The quality control procedure described above was not adopted for the SDF, TF1, and 206 TF2 towers (see Figure 1a). At those sites, horizontal advection plays an important role due 207 to extreme weather variations throughout the year [Levy et al., 2020], such that the energy Tundra (Td) [Peel et al., 2007]. (b) Gridded annual average (AVG) and standard deviation (SD) for temperature ( ), rainfall ( ), and potential evapotranspiration ( ) across South America and the monitored sites [Harris et al., 2020]. balance closure cannot be diagnosed by EBR, as described above. For instance, the SDF 209 zone is known as an anticyclone pathway between the Pacific and Atlantic oceans, and TF1 210 and TF2 are located in the extreme southern parts of Patagonia, a region characterized by 211 strong winds. Thus, for those sites, we used derived from measured LE. in PT-JPL, is obtained from total fractional vegetation cover, whereas in PM-MOD the 227 1-km MODIS (MOD15) product is adopted. The original procedures to obtain those 228 variables were not changed here. The following treatment was applied to the MODIS-derived 229 data. "Good quality" pixels were selected, based on the quality assurance (QA) flags. Next, 230 an autoregressive model was applied to fill in the gaps [Akaike, 1969]. Finally, we imple-231 mented a temporal filter to improve the and time series to reproduce precisely 232 all pre-processing steps of the standard PM-MOD algorithm [Mu et al., 2011]. whereas for open water evaporation no stress factor is applied. For the tall vegetation cover 252 fraction, rainfall interception loss is estimated with Gash's analytical model [Gash, 1979;253

272
The MOD16 ET model (PM-MOD) is based on the Penman-Monteith equation to pro-273 duce a daily global ET product summing up daytime and nighttime ET [Mu et al., 2011]. In 274 this model, total is partitioned into , , and . To compute , PM-MOD 275 uses potential soil evaporation and a soil moisture constraint function based on and air 276 relative humidity ( ) [Fisher et al., 2008]. The evaporation of the water intercepted by the 277 canopy, , is calculated using the relevant equations from a revised version of the Biome-

278
BGC model [Thornton, 1998]. The PM-MOD assumes that occurs when the vegetation 279 is covered with water, i.e. when the water cover fraction ( ) > 0, which is constrained by 280 [Mu et al., 2011]. In the PM-MOD model is calculated as in the PT-JPL model:

282
The PM-MOD model is designed to allow to occur during daytime and nighttime, 283 by adding constraints to stomatal conductance for and minimum temperature, and ig-284 noring constraints relating to high air temperature [Running et al., 2019].

294
This model relies upon the hypothesis that is mostly controlled by specific domi-295 nant processes, such as transpiration and photosynthesis, hence a good correlation between 296 such processes and is necessary for good model performance [Nagler et al., 2007]. There 297 are several formulations to estimate from VIs [Nagler et al., 2005[Nagler et al., , 2009 shown in the main text were obtained using data from days that were common across models, 357 resulting in 4718 data points.

358
To illustrate the relative contribution of each site to the scatter plots in Figure 3, we

364
Nevertheless there is some spread for a few sites, e.g., in the PT-JPL scatter plot that displays 365 a few sites with large bias despite strong overall correlation and .

366
The models slightly overestimate as suggested by higher density of points below      The central panels in Figure 4 provide evidence for the high variability of model pre-   Figures S4-S7 in the SM. In Figure 5 show that variation is similar among models, except for PT-JPL which presents the 430 lowest (e.g., K67). Figure 5 shows that for PM-VI varies around zero across 431 sites, which is expected given the model requires calibration with local data. However, based 432 on 2 , it is apparent that this model's skill is quite limited for > ∼1.2. In general, the PT-433 based models showed larger biases, with PT-JPL and GLEAM consistently overestimating 434 and underestimating , respectively. In terms of 2 , the PT-models ranked better than the 435 PM-models for more than ∼50% of the towers.   land or riparian ecosystems [e.g., Nagler et al., 2005Nagler et al., , 2009Nagler et al., , 2013Jarchow et al., 2017].

486
Here, we tested PM-VI for a much wider variety of biomes, climates and land uses, and 487 found a poor predictive skill for several sites with > 1.2 (e.g K67, K77, K83) or < 0.8 488 (e.g., CAA and SLU), even though the model accounts for a site-specific calibration. indices (see Section 3.0 in the SM for more details). Although we did not verify this in our 501 study, we do not dismiss the possibility that known uncertainties in the estimation of site-502 specific vegetation characteristics (e.g., and leaf stomatal conductance in the PM-503 MOD; Ershadi et al. [2014]) are also causes of lower model performance. In our study, we 504 used soil heat flux ( ) which is generally measured below ground (usually at 5-20 cm deep) 505 using soil heat flux plates. It could be argued that not correcting for the heat storage be-506 tween the plate and the soil surface could lead to sub-optimal estimates of when is 507 calculated as the residual of the energy balance, especially for towers where the soil is bare 508 or covered by sparse vegetation, where can be relatively large. This, in turn, could lead to 509 the conclusion that the models are performing worse than is actually the case. Although de-510 sirable, correcting G for heat storage is rarely possible due to data unavailability (few sites 511 only measure soil moisture and temperature, which are required to estimate soil heat capac-512 ity, and heat storage using the calorimetric method). Moreover, at daily scales and for most 513 sites, G is either negligible in SA (summer or winter, when the amount of heat stored during 514 the day roughly equals that lost during the night) or represents a minor portion only (spring 515 and autumn) of the energy balance. As detailed and discussed in Section S3.0 and Figure S8 516 in SM , it is highly unlikely that neglecting such corrections will have affected the results.

517
There are, however, some issues worth mentioning here. Cause (vi), for instance, is a 518 major issue for PM-VI, as expected because the model is highly dependent on VI dynamics 519 (see Section 2.4) [Nagler et al., 2005]. Regarding cause (iv), the superior performance of 520 the PT models over PM-MOD at most sites is probably linked to uncertainties resulting from 521 the estimation of aerodynamic resistance [Ershadi et al., 2014]. In PM-MOD, the aerody- reported [Ershadi et al., 2015]. Conversely, PT models are highly dependent on (causes 528 iv and v); hence they often fail in dry environments (see metrics for < ∼0.6 in Figure   529 5) where seasonality is dictated by and not radiation, or in regions with low (e.g.,

530
TF2). Poor model responses at K77 (cropland, Figure S10 in SM) were attributed to causes 531 (i) and (ii), as remnants of forest and shrubs were identified within the tower footprint and 532 within MODIS pixel. VI products with higher resolution than MODIS exist and have been 533 used to estimate ET [e.g., Aragon et al., 2018;Fisher et al., 2020]; thus offering a possible 534 solution for causes (i) and (ii). Time lag between ET and EVI (cause iii) was identified at 535 EUC, where EVI followed the decline of ET after ∼1-2 months of interval.

536
Remote sensing based partitioning is expected to present some divergences from 537 ground based measurements. This is the case especially for , because of the difficulty to 538 obtain remote sensing information on soil characteristics that drive , such as soil tem-539 perature and moisture [Talsma et al., 2018a,b], in particular at high vegetation cover frac-540 tions. Globally, transpiration has been reported to account for 57-90% of global , based 541 on in situ data and model outputs [Jasechko et al., 2013;Wei et al., 2017 Ground-based partitioning data are generally not widely available; this also goes 561 for most land cover types included in this study. We compared the models' outputs with field 562 experiment studies that measured one or more components either at the same sites as 563 those used here or within the same region (Table 1). partitioning values derived from 564 GLEAM seem to be more consistent with ground-based information available for tropical 565 rainforests, croplands and grasslands than for wetlands, and mixed and deciduous needle-566 leaf forests (Table 1). As for PT-JPL; its partitioning fits reasonably well with observa-  Despite our efforts to gather as much tower data as possible, with the goal of having a 585 common data set for all models, we faced several limitations including: differences in lengths 586 of observational time series across towers (up to 3 years), as well as lack in overlap of these 587 time series; uneven distribution of towers across groups (e.g., biomes); and, finally, South

588
American geographical features that were not considered in this study (e.g., MGS biome 589 or desert climate type, BWk). Thus, it was not possible to assess, for all towers, model re-    The data used in this study will be available through a data-sharing repository. Funding

S1.0 Study area -Biomes
The tropical & subtropical moist broadleaf forests, i.e. TSMBF, cover approximately 50% of the South American territory. This biome is mainly characterized by a warm climate and high rainfall rates. ET in this region is responsible for generating ∼30% of the atmospheric moisture that precipitates in the Amazon basin [Eltahir and Bras, 1994]. The dry seasonal forests, i.e. TSDBF, are located mostly in the center (Bolivia) and east (Brazilian semi-arid region) of South America; here mean rainfall variability is temporally and spatially high, with pronounced dry seasons that can extend up to 10 months. The biome that encompasses grasslands, savannas, and shrublands, i.e. TSGSS is the second largest biome in South America. This biome has a high concentration of endemic species whose habitats have experienced exceptional losses, therefore many areas of this biome are part of the global biodiversity hotspots for conservation [Myers et al., 2000]. The wetlands, i.e. FBS, host a vast diversity of aquatic and palustrine vegetation [Junk et al., 2006a,b] and provide crucial ecosystem services to the Pantanal (the largest FGS's ecoregion) [Costanza et al., 1997]. FBS are completely surrounded by grasslands, savannas & shrublands, and dry forests. The temperate forests, i.e. TMBF, are located in southern Chile and Argentina. This region contains the Central Chile biodiversity hotspot [Myers et al., 2000]. Mean annual temperature decreases from north to south and from low to high altitudes in the Andes. Within this biome, peatland ecosystems cover most of the area (440,000 km 2 , Arroyo et al. [2005]) in the southernmost part of South America and along the Pacific coast of Chile.

S2.0 Gap filling of meteorological data
Relative humidity ( ) and vapor pressure ( ) records are the most common missing variables among the tower data. Missing values were filled using and saturated air vapor pressure ( ) calculated from Tetens' equation [Tetens, 1930]. Missing values were filled accordingly, given that and temperature ( ) are available at that day or 30min interval. Days without records were discarded. Missing atmospheric pressure ( ) values were estimated from ground elevation at the tower sites using equation 7 from the FAO-56 manual [Allen et al., 1998].

S3.0 Results and Discussion
partitioning Greater insight into model-based partitioning can be gained by comparing our results with other estimates from the literature. Previous model-based estimates using the Gash model [Gash et al., 1995;Valente et al., 1997] for USR (sugar cane) and PDG (woodland savanna) showed that accounts for ∼10% of [Cabral et al., 2012[Cabral et al., , 2015. For USR, all three models limited / < 10%; with PM-MOD offering a mean estimate (∼9%) closest to the estimates cited above, and GLEAM presenting the lowest value of / (<2%). While GLEAM estimated / ≈ 5%, PM-MOD and PT-JPL estimates (20 and 30%, respectively) agreed better with the 20-40% reported by [Denmead et al., 1997] for two sugar cane fields in Australia. For the EUC site, interception estimates from GLEAM (∼10%) are much closer to the / ≈ 13% reported by Cabral et al. [2010], when com-  Table 1 in the main text). However, / > 75% for the Brazilian semi-arid areas are reasonable during the rainy season, when about 70% of the annual occurs [Mutti et al., 2019;Marques et al., 2020].
All models found practically negligible values of (<2% of ) in three Brazilian  [Zhou et al., 2016;Wei et al., 2017;Stoy et al., 2019]. Despite the wide variability of / among models, the overall predictive skill was satisfactory, thus not associated with their capability to correctly estimate each component individually.

Impact of correcting
The impact of the correction of in the energy balance, thus in the estimation of the observed (= − − ), is addressed here. Most Fluxnet-type datasets do not offer soil moisture and soil temperature data, hence correction of G is possible only for a reduced number of validation tower sites. At daily scales, in particular under densely vegetated areas, G is often negligible. The grass sites included in this study lack the data required for correcting . Therefore, we select the GRO tower (soybean) to assess the role of in the model metrics. In Figure S8, we show the metrics when tower ET is calculated using G directly measured by soil heat flux plates below the surface (raw G, black dots) and when the corrected G is used (G surface , red dots), i.e. when heat storage above the plate has been taken into account. As shown in Fig S9, no significant change was noted in 2 , or ; and in most cases the correction caused a deterioration in the metrics. The largest changes were: • GLEAM: 2 decreased from 0.79 to 0.74; • PM-MOD: PBIAS increased from -4.59 to 2.07% • PT-JPL: RMSE increased from 0.83 to 0.9 mm d −1

Diagnosis
Some factors, e.g. those relating to tower location and the surrounding environment, can also affect model performances. A likely cause for model-observation mismatch, for instance at site K77, could be the heterogeneity of the fluxes measured by the towers [Bai et al., 2015]. Previous studies have shown that land cover heterogeneity may induce some variations in the footprint [Chen et al., 2009]. In the case of the K77 site, the footprint is es- Aside from patch-scale heterogeneities, there is also the possibility that the 250-m MODIS pixel is returning a mixed response of different types of vegetation within it (subpixel spatial heterogeneity), which has been shown to compromise model performance [Fisher et al., 2008;Mu et al., 2011;Nagler et al., 2005;McCabe et al., 2016;Fisher et al., 2020].
"Mixed" pixels of croplands have been shown to be problematic, especially when the VI adopted is [e.g., Wardlow et al., 2007]. In this regard, might present a certain advantage as it has been shown to better capture the changes in the green biomass level, making it more sensitive to sudden changes among crops in the senescence phase, whereas would be more suitable for mapping crops at the peak phase [Embry and Nothnagel, 1994;Wardlow et al., 2007].
Therefore, the mismatch between and for K77 site are probably linked to the VI adopted here. An alternative for future work would be to use even higher spatial resolution VNIR/NDVI data than the 250 m MODIS data. Examples include 10-m Sentinel-2 or 3-m Planet . For instance, Aragon et al. [2018] used 3-m Planet NDVI to produce ultra high resolution using PT-JPL. Another promising way forward is the 70-m ECOSTRESS data (e.g., Figure S11), which enables a closer alignment to eddy covariance footprints [Fisher et al., 2020]. The land cover representativity issue is even worse in the case of GLEAM that uses microwave data for vegetation phenology, a much coarser product. As for the EUC site (∼500-m footprint), the plantation border is at a sufficient distance from the tower (∼2 km), compared to the surface heterogeneities encountered for the other sites, and occupies multiple MODIS pixels. Hence none of the problems mentioned above are reasonable explanations for the model overprediction at that site in the dry period. Moreover, data from MODIS captured the vegetation variability during transition in terms of amplitude but we noticed a lag between and ( Figure S9). A similar pattern was reported by [Fisher et al., 2008] while validating PT-JPL against for a savanna tower.
In the case of PT-JPL model estimates at the EUC site, the model's higher sensitivity to than to seems to be a plausible explanation for the model performance ( Figure 5; Figure S9). In the PT-JPL model, affects indirectly, through the soil moisture constraint and interception loss. Conversely, influence of is indirect: a fraction of it is used to compute all three components and it affects the temperature which, in turn, is used to calculate the plant temperature constraint [Fisher et al., 2008]. Moreover, has a direct effect on as it controls the potential in the PT equation. Both the soil moisture and the plant temperature constraints are positively correlated with modelled and they have been shown to be the most sensitive parameters in the PT-JPL model, potentially resulting in ∼20% of model uncertainty [García et al., 2013]. Although the -dependent constraint presented a larger variability at the EUC site, it tended to follow variability whereas the -indirectly-dependent constraint remained constant at its maximum value (one) during the months during which overprediction occurred.
The performance of the PM-based models observed here might have been affected by the uncertainties in the estimation of site-specific properties, as we used the leaf conductances (as well as other parameters) from the Biome Properties Look-Up Table in [Mu et al., 2011], instead of calibrating them for the selected sites. As noted by Ershadi et al. [2015], the parameterisation seems to be crucial for PM modes. Therefore, a clear path for improving PM-based models over South America is to include its towers for calibrating the biome specific parameters (e.g., potential stomatal conductance, surface conductance etc) in order to make them more representative and account for their inter-continental variability. Moreover, Mu et al. [2011] also acknowledged that values of / from MODIS may introduce a bias in estimates, which may explain the systematic errors at some sites (e.g., CAA, FM, K83).
Regarding the PM-VI model, we found it to be the model with the largest variability (0 < 2 < 0.85, 0.15 < < 1.25, 0 < < 0.98) in performance among the selected towers.
Because this model consists of a combination of and an -based crop coefficient, we would expect it to perform best at crop sites. Although that was true for some sites (EUC and USR), PM-VI had the poorest performances at most sites with an aridity index > 1.2. Moreover, the model was capable of estimating for non-crop areas, with good predictive skills found at sites with moist forest (K34), mixed forest (FM), woodland savanna (PDG), and two permanent wetland sites (TF1 and TF2). It is interesting to note the contrasting behaviour of the PM-based models. While the PM-MOD is much more complex than PM-VI, the latter outperforms the former at some sites (e.g., TF1 and TF2). This is most likely caused by the fact that PM-MOD has not been calibrated for the flux towers considered here, unlike PM-VI.