Abstract
The identification of factors driving the climate extremes have been
conventionally driven by the physical models evaluated using global
climate models and/or using statistical analysis.
However, owing to lack of spatial historical records, both of these
approaches pose a data insufficiency challenge. Moreover, identification
of primary drivers of climate extremes from a larger set of factors can
pose another challenge. Bagging machine learning models in conjugation
of synthetic sampling techniques can address both of these
challenges.
Here, I demonstrate the applicability of
three synthetically sampling techniques along with Random Forest (RF) to
identify the main drivers and their spatial locations affecting the
heatwave days over India for the period of 1979-2013. The three sampling
techniques used to generate balanced data are undersampling,
oversampling and synthetic minority oversampling technique (SMOTE). It
was RF model with SMOTE that could identify the most important factors
with greater precision and recall ($f1-$score (0.85)) as compared to
other sampling techniques. Geopotential height\@500 hPa
along with sensible heating fluxes were identified as important factors
characterizing the Indian heatwave days. The work has repercussion for
any of the climate extremes which lacks balanced data along with
significantly lesser number of observations than the factors.