Abstract
Landslides cause billions of dollars in property damage and thousands of
deaths every year worldwide. India has more than 15% of its land area
prone to landslides, hence mapping of these areas for the presence of
landslides is of utmost importance. Landslide susceptibility zonation
maps give approximate information about the occurrence of landslides.
There are various factors responsible for slope instability. In this
work, 11 causative factors have been considered such as Aspect,
Elevation, Geology, Distance from thrusts, Distance from streams, Plan
curvature, Profile curvature, Slope, Stream power index, Tangential
curvature, Topographic wetness index. Machine learning methods such as
artificial neural network, support vector machine require a large amount
of training data; however, the number of landslide occurrences are
limited in a study area. The limited number of landslides leads to a
small number of positive class pixels in the training data. On contrary,
the number of non-landslide pixels (negative class pixels) are huge in
numbers. This under-represented data and severe class distribution skew
create a data imbalance for learning algorithms and sub-optimal models,
which are biased towards the majority class (non-landslide pixels) and
have low performance on the minority class (landslide pixels).
Generally, the data is imbalanced when the class ratio is of the order
of 100:1, 1000:1 and 10000:1 (i.e., one-class points are 100, 1000 or
10000 times more than that of another class points). In our work, class
ratio is more than 300:1 (i.e. for each one landslide pixel, we have
more than 300 non-landslide pixels). Thus, we can clearly say that our
data is imbalanced. There are two major data balancing techniques, which
are oversampling of a minority class and under-sampling of majority
class. The minority oversampling cannot be applied, as it will create
false landslide pixels. We have performed under-sampling of
non-landslide pixels using various techniques. We will discuss landslide
susceptibility zonation with and without using data imbalance technique
and show major improvements in accuracy over imbalanced learning.