A quantitative approach for comparing statistical classifications
founded in machine learning and information theory
Abstract
Statistical classifications and machine-learning-based predictive models
are increasingly used for environmental data analysis and management.
There now exist numerous classifications on the same topic but applied
to different regions or spatial scales, such as geomorphic
classifications. However, no quantitative meta-analysis framework exists
to compare and reconcile across multiple classifications. To fill this
gap, we jointly characterize statistical classifications and predictions
by combining information theory and machine learning in three novel ways
by: (i) measuring the degree of discriminatory information underlying a
statistical classification; (ii) estimating the stability of the
learning process with tuning entropy; and (iii) leveraging the
sequential coarse-graining of information inherent to deep neural
networks but absent from traditional machine learning models. This
framework is applied through a benchmark of 59 millions models on a
unique example of a single statistical classification methodology
applied to nine different regions of California, USA. Regional results
show that random forest consistently outperforms deep neural networks.
In addition, a correlation analysis between regional characteristics,
the level of discriminatory information of each classification, and the
performance in statistical learning explains variations in performance
and reveals the decisive role of the spatial scale of classification
outputs. Because such a spatial scale is itself linked to the common
situation of limited field sampling, directly comparing findings from
statistical classifications and associated predictions appears seldom
justified. A more desirable avenue to compare findings lies in combining
data underlying statistical approaches in an interpretable and
justifiable environmental data science.