loading page

A quantitative approach for comparing statistical classifications founded in machine learning and information theory
  • +2
  • Hervé Guillon,
  • Belize Lane,
  • Colin Francis Byrne,
  • Gregory Brian Pasternack,
  • Samuel Sandoval Solis
Hervé Guillon
University of California, Davis

Corresponding Author:[email protected]

Author Profile
Belize Lane
Utah State University
Author Profile
Colin Francis Byrne
University of California Davis
Author Profile
Gregory Brian Pasternack
University of California, Davis
Author Profile
Samuel Sandoval Solis
University of California, Davis
Author Profile

Abstract

Statistical classifications and machine-learning-based predictive models are increasingly used for environmental data analysis and management. There now exist numerous classifications on the same topic but applied to different regions or spatial scales, such as geomorphic classifications. However, no quantitative meta-analysis framework exists to compare and reconcile across multiple classifications. To fill this gap, we jointly characterize statistical classifications and predictions by combining information theory and machine learning in three novel ways by: (i) measuring the degree of discriminatory information underlying a statistical classification; (ii) estimating the stability of the learning process with tuning entropy; and (iii) leveraging the sequential coarse-graining of information inherent to deep neural networks but absent from traditional machine learning models. This framework is applied through a benchmark of 59 millions models on a unique example of a single statistical classification methodology applied to nine different regions of California, USA. Regional results show that random forest consistently outperforms deep neural networks. In addition, a correlation analysis between regional characteristics, the level of discriminatory information of each classification, and the performance in statistical learning explains variations in performance and reveals the decisive role of the spatial scale of classification outputs. Because such a spatial scale is itself linked to the common situation of limited field sampling, directly comparing findings from statistical classifications and associated predictions appears seldom justified. A more desirable avenue to compare findings lies in combining data underlying statistical approaches in an interpretable and justifiable environmental data science.