An algorithm for unsupervised partitioning of geoscientific datasets using flexible similarity metrics

Grant Petty

doi:10.1002/essoar.10510188.1

loading page

An algorithm for unsupervised partitioning of geoscientific datasets using flexible similarity metrics

Grant Petty

Abstract

A simple yet flexible and robust unsupervised classification algorithm is described for efficiently partitioning a data set into compact, non-overlapping groups or classes based on pairwise similarity. Unlike most clustering algorithms, there is no assumption that natural clusters exist in the dataset, though some clusters, when present, may be preferentially assigned to one or more classes. The method also does not require data objects to be compared within any coordinate system but rather permits the user to quantify pairwise similarity using almost any conceivable criterion. For all of the above reasons, the method lends itself to certain geoscientific applications for which conventional clustering methods are unsuited, including two non-trivial and distinctly different datasets presented as examples. The computer memory required for the user-defined similarity matrix is 4N^2 bytes and is the sole practical limitation on the size N of the dataset that can be directly classified. Much larger data sets can be readily accommodated by assigning members to classes previously determined from a representative subset.