Is there evidence of irregular structure in vegetation data reflected in the performance of different algorithms?
We hypothesised that in cases where the structure of vegetation data is variable (irregular shaped clusters or variable density), an algorithm sensitive to such variability would perform better (lower rates of mis-classification) than one that optimises central-tendency (more homogenous clusters). While the structure of our vegetation data is unknown, it is unlikely to be regular, neither continuous along environmental gradients nor arranged in discrete clusters. Theory and empirical evidence suggest that assemblages of species form multi-dimensional continua (Whittaker, 1975; Goodall, 1978; Kent, 2011). However, discontinuities may arise where environmental gradients are either discontinuous in geographic space, or parts of the environmental spectrum are not represented (Austin, 2013). Discontinuities are also likely to arise in our data at broader thematic scales due to biases in the distribution in sample (Gellie et al . 2018) and they patently exist at continental scales between climatically similar sub-continental regions which are separated by water or large areas with unsuitable climate and so share few species (Tozer et al . 2017). We further hypothesised, therefore, that Chameleon’s primary advantage was likely to be in the elucidation of upper-hierarchal clusters.
Overall, the results of our analyses support both hypotheses, although it is clear that: i) the utility of the different clustering methods cannot be encapsulated solely in terms of cluster homogeneity and rates of misclassification, ii) internal evaluators can be misleading in terms of cluster quality; and iii) the superior performance of Chameleon in elucidating upper-hierarchical clusters is entirely dependent on selecting appropriate parameters from an infinite range of combinations. The clearest support for our hypotheses was evident in the comparison between solutions derived using Chameleon clusters with those derived by k-means over the range from 15 – 250 clusters. Chameleon’s best 15 and 30 cluster solutions exhibited significantly lower rates of mis-classification than those of k-means at the cost of an increase in heterogeneity (Figure C), while at progressively higher levels of thematic detail (60 – 250 clusters) there was a convergence in the respective metrics. We speculate that increasing rates of misclassification at finer thematic scale is indicative of the partitioning of a continuum. That is, at fine thematic scales communities increasingly intergrade such that the proportion of their (ever decreasing) member-sets which most closely resemble samples in adjacent clusters increases. If there was indeed variability in the structure of our data at broad thematic scales, then the algorithms performed as hypothesised. We conclude there was a clear advantage in using Chameleon over k-means to elucidate our upper-hierarchical clusters (and relatively little cost), but no apparent advantage at finer thematic scales in terms of cluster metrics. However, since Chameleon solutions of progressive finer scale can be produced by continually partitioning the sparse graph, the algorithm potential offers a straightforward method of integrating plot-based classifications at multiple hierarchical scales.
Accounting for the performance of agglomerative and divisive clustering algorithms is more complicated. First, on the basis of cluster homogeneity and rates of mis-classification, our agglomerative algorithm performed better than either Chameleon or k-means, scoring higher on both metrics at all levels of thematic detail, while our divisive algorithm scored worse (Figure C). Both, however, produced 15-cluster solutions of much greater unevenness in membership numbers than k-means or Chameleon (Figure D) which, if evidence of chaining (sensuPeet & Roberts 2013), could suggest that both solutions were less informative in relation to the nature of upper-hierarchical groupings. Conversely, our three traditional algorithms scored equally highly in terms of the number of diagnostic species and clearly higher than the best Chameleon solutions, suggesting that unevenness in cluster membership numbers could, in fact, be symptomatic of biases in the distribution of samples among ‘natural’ clusters, and that the three traditional algorithms performed better in detecting these uneven clusters (as evidenced by higher numbers of diagnostic species).
Comparisons with a reference classification suggest unevenness in the cluster size is more likely to be indicative of chaining, because indicator species of the largest clusters tended to represent large numbers of known classes, some of which are relatively distantly related, a phenomenon most strongly evident in the agglomerative and divisive solutions (Tables 3a-c). This reflects a well-known weakness of agglomerative or divisive methods which incorporate merge or split decisions based on the aggregate properties of clusters. Such methods require either unrealistic assumptions concerning the structure of the data and/or sequential merge/split decisions which cannot be reversed, and which are necessarily sensitive to the composition of the dataset (Han et al ., 2012). While we did not evaluate the quality of solutions of greater than 15 classes, we suggest our agglomerative algorithm outperformed all others in producing 250-cluster solutions with low rates of misclassification and high homogeneity, but that subsequent, upper-hierarchical groupings because progressively less meaningful because of poor merging decisions. We conclude that Chameleon and k-means generated the most informative solutions of 15 clusters with the former perhaps better representing the natural structure of the data while the latte produced more homogeneous groupings.