Are natural clusters likely to be more stable/robust to new data?
Clustering solutions are notoriously sensitive to classification protocols, and it has generally proven difficult to retrieve the units of individual CCSs via meta-analysis of combined data (Tichý et al . 2011, 2014). Wiser & De Cáceres (2013) and Tichý et al . (2014) characterised this problem in terms of the need to preserve units of one or more CCSs while allowing for previously unrecognised units to be identified following the acquisition of new plot data. Their respective solutions comprise alternative forms of semi-supervised clustering, promising approaches that allow for units to be “fixed” by specifying their plot membership a priori while allowing for unattributed plots to form new clusters. The question of when units should be “fixed” must still be addressed. If the problem arises either because algorithms cluster irregular data in idiosyncratic ways or there are biases in the distribution of samples in compositional then some understanding of the underlying data structure is likely to be informative.
In theory, algorithms sensitive to data structure may reduce the extent of this problem, at least at some thematic scales. Tozer et al. (2022) concluded Chameleon’s novel approach to modelling inter-sample relationships greatly facilitated the revision of an earlier broad-thematic classification of forested wetlands based on substantially fewer plot samples (Keith & Scott 2005). Unlike many traditional methods which incorporate merge or split decisions based on the aggregate properties of clusters, Chameleon operates on inter-connected neighbourhood sets structured, in Tozer et al ’s (2022) case, on the same similarity metric used in the original analysis. They considered these features pivotal, because the algorithm could potentially minimise the impact of incremental additions of new data by retaining connections between samples from the original set (Tozer et al . 2022) (although they speculated that this could best achieved by specifying a large neighbourhood which, on the basis of our results, we suggest is not appropriate for minimising the rate of misclassification). Tozer et al . (2022) reasoned that since Chameleon dissolves connections between relatively weakly-connected samples in the partitioning phases, strong pairwise relationships between samples underpinning clusters in the original analysis were potentially preserved (and reflected more faithfully) in their new Chameleon-derived clusters (provided they were not displaced by a sufficiently large number of more strongly inter-connected samples). This interesting feature requires further study.