Are natural clusters likely to be more stable/robust to new data?
Clustering solutions are notoriously sensitive to classification
protocols, and it has generally proven difficult to retrieve the units
of individual CCSs via meta-analysis of combined data (Tichý et
al . 2011, 2014). Wiser & De Cáceres (2013) and Tichý et al .
(2014) characterised this problem in terms of the need to preserve units
of one or more CCSs while allowing for previously unrecognised units to
be identified following the acquisition of new plot data. Their
respective solutions comprise alternative forms of semi-supervised
clustering, promising approaches that allow for units to be “fixed” by
specifying their plot membership a priori while allowing for
unattributed plots to form new clusters. The question of when units
should be “fixed” must still be addressed. If the problem arises
either because algorithms cluster irregular data in idiosyncratic ways
or there are biases in the distribution of samples in compositional then
some understanding of the underlying data structure is likely to be
informative.
In theory, algorithms sensitive to data structure may reduce the extent
of this problem, at least at some thematic scales. Tozer et al. (2022)
concluded Chameleon’s novel approach to modelling inter-sample
relationships greatly facilitated the revision of an earlier
broad-thematic classification of forested wetlands based on
substantially fewer plot samples (Keith & Scott 2005). Unlike many
traditional methods which incorporate merge or split decisions based on
the aggregate properties of clusters, Chameleon operates on
inter-connected neighbourhood sets structured, in Tozer et al ’s
(2022) case, on the same similarity metric used in the original
analysis. They considered these features pivotal, because the algorithm
could potentially minimise the impact of incremental additions of new
data by retaining connections between samples from the original set
(Tozer et al . 2022) (although they speculated that this could
best achieved by specifying a large neighbourhood which, on the basis of
our results, we suggest is not appropriate for minimising the rate of
misclassification). Tozer et al . (2022) reasoned that since
Chameleon dissolves connections between relatively weakly-connected
samples in the partitioning phases, strong pairwise relationships
between samples underpinning clusters in the original analysis were
potentially preserved (and reflected more faithfully) in their new
Chameleon-derived clusters (provided they were not displaced by a
sufficiently large number of more strongly inter-connected samples).
This interesting feature requires further study.