Are ‘natural’ clusters necessarily less homogeneous?
Although a degradation of cluster homogeneity is implicit in our model, the degree to which this is realised is likely to be highly dependent on the structure of individual datasets. In our case study, the mis-classification rate achieved by Chameleon was half that of k-means at the cost of a 10% reduction in cluster homogeneity. We speculate that if the clusters Chameleon retrieved in our dataset are indeed irregular shapes, then our results suggest they are unlikely to be highly elongated, and variability in our data structure tends toward uneven density rather than irregular shape.
The question of whether ‘natural’ clusters necessarily have fewer diagnostic species is more difficult to resolve based on our analyses.A priori , we inclined to the notion that more heterogenous clusters would mean fewer diagnostic species, the pattern reflected in our results, however Schmidtlein et al. (2010) demonstrated that Isopam, an algorithm that adapts to irregular cluster shapes, consistently out-performed other algorithms in terms of the number of indicator species (sensu Dufrêne & Legendre 1997) and was also highly ranked in terms of the number of species with standardized phi >0.35 (Tichy ̵́ & Chytry ̵́ 2006). Higher numbers of diagnostic species could reflect the sampling of a wider species pool, since samples sharing no species can occupy the same cluster if comprise an interconnected neighbourhood (Schmidtlein et al. 2010). However, it is not clear that higher numbers of diagnostic species is not an artefact of Isopam’s partitioning of the ordinations space by medoids, notwithstanding the fact the ordination axes are adjusted to accommodate non-linearities (and hence irregularities).
On the evidence of our results, we conclude that our original contention is supported, that cluster solution derived by algorithms sensitive to data structure are unlikely to be as compact or homogenous as those derived by optimising central tendency, although the differences may not always be pronounced, depending on the characteristics of individual datasets and thematic scale of investigation. In that case, we suggest that further research is required into metrics which give insights into how well cluster solutions model the structure of vegetation data (eg within-cluster inter-connectedness, mis-classification rates) to better understand the potential trade-offs involved in maximising homogeneity or indicator values.