Performance of Chameleon under combinations of varying parameters- single linkage
Trends in the mis-classification rate and average within cluster homogeneity in Chameleon cluster solutions generated using the weighted single-link functions are summarised in Figure 3. The misclassification rate rose with increasing neighbourhood size (Figure 3A). This result may reflect aberrations caused by forcing members of small clusters to forge links with samples in other clusters as illustrated by Chameleon’s attribution of the simulated data we presented in Figure 1 given neighbourhoods of different sizes (Figure 4). Solutions derived by agglomeration from 30 sub-partitions had consistently lower rates of misclassification, but beyond 30 sub-partitions solutions became increasing uneven (chaining) and mis-classification rates became meaningless because a high proportion of samples were concentrated in few clusters. The problem of chaining was not corrected by directing the algorithm to prioritise large clusters over small in the partitioning phase however more even clusters were produced when the cluster-weighted complete link function was employed in the agglomerative phase of the algorithm and subsequent analyses were performed using this option, as described in the next section. There was no clear trend in within-cluster homogeneity with increasing neighbourhood size when the agglomeration phase was omitted (Figure 3B). Solutions derived by agglomeration from 30 sub-partitions had highest homogeneity with a neighbourhood size of 100. Beyond 30 sub-partitions the data showed no clear trend and varied erratically depending on the uneven-ness of the solutions.
Clusters of 15 solutions generated using the cluster-weighted complete link function exhibited higher rates of mis-classification and lower within-cluster homogeneity when either neighbourhood size (n) or the number of sub-partitions (a) in the agglomerative phase were increased, although increasing n disproportionately affected the mis-classification rate while increasing a disproportionately affected cluster homogeneity (Figure 5).
Both the rate of mis-classification and within-cluster homogeneity increased with increasing thematic resolution (Figure 6). Chameleon solutions derived using small neighbour sizes and either: modest numbers of sub-partitions (twice the number of classes in the solution); or with the agglomeration phase omitted, were better (lower rates of misclassification and higher homogeneity) than those derived with the divisive algorithm, but worse than those derived with the agglomerative algorithm (Figure 6). However, 15- class solutions derived by Chameleon were more even than those produced by either the agglomerative or divisive algorithms (Figure 7). Chameleon solutions were better than those of k-means at broad thematic scales (15 – 60 classes) but equivalent at finer scales (90 – 250 classes). Chameleon produced more even 15-class solutions than k- means (Figure 7).
Clusters derived by Chameleon solutions were generally characterised by fewer diagnostic species than those derived using the traditional algorithms (Table 2), however species diagnostic of Chameleon clusters corresponded more with those characterising units of a reference classification for our study area than those diagnostic of cluster derived by agglomerative or divisive algorithms, both in the range of units represented and with less overlap between unrelated units (Table 3a, 3b, 3c). Clusters derived by k-means retrieved units of the reference classification with efficiency similar to Chameleon (Table 3d).