Methods for selecting clusters — method

Finding the optimal number of clusters is generally a balance between optimal fit statistics, parsimony, and interpretability. These functions help select the number of clusters to return from hc, some hierarchical clustering object:

k_strict() selects a number of clusters in which there is no distance between cluster members.
k_elbow() selects a number of clusters in which there is a fair trade-off between parsimony and fit according to the elbow method.
k_silhouette() selects a number of clusters that maximises the silhouette score.

These functions are generally not user-facing but used internally in e.g. the *_equivalence() functions.

k_strict(hc, .data)

k_elbow(hc, .data, motif, Kmax)

k_silhouette(hc, .data, Kmax)

k_gap(hc, motif, Kmax, sims = 100)

Arguments

hc: A hierarchical clustering object.
.data: A network object of class mnet, igraph, tbl_graph, network, or similar. For more information on the standard coercion possible, see manynet::as_tidygraph().
motif: A motif census object.
Kmax: An integer indicating the maximum number of options to consider. The minimum of this and the number of nodes in the network is used.
sims: Integer of how many simulations should be generated as a reference distribution.

Value

A single integer indicating the number of clusters to return.

Strict method

The strict method selects the number of clusters in which there is no distance between cluster members. This is a very conservative method that may be appropriate when the goal is to identify clusters of nodes that are exactly the same. However, it may not be appropriate in cases where the data is noisy or when the clusters are not well-defined, as it may result in a large number of small clusters.

Elbow method

The elbow method is a heuristic used in cluster analysis to determine the optimal number of clusters. It is based on the idea of plotting the within cluster correlation as a function of the number of clusters and looking for an "elbow" where there is a significant decrease in the rate of improvement in correlation as the number of clusters increases. The point at which the elbow occurs is often considered a good choice for the number of clusters, as it represents a balance between model complexity and fit to the data.

Silhouette method

The silhouette method is based on the concept of cohesion and separation. Cohesion refers to how closely related the nodes within a cluster are, while separation refers to how distinct the clusters are from each other. The silhouette score combines these two concepts into a single metric that can be used to evaluate the quality of a clustering solution. The silhouette score is calculated as follows: For each node, calculate the average distance to all other nodes in the same cluster (a) and the average distance to all other nodes in the next nearest cluster (b). The silhouette score for each node is then calculated as: $$S(i) = \frac{b - a}{\max(a, b)}$$ A higher silhouette score indicates that the node is well-matched to its own cluster and poorly matched to neighboring clusters. The silhouette score for the entire clustering is the average silhouette score across all nodes. Maximizing the silhouette score across a range of potential clusterings allows researchers to identify the number of clusters that best captures the underlying structure of the data. It is particularly useful when the clusters are well-separated.

References

On the elbow method

Thorndike, Robert L. 1953. "Who Belongs in the Family?". Psychometrika, 18(4): 267–76. doi:10.1007/BF02289263 .

On the silhouette method

Rousseeuw, Peter J. 1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics, 20: 53–65. doi:10.1016/0377-0427(87)90125-7 .