THEORY

Clustering — Finding Structure Without Labels

How K-means discovers groups through iterative refinement, how to pick K, and how hierarchical clustering reveals structure at every scale.

Overview

Clustering discovers structure in unlabeled data. This chapter develops K-means from first principles, discusses how to choose K, and introduces hierarchical clustering and dendrograms for multi-scale analysis.

You Will Learn

  • The K-means objective and its alternating minimisation algorithm
  • The E-step and M-step in K-means and why they decrease the objective
  • The Elbow method for choosing the number of clusters
  • Agglomerative hierarchical clustering and linkage criteria
  • How to interpret dendrograms and choose clusterings at different scales

Main Content

The K-means Objective

Given data points {x₁, …, x_n}, K-means seeks K centroids μ₁, …, μ_K and assignments z_i ∈ {1, …, K} that minimise the within-cluster sum of squares J = Σ_i ‖x_i − μ_{z_i}‖². Intuitively, each point is assigned to the nearest centroid, and centroids are set to the mean of their assigned points. This objective prefers compact, spherical clusters in Euclidean space.

E-step and M-step in K-means

K-means iterates two steps: (E-step) assign each point to the nearest centroid by Euclidean distance; (M-step) recompute each centroid as the mean of its assigned points. Each step does not increase J; the E-step minimises J over assignments with centroids fixed, and the M-step minimises J over centroids with assignments fixed. Because there are finitely many possible assignments, the algorithm converges to a local minimum.

Choosing K with the Elbow Method

The number of clusters K must be specified a priori. The Elbow method runs K-means for K = 1, 2, …, K_max and records the inertia (final value of J). As K increases, J decreases monotonically, but the marginal gain diminishes. Plotting J vs K produces a curve with a noticeable 'elbow' where additional clusters yield only small improvements. This elbow suggests a reasonable K, though it is not always clear-cut and domain knowledge is often required.

Agglomerative Hierarchical Clustering

Hierarchical clustering does not fix K in advance. Agglomerative methods start with each point as its own cluster and repeatedly merge the two closest clusters according to a linkage criterion (single, complete, average, Ward). This produces a dendrogram that visualises the nested grouping structure. Cutting the dendrogram at different heights yields different numbers of clusters; examining these cuts gives insight into multi-scale structure in the data.

Linkage Criteria and Their Effects

Single linkage defines inter-cluster distance as the minimum distance between any pair of points across clusters, often producing 'chained' clusters. Complete linkage uses the maximum distance and tends to produce compact, spherical groups. Average linkage averages all pairwise distances. Ward’s method merges clusters that minimally increase total within-cluster variance and often yields the most balanced, interpretable trees for numeric data.

Examples

K-means Alternating Minimisation Sketch

Pseudo-code for one iteration of K-means.

def kmeans_step(X, centroids):
    # E-step: assign labels
    dists = np.linalg.norm(X[:, None, :] - centroids[None, :, :], axis=2)
    labels = dists.argmin(axis=1)

    # M-step: recompute centroids
    new_centroids = np.vstack([
        X[labels == k].mean(axis=0) for k in range(centroids.shape[0])
    ])
    return labels, new_centroids

Common Mistakes

Interpreting K-means clusters as always meaningful

Why: K-means can force arbitrary partitions even when data has no real cluster structure.

Fix: Validate clusters with domain knowledge, visualisation and stability analyses; do not over-interpret arbitrary groupings.

Using K-means with non-spherical or categorical data

Why: K-means assumes Euclidean geometry and is biased toward spherical, equal-variance clusters.

Fix: Consider alternative methods (e.g., Gaussian mixtures, density-based clustering, k-modes) when data violate K-means assumptions.

Mini Exercises

1. Prove that the M-step update in K-means (setting μ_k to the mean of assigned points) minimises the squared error within cluster k.

2. On a 2D dataset with obvious non-spherical clusters (e.g., two moons), what happens when you apply K-means? Why?

Further Reading