Clustering Notebook — K-Means from Scratch & Customer Segmentation — Machine Learning

Overview

This chapter implements K-means from scratch, applies the Elbow method, explores hierarchical clustering with dendrograms, and performs real customer segmentation on the iFood marketing dataset.

You Will Learn

Writing K-means in NumPy with explicit E- and M-steps
Using the Elbow method to choose K in practice
Constructing dendrograms and interpreting them
Performing and interpreting customer segmentation on real data

Main Content

Implementing K-means from Scratch

You implement functions to (1) initialise centroids, (2) assign points to nearest centroids, and (3) recompute centroids. Running this implementation on Iris confirms that it quickly converges and recovers setosa almost perfectly, with some mixing between versicolor and virginica where their petal features overlap.

Choosing K via the Elbow Method

By running K-means for K = 1…10 and plotting inertia as a function of K, you learn to visually detect elbows. On Iris, K = 3 is usually an obvious elbow. On more complex data the elbow may be subtle, which is a useful reminder that model selection often involves imperfect but pragmatic heuristics.

Hierarchical Clustering and Dendrograms

Using scipy’s linkage and dendrogram functions, you perform agglomerative clustering on Iris and visualise the hierarchy. Cutting the dendrogram at different heights produces varying clusterings that correspond to coarse and fine-grained groupings. Comparing these to K-means cluster assignments shows where methods agree and differ.

Customer Segmentation on iFood

On the iFood marketing dataset, you standardise numeric features, select relevant subsets (e.g., spending by category, channel usage, demographics), and apply K-means with a K chosen via the Elbow method and domain sense. Profiling each cluster by its centroid yields interpretable customer personas (e.g., high-income wine enthusiasts, deal-seeking families), illustrating the practical value of unsupervised learning.

Examples

Running K-means on Iris

Use scikit-learn’s KMeans to confirm your implementation.

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

X, y = load_iris(return_X_y=True)
km = KMeans(n_clusters=3, random_state=42)
labels = km.fit_predict(X)
print("Cluster centers:\n", km.cluster_centers_)

Common Mistakes

Omitting feature scaling before clustering

Why: Unscaled features with different units dominate distance computations, biasing clusters.

Fix: Standardise or normalise features before applying distance-based clustering methods.

Assuming clusters discovered in customer data are static over time

Why: Customer behaviour shifts; segments may drift or split.

Fix: Re-segment periodically and track stability of cluster assignments over time.

Mini Exercises

1. On the iFood dataset, build at least three different K-means segmentations using different feature subsets. How do the inferred personas change?

2. Compare K-means and hierarchical clustering on the same subset of customers. Where do they agree and disagree?

Clustering Notebook — K-Means from Scratch & Customer Segmentation