THEORY

Density Estimation & the EM Algorithm

How Gaussian mixtures model complex distributions, why EM works through soft assignments, and how density models become classifiers.

Overview

Density estimation aims to model the full probability distribution p(x) rather than only conditional probabilities. This chapter introduces Gaussian Mixture Models (GMMs), the Expectation–Maximization (EM) algorithm, and shows how density models become classifiers.

You Will Learn

  • Why single Gaussians are often insufficient to model real data
  • How Gaussian mixtures approximate complex distributions
  • The EM algorithm: E-step responsibilities and M-step parameter updates
  • How to use class-conditional densities for classification
  • The singularity problem in maximum likelihood and how to regularise covariances

Main Content

From Single Gaussians to Mixtures

Real-world data often exhibit multi-modal structure: heights of a population, formant frequencies of vowels, or pixel intensities in images. A single Gaussian with one mean and covariance cannot capture multiple distinct modes. Gaussian Mixture Models address this by representing p(x) as a weighted sum of K Gaussians with distinct means and covariances: p(x) = Σ_k π_k 𝓝(x; μ_k, Σ_k). With enough components, a GMM can approximate virtually any smooth density.

Latent Variable Viewpoint

GMMs can be viewed as introducing a latent discrete variable Z taking values 1…K indicating which component generated a sample. The generative process is: sample Z ∼ Categorical(π), then sample X ∼ 𝓝(μ_Z, Σ_Z). The joint density factorises as p(x, z) = π_z 𝓝(x; μ_z, Σ_z). Learning a GMM from data thus amounts to estimating π_k, μ_k, Σ_k from unlabeled samples.

Expectation–Maximization Algorithm

EM alternates between assigning soft cluster memberships and re-estimating parameters. In the E-step, responsibilities γ_{nk} = p(z_k | x_n) are computed using the current parameters via Bayes’ rule: γ_{nk} ∝ π_k 𝓝(x_n; μ_k, Σ_k). In the M-step, parameters are updated using these responsibilities as fractional counts: μ_k = Σ_n γ_{nk} x_n / Σ_n γ_{nk}, Σ_k = Σ_n γ_{nk} (x_n−μ_k)(x_n−μ_k)ᵀ / Σ_n γ_{nk}, and π_k = (1/N) Σ_n γ_{nk}. Each EM iteration is guaranteed not to decrease the log-likelihood; it climbs a hill in parameter space until convergence to a local optimum.

Density Models as Classifiers

If you fit separate GMMs p(x | y = k) for each class k and know or estimate class priors p(y = k), you can classify new samples via Bayes’ rule: p(y = k | x) ∝ p(x | y = k) p(y = k). The decision boundary is where two such posteriors are equal. Unlike linear classifiers, these boundaries can be highly nonlinear and reflect multi-modal structure in class-conditional densities.

Singularities and Regularisation

Maximum likelihood estimation of GMMs is susceptible to singularities: a component’s covariance can collapse toward zero around a single data point, sending its likelihood to infinity and making the overall log-likelihood unbounded above. Numerically this manifests as nearly singular covariance matrices and exploding responsibilities. A practical remedy is to regularise the covariance by adding a small multiple of the identity (εI) after each M-step. This prevents collapse while minimally perturbing well-behaved covariances.

Examples

Responsibilities in a 1D GMM

Compute posteriors γ_{nk} for a toy 1D mixture of two Gaussians.

import numpy as np

pi = np.array([0.4, 0.6])
mu = np.array([-1.0, 2.0])
sigma = np.array([0.5, 1.0])

x = np.linspace(-4, 5, 100)

def normal_pdf(x, m, s):
    return 1.0 / (np.sqrt(2 * np.pi) * s) * np.exp(-0.5 * ((x - m) / s) ** 2)

num = np.vstack([pi[k] * normal_pdf(x, mu[k], sigma[k]) for k in range(2)])
den = num.sum(axis=0, keepdims=True)
resp = num / den  # shape (2, 100)

Common Mistakes

Assuming EM finds the global optimum

Why: EM is a local optimisation method and can converge to different solutions depending on initialisation.

Fix: Use multiple random restarts or informed initialisation (e.g., K-means centroids) and compare final log-likelihoods.

Ignoring degeneracy warnings or nearly singular covariance matrices

Why: These signal singularities that invalidate the learned model.

Fix: Inspect eigenvalues of covariances and apply diagonal regularisation or constrain minimum variance.

Mini Exercises

1. Explain why GMMs can approximate any continuous density on a compact set given enough components.

2. For a 2-component GMM in 1D, derive the E-step responsibility formula explicitly.

Further Reading