THEORY

PCA & Dimensionality Reduction — Making Sense of High Dimensions

Why high-dimensional data is deceptively tricky, how PCA finds the most informative directions, and when to select features instead.

Overview

Principal Component Analysis (PCA) is a cornerstone technique for understanding and compressing high-dimensional data. This chapter explains its geometric intuition, eigen-decomposition, variance explanation and trade-offs with feature selection.

You Will Learn

  • Why high-dimensional data often lives near a lower-dimensional subspace
  • How PCA finds principal directions via covariance eigenvectors
  • Interpreting explained variance ratios and choosing the number of components
  • Limitations of PCA (linearity, variance vs task relevance)
  • When to use PCA vs feature selection based on correlation or chi-squared tests

Main Content

The Curse and Blessing of Dimensionality

In high-dimensional spaces, distances can become less informative, and data tends to be sparse. However, many real datasets (images, speech, text embeddings) lie near low-dimensional manifolds embedded in high-dimensional ambient spaces. PCA exploits this by finding directions along which the data varies most, revealing intrinsic structure and enabling dimensionality reduction without severe information loss.

Covariance and Eigen-Decomposition

Given mean-centered data matrix X ∈ ℝ^{n×d}, the empirical covariance matrix is C = (1/n) XᵀX. PCA finds eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λ_d and corresponding eigenvectors v₁, …, v_d of C. Each v_k defines a principal direction, and λ_k gives the variance along that direction. Projecting data onto the first m eigenvectors yields an m-dimensional representation that captures the largest possible variance among all rank-m linear projections.

Explained Variance and Component Selection

The proportion of total variance explained by the k-th component is λ_k / Σ_j λ_j. Cumulative explained variance curves show how many components are needed to capture, say, 90% or 95% of the variance. For MNIST digits, a few hundred components often retain most of the information out of 784 dimensions. Component selection trades compression and speed against potential performance loss on downstream tasks.

PCA vs Feature Selection

PCA creates new orthogonal features (principal components) as linear combinations of original features. This often yields better compression but sacrifices direct interpretability: 'principal component 7' is harder to explain than 'blood pressure.' Feature selection, in contrast, chooses a subset of original features based on criteria such as correlation with the target or chi-squared scores. It preserves semantics but may be less compact or efficient. In practice, you often use PCA for representation learning and selection methods when interpretability is paramount.

Examples

Eigen-Decomposition with NumPy

Compute PCA directions for a small dataset using eigen-decomposition of the covariance matrix.

import numpy as np

X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0]])
X_centered = X - X.mean(axis=0, keepdims=True)
C = (X_centered.T @ X_centered) / X_centered.shape[0]

vals, vecs = np.linalg.eigh(C)
order = np.argsort(vals)[::-1]
vals, vecs = vals[order], vecs[:, order]
print("Eigenvalues:", vals)
print("First principal direction:", vecs[:, 0])

Common Mistakes

Applying PCA on unscaled features in heterogeneous units

Why: Features with large variance purely due to units dominate the principal components.

Fix: Standardise features (zero mean, unit variance) before PCA when units differ.

Interpreting PCA components as directly causal or semantically pure

Why: Components are linear combinations of features and may mix multiple underlying factors.

Fix: Use PCA primarily as a mathematical tool for compression and visualisation; be cautious with causal narratives.

Mini Exercises

1. Derive the PCA optimisation objective and show that its solution is given by the top eigenvectors of the covariance matrix.

2. Explain why the sum of all eigenvalues of C equals the total variance of the dataset.

Further Reading