Overview
This chapter applies PCA to MNIST for compression and visualisation, compares classification performance on raw vs PCA-reduced features, and explores feature selection on the Diabetes dataset.
You Will Learn
- Using IncrementalPCA and low-rank PCA on large datasets
- Visualising reconstruction quality as more components are added
- Training MLPs on raw and PCA-transformed MNIST features
- Computing correlation and chi-squared scores for feature selection
Main Content
Incremental PCA on MNIST
You apply IncrementalPCA to 70,000 MNIST images, processing data in batches to avoid memory issues. Inspecting explained variance ratios reveals that relatively few components suffice to capture most of the variability in handwritten digits, enabling substantial dimensionality reduction.
Reconstruction Experiments
By projecting images to m components and back, you qualitatively assess how much information each additional component provides. Reconstructions with 10 components are blurry but recognisable; with 100–150 components they become nearly indistinguishable from originals. This visually ties explained variance ratios to perceptual quality.
MLP Classification on Raw vs PCA Features
You train identical MLP architectures on raw 784-dimensional inputs and on PCA-reduced inputs of various dimensionalities. Measuring accuracy and training time demonstrates the trade-off: PCA typically preserves accuracy while reducing training time and risk of overfitting, especially when the classifier has many parameters in the first layer.
Feature Selection on the Diabetes Dataset
For the Diabetes dataset, you compute Pearson correlation between each feature and the target, as well as chi-squared scores for discretised features. Ranking features by these scores highlights the most predictive variables. Comparing models trained on the full feature set, on PCA components, and on selected features reveals how different dimensionality reduction strategies impact interpretability and performance.
Examples
IncrementalPCA Usage Sketch
Applying IncrementalPCA with batching on MNIST.
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=100)
for X_batch in iterate_mnist_batches():
ipca.partial_fit(X_batch)
X_reduced = ipca.transform(X_full)Common Mistakes
Projecting validation/test data using PCA fitted on the full dataset
Why: Fitting PCA on all data leaks information from validation/test into the projection, compromising evaluation.
Fix: Fit PCA on the training set only, then apply the learned transform to validation and test sets.
Choosing number of components solely by variance explained without considering downstream task performance
Why: Some components capturing small variance may still be crucial for prediction.
Fix: Combine explained variance analysis with cross-validated performance of downstream models.
Mini Exercises
1. Train an MLP on MNIST using raw pixels, 50-component PCA, 100-component PCA and 200-component PCA. Compare accuracy and training time.
2. On the Diabetes dataset, compare feature subsets chosen by correlation and chi-squared score. Where do they agree and differ?