PRACTICE

Probability in Practice — Bayes, Gaussians & the Iris Dataset

Implement Bayes' theorem from scratch, plot Gaussian PDFs, compute covariance and PCC by hand and with NumPy, and explore feature distributions on the Iris dataset.

Download Notebook (.ipynb)

Overview

This chapter turns probability theory into executable code. You will numerically experiment with Bayes’ theorem, Gaussians, covariance, correlation and distribution visualisation on real datasets such as Iris.

You Will Learn

  • Implementing Bayes’ theorem numerically and exploring base-rate effects
  • Plotting and comparing Gaussian PDFs with different means and variances
  • Computing covariance and Pearson correlation from scratch and via NumPy
  • Exploring empirical joint and marginal distributions on the Iris dataset

Main Content

Numerical Bayes and the Base-Rate Fallacy

When implementing Bayes’ theorem in code, you quickly see how sensitive the posterior P(A | B) is to the prior P(A). In highly imbalanced settings (rare diseases, fraud detection), even very accurate tests can yield a majority of false positives. Simulating this numerically with different priors and likelihoods is an excellent way to build intuition for why model calibration and prior knowledge matter.

Visualising Gaussian Families

By plotting multiple Gaussian PDFs on the same axes you can visualise how μ and σ² shape the distribution. Keeping μ fixed and varying σ² shows the trade-off between concentration and spread. Keeping σ² fixed and varying μ illustrates how the mode shifts. These plots directly inform how you think about feature distributions and the assumptions behind Gaussian-based models.

Empirical Covariance and Correlation on Iris

Using the Iris dataset, you can compute the full covariance matrix and correlation matrix across the four standard features. High correlation between petal length and petal width, and much lower correlation between sepal and petal features, highlights which features carry redundant information. Reproducing these matrices with both manual loops and NumPy’s built-ins reinforces your understanding of the formulas.

Approximating Joint and Conditional Distributions

Two-dimensional histograms (or kernel density estimates) approximate the joint distribution p(x, y). Conditioning on species and slicing these histograms reveals class-conditional structures. For example, p(petal_length | species = virginica) looks very different from p(petal_length | species = setosa). This is exactly why simple linear classifiers can separate these classes effectively.

Examples

Plotting Gaussian PDFs

Overlay several Gaussians with different means and variances.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 400)

def gaussian(x, mu, sigma):
    return 1.0 / (np.sqrt(2 * np.pi) * sigma) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

plt.plot(x, gaussian(x, 0, 1), label="μ=0, σ=1")
plt.plot(x, gaussian(x, 0, 2), label="μ=0, σ=2")
plt.plot(x, gaussian(x, 2, 1), label="μ=2, σ=1")
plt.legend(); plt.show()

Covariance Matrix on Iris

Compute the covariance matrix of the four Iris features manually.

from sklearn.datasets import load_iris
import numpy as np

iris = load_iris()
X = iris.data  # shape (150, 4)
X_centered = X - X.mean(axis=0, keepdims=True)

n = X.shape[0]
Sigma = (X_centered.T @ X_centered) / n
print("Covariance matrix:\n", Sigma)

Common Mistakes

Using population vs sample formulas interchangeably

Why: The difference between dividing by n and (n − 1) matters for small samples and affects the bias of your estimator.

Fix: Be explicit about whether you want population or sample estimates; NumPy and pandas expose both via ddof.

Interpreting noisy empirical histograms as exact distributions

Why: Finite samples and binning choices can create artefacts, especially in tails or sparse regions.

Fix: Use multiple bin sizes, consider kernel density estimation, and always interpret empirical plots with uncertainty in mind.

Mini Exercises

1. Simulate 10,000 samples from two different Gaussians and empirically verify that the sample mean and variance converge to the true parameters.

2. On the Iris dataset, which pair of features has the highest absolute Pearson correlation and what does that imply for feature engineering?

Further Reading