Probability Theory — The Language of Uncertainty — Machine Learning

Overview

Probability is the language of uncertainty. This chapter builds a rigorous intuition for random variables, events, and Bayes’ theorem, then connects Gaussians, covariance, and correlation directly to how ML models reason about data.

You Will Learn

Core probability concepts: events, conditional probability, independence
Bayes’ theorem and posterior reasoning on realistic examples
Gaussian distributions: parameters, geometry, and why they dominate ML
Covariance and Pearson correlation as measures of linear dependence
Joint, marginal, and conditional distributions and how they relate

Main Content

Events, Random Variables and Probability

A random variable X is a quantity whose value is uncertain before we observe it: tomorrow’s temperature, a customer’s spending, the label of an image. An event is a statement about that variable, such as X > 30 or 'this email is spam.' The probability P(A) quantifies how often we expect event A to occur in repeated trials. For ML, we rarely work with only one event: we care about questions like 'what is the probability this email is spam given that it contains the word lottery?' which is precisely a conditional probability.

Bayes’ Theorem: Updating Beliefs

Bayes’ theorem lets us update prior beliefs in light of new evidence. For events A and B with P(B) > 0, it states P(A | B) = P(B | A) P(A) / P(B). Here P(A) is the prior, P(B | A) is the likelihood, and P(A | B) is the posterior. In a loan-default setting, A might be 'customer will default' and B might be 'customer has a certain credit pattern.' Even if the pattern is highly indicative of default (large P(B | A)), the posterior can still be small if defaults are rare overall (small P(A)). This is the base-rate effect, and misunderstanding it leads many people to overinterpret diagnostic tests and model outputs. In ML terms, most classification models are trying to approximate P(y | x) for some label y and features x; Bayes’ rule links this posterior to class-conditional densities P(x | y).

Gaussian Distributions and Parameters

The univariate Gaussian (normal) distribution with mean μ and variance σ² has density p(x) = (1 / √(2πσ²)) exp(−(x − μ)² / (2σ²)). Geometrically, μ sets the centre of the bell curve, while σ² controls its spread. In higher dimensions, a Gaussian is specified by a mean vector μ ∈ ℝᵈ and a covariance matrix Σ ∈ ℝ^{d×d}. Contours of constant density are ellipses (or ellipsoids) whose shape is determined by Σ. Many ML algorithms either explicitly assume Gaussian structure (Gaussian Naive Bayes, Gaussian mixture models, LDA) or behave as if the data were locally Gaussian due to the Central Limit Theorem. Understanding how μ and Σ affect the geometry of the distribution is crucial when interpreting learned parameters and diagnostic plots.

Covariance and Pearson Correlation

Given two scalar random variables X and Y with means μ_X and μ_Y, the covariance is defined as cov(X, Y) = E[(X − μ_X)(Y − μ_Y)]. Empirically we estimate it with (1/n) Σ_i (x_i − x̄)(y_i − ȳ). If X tends to be above its mean when Y is above its mean, the covariance is positive; if X is above its mean when Y is below, it is negative. The issue with raw covariance is its dependence on the units of measurement. Pearson’s correlation coefficient fixes this by normalising with the standard deviations: ρ = cov(X, Y) / (σ_X σ_Y). This yields a dimensionless value in [−1, 1] that measures linear dependence. In ML, correlation analysis is used to detect redundant features and to sanity-check data: very high |ρ| between features may indicate multicollinearity that can destabilise linear models.

Joint, Marginal and Conditional Distributions

The joint distribution p(x, y) tells you how likely specific pairs (x, y) are. From the joint you can recover marginals and conditionals. The marginal p(x) is obtained by summing or integrating out Y: p(x) = Σ_y p(x, y). The conditional p(y | x) = p(x, y) / p(x) describes how Y behaves when X is fixed. In practice, we often estimate these quantities via histograms or kernel density estimators. For example, in the Iris dataset, the joint over petal length and petal width differs drastically between species, which is why a classifier can separate them. Understanding these relationships at the probability level makes it clear why feature engineering and distribution shifts matter so much.

Examples

Loan Default Example with Bayes’ Theorem

Computing the posterior probability of default given an observed credit pattern.

prior_default = 0.01
likelihood_pattern_given_default = 0.80
likelihood_pattern_given_safe = 0.10

p_pattern = (
    likelihood_pattern_given_default * prior_default
    + likelihood_pattern_given_safe * (1 - prior_default)
)
posterior_default = (
    likelihood_pattern_given_default * prior_default / p_pattern
)
print(f"P(default | pattern) = {posterior_default:.4f}")

Estimating Covariance and Correlation in NumPy

Compute empirical covariance and Pearson correlation between two features.

import numpy as np

x = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([1.1, 1.9, 3.2, 3.9])

x_mean, y_mean = x.mean(), y.mean()

cov_xy = np.mean((x - x_mean) * (y - y_mean))
cor_xy = cov_xy / (x.std(ddof=0) * y.std(ddof=0))

print("cov(x, y) =", cov_xy)
print("corr(x, y) =", cor_xy)

Common Mistakes

Treating P(A | B) and P(B | A) as interchangeable

Why: They can differ dramatically when class priors are imbalanced; P(default | pattern) is not the same as P(pattern | default).

Fix: Always write out Bayes’ theorem explicitly and check which conditional you actually have from data or a model.

Interpreting correlation as causation

Why: High |ρ| indicates a strong linear relationship, but a third variable can drive both X and Y.

Fix: Use correlation for feature screening and diagnostics, but rely on experimental design or causal methods for causal claims.

Ignoring feature distributions when building models

Why: Many algorithms implicitly assume certain distributions; heavy tails or multimodality can break these assumptions.

Fix: Inspect univariate and bivariate plots (histograms, KDEs, scatter plots) for key features before choosing or trusting a model.

Mini Exercises

1. In a disease screening test, prevalence is 0.5%, sensitivity is 99%, and specificity is 95%. Compute P(disease | positive test).

2. Explain in words what the covariance between two features represents.

3. Given a joint distribution p(x, y), describe how to obtain the marginal p(x) and the conditional p(y | x).

Probability Theory — The Language of Uncertainty