Linear Regression — Predicting Numbers with Lines — Machine Learning

Overview

Linear regression is the canonical model for predicting continuous quantities. This chapter develops its hypothesis function, mean squared error loss, gradient descent optimisation, feature normalisation, and the bias–variance trade-off, culminating in Ridge (L2) regularisation.

You Will Learn

The linear regression hypothesis ŷ = wᵀx + b and geometric interpretation
Mean Squared Error (MSE) as a loss function and its properties
Batch gradient descent updates derived from first principles
Why feature scaling is critical for optimisation
How model complexity, noise and regularisation interact via bias–variance
How L2 regularisation (Ridge) constrains weights and controls overfitting

Main Content

Hypothesis Function and Geometry

For a d-dimensional feature vector x ∈ ℝᵈ, linear regression assumes a hypothesis of the form ŷ = wᵀx + b, where w ∈ ℝᵈ and b ∈ ℝ. Each weight w_j measures how sensitive the prediction is to feature x_j. In two dimensions the graph is a line; in higher dimensions it is a hyperplane. Interpreting w_j in the context of standardised features allows you to say things like 'a one-standard-deviation increase in BMI is associated with a β-unit change in the target,' which is the basis for effect size interpretation in many applied fields.

Mean Squared Error and Convexity

Given n training examples (x_i, y_i), the Mean Squared Error loss is J(w, b) = (1/n) Σ (wᵀx_i + b − y_i)². This is a convex quadratic function in (w, b), which means it has a unique global minimum. Convexity is extremely valuable: any local descent method that continues to decrease the loss must converge to the global optimum. Analytically, one can solve for (w, b) in closed form via the normal equations, but in high dimensions or when adding regularisation, gradient-based methods are more practical and extend naturally to neural networks.

Gradient Descent for Linear Regression

To minimise MSE via gradient descent we compute partial derivatives. Writing predictions as ŷ_i = wᵀx_i + b, the gradient with respect to w is ∂J/∂w = (2/n) Xᵀ(Xw + b1 − y), and the gradient for b is ∂J/∂b = (2/n) Σ (ŷ_i − y_i). The update rule is w ← w − α ∂J/∂w and b ← b − α ∂J/∂b, where α is the learning rate. Too large an α leads to divergence; too small slows convergence. In practice you diagnose bad choices of α by visualising the training loss over iterations.

Feature Scaling and Conditioning

If features have wildly different scales (e.g., income in thousands vs age in years), the loss surface becomes elongated and ill-conditioned. Gradient descent then 'zig-zags' and converges slowly. Standardisation (z-scoring) each column of X to zero mean and unit variance makes the problem isotropic in parameter space, leading to more circular level sets and faster, more stable optimisation. Importantly, you must compute scaling parameters on the training set only and reuse them on validation and test sets to avoid data leakage.

Bias–Variance and L2 Regularisation

Increasing model flexibility (more features, polynomial terms) reduces bias but increases variance: the model can fit training noise. L2 regularisation modifies the objective to J_reg(w, b) = J(w, b) + λ‖w‖²₂. This shrinks weights toward zero, effectively penalising overly complex solutions. Geometrically, the unconstrained optimum is projected onto a ball of radius depending on λ. As λ grows, variance decreases but bias increases. Cross-validation over a grid of λ values lets you trade these off empirically and pick the value that minimises validation error.

Examples

Gradient Descent on a Toy Dataset

Implement batch gradient descent for a 1D linear regression problem.

import numpy as np

# Synthetic data
x = np.linspace(0, 10, 100)
y_true = 3 * x + 5
noise = np.random.randn(*x.shape) * 2
y = y_true + noise

X = np.c_[x, np.ones_like(x)]  # shape (100, 2)
w = np.zeros(2)

alpha = 1e-3
for _ in range(1000):
    preds = X @ w
    grad = 2 / len(X) * X.T @ (preds - y)
    w -= alpha * grad

print("Learned weights:", w)

Common Mistakes

Using unscaled features with gradient descent

Why: Leads to slow or unstable convergence because the loss surface is poorly conditioned.

Fix: Standardise features (subtract mean, divide by standard deviation) based on training data before optimisation.

Choosing λ for L2 regularisation by eye on training loss

Why: Regularisation trades training error for generalisation; training loss alone cannot tell you the best λ.

Fix: Use a validation set or cross-validation to select λ that minimises validation error.

Mini Exercises

1. Show that the MSE loss for linear regression is convex in (w, b). Sketch the argument.

2. Explain how increasing λ in Ridge regression affects the bias and variance of the estimator.

Linear Regression — Predicting Numbers with Lines