Overview
Linear regression is the canonical model for predicting continuous quantities. This chapter develops its hypothesis function, mean squared error loss, gradient descent optimisation, feature normalisation, and the bias–variance trade-off, culminating in Ridge (L2) regularisation.
You Will Learn
- The linear regression hypothesis ŷ = wᵀx + b and geometric interpretation
- Mean Squared Error (MSE) as a loss function and its properties
- Batch gradient descent updates derived from first principles
- Why feature scaling is critical for optimisation
- How model complexity, noise and regularisation interact via bias–variance
- How L2 regularisation (Ridge) constrains weights and controls overfitting
Main Content
Hypothesis Function and Geometry
For a d-dimensional feature vector x ∈ ℝᵈ, linear regression assumes a hypothesis of the form ŷ = wᵀx + b, where w ∈ ℝᵈ and b ∈ ℝ. Each weight w_j measures how sensitive the prediction is to feature x_j. In two dimensions the graph is a line; in higher dimensions it is a hyperplane. Interpreting w_j in the context of standardised features allows you to say things like 'a one-standard-deviation increase in BMI is associated with a β-unit change in the target,' which is the basis for effect size interpretation in many applied fields.
Mean Squared Error and Convexity
Given n training examples (x_i, y_i), the Mean Squared Error loss is J(w, b) = (1/n) Σ (wᵀx_i + b − y_i)². This is a convex quadratic function in (w, b), which means it has a unique global minimum. Convexity is extremely valuable: any local descent method that continues to decrease the loss must converge to the global optimum. Analytically, one can solve for (w, b) in closed form via the normal equations, but in high dimensions or when adding regularisation, gradient-based methods are more practical and extend naturally to neural networks.
Gradient Descent for Linear Regression
To minimise MSE via gradient descent we compute partial derivatives. Writing predictions as ŷ_i = wᵀx_i + b, the gradient with respect to w is ∂J/∂w = (2/n) Xᵀ(Xw + b1 − y), and the gradient for b is ∂J/∂b = (2/n) Σ (ŷ_i − y_i). The update rule is w ← w − α ∂J/∂w and b ← b − α ∂J/∂b, where α is the learning rate. Too large an α leads to divergence; too small slows convergence. In practice you diagnose bad choices of α by visualising the training loss over iterations.
Feature Scaling and Conditioning
If features have wildly different scales (e.g., income in thousands vs age in years), the loss surface becomes elongated and ill-conditioned. Gradient descent then 'zig-zags' and converges slowly. Standardisation (z-scoring) each column of X to zero mean and unit variance makes the problem isotropic in parameter space, leading to more circular level sets and faster, more stable optimisation. Importantly, you must compute scaling parameters on the training set only and reuse them on validation and test sets to avoid data leakage.
Bias–Variance and L2 Regularisation
Increasing model flexibility (more features, polynomial terms) reduces bias but increases variance: the model can fit training noise. L2 regularisation modifies the objective to J_reg(w, b) = J(w, b) + λ‖w‖²₂. This shrinks weights toward zero, effectively penalising overly complex solutions. Geometrically, the unconstrained optimum is projected onto a ball of radius depending on λ. As λ grows, variance decreases but bias increases. Cross-validation over a grid of λ values lets you trade these off empirically and pick the value that minimises validation error.
Examples
Gradient Descent on a Toy Dataset
Implement batch gradient descent for a 1D linear regression problem.
import numpy as np
# Synthetic data
x = np.linspace(0, 10, 100)
y_true = 3 * x + 5
noise = np.random.randn(*x.shape) * 2
y = y_true + noise
X = np.c_[x, np.ones_like(x)] # shape (100, 2)
w = np.zeros(2)
alpha = 1e-3
for _ in range(1000):
preds = X @ w
grad = 2 / len(X) * X.T @ (preds - y)
w -= alpha * grad
print("Learned weights:", w)Common Mistakes
Using unscaled features with gradient descent
Why: Leads to slow or unstable convergence because the loss surface is poorly conditioned.
Fix: Standardise features (subtract mean, divide by standard deviation) based on training data before optimisation.
Choosing λ for L2 regularisation by eye on training loss
Why: Regularisation trades training error for generalisation; training loss alone cannot tell you the best λ.
Fix: Use a validation set or cross-validation to select λ that minimises validation error.
Mini Exercises
1. Show that the MSE loss for linear regression is convex in (w, b). Sketch the argument.
2. Explain how increasing λ in Ridge regression affects the bias and variance of the estimator.