Overview
Machine learning is the practice of teaching computers to learn patterns from data instead of hand-coding rules. This chapter introduces the formal notation, core paradigms (supervised, unsupervised, reinforcement learning), the end-to-end ML pipeline, and the bias–variance tradeoff that underpins every model choice.
You Will Learn
- Formal definitions: dataset notation {(x_i, y_i)}, hypothesis f(x;θ), empirical vs expected risk, generalization
- Supervised vs unsupervised vs reinforcement learning
- The ML pipeline: features → model → loss → optimization → generalization
- Parametric vs non-parametric models
- Bias–variance tradeoff and overfitting vs underfitting
- A short Python demo with sklearn (linear regression or classification)
Main Content
What Is Machine Learning?
Machine learning flips the traditional programming paradigm. Instead of writing explicit rules (if house has 3 bedrooms and is in Brooklyn, then price ≈ X), you provide examples — thousands of input–output pairs — and an algorithm discovers the mapping. The computer learns from data.
Formal Notation
A dataset is a collection of examples: D = {(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)}. Each xᵢ is a feature vector (e.g., [sqft, bedrooms, age]) and yᵢ is the target (e.g., price). A model is a function f(x; θ) parametrized by θ. Training finds θ that minimizes a loss function. The empirical risk is the average loss on the training set: R̂(θ) = (1/n) Σ L(f(xᵢ; θ), yᵢ). The expected risk is the loss over the true data distribution — what we care about in production. Generalization is the gap between training performance and performance on unseen data.
Supervised vs Unsupervised vs Reinforcement Learning
In supervised learning, every example has a label (x, y). The model learns to predict y from x. Regression predicts continuous values (prices, temperatures); classification predicts discrete labels (spam/not-spam, digit 0–9). In unsupervised learning, there are no labels. The algorithm discovers structure: clustering groups similar points, dimensionality reduction finds compact representations. Reinforcement learning trains an agent that takes actions and receives rewards; the goal is to maximize cumulative reward over time.
The ML Pipeline
Every ML project follows the same flow: (1) Collect and clean data. (2) Extract features — numeric representations the model can use. (3) Choose a model family (linear, tree, neural network). (4) Define a loss function that measures prediction error. (5) Optimize: find parameters that minimize the loss (e.g., via gradient descent). (6) Evaluate on held-out data to assess generalization. (7) Deploy and monitor.
Parametric vs Non-Parametric
Parametric models (e.g., linear regression, neural networks) have a fixed number of parameters regardless of dataset size. Non-parametric models (e.g., k-NN, decision trees) grow with the data — they store or use the training set at prediction time. Parametric models are faster at inference; non-parametric ones can be more flexible but need more data.
Bias–Variance Tradeoff
Error decomposes into bias (how wrong the model is on average — underfitting) and variance (how much predictions vary with different training sets — overfitting). Simple models have high bias, low variance; complex models have low bias, high variance. Regularization, cross-validation, and early stopping help find the sweet spot.
Examples
Linear Regression with scikit-learn
A minimal example: load data, fit a linear model, and evaluate.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"R² on test set: {score:.4f}")Classification with Logistic Regression
Binary classification on the Iris dataset (two species).
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
# Use only 2 classes for binary classification
X, y = X[y != 2], y[y != 2]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f"Accuracy: {acc:.4f}")Common Mistakes
Evaluating on the same data you trained on
Why: The model can memorize training examples, giving falsely high accuracy.
Fix: Always use a held-out test set (or cross-validation) for honest evaluation.
Using test data for model selection or tuning
Why: You leak information into the model; the test set is no longer 'unseen.'
Fix: Use a validation set for hyperparameter tuning; reserve the test set for final evaluation only.
Ignoring feature scale
Why: Gradient descent and distance-based methods are sensitive to feature magnitudes.
Fix: Standardize (z-score) or normalize features before training.
Mini Exercises
1. In your own words, what is the difference between empirical risk and expected risk?
2. Give one example each of a supervised, unsupervised, and reinforcement learning problem.
3. Why does a model that fits the training data perfectly often perform poorly on new data?