Decision Trees, Naive Bayes & the Class Imbalance Problem

Overview

This chapter extends classification beyond linear models to decision trees, Gaussian Naive Bayes, and the challenges of class imbalance. You will understand entropy-based splitting, the Naive Bayes assumption, and diagnostic tools such as confusion matrices.

You Will Learn

How decision trees grow by greedy entropy-based splits
Information gain and the role of impurity measures
The Naive Bayes assumption and why it often still works
Confusion matrices, precision, recall and F1-score
The impact of class imbalance and strategies to mitigate it

Main Content

Decision Trees and Entropy

A decision tree recursively partitions feature space into axis-aligned regions by asking questions of the form x_j ≤ t. At each node it chooses the feature and threshold that maximise a purity gain. Purity is measured via entropy H(S) = −Σ_k p_k log₂ p_k, where p_k is the proportion of class k in node S. A pure node (all same class) has entropy 0; a maximally mixed node has high entropy. Information gain is IG = H(parent) − Σ_child (|child|/|parent|) H(child). Greedy splitting on the highest IG approximates a complex decision surface by a hierarchy of simple rules.

Overfitting in Trees

An unconstrained tree grown to purity will memorise the training data, achieving near-zero training error but often poor test performance. This is overfitting. Controlling tree depth, minimum samples per leaf, and minimum impurity decrease are common regularisation strategies. Ensemble methods such as random forests and gradient-boosted trees combat overfitting by averaging or boosting many weakly correlated trees.

Gaussian Naive Bayes

Naive Bayes models p(x | y) as a product of per-feature class-conditional densities: p(x | y) = Π_j p(x_j | y). The 'naive' assumption is that features are conditionally independent given the class. In Gaussian Naive Bayes, each p(x_j | y=k) is modelled as a univariate Gaussian with class-specific mean μ_{jk} and variance σ²_{jk}. Training is trivial: for each class and feature, compute μ and σ². Prediction uses Bayes’ rule: p(y | x) ∝ p(y) Π_j p(x_j | y). Despite the unrealistic independence assumption, Naive Bayes can perform surprisingly well and is extremely fast.

Confusion Matrices and Class Imbalance

Accuracy alone obscures where a classifier fails, especially in imbalanced settings. A confusion matrix counts true vs predicted labels for each class, exposing which classes are often confused. From it you derive precision (TP/(TP+FP)), recall (TP/(TP+FN)), and F1-score. In heavily imbalanced datasets it is common to see high accuracy but terrible recall on minority classes, which may be exactly the classes you care about (fraud, rare diseases).

Handling Imbalanced Data

Two broad strategies address imbalance: modifying the loss and modifying the data. Class-weighted losses penalise mistakes on minority classes more heavily, e.g., via class_weight='balanced' in scikit-learn. Data-level methods oversample minority classes (possibly with synthetic samples as in SMOTE) or undersample majority classes. Evaluating with per-class metrics and inspecting confusion matrices is essential after applying these techniques.

Examples

Entropy and Information Gain Example

Compute entropy and information gain for a simple binary split.

import numpy as np

def entropy(counts):
    probs = counts / counts.sum()
    probs = probs[probs > 0]
    return -(probs * np.log2(probs)).sum()

parent = np.array([9, 5])  # e.g., 9 positives, 5 negatives
left = np.array([8, 2])
right = np.array([1, 3])

H_parent = entropy(parent)
H_left = entropy(left)
H_right = entropy(right)

n = parent.sum()
IG = H_parent - (left.sum()/n)*H_left - (right.sum()/n)*H_right
print("Information gain:", IG)

Common Mistakes

Allowing trees to grow without constraints

Why: Unpruned trees can perfectly memorise the training set and generalise poorly.

Fix: Limit depth, require minimum samples per leaf, or use ensembles with built-in regularisation.

Using accuracy as the sole metric in imbalanced datasets

Why: High accuracy can coincide with near-zero recall on the minority class.

Fix: Inspect confusion matrices and report precision, recall and F1 for each class; consider AUROC/PR curves.

Mini Exercises

1. Explain how SMOTE generates synthetic examples for a minority class.

2. Why might a shallow tree underfit while a very deep tree overfits?