PRACTICE

Classifiers in Action — Iris, Forest Covertype & SMOTE

Train multinomial logistic regression, decision trees, and Gaussian Naive Bayes with sklearn; build confusion matrices; tackle class imbalance with balanced weights and SMOTE on a 580K-sample dataset.

Download Notebook (.ipynb)

Overview

This chapter puts logistic regression, decision trees and Gaussian Naive Bayes into practice on Iris and the imbalanced Forest Covertype dataset, emphasising evaluation via confusion matrices and strategies for dealing with imbalance.

You Will Learn

  • Training and comparing multiple classifiers with scikit-learn
  • Constructing and interpreting confusion matrices
  • Computing per-class precision, recall and F1-score
  • Handling class imbalance with class weights and SMOTE

Main Content

Iris as a Controlled Benchmark

The Iris dataset, with three balanced classes and four features, serves as a clean playground. Training logistic regression, decision tree and GaussianNB side by side reveals different strengths: logistic regression yields smooth linear boundaries, trees can carve complex piecewise-constant regions, and Naive Bayes is extremely fast but can be miscalibrated when independence is violated.

Confusion Matrices and Per-Class Metrics

For each classifier you compute a confusion matrix and derive per-class precision, recall and F1. On Iris, all models often achieve high overall accuracy, but the confusion matrices expose which classes they confuse (usually versicolor vs virginica). This trains you to go beyond a single scalar score and inspect model behaviour class-by-class.

Forest Covertype: Large-Scale and Imbalanced

Forest Covertype has hundreds of thousands of instances, dozens of features, and seven classes with highly skewed frequencies. Training the same classifiers naively yields misleadingly high accuracy dominated by majority classes. Confusion matrices reveal that some rare classes have near-zero recall. This is a realistic scenario where default settings fail silently.

Balancing with Class Weights and SMOTE

By enabling class_weight='balanced' in logistic regression and decision trees, the learning algorithm upweights minority-class errors and downweights majority-class errors. Applying SMOTE to oversample minority classes before training further equalises class frequencies. Comparing confusion matrices before and after these interventions shows significant recall improvements on rare classes, often at an acceptable cost to precision.

Examples

Training Multiple Classifiers on Iris

Logistic regression, decision tree and Gaussian Naive Bayes with scikit-learn.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

models = {
    "logreg": LogisticRegression(max_iter=1000),
    "tree": DecisionTreeClassifier(max_depth=4),
    "gnb": GaussianNB(),
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(name)
    print(classification_report(y_test, y_pred))

Common Mistakes

Failing to stratify train/test splits on imbalanced data

Why: Random splits can produce even more skewed class distributions in train and test, making evaluation unstable.

Fix: Use stratified splitting so that each partition retains the overall class proportions.

Applying SMOTE to both training and test sets

Why: Synthetic samples in the test set break the evaluation by leaking generated patterns.

Fix: Apply SMOTE only on the training set; keep validation and test sets strictly real data.

Mini Exercises

1. On a subset of Forest Covertype, compare the confusion matrix of a decision tree trained with and without class_weight='balanced'. What changes do you observe for minority classes?

2. Explain why Naive Bayes can perform well even when its independence assumptions are violated.

Further Reading