PRACTICE

Python Fundamentals for ML

Hands-on Python from zero: variables, data types, lists, tuples, dicts, sets, operators, control flow, I/O, and modules — with exercises you actually run.

Download Notebook (.ipynb)

Overview

Python is the lingua franca of machine learning. This chapter covers data types, collections (lists, dicts, sets, tuples) with ML-oriented examples, NumPy array basics and shapes, functions and comprehensions for dataset prep, reading CSV, basic data cleaning, and a simple train/test split.

You Will Learn

  • Data types (int, float, str, bool) and their use in ML code
  • Lists, dicts, sets, tuples with ML examples (config dicts, class labels, tensor shapes)
  • NumPy arrays: creation, shapes, indexing
  • Functions and comprehensions for dataset preparation
  • Reading CSV, basic data cleaning, train/test split

Main Content

Data Types in ML

Integers store counts (num_epochs, batch_size). Floats store hyperparameters (learning_rate, dropout). Strings store names (optimizer='Adam', model_path). Booleans toggle behavior (is_training=True). These four types appear in every ML script.

Lists, Dicts, Sets, Tuples

A list is an ordered, mutable sequence — e.g., a column of feature values. A tuple is immutable, ideal for fixed shapes like (28, 28, 1). A dict maps keys to values: config = {'lr': 0.01, 'epochs': 100} bundles hyperparameters. A set stores unique elements — useful for distinct class labels: set(y_train) gives you all classes.

NumPy Arrays

NumPy arrays are the foundation of numerical Python. Create with np.array(), np.zeros(), np.ones(), np.arange(). The .shape attribute is critical — (n_samples, n_features) for a design matrix. Indexing and slicing work like lists but can operate on multiple dimensions. Vectorization (array operations without loops) is essential for speed.

Functions and Comprehensions

Functions encapsulate logic: def normalize(x): return (x - x.mean()) / x.std(). List comprehensions build lists concisely: [x**2 for x in features]. Dict comprehensions: {k: v*2 for k, v in config.items()}. These patterns appear constantly in data preprocessing.

Reading CSV and Data Cleaning

Use pandas: df = pd.read_csv('data.csv'). Inspect with df.head(), df.info(), df.describe(). Handle missing values: df.dropna() or df.fillna(value). Filter invalid rows: df[df['age'] > 0]. Convert types: df['date'] = pd.to_datetime(df['date']).

Train/Test Split

Never evaluate on data you trained on. Use sklearn: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42). The random_state ensures reproducibility. For classification, use stratify=y to preserve class proportions.

Examples

Config Dict and List Comprehension

Hyperparameters as a dict; build a list of squared features.

config = {"lr": 0.001, "batch_size": 32, "epochs": 100}
features = [1.2, 3.4, 5.6, 7.8]
squared = [f**2 for f in features]
print(squared)  # [1.44, 11.56, 31.36, 60.84]

NumPy Array and Shape

Create a design matrix and check its shape.

import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6]])
print(X.shape)  # (3, 2) — 3 samples, 2 features
print(X.mean(axis=0))  # [3., 4.] — mean per feature

CSV Load and Train/Test Split

Load a CSV, extract features and target, split.

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data.csv")
X = df[["feat1", "feat2"]].values
y = df["target"].values
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Common Mistakes

Using Python lists for large numerical computations

Why: Lists are slow; NumPy uses optimized C code.

Fix: Convert to np.array() and use vectorized operations.

Splitting before cleaning

Why: Test set statistics can leak into training if you clean using global stats.

Fix: Split first, then compute normalization (mean, std) from the training set only and apply to both.

Forgetting random_state in train_test_split

Why: Results are not reproducible; each run gives different splits.

Fix: Always set random_state=42 (or another fixed value) for reproducibility.

Mini Exercises

1. Write a function that takes a list of numbers and returns (mean, variance). Use only a loop and arithmetic.

2. Given a list of dicts (e.g., [{'valid': True, 'x': 1}, {'valid': False, 'x': 2}]), write a comprehension to filter only valid entries.

3. What does X.shape return for a 2D array with 100 samples and 5 features?

Further Reading