Overview
PyTorch tensors are the building blocks of deep learning: multi-dimensional arrays that run on CPU or GPU and support automatic differentiation. This chapter covers tensor creation, dtype and device, shape semantics, broadcasting rules, matrix multiplication, reshape/view/transpose/permute, and autograd basics.
You Will Learn
- Tensor creation: from lists, NumPy, zeros, ones, randn, arange
- dtype and device (CPU vs CUDA)
- Shape semantics and common bugs
- Broadcasting rules with clear examples
- Matrix multiplication (matmul) rules and examples
- reshape, view, transpose, permute
- autograd: requires_grad, backward()
Main Content
Tensors: Creation and Properties
Create tensors with torch.tensor(), torch.zeros(), torch.ones(), torch.randn(), torch.arange(). Every tensor has .shape, .dtype (float32, int64, etc.), and .device (cpu or cuda). Check these constantly — shape mismatches are the #1 source of bugs in deep learning.
Shape Semantics
Convention: (batch, features) for 2D, (batch, channels, height, width) for images. A design matrix X has shape (n_samples, n_features). A batch of images has (B, C, H, W). Linear layer expects input (batch, in_features) and weight (out_features, in_features); output is (batch, out_features).
Broadcasting
Dimensions are compared from right to left. They are compatible if equal or one is 1. (3, 4) and (4,) → (3, 4). (3, 1) and (1, 5) → (3, 5). Example: subtract a mean vector from a batch: x - x.mean(dim=0) broadcasts the mean across the batch.
Matrix Multiplication
torch.matmul(A, B) or A @ B. For 2D: (m, k) @ (k, n) → (m, n). For batches: (b, m, k) @ (b, k, n) → (b, m, n). Element-wise * is the Hadamard product — same shape required. A linear layer does y = x @ W.T + b.
Reshape, View, Transpose, Permute
view and reshape change shape without copying (if contiguous). squeeze() removes dims of size 1; unsqueeze(dim) adds one. transpose(dim0, dim1) swaps two dimensions. permute(dims) reorders all dimensions — e.g., (B,H,W,C) → (B,C,H,W) for conv layers.
Autograd Basics
Set requires_grad=True on tensors you want to differentiate. Operations build a computation graph. Call .backward() on a scalar loss to compute gradients. Access gradients via .grad. Use torch.no_grad() when you don't need gradients (e.g., validation).
Examples
Tensor Creation and Shape
Create tensors and inspect properties.
import torch
x = torch.randn(3, 4)
print(x.shape) # (3, 4)
print(x.dtype) # torch.float32
print(x.device) # cpuBroadcasting
Subtract per-feature mean from a batch.
import torch
X = torch.randn(32, 10) # 32 samples, 10 features
mean = X.mean(dim=0) # (10,)
X_centered = X - mean # (32,10) - (10,) broadcasts to (32,10)Matrix Multiplication
Linear transformation: (batch, in) @ (out, in).T
import torch
batch, in_f, out_f = 8, 64, 32
x = torch.randn(batch, in_f)
W = torch.randn(out_f, in_f)
y = x @ W.T # (8, 64) @ (32, 64).T = (8, 32)Autograd
Compute gradients for a simple loss.
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2
loss = y.sum()
loss.backward()
print(x.grad) # [2., 4., 6.]Common Mistakes
Using * instead of @ for matrix multiplication
Why: * is element-wise; you get shape errors or wrong results.
Fix: Use torch.matmul or @ for matrix multiplication.
Broadcasting producing wrong shapes silently
Why: PyTorch broadcasts (1,) to match; you may get (3,4) when you wanted (4,3).
Fix: Check .shape after every operation; use unsqueeze explicitly when needed.
In-place operations breaking autograd
Why: x.add_(1) modifies x in place; the graph may not track it correctly.
Fix: Avoid in-place ops on tensors with requires_grad=True; use x = x + 1.
Mini Exercises
1. What is the output shape of torch.randn(5, 3) @ torch.randn(3, 7)?
2. Given x of shape (3, 4), write one line to add a batch dimension so it becomes (1, 3, 4).
3. Why does loss.backward() require loss to be a scalar?