Lesson 3: Convolutional Neural Networks (CNNs)
Learning Objectives
- Understand the problem CNNs solve compared to fully connected networks
- Build intuition for convolution, receptive fields, and pooling
- Explain regularization (Dropout) and normalization (BatchNorm) in CNNs
- Understand the LeNet architecture and where each layer fits
- Train a simple CNN on MNIST in the companion PyTorch notebook
Motivation: Why CNNs?
- Images are high-dimensional: a \(28 \times 28\) grayscale image has 784 pixels; larger images have tens of thousands
- Fully connected layers treat every pixel independently and ignore spatial structure
- We want models that are sensitive to local patterns (edges, corners, textures) and reuse them across the image
- Convolutional layers do exactly this through local connectivity and weight sharing
From Dense to Convolutional
Fully Connected Layer
- Each output unit connects to every input pixel
- Number of parameters grows quickly with image size
- No notion of neighbors or locality
Convolutional Layer
- Each filter looks at a small patch (e.g., \(3\times3\)) at a time
- Same filter slides over the whole image (weight sharing)
- Output feature map answers: “Where does this pattern appear?”
Convolution: Local Receptive Fields
A convolution layer learns filters \(K\) that are applied locally to the input image \(X\).
\[
Y[i, j] = \sum_{u,v} K[u, v] \cdot X[i+u,\, j+v]
\]
- Each output position \((i, j)\) “sees” only a small neighborhood of the input
- Stacking convolutions increases the receptive field (how much of the original image a unit depends on)
- Early layers: edges and simple shapes; deeper layers: object parts and concepts
Diagram: Sliding Filter
digraph conv2d {
rankdir=LR;
node [fontsize=11];
subgraph cluster_input {
label="Input (image)";
style=dashed;
img [label="28×28 pixels", shape=box];
}
subgraph cluster_kernel {
label="Filter (kernel)";
style=dashed;
k [label="3×3 weights", shape=box];
}
subgraph cluster_output {
label="Feature map";
style=dashed;
fmap [label="26×26 activations", shape=box];
}
img -> k [label="slide", fontsize=10];
k -> fmap [label="dot products", fontsize=10];
}
As the filter slides, it produces a feature map that is high where the pattern is present.
Shapes in Conv Layers
- Input: \((N, C_{\text{in}}, H, W)\) — batch, channels, height, width
- Conv2d with \(C_{\text{out}}\) filters, kernel \(K \times K\), stride \(s\), padding \(p\)
- Output height:
\[
H_{\text{out}} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1
\]
- Output shape: \((N, C_{\text{out}}, H_{\text{out}}, W_{\text{out}})\)
- “Same” padding keeps \(H_{\text{out}} \approx H\); “valid” padding lets it shrink
Pooling: Downsampling Features
- Pooling reduces spatial size while keeping important information
- Max pooling: keeps the largest activation in each window (e.g., \(2\times2\))
- Adds some translation invariance (small shifts in the image do not change the pooled output much)
- Used after one or more conv layers to gradually reduce resolution and number of parameters
Code: Basic Conv + Pool in PyTorch
import torch.nn as nn
conv_block = nn.Sequential(
nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
)
This block takes a \(1 \times 28 \times 28\) image, extracts 6 feature maps, applies ReLU, then halves height and width via max pooling.
LeNet: Classic CNN for Digits
LeNet-5 is one of the earliest successful CNNs for digit recognition.
digraph lenet {
rankdir=LR;
node [fontsize=12, shape=box, style=rounded, height=0.7];
edge [penwidth=1.5];
graph [nodesep=0.5, ranksep=0.7];
input [label="Input\n1×32×32"];
c1 [label="Conv C1\n6×5×5 + ReLU\n→ 6×28×28", style="filled,rounded", fillcolor="#e3f2fd"];
s2 [label="Pool S2\n2×2 max\n→ 6×14×14", style="filled,rounded", fillcolor="#fff3e0"];
c3 [label="Conv C3\n16×5×5 + ReLU\n→ 16×10×10", style="filled,rounded", fillcolor="#e3f2fd"];
s4 [label="Pool S4\n2×2 max\n→ 16×5×5", style="filled,rounded", fillcolor="#fff3e0"];
f5 [label="FC F5\n120", style="filled,rounded", fillcolor="#e8f5e9"];
f6 [label="FC F6\n84", style="filled,rounded", fillcolor="#e8f5e9"];
out [label="Output\n10 classes", style="filled,rounded", fillcolor="#c8e6c9"];
input -> c1 -> s2 -> c3 -> s4 -> f5 -> f6 -> out;
}
Modern CNNs (VGG, ResNet, etc.) follow the same pattern: conv + nonlinearity + pooling → fully connected classifier.
Regularization: Overfitting in CNNs
- CNNs can easily memorize training images if the model is too large or data is limited
- Symptoms: training accuracy high, validation accuracy low and noisy
- Regularization methods reduce overfitting by limiting model capacity or adding noise
- Common tools in CNNs: weight decay (L2), Dropout, data augmentation (covered more in Lesson 4)
Dropout: Turning Off Units
During training, Dropout randomly zeros out a fraction \(p\) of activations.
\[
\tilde{h}_i =
\begin{cases}
0 & \text{with probability } p \\
\dfrac{h_i}{1 - p} & \text{with probability } 1 - p
\end{cases}
\]
- \(h_i\) is the activation of unit \(i\) before dropout; \(\tilde{h}_i\) is the activation after dropout
- Forces the network not to rely on any single path or feature
- Behaves like training an ensemble of many “thinned” networks
- At inference time, Dropout is turned off; we use the full network without randomly dropping units
Code: Dropout in PyTorch
import torch.nn as nn
classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(16 * 5 * 5, 120),
nn.ReLU(),
nn.Dropout(p=0.5),
nn.Linear(120, 84),
nn.ReLU(),
nn.Linear(84, 10),
)
Dropout is only active in model.train() mode. In model.eval(), all units are used.
Normalization: BatchNorm (Intuition)
BatchNorm normalizes activations using mini-batch statistics, then rescales them with learnable parameters.
- Intermediate activations can have very different scales across layers and batches
- BatchNorm normalizes per feature/channel so each batch has roughly zero mean and unit variance
- Learnable \(\gamma\) and \(\beta\) restore useful scales and offsets after normalization
- Helps stabilize training, allows higher learning rates, and can act as a mild regularizer
- Uses mini-batch statistics during training, running averages during inference
BatchNorm: Equations
Given a mini-batch \(\{x_1, \dots, x_m\}\) for one feature/channel:
\[
\mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad
\sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2
\]
\[
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \quad
y_i = \gamma \hat{x}_i + \beta
\]
- \(\mu_B\), \(\sigma_B^2\): batch mean and variance for this feature/channel
- \(\hat{x}_i\): normalized activation (zero mean, unit variance within the batch)
- \(y_i\): final activation after BatchNorm, passed to the next layer
- \(\gamma\), \(\beta\): learned scale and shift that let the layer choose a good range again
- \(\varepsilon\): small constant to avoid division by zero
- During inference, \(\mu_B\) and \(\sigma_B^2\) are replaced by running averages accumulated during training.
Code: Conv Block with BatchNorm
conv_block = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
)
BatchNorm keeps the distribution of activations more stable across batches, which often speeds up convergence.
Putting It Together: Simple CNN for MNIST
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2), # 14×14
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2), # 7×7
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(64 * 7 * 7, 128),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(128, 10),
)
def forward(self, x):
x = self.features(x)
x = self.classifier(x)
return x
This structure (conv blocks → pooling → fully connected classifier) mirrors LeNet but with slightly modernized choices.
Switch to Notebook: CNN on MNIST
- Now we will train a CNN end-to-end on MNIST in the companion notebook
- Open
notebooks/lesson3_Pytorch_MNIST_cnn.ipynb to walk through:
- Loading MNIST, defining the CNN, training with Dropout/BatchNorm, evaluation, and inference on sample images
Homework
- Experiment with different CNN architectures for MNIST: change number of filters, kernel sizes, or depth.
- Compare models with and without Dropout and BatchNorm. Plot training and validation curves and discuss overfitting.
- Visualize intermediate feature maps for a few test images to see what early and late layers are detecting.
- Try training on Fashion-MNIST with your best CNN and compare results to Lesson 2’s fully connected network.