Lesson 3 — Convolutional Neural Networks

Lesson 3: Convolutional Neural Networks (CNNs)

Learning Objectives

Understand the problem CNNs solve compared to fully connected networks
Build intuition for convolution, receptive fields, and pooling
Explain regularization (Dropout) and normalization (BatchNorm) in CNNs
Understand the LeNet architecture and where each layer fits
Train a simple CNN on MNIST in the companion PyTorch notebook

Motivation: Why CNNs?

Images are high-dimensional: a \(28 \times 28\) grayscale image has 784 pixels; larger images have tens of thousands
Fully connected layers treat every pixel independently and ignore spatial structure
We want models that are sensitive to local patterns (edges, corners, textures) and reuse them across the image
Convolutional layers do exactly this through local connectivity and weight sharing

From Dense to Convolutional

Fully Connected Layer

Each output unit connects to every input pixel
Number of parameters grows quickly with image size
No notion of neighbors or locality

Convolutional Layer

Each filter looks at a small patch (e.g., \(3\times3\)) at a time
Same filter slides over the whole image (weight sharing)
Output feature map answers: “Where does this pattern appear?”

Convolution: Local Receptive Fields

A convolution layer learns filters \(K\) that are applied locally to the input image \(X\).

\[ Y[i, j] = \sum_{u,v} K[u, v] \cdot X[i+u,\, j+v] \]

Each output position \((i, j)\) “sees” only a small neighborhood of the input
Stacking convolutions increases the receptive field (how much of the original image a unit depends on)
Early layers: edges and simple shapes; deeper layers: object parts and concepts

Diagram: Sliding Filter


digraph conv2d {
  rankdir=LR;
  node [fontsize=11];

  subgraph cluster_input {
    label="Input (image)";
    style=dashed;
    img [label="28×28 pixels", shape=box];
  }

  subgraph cluster_kernel {
    label="Filter (kernel)";
    style=dashed;
    k [label="3×3 weights", shape=box];
  }

  subgraph cluster_output {
    label="Feature map";
    style=dashed;
    fmap [label="26×26 activations", shape=box];
  }

  img -> k [label="slide", fontsize=10];
  k -> fmap [label="dot products", fontsize=10];
}

As the filter slides, it produces a feature map that is high where the pattern is present.

Shapes in Conv Layers

Input: \((N, C_{\text{in}}, H, W)\) — batch, channels, height, width
Conv2d with \(C_{\text{out}}\) filters, kernel \(K \times K\), stride \(s\), padding \(p\)
Output height: \[ H_{\text{out}} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1 \]
Output shape: \((N, C_{\text{out}}, H_{\text{out}}, W_{\text{out}})\)
“Same” padding keeps \(H_{\text{out}} \approx H\); “valid” padding lets it shrink

Pooling: Downsampling Features

Pooling reduces spatial size while keeping important information
Max pooling: keeps the largest activation in each window (e.g., \(2\times2\))
Adds some translation invariance (small shifts in the image do not change the pooled output much)
Used after one or more conv layers to gradually reduce resolution and number of parameters

Code: Basic Conv + Pool in PyTorch

import torch.nn as nn

conv_block = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

This block takes a \(1 \times 28 \times 28\) image, extracts 6 feature maps, applies ReLU, then halves height and width via max pooling.

LeNet: Classic CNN for Digits

LeNet-5 is one of the earliest successful CNNs for digit recognition.


digraph lenet {
  rankdir=LR;
  node [fontsize=12, shape=box, style=rounded, height=0.7];
  edge [penwidth=1.5];
  graph [nodesep=0.5, ranksep=0.7];

  input  [label="Input\n1×32×32"];
  c1     [label="Conv C1\n6×5×5 + ReLU\n→ 6×28×28", style="filled,rounded", fillcolor="#e3f2fd"];
  s2     [label="Pool S2\n2×2 max\n→ 6×14×14", style="filled,rounded", fillcolor="#fff3e0"];
  c3     [label="Conv C3\n16×5×5 + ReLU\n→ 16×10×10", style="filled,rounded", fillcolor="#e3f2fd"];
  s4     [label="Pool S4\n2×2 max\n→ 16×5×5", style="filled,rounded", fillcolor="#fff3e0"];
  f5     [label="FC F5\n120", style="filled,rounded", fillcolor="#e8f5e9"];
  f6     [label="FC F6\n84", style="filled,rounded", fillcolor="#e8f5e9"];
  out    [label="Output\n10 classes", style="filled,rounded", fillcolor="#c8e6c9"];

  input -> c1 -> s2 -> c3 -> s4 -> f5 -> f6 -> out;
}

Modern CNNs (VGG, ResNet, etc.) follow the same pattern: conv + nonlinearity + pooling → fully connected classifier.

Regularization: Overfitting in CNNs

CNNs can easily memorize training images if the model is too large or data is limited
Symptoms: training accuracy high, validation accuracy low and noisy
Regularization methods reduce overfitting by limiting model capacity or adding noise
Common tools in CNNs: weight decay (L2), Dropout, data augmentation (covered more in Lesson 4)

Dropout: Turning Off Units

During training, Dropout randomly zeros out a fraction \(p\) of activations.

\[ \tilde{h}_i = \begin{cases} 0 & \text{with probability } p \\ \dfrac{h_i}{1 - p} & \text{with probability } 1 - p \end{cases} \]

\(h_i\) is the activation of unit \(i\) before dropout; \(\tilde{h}_i\) is the activation after dropout
Forces the network not to rely on any single path or feature
Behaves like training an ensemble of many “thinned” networks
At inference time, Dropout is turned off; we use the full network without randomly dropping units

Code: Dropout in PyTorch

import torch.nn as nn

classifier = nn.Sequential(
    nn.Flatten(),
    nn.Linear(16 * 5 * 5, 120),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(120, 84),
    nn.ReLU(),
    nn.Linear(84, 10),
)

Dropout is only active in model.train() mode. In model.eval(), all units are used.

Normalization: BatchNorm (Intuition)

BatchNorm normalizes activations using mini-batch statistics, then rescales them with learnable parameters.

Intermediate activations can have very different scales across layers and batches
BatchNorm normalizes per feature/channel so each batch has roughly zero mean and unit variance
Learnable \(\gamma\) and \(\beta\) restore useful scales and offsets after normalization
Helps stabilize training, allows higher learning rates, and can act as a mild regularizer
Uses mini-batch statistics during training, running averages during inference

BatchNorm: Equations

Given a mini-batch \(\{x_1, \dots, x_m\}\) for one feature/channel:

\[ \mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 \]

\[ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \quad y_i = \gamma \hat{x}_i + \beta \]

\(\mu_B\), \(\sigma_B^2\): batch mean and variance for this feature/channel
\(\hat{x}_i\): normalized activation (zero mean, unit variance within the batch)
\(y_i\): final activation after BatchNorm, passed to the next layer
\(\gamma\), \(\beta\): learned scale and shift that let the layer choose a good range again
\(\varepsilon\): small constant to avoid division by zero
During inference, \(\mu_B\) and \(\sigma_B^2\) are replaced by running averages accumulated during training.

Code: Conv Block with BatchNorm

conv_block = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(2),
)

BatchNorm keeps the distribution of activations more stable across batches, which often speeds up convergence.

Putting It Together: Simple CNN for MNIST

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),   # 14×14
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),   # 7×7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

This structure (conv blocks → pooling → fully connected classifier) mirrors LeNet but with slightly modernized choices.

Switch to Notebook: CNN on MNIST

Now we will train a CNN end-to-end on MNIST in the companion notebook
Open notebooks/lesson3_Pytorch_MNIST_cnn.ipynb to walk through:
Loading MNIST, defining the CNN, training with Dropout/BatchNorm, evaluation, and inference on sample images

Additional Resources

CNN Explainer — interactive visual explanation of convolutions, feature maps, and pooling.
Gentle dive into CNN math — step-by-step visualization of the convolution operation.
LeNet visualization video — walk-through of how LeNet processes digit images.
3D LeNet visualization — 3D view of activations through a CNN.

Homework

Experiment with different CNN architectures for MNIST: change number of filters, kernel sizes, or depth.
Compare models with and without Dropout and BatchNorm. Plot training and validation curves and discuss overfitting.
Visualize intermediate feature maps for a few test images to see what early and late layers are detecting.
Try training on Fashion-MNIST with your best CNN and compare results to Lesson 2’s fully connected network.