Lesson 3: Convolutional Neural Networks (CNNs)

Learning Objectives

  • Understand the problem CNNs solve compared to fully connected networks
  • Build intuition for convolution, receptive fields, and pooling
  • Explain regularization (Dropout) and normalization (BatchNorm) in CNNs
  • Understand the LeNet architecture and where each layer fits
  • Train a simple CNN on MNIST in the companion PyTorch notebook

Motivation: Why CNNs?

  • Images are high-dimensional: a \(28 \times 28\) grayscale image has 784 pixels; larger images have tens of thousands
  • Fully connected layers treat every pixel independently and ignore spatial structure
  • We want models that are sensitive to local patterns (edges, corners, textures) and reuse them across the image
  • Convolutional layers do exactly this through local connectivity and weight sharing

From Dense to Convolutional

Fully Connected Layer
  • Each output unit connects to every input pixel
  • Number of parameters grows quickly with image size
  • No notion of neighbors or locality
Convolutional Layer
  • Each filter looks at a small patch (e.g., \(3\times3\)) at a time
  • Same filter slides over the whole image (weight sharing)
  • Output feature map answers: “Where does this pattern appear?”

Convolution: Local Receptive Fields

A convolution layer learns filters \(K\) that are applied locally to the input image \(X\).

\[ Y[i, j] = \sum_{u,v} K[u, v] \cdot X[i+u,\, j+v] \]

  • Each output position \((i, j)\) “sees” only a small neighborhood of the input
  • Stacking convolutions increases the receptive field (how much of the original image a unit depends on)
  • Early layers: edges and simple shapes; deeper layers: object parts and concepts

Diagram: Sliding Filter


digraph conv2d {
  rankdir=LR;
  node [fontsize=11];

  subgraph cluster_input {
    label="Input (image)";
    style=dashed;
    img [label="28×28 pixels", shape=box];
  }

  subgraph cluster_kernel {
    label="Filter (kernel)";
    style=dashed;
    k [label="3×3 weights", shape=box];
  }

  subgraph cluster_output {
    label="Feature map";
    style=dashed;
    fmap [label="26×26 activations", shape=box];
  }

  img -> k [label="slide", fontsize=10];
  k -> fmap [label="dot products", fontsize=10];
}
          

As the filter slides, it produces a feature map that is high where the pattern is present.

Shapes in Conv Layers

  • Input: \((N, C_{\text{in}}, H, W)\) — batch, channels, height, width
  • Conv2d with \(C_{\text{out}}\) filters, kernel \(K \times K\), stride \(s\), padding \(p\)
  • Output height: \[ H_{\text{out}} = \left\lfloor \frac{H + 2p - K}{s} \right\rfloor + 1 \]
  • Output shape: \((N, C_{\text{out}}, H_{\text{out}}, W_{\text{out}})\)
  • “Same” padding keeps \(H_{\text{out}} \approx H\); “valid” padding lets it shrink

Pooling: Downsampling Features

  • Pooling reduces spatial size while keeping important information
  • Max pooling: keeps the largest activation in each window (e.g., \(2\times2\))
  • Adds some translation invariance (small shifts in the image do not change the pooled output much)
  • Used after one or more conv layers to gradually reduce resolution and number of parameters

Code: Basic Conv + Pool in PyTorch

import torch.nn as nn

conv_block = nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

This block takes a \(1 \times 28 \times 28\) image, extracts 6 feature maps, applies ReLU, then halves height and width via max pooling.

LeNet: Classic CNN for Digits

LeNet-5 is one of the earliest successful CNNs for digit recognition.


digraph lenet {
  rankdir=LR;
  node [fontsize=12, shape=box, style=rounded, height=0.7];
  edge [penwidth=1.5];
  graph [nodesep=0.5, ranksep=0.7];

  input  [label="Input\n1×32×32"];
  c1     [label="Conv C1\n6×5×5 + ReLU\n→ 6×28×28", style="filled,rounded", fillcolor="#e3f2fd"];
  s2     [label="Pool S2\n2×2 max\n→ 6×14×14", style="filled,rounded", fillcolor="#fff3e0"];
  c3     [label="Conv C3\n16×5×5 + ReLU\n→ 16×10×10", style="filled,rounded", fillcolor="#e3f2fd"];
  s4     [label="Pool S4\n2×2 max\n→ 16×5×5", style="filled,rounded", fillcolor="#fff3e0"];
  f5     [label="FC F5\n120", style="filled,rounded", fillcolor="#e8f5e9"];
  f6     [label="FC F6\n84", style="filled,rounded", fillcolor="#e8f5e9"];
  out    [label="Output\n10 classes", style="filled,rounded", fillcolor="#c8e6c9"];

  input -> c1 -> s2 -> c3 -> s4 -> f5 -> f6 -> out;
}
          

Modern CNNs (VGG, ResNet, etc.) follow the same pattern: conv + nonlinearity + pooling → fully connected classifier.

Regularization: Overfitting in CNNs

  • CNNs can easily memorize training images if the model is too large or data is limited
  • Symptoms: training accuracy high, validation accuracy low and noisy
  • Regularization methods reduce overfitting by limiting model capacity or adding noise
  • Common tools in CNNs: weight decay (L2), Dropout, data augmentation (covered more in Lesson 4)

Dropout: Turning Off Units

During training, Dropout randomly zeros out a fraction \(p\) of activations.

\[ \tilde{h}_i = \begin{cases} 0 & \text{with probability } p \\ \dfrac{h_i}{1 - p} & \text{with probability } 1 - p \end{cases} \]

  • \(h_i\) is the activation of unit \(i\) before dropout; \(\tilde{h}_i\) is the activation after dropout
  • Forces the network not to rely on any single path or feature
  • Behaves like training an ensemble of many “thinned” networks
  • At inference time, Dropout is turned off; we use the full network without randomly dropping units

Code: Dropout in PyTorch

import torch.nn as nn

classifier = nn.Sequential(
    nn.Flatten(),
    nn.Linear(16 * 5 * 5, 120),
    nn.ReLU(),
    nn.Dropout(p=0.5),
    nn.Linear(120, 84),
    nn.ReLU(),
    nn.Linear(84, 10),
)

Dropout is only active in model.train() mode. In model.eval(), all units are used.

Normalization: BatchNorm (Intuition)

BatchNorm normalizes activations using mini-batch statistics, then rescales them with learnable parameters.

  • Intermediate activations can have very different scales across layers and batches
  • BatchNorm normalizes per feature/channel so each batch has roughly zero mean and unit variance
  • Learnable \(\gamma\) and \(\beta\) restore useful scales and offsets after normalization
  • Helps stabilize training, allows higher learning rates, and can act as a mild regularizer
  • Uses mini-batch statistics during training, running averages during inference

BatchNorm: Equations

Given a mini-batch \(\{x_1, \dots, x_m\}\) for one feature/channel:

\[ \mu_B = \frac{1}{m} \sum_{i=1}^m x_i, \quad \sigma_B^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2 \]

\[ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \quad y_i = \gamma \hat{x}_i + \beta \]

  • \(\mu_B\), \(\sigma_B^2\): batch mean and variance for this feature/channel
  • \(\hat{x}_i\): normalized activation (zero mean, unit variance within the batch)
  • \(y_i\): final activation after BatchNorm, passed to the next layer
  • \(\gamma\), \(\beta\): learned scale and shift that let the layer choose a good range again
  • \(\varepsilon\): small constant to avoid division by zero
  • During inference, \(\mu_B\) and \(\sigma_B^2\) are replaced by running averages accumulated during training.

Code: Conv Block with BatchNorm

conv_block = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, padding=1),
    nn.BatchNorm2d(32),
    nn.ReLU(),
    nn.MaxPool2d(2),
)

BatchNorm keeps the distribution of activations more stable across batches, which often speeds up convergence.

Putting It Together: Simple CNN for MNIST

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),   # 14×14
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),   # 7×7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

This structure (conv blocks → pooling → fully connected classifier) mirrors LeNet but with slightly modernized choices.

Switch to Notebook: CNN on MNIST

  • Now we will train a CNN end-to-end on MNIST in the companion notebook
  • Open notebooks/lesson3_Pytorch_MNIST_cnn.ipynb to walk through:
  • Loading MNIST, defining the CNN, training with Dropout/BatchNorm, evaluation, and inference on sample images

Additional Resources

Homework

  • Experiment with different CNN architectures for MNIST: change number of filters, kernel sizes, or depth.
  • Compare models with and without Dropout and BatchNorm. Plot training and validation curves and discuss overfitting.
  • Visualize intermediate feature maps for a few test images to see what early and late layers are detecting.
  • Try training on Fashion-MNIST with your best CNN and compare results to Lesson 2’s fully connected network.