Lesson 4 — Data Augmentation & CNN Architectures

Lesson 4: Data Augmentation & CNN Architectures

Learning Objectives

Understand why and when we use data augmentation for vision models
Recognize common augmentation operations and how to implement them in PyTorch
Build intuition for the evolution of CNN architectures from LeNet to ResNet
Connect CNN features to downstream tasks like object tracking

Recap: CNNs and Overfitting

CNNs can have millions of parameters and easily memorize the training set
Overfitting symptoms: training loss \(\downarrow\) while validation loss \(\uparrow\) or plateaus early
We already saw regularization like Dropout and stabilization techniques like BatchNorm
In this lesson we add weight decay and data augmentation to our toolkit

Regularization with Weight Decay

Weight decay (L2 regularization) discourages very large weights by adding a penalty to the loss.

\[ L_{\text{total}} = L_{\text{data}} + \lambda \lVert W \rVert_2^2 \]

\(\lambda\) controls the strength of the penalty (e.g., \(10^{-4}\) or \(10^{-3}\))
Encourages simpler models and can reduce overfitting, especially with many parameters
In PyTorch, set weight_decay in the optimizer instead of manually modifying the loss

import torch.optim as optim

optimizer = optim.Adam(model.parameters(),
                       lr=1e-3,
                       weight_decay=1e-4)  # L2 penalty

Combine weight decay with Dropout, BatchNorm, and data augmentation for more robust CNNs.

Idea: Data Augmentation

Data augmentation creates new training examples by applying label-preserving transformations.

Example: flipping, rotating, cropping, or color-jittering images
We do not change the label — a rotated cat is still a cat
Model learns to be robust to common variations (pose, lighting, small translations)
Acts like training on a much larger dataset without collecting more images

Types of Augmentation

Geometric

Random crop / resize
Horizontal / vertical flip
Small rotations (e.g., \(\pm 10^\circ\))
Random affine transforms (scale, shear)

Photometric

Brightness / contrast changes
Color jitter (hue, saturation)
Gaussian noise or blur
Random grayscale

Choose transforms that make sense for your data and task (e.g., avoid vertical flips for digits)

Data Augmentation Pipeline


digraph aug_pipeline {
  rankdir=LR;
  node [fontsize=13, shape=box, style=rounded, height=0.7];
  edge [penwidth=1.5];
  graph [nodesep=0.7, ranksep=0.9];

  raw   [label="Original image\n(train sample)"];
  aug   [label="Random transforms\n(flip, rotate, color jitter)", style="filled,rounded", fillcolor="#fff3e0"];
  batch [label="Augmented batch\n(x_batch, y_batch)"];
  model [label="CNN model\n(f(x; θ))", style="filled,rounded", fillcolor="#e3f2fd"];

  raw -> aug -> batch -> model;
}

Each epoch, the same image can look different → the model sees a stream of varied views.

PyTorch: Augmentation with `torchvision.transforms`

import torchvision.transforms as T

train_transform = T.Compose([
    T.RandomHorizontalFlip(p=0.5),
    T.RandomRotation(degrees=10),
    T.ColorJitter(brightness=0.2, contrast=0.2),
    T.ToTensor(),
    T.Normalize((0.1307,), (0.3081,)),  # MNIST stats
])

test_transform = T.Compose([
    T.ToTensor(),
    T.Normalize((0.1307,), (0.3081,)),
])

Apply train_transform only to training data; keep validation/test transforms deterministic.

Using Transforms in a Dataset

from torchvision import datasets
from torch.utils.data import DataLoader

train_ds = datasets.MNIST(
    root="data",
    train=True,
    download=True,
    transform=train_transform,
)

test_ds = datasets.MNIST(
    root="data",
    train=False,
    download=True,
    transform=test_transform,
)

train_loader = DataLoader(train_ds, batch_size=64, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=128)

Every time train_loader samples an image, a fresh random augmentation is applied.

Choosing Augmentations Carefully

Valid if the transform does not change the label
Digits (MNIST): horizontal flips or 180° rotations can change a “6” into a “9” → avoid
Natural images: flips and small rotations are usually safe
Domain-specific tasks (medical, documents) require extra care and domain knowledge

When Does Augmentation Help Most?

You have limited labeled data and a relatively large model
Test-time conditions are diverse: different devices, lighting, or viewpoints
You want robustness to small perturbations (cropping, noise, color shifts)
Often combined with Dropout/BatchNorm and early stopping

From LeNet to Modern CNNs

Architecture design evolved, but core building blocks stayed similar.

LeNet-5 (1998): small CNN for digits (MNIST-like); conv → pool → conv → pool → FC
AlexNet (2012): deeper CNN for ImageNet; ReLU, Dropout, trained on GPU
VGG (2014): very deep, simple stacks of \(3\times3\) convs and max pooling
Inception / GoogLeNet (2014): multi-branch “Inception” modules with different kernel sizes
ResNet (2015): residual connections (skip connections) enabled very deep networks

AlexNet: Bigger CNN for ImageNet

Problem: ImageNet has millions of RGB images and 1000 classes — LeNet is too small and shallow.
Key ideas: deeper conv stack (5 conv + 3 FC), ReLU everywhere, Dropout in fully connected layers, trained on GPUs.
What it solved: showed CNNs can scale to large, real-world datasets and dramatically beat hand-crafted features.
Trade-offs: large model (tens of millions of parameters), heavy compute and memory usage.

AlexNet Architecture (Diagram)

High-level view of the original AlexNet layers (input, conv / pooling blocks, fully connected head).

Architecture Sketch: LeNet vs AlexNet

LeNet vs AlexNet (Summary)

Both use conv + nonlinearity + pooling → fully connected layers; AlexNet scales this pattern up for large-scale vision.

Why AlexNet? ImageNet is much larger and more varied than MNIST, so we need a deeper, higher-capacity CNN.
How: more conv layers, many more channels, aggressive pooling, and heavy use of ReLU + Dropout.
Key changes vs LeNet: supports RGB images, trains on GPUs, and scales up width/depth to handle 1000-way classification.

VGG: Deep and Simple

Key idea: stack many \(3\times3\) conv layers instead of a few large kernels
Pattern: \((\text{Conv} \rightarrow \text{ReLU})\) repeated 2–3 times → max pool → repeat
Why: deeper networks capture more complex patterns; keeping blocks identical simplifies design and tuning.
Changes vs AlexNet: replaces large, varied kernels with repeated \(3\times3\) convs, trading width for depth and regular structure.
Downside: many parameters and heavy computation (expensive without modern accelerators).

# simplified VGG-style block
import torch.nn as nn

vgg_block = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Conv2d(64, 64, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),  # spatial size / 2
)

Inception: Multi-Scale Features

Inception modules apply multiple filter sizes in parallel and concatenate outputs.


digraph inception_block {
  rankdir=TB;
  node [fontsize=10, shape=box, style=rounded];

  input  [label="input feature map"];
  b1_1x1 [label="1×1 conv"];
  b2_1x1 [label="1×1 conv\n(reduce)"];
  b2_3x3 [label="3×3 conv"];
  b3_1x1 [label="1×1 conv\n(reduce)"];
  b3_5x5 [label="5×5 conv"];
  b4_pool [label="3×3 max pool"];
  b4_1x1 [label="1×1 conv"];
  concat [label="concat\nchannels"];

  input -> b1_1x1 -> concat;
  input -> b2_1x1 -> b2_3x3 -> concat;
  input -> b3_1x1 -> b3_5x5 -> concat;
  input -> b4_pool -> b4_1x1 -> concat;
}

Network chooses useful features at multiple scales (1×1, 3×3, 5×5) in the same layer.

Why: different objects and patterns appear at different scales; a single kernel size can miss useful structure.
How: parallel branches with 1×1, 3×3, 5×5 convs and pooling, plus 1×1 “bottlenecks” to keep compute affordable.
Changes vs VGG: moves from a single conv path per block to multi-branch modules that learn multi-scale features in one stage.

ResNet: Residual Connections

Residual (skip) connections help train very deep networks by making layers learn a residual correction.

Why: very deep plain networks are hard to train (vanishing gradients, degradation — deeper nets perform worse).
How: each block learns a residual \(F(x)\); the skip connection lets gradients flow directly through \(x\).

ResNet: Going Deeper

Stacking many residual blocks with skip connections makes 50–100+ layer CNNs trainable in practice.

Changes vs Inception/VGG: focuses on depth with identity shortcuts instead of complex multi-branch modules.
Residual paths act like “highways” for gradients, reducing vanishing-gradient issues in very deep models.

Code: Simple Residual Block (PyTorch)

class BasicBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
            nn.ReLU(),
            nn.Conv2d(out_ch, out_ch, 3, padding=1),
            nn.BatchNorm2d(out_ch),
        )
        self.relu = nn.ReLU()

    def forward(self, x):
        out = self.conv(x)
        out = out + x  # skip connection (same shape)
        return self.relu(out)

Here the block learns a residual \(F(x)\) and adds it back to the original input \(x\) before the final ReLU.

xResNet: Inception-Inspired CNN

xResNet variants combine residual connections with Inception-style improvements for efficiency.
Depthwise separable convolutions: factor a standard conv into cheaper steps (per-channel spatial conv + 1×1 mixing) to reduce parameters and FLOPs.
Batch normalization + ReLU: stabilize training and allow deeper networks.
Global average pooling: replace large fully-connected layers with a compact pooling + small classifier head.

Depthwise Separable Convolutions

Standard conv: every filter looks at all input channels at once (expensive: \(k \times k \times C_{in} \times C_{out}\)).
Depthwise step: apply one \(k \times k\) filter per input channel (no channel mixing yet).
Pointwise step: a cheap \(1 \times 1\) conv mixes channels to get the desired \(C_{out}\).
Total cost is much lower, so we can build deeper or wider CNNs with similar compute and better accuracy.

Depthwise Separable Conv: xResNet18 Block

xResNet18-style block: depthwise + pointwise conv inside a residual (skip) connection.


digraph depthwise_sep_conv {
  rankdir=LR;
  node [fontsize=10, shape=box, style=rounded];

  input     [label="input\nfrom previous block"];
  depthwise [label="depthwise conv\nk×k per channel"];
  pointwise [label="pointwise conv\n1×1 across channels"];
  skip      [label="skip / identity\n(optional 1×1 conv)"];
  add       [label="add\n(residual + conv output)"];
  output    [label="output\nto next block"];

  input -> depthwise -> pointwise -> add -> output;
  input -> skip -> add;
}

Transfer Learning with CNNs

Modern practice: start from a CNN pre-trained on a large dataset (e.g., ImageNet)
Freeze most convolutional layers; replace and train the final classifier head
Works well even with limited labeled data in your domain
In this lesson we implement transfer learning with data augmentation in the companion notebook

Transfer Learning with xResNet18

Use a pretrained xresnet18 backbone instead of training all weights from scratch on CIFAR-10.
Phase 1 (frozen): freeze early layers and train only the new classification head for a few epochs.
Phase 2 (fine-tune): unfreeze more layers and train with a smaller learning rate to gently adapt pretrained features.
Helps small or medium-sized datasets achieve higher accuracy and faster convergence with less overfitting.

Switch to Notebook / Code

Open notebooks/lesson4_data_augmentation.ipynb
Add and visualize data augmentation for your training images
Apply transfer learning with a pre-trained CNN (e.g., ResNet) on a smaller dataset
Compare results with and without augmentation / fine-tuning

Image Feature Vectors

A feature vector is a numeric representation of an image: a 1D vector \([f_1, f_2, \dots, f_d]\).
A pretrained CNN (e.g., xresnet18) maps each image to a point in a high-dimensional space (e.g., 512-D).
Nearby points correspond to visually or semantically similar images (e.g., similar objects, colors, textures).
We can store these vectors and compare them instead of comparing raw pixels.

Cosine Similarity for Image Search

Cosine similarity measures the angle between two feature vectors \(u\) and \(v\): \(\cos(\theta) = \frac{u \cdot v}{\|u\|\|v\|}\).
Values are in \([-1, 1]\); higher values (closer to 1) mean more similar direction in feature space.
For image retrieval: compute the feature vector of a query image and rank dataset images by cosine similarity.
This is the core idea behind many “find similar images” and recommendation systems.

Evaluating Similarity Search

Accuracy: fraction of predictions that are correct, \(\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\).
Precision: of the images you retrieved as “similar”, what fraction are actually relevant? \(\text{Precision} = \frac{TP}{TP + FP}\).
Recall: of all relevant images in the dataset, what fraction did your system retrieve? \(\text{Recall} = \frac{TP}{TP + FN}\).
For each backbone (e.g., ResNet18 vs xResNet18), compute these metrics on a small labeled subset to compare retrieval quality.

Check Your Understanding

Why does data augmentation usually improve generalization for CNNs trained on small image datasets?
How do depthwise separable convolutions reduce computation compared to a standard convolution?
What does a high cosine similarity between two image feature vectors tell you about the underlying images?
Why can transfer learning with a pretrained backbone outperform training the same architecture from scratch on CIFAR-10?

Homework

Build a small Gradio app that takes a custom input image (file upload) and returns the most similar images from a reference set.
Use a pretrained CNN backbone (e.g., xresnet18) to extract a feature vector for each image.
Compute cosine similarity between the query feature vector and all reference features, and display the top-\(k\) matches with similarity scores.
Optionally, compare at least two different backbones (e.g., ResNet18 vs xResNet18) and report precision, recall, and accuracy on a small labeled evaluation set.
Document your design choices (feature extractor, reference dataset, normalization) and briefly discuss limitations of this approach.