Lesson 5 — Variational Autoencoders & Latent PCA

Lesson 5: Variational Autoencoders & Latent PCA

Learning Objectives

Understand how the convolutional VAE in projects/face_autoencoder/src/model.py encodes and decodes face images.
Explain the reparameterization trick and the \(\beta\)-ELBO loss implemented in VAELoss.
Collect latent representations, run PCA, and use the Gradio sliders in app.py to traverse interpretable directions.

Plain Autoencoder (Deterministic)

A basic autoencoder compresses and reconstructs data without any probabilistic latent space.


digraph ae_flow {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  x      [label="Input image\nx"];
  enc    [label="Encoder\nConv / Linear layers"];
  z      [label="Latent code\nz"];
  dec    [label="Decoder\nConvTranspose / Linear"];
  x_hat  [label="Reconstruction\nx̂"];

  x -> enc -> z -> dec -> x_hat;
}

Train by minimizing reconstruction loss (e.g., MSE) between \(x\) and \(\hat{x}\).
Latent code \(z\) is a deterministic function of \(x\) with no explicit prior.
Works well for compression and denoising, but latent space may have gaps and is not guaranteed to be smooth.

Autoencoders on MNIST / Fashion-MNIST

Before VAEs, it is useful to build intuition with a small, deterministic autoencoder on simple 28x28 images.

Datasets: MNIST (digits) and Fashion-MNIST (clothing items).
Goal: learn a compact latent code that can reconstruct inputs and optionally denoise simple corruptions.
Architecture: 2–3 fully connected or convolutional layers in the encoder and a mirrored decoder.

class MnistAutoencoder(nn.Module):
    def __init__(self, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 256), nn.ReLU(),
            nn.Linear(256, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, 28 * 28), nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z).view(-1, 1, 28, 28)
        return x_hat

We will use this style of model for the homework on MNIST and Fashion-MNIST.

Where the VAE Fits

We train a generative model that learns a smooth latent space of 64x64 RGB faces and supports interactive exploration.


digraph vae_flow {
  rankdir=LR;
  node [fontsize=12, shape=box, style=rounded, height=0.7];
  edge [penwidth=1.5];
  graph [nodesep=0.5, ranksep=0.8];

  images   [label="Input faces\n(x) [64x64x3]"];
  encoder  [label="Encoder CNN\nstrided conv blocks", style="filled,rounded", fillcolor="#e3f2fd"];
  heads    [label="Latent heads\nfc_mu & fc_logvar"];
  sample   [label="Reparameterize\nmu + sigma * eps", style="filled,rounded", fillcolor="#fff3e0"];
  decoder  [label="Decoder CNN\nConvTranspose + Tanh", style="filled,rounded", fillcolor="#e3f2fd"];
  recon    [label="Reconstruction\n~x", style="filled,rounded", fillcolor="#e8f5e9"];
  pca      [label="Latent PCA\nmetadata + sliders"];

  images -> encoder -> heads -> sample -> decoder -> recon;
  heads -> pca [style=dashed];
  sample -> pca [style=dashed];
}

The encoder compresses spatial structure; latent heads model a Gaussian posterior; the decoder mirrors the encoder.
Latent samples feed both reconstruction training and the PCA analysis used later in the app.

Encoder Anatomy

Four `_conv_block` stages progressively downsample 64 -> 32 -> 16 -> 8 -> 4 while increasing channels (32 -> 256).
Each block: `Conv2d(stride=2)` + `BatchNorm2d` + `LeakyReLU(0.2)` for stable feature scaling and nonlinearity.
A dummy tensor at init time infers the flattened dimension so we can support arbitrary image sizes.

self.encoder_cnn = nn.Sequential(
    _conv_block(image_channels, base_channels),
    _conv_block(base_channels, base_channels * 2),
    _conv_block(base_channels * 2, base_channels * 4),
    _conv_block(base_channels * 4, base_channels * 8),
)

h = self.encoder_cnn(x)
h = h.view(x.size(0), -1)

Latent Gaussian + Reparameterization

Two linear heads (`fc_mu`, `fc_logvar`) map flattened features to the mean and log-variance of \(q_\phi(z \mid x) = \mathcal{N}(\mu, \mathrm{diag}(\sigma^2))\).
`reparameterize` keeps gradients flowing by sampling \(\epsilon \sim \mathcal{N}(0, I)\) and computing \(z = \mu + \sigma \odot \epsilon\).
`logvar` is stored instead of \(\sigma\) to ensure positivity via `torch.exp(0.5 * logvar)`.

def encode(self, x):
    h = self.encoder_cnn(x).view(x.size(0), -1)
    mu = self.fc_mu(h)
    logvar = self.fc_logvar(h)
    return mu, logvar

def reparameterize(self, mu, logvar):
    std = torch.exp(0.5 * logvar)
    eps = torch.randn_like(std)
    return mu + eps * std

Decoder & Reconstruction Quality

`decoder_input` projects each latent vector back to the 4x4xC tensor inferred from the encoder.
Three `_deconv_block` stages mirror the encoder by doubling spatial size via `ConvTranspose2d` strides.
A final `ConvTranspose2d -> Tanh` predicts RGB pixels scaled to [-1, 1]; downstream utilities denormalize for visualization.

def decode(self, z):
    h = self.decoder_input(z)
    h = h.view(z.size(0), self.enc_channels, self.enc_spatial, self.enc_spatial)
    return self.decoder_cnn(h)

\(\beta\)-ELBO Loss

Objective: maximize the Evidence Lower Bound \(\mathcal{L} = \mathbb{E}[\log p_\theta(x \mid z)] - \beta \; \mathrm{KL}(q_\phi(z \mid x) \Vert p(z))\).
`VAELoss` uses `F.mse_loss` for the reconstruction term (faces are continuous) and a closed-form KL between diagonal Gaussians.
Tuning \(\beta\) trades sharp reconstructions (low \(\beta\)) for more disentangled latents (high \(\beta\)).

recon_loss = F.mse_loss(output.reconstruction, target, reduction="mean")
kl_div = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
loss = recon_loss + beta * kl_div

Training Loop Helpers (`training_utils.py`)

`seed_everything` and callbacks are small utilities that make VAE experiments reproducible and easy to monitor.
Latent vectors \(\mu(x)\) collected from the encoder can be analyzed with PCA to find major directions of variation.
A simple Gradio app can then expose a few PCA directions as sliders to explore the learned latent space interactively.

Case study (optional): inspect training_utils.py and app.py in the project for a concrete implementation of these ideas.

Homework: Train an Autoencoder on MNIST & Fashion-MNIST

Implement a small PyTorch autoencoder (MLP or shallow CNN) for 28x28 grayscale images, following the MnistAutoencoder pattern.
Train on MNIST until reconstruction loss plateaus; visualize a grid of original vs reconstructed digits.
Repeat training on Fashion-MNIST; compare what kinds of structure the model captures (edges, shapes, textures).
Optionally add Gaussian noise to inputs and train a denoising autoencoder; report how this changes visual quality.
Reflect: how does a deterministic autoencoder's latent space differ from the VAE latent space used in the face project?

Why Reparameterization Trick?

Problem: sampling \(z \sim \mathcal{N}(\mu, \sigma^2)\) is non-differentiable
Direct sampling breaks backpropagation: gradients can't flow through random operations
Solution: separate randomness from learnable parameters

Without reparameterization:

# ❌ Can't backprop through this
z = torch.normal(mu, std)
# mu and std have no gradients!

With reparameterization:

# ✅ Gradients flow through mu, std
eps = torch.randn_like(std)  # fixed noise
z = mu + eps * std
# Both mu and std are differentiable!

Key insight: \(\epsilon\) is sampled once and fixed; gradients flow through \(\mu\) and \(\sigma\).

Understanding KL Divergence

The KL term regularizes the learned posterior \(q_\phi(z \mid x)\) to match the prior \(p(z) = \mathcal{N}(0, I)\).

\[ \mathrm{KL}(q_\phi(z \mid x) \Vert p(z)) = \frac{1}{2} \sum_{i=1}^d \left( \sigma_i^2 + \mu_i^2 - 1 - \log \sigma_i^2 \right) \]

\(\sigma_i^2 + \mu_i^2 - 1\): penalizes large means and variances (pushes toward standard normal)
\(-\log \sigma_i^2\): prevents collapse (keeps variance from going to zero)
Without KL: encoder could map each \(x\) to a different region → no smooth latent space
With KL: encoder learns to use a shared, structured latent space

\(\beta\)-VAE: Controlling Disentanglement

The \(\beta\) parameter in \(\mathcal{L} = \mathbb{E}[\log p_\theta(x \mid z)] - \beta \; \mathrm{KL}(q_\phi(z \mid x) \Vert p(z))\) controls the trade-off:

Low \(\beta\) (e.g., 0.1-0.5)

Emphasizes reconstruction quality
Sharper, more detailed outputs
Less structured latent space
Latent dimensions may be correlated

High \(\beta\) (e.g., 2-10)

Emphasizes latent regularization
More disentangled features
Smoother, more interpretable latents
May sacrifice some reconstruction quality

Standard VAE: \(\beta = 1.0\) (balanced). \(\beta\)-VAE: \(\beta > 1\) for better disentanglement.

Reconstruction Loss: MSE vs Log(MSE) vs Perceptual

MSE Loss (Standard)

Pixel-wise squared error: \(\|x - \hat{x}\|^2\)
Simple, fast, differentiable
Problem: averages out details → blurry reconstructions
Large errors dominate gradients

Blurry face reconstruction from MSE loss

Example: MSE loss produces blurry, averaged reconstructions

recon_loss = F.mse_loss(
    output.reconstruction, 
    target
)

Log(MSE) Loss (Alternative)

\(\log(\|x - \hat{x}\|^2 + \epsilon)\)
Reweights gradients: \(\frac{1}{\text{MSE}} \cdot \frac{\partial \text{MSE}}{\partial \theta}\)
Large errors contribute less; small errors contribute more
Often produces sharper results than standard MSE
Valid alternative that sometimes outperforms MSE

mse = F.mse_loss(output.reconstruction, target)
recon_loss = torch.log(mse + 1e-8)
# Or: loss_func = VAELoss(beta=1.0, use_log_mse=True)

Visual Comparison of VAE Losses

Different reconstruction losses change how the VAE trades sharpness vs smoothness in generated faces.

Side-by-side comparison of VAE reconstructions under different losses

MSE: smooth but often blurry reconstructions.
Log(MSE): sharper details, still relatively cheap to train.
Perceptual: most realistic faces, at the cost of extra compute.

Reconstruction Loss: Perceptual Loss

Perceptual Loss (Enhanced)

Feature-space error: \(\|f(x) - f(\hat{x})\|^2\)
Uses pre-trained VGG features
Better preserves facial structure and details
Sharper, more realistic reconstructions
Most computationally expensive option

# Compare VGG features instead
recon_features = vgg(x)
target_features = vgg(target)
perceptual_loss = F.mse_loss(
    recon_features, 
    target_features
)

# Or use PerceptualVAELoss:
loss_func = PerceptualVAELoss(
    beta=1.0,
    feature_layer='relu3_3'
)

Summary: MSE (standard) → Log(MSE) (often better) → Perceptual (best quality, slower)

Perceptual Loss: VGG Feature Layers

Different VGG layers capture different levels of abstraction:

relu1_2: Low-level (edges, colors) — too shallow for faces
relu2_2: Mid-early (textures, patterns) — good detail preservation
relu3_3: Mid-level (object parts, facial structures) — recommended for 64-96px faces
relu4_3: High-level (semantic features) — may lose spatial detail for small images

# In PerceptualVAELoss
loss_func = PerceptualVAELoss(
    beta=1.0,
    perceptual_weight=1.0,
    mse_weight=0.1,  # Optional: hybrid approach
    feature_layer='relu3_3'  # Best for faces
)

For face images, relu3_3 balances structure preservation with detail.

Architecture Design Choices

Encoder

Strided convolutions: Efficient downsampling (no pooling needed)
BatchNorm: Stabilizes training, allows higher learning rates
LeakyReLU(0.2): Prevents dead neurons, common in GANs/VAEs
Progressive channels: 32→64→128→256 captures hierarchical features

Decoder

ConvTranspose2d: Learns upsampling (better than interpolation)
Mirror structure: Symmetric to encoder for balanced capacity
Tanh output: Maps to [-1, 1] matching normalized inputs
No final activation: Could use sigmoid for [0,1], but Tanh works well

Training Considerations

Learning rate: Start with 1e-3 to 2e-3; VAEs can be sensitive to LR
Batch size: Larger batches (256+) help stabilize BatchNorm and gradients
KL annealing: Start with \(\beta=0\), gradually increase to final value (helps early training)
Latent dimension: Too small → information bottleneck; too large → underutilized capacity
Monitoring: Watch both reconstruction loss (quality) and KL (regularization) separately
Early stopping: Stop when validation loss plateaus; VAEs can overfit to training faces

# Example: KL annealing
for epoch in range(epochs):
    beta = min(1.0, epoch / 50)  # Ramp up over 50 epochs
    loss_func.beta = beta
    learn.fit_one_cycle(1, lr_max=lr)

Why PCA on Latent Space?

Interpretability: PCA finds orthogonal directions of maximum variance
Dimensionality reduction: First few components capture most variation
Controllable generation: Each component often corresponds to a semantic attribute (smile, pose, lighting)
Linear interpolation: Moving along a component is smooth and predictable

PCA Components:

PC1: Often captures pose/angle
PC2: Often captures expression
PC3: Often captures lighting
PC4+: More subtle variations

Explained Variance:

First 8 components: ~60-80% variance
First 16 components: ~85-95% variance
Remaining: fine details

Latent Space Properties

Smoothness: Nearby points in latent space → similar faces (enabled by KL regularization)
Completeness: Most of latent space decodes to valid faces (not just training examples)
Interpolation: Linear paths between latents produce smooth face morphing
Arithmetic: Can do "smiling face - neutral + angry" = new expression (if disentangled)

# Example: Interpolation
z1 = model.encode(face1)[0]  # mu for face1
z2 = model.encode(face2)[0]  # mu for face2

for alpha in [0.0, 0.25, 0.5, 0.75, 1.0]:
    z_interp = (1 - alpha) * z1 + alpha * z2
    face_interp = model.decode(z_interp)
    # Smooth morphing between faces!

Common Issues & Solutions

Problem: Blurry Reconstructions

Try perceptual loss instead of MSE
Reduce \(\beta\) (less KL pressure)
Increase model capacity (more channels)
Check if latent dim is too small

Problem: Posterior Collapse

KL → 0, encoder ignores input
Increase \(\beta\) gradually
Use KL annealing
Check decoder isn't too powerful

Problem: Training Instability

Lower learning rate
Gradient clipping
Warm-up period
Check data normalization

Problem: Poor Latent Structure

Increase \(\beta\) for disentanglement
Use \(\beta\)-VAE (\(\beta > 1\))
Train longer
Check KL term is active

VAE vs Regular Autoencoder

Regular Autoencoder

Encoder: \(x \to z\) (deterministic)
Decoder: \(z \to \hat{x}\)
Latent space: unconstrained; may be irregular with “holes”.
Problem: sampling random \(z\) often lands off the data manifold → poor or meaningless generations.

Variational Autoencoder

Encoder: \(x \to (\mu, \sigma)\) (probabilistic)
Sample: \(z \sim \mathcal{N}(\mu, \sigma^2)\)
Decoder: \(z \to \hat{x}\)
Latent space: continuous, smooth, generative

VAEs learn a distribution over latents, enabling smooth interpolation and generation of new faces.

From Autoencoder Loss to VAE Loss

Autoencoder objective: minimize reconstruction error only, e.g. \(L_{\text{AE}} = \mathbb{E}_{x}[\|x - \hat{x}\|^2]\).
No explicit constraint on where latent codes \(z\) live → irregular latent space and poor samples from random \(z\).
VAE objective: keep reconstruction term, but add a KL regularizer that pulls \(q_\phi(z \mid x)\) toward a simple prior \(p(z)\).
This gives the Evidence Lower Bound (ELBO): \(\mathcal{L} = \mathbb{E}[\log p_\theta(x \mid z)] - \beta \, \mathrm{KL}(q_\phi(z \mid x) \Vert p(z))\).

VAE Generative Story

Prior: choose a simple latent distribution \(p(z) = \mathcal{N}(0, I)\).
Decoder: learn \(p_\theta(x \mid z)\) (e.g., Gaussian with mean given by a neural network) to reconstruct data from latent codes.
Encoder: learn \(q_\phi(z \mid x)\) that maps each example to a latent Gaussian \(\mathcal{N}(\mu(x), \sigma^2(x))\).
At test time we can either encode–decode (like an autoencoder) or sample \(z \sim p(z)\) and decode to generate new examples.

Step-by-Step: AE → VAE

Plain Autoencoder

Input \(x\)
Encoder outputs latent code \(z\)
Decoder outputs \(\hat{x}\)
Loss: reconstruction only (e.g., MSE)
Sampling: ad hoc; random \(z\) may not decode to realistic data

Variational Autoencoder

Input \(x\)
Encoder outputs \((\mu(x), \log \sigma^2(x))\)
Sample \(z = \mu + \sigma \odot \epsilon\), \(\epsilon \sim \mathcal{N}(0, I)\)
Decoder outputs \(\hat{x}\)
Loss: reconstruction + KL to prior → well-structured latent space

Next Steps

Reproduce the training pipeline in projects/face_autoencoder/face_autoencoder_training.ipynb and compare \(\beta\) values.
Experiment with different latent dimensions or `base_channels`, and observe how PCA variance ratios change.
Try PerceptualVAELoss instead of VAELoss and compare reconstruction quality.
Extend the Gradio app with preset buttons (e.g., "add smile", "turn head") by saving curated coefficient vectors.
Implement latent interpolation: encode two faces, interpolate in latent space, decode to see morphing.
Optional: try ICA or t-SNE on the collected latents to contrast with PCA's linear assumptions.