Lesson 2: How Neural Networks Learn

Learning Objectives

  • Explain how neural networks learn from data (forward pass, loss, gradients, optimizers)
  • Understand and compare common activation functions and when to use them
  • Understand regression and classification losses, including cross-entropy as “surprise”
  • Understand SGD, momentum, and Adam update rules and their effect on training
  • Apply these ideas by training, evaluating, and running inference on MNIST in the PyTorch notebook

Recap: Single Neuron & Activation

From Lesson 1: a single neuron computes a weighted sum, then applies an activation.

\[ z = \sum_{i=1}^d w_i x_i + b, \quad a = \sigma(z) \]

  • \(x_i\): input features, \(w_i\): weights, \(b\): bias
  • \(\sigma(\cdot)\): activation function (e.g., ReLU, sigmoid)
  • Activation introduces nonlinearity so networks can model complex patterns

Visualization: Simple Neuron


digraph simple_neuron {
  rankdir=LR;
  node [fontsize=14];
  edge [penwidth=1.5];

  x1 [label="x₁", width=0.6];
  x2 [label="x₂", width=0.6];
  x3 [label="x₃", width=0.6];
  neuron [label="z = w·x + b\n a = σ(z)", shape=circle, style=filled, fillcolor="#e3f2fd", width=1.5];
  y [label="output a", shape=box, style="filled,rounded", fillcolor="#e8f5e9"];

  x1 -> neuron;
  x2 -> neuron;
  x3 -> neuron;
  neuron -> y;
}
          

Inputs are combined linearly into \(z\), then passed through an activation \(\sigma\) to produce output \(a\).

Activations and Learning

  • Without nonlinear activations, stacked layers collapse to a single linear map
  • Nonlinearities shape how gradients flow and what patterns can be learned
  • Hidden layers: usually ReLU or variants (LeakyReLU, GELU)
  • Output layer: choose activation to match task and loss
  • Rule of thumb: ReLU in hidden layers, sigmoid or softmax at the output

ReLU in Practice

\(\mathrm{ReLU}(z) = \max(0, z)\) — standard for hidden units.

ReLU activation function
  • Simple piecewise linear shape (0 for \(z \lt 0\), linear for \(z \gt 0\)); cheap to compute
  • Sparse activations (many zeros) can help generalization
  • Works well with modern optimizers like Adam for vision and tabular data
  • If many units die (always zero), try LeakyReLU

LeakyReLU in Practice

\(\mathrm{LeakyReLU}_\alpha(z) = \max(\alpha z, z)\) with small \(\alpha \approx 0.01\).

LeakyReLU activation function and derivative
  • Like ReLU but with a small negative slope for \(z \lt 0\)
  • Reduces the “dying ReLU” problem by keeping gradients non-zero for negative inputs
  • Useful when many ReLU units are stuck at zero during training
  • Slightly more complex than ReLU but still cheap to compute

ELU in Practice

\(\mathrm{ELU}_\alpha(z) = \begin{cases} z & \text{if } z \ge 0 \\ \alpha (e^{z} - 1) & \text{if } z < 0 \end{cases}\), typically with \(\alpha = 1\).

ELU activation function and derivative
  • Smooth version of ReLU: negative inputs map to negative outputs instead of exactly 0
  • Can help keep activations more zero-centered and reduce bias shift
  • Slightly more expensive than ReLU/LeakyReLU due to the exponential
  • Less common today than ReLU/LeakyReLU but still useful to try in some vision models

GELU in Practice

\(\mathrm{GELU}(z) \approx \tfrac{1}{2} z \big(1 + \tanh\big(\sqrt{\tfrac{2}{\pi}} (z + 0.044715 z^3)\big)\big)\).

Smooth, probabilistic variant of ReLU, commonly used in transformers.

GELU activation function and derivative
  • Soft ReLU-like behavior with smooth gradients over all \(z\)
  • Popular in large transformer models; more expensive than ReLU/LeakyReLU
  • For small/medium models in this course, ReLU (or LeakyReLU) is usually sufficient

Sigmoid in Practice

Sigmoid activation function and derivative
  • Sigmoid squashes real-valued inputs to \((0, 1)\)
  • Derivative is largest near 0 and tiny near 0 or 1 → can cause vanishing gradients when saturated
  • Good for binary outputs with BCE-style losses; interpret as probability of class 1
  • Avoid in deep hidden layers; prefer ReLU/LeakyReLU or GELU

Tanh in Practice

\(\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\) — squashes to \((-1, 1)\).

Tanh activation function and derivative
  • Outputs are zero-centered, which can help optimization compared to sigmoid
  • Derivative is largest near 0 and saturates (goes to 0) for large \(|z|\)
  • Historically popular in RNNs; less common in modern deep nets due to saturation
  • Use when you need outputs roughly in \([-1, 1]\) and can tolerate vanishing gradients

Softmax in Practice

\[ \text{softmax}_k(\mathbf{z}) = \frac{e^{z_k}}{\sum_j e^{z_j}} \]


digraph softmax_flow {
  rankdir=LR;
  node [fontsize=10];

  x1  [label="NN score\n(cat)", shape=circle];
  x2  [label="NN score\n(dog)", shape=circle];
  x3  [label="NN score\n(plane)", shape=circle];
  x4  [label="NN score\n(car)", shape=circle];

  z1  [label="10\n(cat logit)", shape=box];
  z2  [label="4\n(dog logit)", shape=box];
  z3  [label="5\n(plane logit)", shape=box];
  z4  [label="8\n(car logit)", shape=box];
  sm  [label="softmax", shape=box, style=rounded];
  p1  [label="0.873\nP(cat)"];
  p2  [label="0.002\nP(dog)"];
  p3  [label="0.006\nP(plane)"];
  p4  [label="0.119\nP(car)"];

  x1 -> z1 -> sm;
  x2 -> z2 -> sm;
  x3 -> z3 -> sm;
  x4 -> z4 -> sm;
  sm -> p1;
  sm -> p2;
  sm -> p3;
  sm -> p4;
}
          
  • Converts a vector of logits \(\mathbf{z}\) into a probability distribution over classes
  • Used at the output of multi-class classifiers together with cross-entropy loss

Activation Cheat Sheet

  • Hidden layers: ReLU (default), LeakyReLU or ELU if many units die; GELU for transformer-style models
  • Regression output: no activation (linear); optionally clamp in code if needed
  • Binary classification: sigmoid output + binary cross-entropy style loss
  • Multi-class (single label): logits + softmax + cross-entropy loss

From Forward Pass to Learning

  • Forward pass: given weights, compute predictions \(\hat{y} = f_\theta(x)\)
  • Learning: adjust parameters \(\theta\) to make predictions match labels \(y\)
  • \(\theta\) collects all trainable parameters of the model (weights, biases, embeddings, etc.)
  • We need: a way to measure error (loss) and a way to update \(\theta\) (optimizer)
  • Training loop = repeat: forward → loss → gradients → parameter update

Diagram: Training Loop


digraph training_loop {
  rankdir=LR;
  node [shape=box, fontsize=13, style=rounded, height=0.6];
  edge [penwidth=1.5];
  graph [nodesep=0.6, ranksep=0.8];

  data      [label="Mini-batch (x, y)"];
  model     [label="Model f_θ", style="filled,rounded", fillcolor="#e3f2fd"];
  preds     [label="Predictions ŷ"];
  loss      [label="Loss L(ŷ, y)", style="filled,rounded", fillcolor="#ffcdd2"];
  grads     [label="Gradients ∂L/∂θ"];
  optimizer [label="Optimizer update\nθ ← θ - η∂L/∂θ", style="filled,rounded", fillcolor="#c8e6c9"];

  data -> model -> preds -> loss -> grads -> optimizer;
  optimizer -> model [label="updated θ", fontsize=10, style=dashed];
}
          

Key idea: the loss tells us how wrong we are; gradients tell us how to change \(\theta\).

Gradient Descent: Core Idea

We want to move parameters \(\theta\) in the direction that reduces loss.

\[ \theta_{\text{new}} = \theta_{\text{old}} - \eta \, \nabla_\theta L(\theta) \]

  • \(\nabla_\theta L\): gradient — direction of steepest increase in loss
  • \(\eta\): learning rate — how big each step is
  • \(\theta\) typically includes all weights and biases (and any other trainable parameters) in the neural network
  • Backpropagation efficiently computes these gradients layer by layer

Mini-Batch Training

  • Compute loss and gradients on a small batch of examples (e.g., 64 images)
  • Update parameters after each batch → faster, more memory efficient
  • Noisy gradients can actually help escape poor local minima
  • In PyTorch, DataLoader handles batching and shuffling

Code: Backpropagation in PyTorch

for x_batch, y_batch in data_loader:
    optimizer.zero_grad()        # reset gradients
    logits = model(x_batch)      # forward pass
    loss = loss_fn(logits, y_batch)
    loss.backward()              # backprop: compute ∂L/∂θ
    optimizer.step()             # gradient descent step on θ

This loop implements the steps: forward → loss → gradients via backward() → parameter update via optimizer.step().

Loss Functions: Measuring Error

  • Loss \(L(\hat{y}, y)\) is low when predictions are good, high when they are bad
  • We minimize average loss over the dataset: \(\frac{1}{N}\sum_i L(\hat{y}_i, y_i)\)
  • Regression: often Mean Squared Error (MSE) or Mean Absolute Error (MAE)
  • Classification: usually cross-entropy loss (works well with probabilities)

MSE Loss in Practice

Mean Squared Error for real-valued targets \(y\) and predictions \(\hat{y}\):

\[ L_{\text{MSE}} = \frac{1}{N}\sum_i (\hat{y}_i - y_i)^2 \]

  • Penalizes large errors more strongly (squares the error)
  • Smooth gradients make it a good default for many regression problems
  • Example (target \(y = 4\)): prediction \(5 \Rightarrow \text{MSE}=1\); prediction \(8 \Rightarrow \text{MSE}=16\)

MAE Loss in Practice

Mean Absolute Error for real-valued targets \(y\) and predictions \(\hat{y}\):

\[ L_{\text{MAE}} = \frac{1}{N}\sum_i \lvert \hat{y}_i - y_i \rvert \]

  • More robust to outliers: large errors are not squared
  • Gradient magnitude does not grow with the size of the error
  • Example (target \(y = 4\)): prediction \(5 \Rightarrow \text{MAE}=1\); prediction \(8 \Rightarrow \text{MAE}=4\)
  • Consider MAE when you have many outliers or heavy-tailed noise

Binary Cross-Entropy (Sigmoid Output)

For binary labels \(y \in \{0, 1\}\) and predicted probability \(p = \hat{y}\):

\[ L_{\text{BCE}}(p, y) = -\big( y \log p + (1-y)\log(1-p) \big) \]

  • Large penalty when we are confident and wrong
  • Small penalty when we are confident and correct
  • Works naturally with a sigmoid output in \((0,1)\)
  • Interpretation: when \(y=1\), the loss is \(-\log p\); when \(y=0\), it is \(-\log(1-p)\) — in both cases this measures the model's "surprise" at the true label
  • Example: \(y=1, p=0.9 \Rightarrow L \approx 0.11\); \(y=1, p=0.1 \Rightarrow L \approx 2.30\) (confident and wrong)

Multiclass Cross-Entropy (Softmax Output)

For class scores \(\mathbf{z} \in \mathbb{R}^K\) and true class index \(y \in \{0, \dots, K-1\}\):

\[ \text{softmax}_k(\mathbf{z}) = \frac{e^{z_k}}{\sum_j e^{z_j}}, \quad L(\mathbf{z}, y) = -\log \text{softmax}_{y}(\mathbf{z}) \]

  • Here \(y\) is the index of the true class; we take the softmax probability at position \(y\) and apply \(-\log\) to it
  • Softmax converts scores into a probability distribution over classes
  • Cross-entropy encourages high probability on the correct class
  • Interpretation: \(-\log p(\text{correct class})\) is the model's "surprise" — low surprise for confident, correct predictions; high surprise when it assigns low probability to the true class
  • Default choice for multi-class problems like MNIST (10 digits)
  • Example: if the correct class has probability \(0.8\), loss is \(-\log 0.8 \approx 0.22\); if it has probability \(0.2\), loss is \(-\log 0.2 \approx 1.61\)

Logits, Softmax, and CrossEntropyLoss

  • Model output: the last linear layer returns logits (unnormalized scores), not probabilities
  • nn.CrossEntropyLoss expects logits and internally applies LogSoftmax + negative log-likelihood, so do not add a softmax layer inside the model
import torch
import torch.nn as nn

logits = model(x_batch)               # shape: [batch_size, num_classes]
targets = y_batch                     # integer class labels, shape: [batch_size]

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)       # expects raw logits

pred_classes = logits.argmax(dim=1)   # class predictions from logits
probs = torch.softmax(logits, dim=1)  # optional: class probabilities for analysis

Training: pass logits + integer labels to CrossEntropyLoss. Inference: use argmax on logits; apply softmax only if you need explicit probabilities.

Matching Activation and Loss (PyTorch)

  • Binary classification: use a single logit and BCEWithLogitsLoss (includes sigmoid)
  • Multi-class classification: use logits of size K and CrossEntropyLoss (includes softmax)
  • Regression: use linear outputs with MSELoss (or L1Loss for MAE)
  • Always check that the model outputs and loss function expect the same shape and scale

Optimizers: How We Step

  • SGD: basic gradient descent with learning rate \(\eta\)
  • SGD + Momentum: smooths updates using a running average of gradients
  • Adam: adaptive step sizes per-parameter + momentum (good default)
  • Choice affects convergence speed and stability, not model capacity

SGD with Momentum (Update Rule)

Momentum adds a velocity term that accumulates gradients over time.

\[ v_{t} = \beta v_{t-1} + (1 - \beta)\,\nabla_\theta L_t, \quad \theta_{t+1} = \theta_t - \eta v_t \]

  • \(v_t\): running average of recent gradients (velocity)
  • \(\beta \in [0,1)\): momentum factor (e.g., 0.9) controlling how much history to keep
  • Helps smooth noisy gradients and accelerates progress along consistent directions
  • In PyTorch: torch.optim.SGD(model.parameters(), lr=..., momentum=0.9)

Adam Optimizer (Update Rule)

Adam keeps moving averages of both gradients and squared gradients.

\[ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,\nabla_\theta L_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)\,(\nabla_\theta L_t)^2 \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \\ \theta_{t+1} &= \theta_t - \eta\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned} \]

  • \(\hat{m}_t\): momentum-like term (first moment); \(\hat{v}_t\): per-parameter variance estimate (second moment)
  • Adaptive step size: parameters with noisy/large gradients get smaller effective steps
  • Works well out-of-the-box with defaults (\(\beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}\))
  • In PyTorch: torch.optim.Adam(model.parameters(), lr=1e-3)

Optimizer Cheat Sheet (PyTorch)

  • Default: Adam(model.parameters(), lr=1e-3) for many small/medium models
  • When to favor SGD: very large datasets or convnets, when you can tune learning rate + momentum
  • AdamW: variant of Adam with better weight decay; common for transformers

Optimizer Comparison on MNIST

Training loss comparison for SGD, SGD+Momentum, and Adam on MNIST
  • On a small MNIST MLP, SGD with momentum and Adam reduce training loss faster than plain SGD
  • Momentum accelerates progress along consistent directions; Adam adapts step sizes per parameter
  • Use Adam as a strong default; switch to SGD + momentum when you can afford tuning for large convnets

Learning Rate Intuition

  • Too small → very slow training, may get stuck
  • Too large → loss oscillates or diverges
  • Practical tip: try \(\eta \in \{1e{-3}, 3e{-3}, 1e{-4}\}\) with Adam
  • Always monitor training and validation curves

Reading Training Curves

  • Training loss ↓ and validation loss ↓ → learning and generalizing
  • Training loss ↓, validation loss ↑ → overfitting
  • Both flat → optimizer or learning rate issues
  • Use curves to decide when to stop or adjust hyperparameters

Switch to Notebook: MNIST

  • Now we will apply these ideas end-to-end on MNIST in the companion notebook
  • Open notebooks/lesson2_Pytorch_MNIST.ipynb to walk through:
  • Loading MNIST, defining the model, training, evaluation, and simple inference examples

Homework

  • Train a model on the Fashion-MNIST dataset and evaluate its performance (CIFAR-10 is preferred if you have a GPU or strong CPU).
  • Experiment to find a good combination of architecture, optimizer, learning rate, and number of epochs.
  • Add visualizations similar to the Lesson 2 MNIST notebook: plot training and validation loss and accuracy curves.
  • Run inference with your best model on a new image captured with a camera and analyze the prediction.