Lesson 2 — How Neural Networks Learn

Lesson 2: How Neural Networks Learn

Learning Objectives

Explain how neural networks learn from data (forward pass, loss, gradients, optimizers)
Understand and compare common activation functions and when to use them
Understand regression and classification losses, including cross-entropy as “surprise”
Understand SGD, momentum, and Adam update rules and their effect on training
Apply these ideas by training, evaluating, and running inference on MNIST in the PyTorch notebook

Recap: Single Neuron & Activation

From Lesson 1: a single neuron computes a weighted sum, then applies an activation.

\[ z = \sum_{i=1}^d w_i x_i + b, \quad a = \sigma(z) \]

\(x_i\): input features, \(w_i\): weights, \(b\): bias
\(\sigma(\cdot)\): activation function (e.g., ReLU, sigmoid)
Activation introduces nonlinearity so networks can model complex patterns

Visualization: Simple Neuron


digraph simple_neuron {
  rankdir=LR;
  node [fontsize=14];
  edge [penwidth=1.5];

  x1 [label="x₁", width=0.6];
  x2 [label="x₂", width=0.6];
  x3 [label="x₃", width=0.6];
  neuron [label="z = w·x + b\n a = σ(z)", shape=circle, style=filled, fillcolor="#e3f2fd", width=1.5];
  y [label="output a", shape=box, style="filled,rounded", fillcolor="#e8f5e9"];

  x1 -> neuron;
  x2 -> neuron;
  x3 -> neuron;
  neuron -> y;
}

Inputs are combined linearly into \(z\), then passed through an activation \(\sigma\) to produce output \(a\).

Activations and Learning

Without nonlinear activations, stacked layers collapse to a single linear map
Nonlinearities shape how gradients flow and what patterns can be learned
Hidden layers: usually ReLU or variants (LeakyReLU, GELU)
Output layer: choose activation to match task and loss
Rule of thumb: ReLU in hidden layers, sigmoid or softmax at the output

ReLU in Practice

\(\mathrm{ReLU}(z) = \max(0, z)\) — standard for hidden units.

Simple piecewise linear shape (0 for \(z \lt 0\), linear for \(z \gt 0\)); cheap to compute
Sparse activations (many zeros) can help generalization
Works well with modern optimizers like Adam for vision and tabular data
If many units die (always zero), try LeakyReLU

LeakyReLU in Practice

\(\mathrm{LeakyReLU}_\alpha(z) = \max(\alpha z, z)\) with small \(\alpha \approx 0.01\).

LeakyReLU activation function and derivative

Like ReLU but with a small negative slope for \(z \lt 0\)
Reduces the “dying ReLU” problem by keeping gradients non-zero for negative inputs
Useful when many ReLU units are stuck at zero during training
Slightly more complex than ReLU but still cheap to compute

ELU in Practice

\(\mathrm{ELU}_\alpha(z) = \begin{cases} z & \text{if } z \ge 0 \\ \alpha (e^{z} - 1) & \text{if } z < 0 \end{cases}\), typically with \(\alpha = 1\).

Smooth version of ReLU: negative inputs map to negative outputs instead of exactly 0
Can help keep activations more zero-centered and reduce bias shift
Slightly more expensive than ReLU/LeakyReLU due to the exponential
Less common today than ReLU/LeakyReLU but still useful to try in some vision models

GELU in Practice

\(\mathrm{GELU}(z) \approx \tfrac{1}{2} z \big(1 + \tanh\big(\sqrt{\tfrac{2}{\pi}} (z + 0.044715 z^3)\big)\big)\).

Smooth, probabilistic variant of ReLU, commonly used in transformers.

Soft ReLU-like behavior with smooth gradients over all \(z\)
Popular in large transformer models; more expensive than ReLU/LeakyReLU
For small/medium models in this course, ReLU (or LeakyReLU) is usually sufficient

Sigmoid in Practice

Sigmoid activation function and derivative

Sigmoid squashes real-valued inputs to \((0, 1)\)
Derivative is largest near 0 and tiny near 0 or 1 → can cause vanishing gradients when saturated
Good for binary outputs with BCE-style losses; interpret as probability of class 1
Avoid in deep hidden layers; prefer ReLU/LeakyReLU or GELU

Tanh in Practice

\(\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}\) — squashes to \((-1, 1)\).

Outputs are zero-centered, which can help optimization compared to sigmoid
Derivative is largest near 0 and saturates (goes to 0) for large \(|z|\)
Historically popular in RNNs; less common in modern deep nets due to saturation
Use when you need outputs roughly in \([-1, 1]\) and can tolerate vanishing gradients

Softmax in Practice

\[ \text{softmax}_k(\mathbf{z}) = \frac{e^{z_k}}{\sum_j e^{z_j}} \]


digraph softmax_flow {
  rankdir=LR;
  node [fontsize=10];

  x1  [label="NN score\n(cat)", shape=circle];
  x2  [label="NN score\n(dog)", shape=circle];
  x3  [label="NN score\n(plane)", shape=circle];
  x4  [label="NN score\n(car)", shape=circle];

  z1  [label="10\n(cat logit)", shape=box];
  z2  [label="4\n(dog logit)", shape=box];
  z3  [label="5\n(plane logit)", shape=box];
  z4  [label="8\n(car logit)", shape=box];
  sm  [label="softmax", shape=box, style=rounded];
  p1  [label="0.873\nP(cat)"];
  p2  [label="0.002\nP(dog)"];
  p3  [label="0.006\nP(plane)"];
  p4  [label="0.119\nP(car)"];

  x1 -> z1 -> sm;
  x2 -> z2 -> sm;
  x3 -> z3 -> sm;
  x4 -> z4 -> sm;
  sm -> p1;
  sm -> p2;
  sm -> p3;
  sm -> p4;
}

Converts a vector of logits \(\mathbf{z}\) into a probability distribution over classes
Used at the output of multi-class classifiers together with cross-entropy loss

Activation Cheat Sheet

Hidden layers: ReLU (default), LeakyReLU or ELU if many units die; GELU for transformer-style models
Regression output: no activation (linear); optionally clamp in code if needed
Binary classification: sigmoid output + binary cross-entropy style loss
Multi-class (single label): logits + softmax + cross-entropy loss

From Forward Pass to Learning

Forward pass: given weights, compute predictions \(\hat{y} = f_\theta(x)\)
Learning: adjust parameters \(\theta\) to make predictions match labels \(y\)
\(\theta\) collects all trainable parameters of the model (weights, biases, embeddings, etc.)
We need: a way to measure error (loss) and a way to update \(\theta\) (optimizer)
Training loop = repeat: forward → loss → gradients → parameter update

Diagram: Training Loop


digraph training_loop {
  rankdir=LR;
  node [shape=box, fontsize=13, style=rounded, height=0.6];
  edge [penwidth=1.5];
  graph [nodesep=0.6, ranksep=0.8];

  data      [label="Mini-batch (x, y)"];
  model     [label="Model f_θ", style="filled,rounded", fillcolor="#e3f2fd"];
  preds     [label="Predictions ŷ"];
  loss      [label="Loss L(ŷ, y)", style="filled,rounded", fillcolor="#ffcdd2"];
  grads     [label="Gradients ∂L/∂θ"];
  optimizer [label="Optimizer update\nθ ← θ - η∂L/∂θ", style="filled,rounded", fillcolor="#c8e6c9"];

  data -> model -> preds -> loss -> grads -> optimizer;
  optimizer -> model [label="updated θ", fontsize=10, style=dashed];
}

Key idea: the loss tells us how wrong we are; gradients tell us how to change \(\theta\).

Gradient Descent: Core Idea

We want to move parameters \(\theta\) in the direction that reduces loss.

\[ \theta_{\text{new}} = \theta_{\text{old}} - \eta \, \nabla_\theta L(\theta) \]

\(\nabla_\theta L\): gradient — direction of steepest increase in loss
\(\eta\): learning rate — how big each step is
\(\theta\) typically includes all weights and biases (and any other trainable parameters) in the neural network
Backpropagation efficiently computes these gradients layer by layer

Mini-Batch Training

Compute loss and gradients on a small batch of examples (e.g., 64 images)
Update parameters after each batch → faster, more memory efficient
Noisy gradients can actually help escape poor local minima
In PyTorch, DataLoader handles batching and shuffling

Code: Backpropagation in PyTorch

for x_batch, y_batch in data_loader:
    optimizer.zero_grad()        # reset gradients
    logits = model(x_batch)      # forward pass
    loss = loss_fn(logits, y_batch)
    loss.backward()              # backprop: compute ∂L/∂θ
    optimizer.step()             # gradient descent step on θ

This loop implements the steps: forward → loss → gradients via backward() → parameter update via optimizer.step().

Loss Functions: Measuring Error

Loss \(L(\hat{y}, y)\) is low when predictions are good, high when they are bad
We minimize average loss over the dataset: \(\frac{1}{N}\sum_i L(\hat{y}_i, y_i)\)
Regression: often Mean Squared Error (MSE) or Mean Absolute Error (MAE)
Classification: usually cross-entropy loss (works well with probabilities)

MSE Loss in Practice

Mean Squared Error for real-valued targets \(y\) and predictions \(\hat{y}\):

\[ L_{\text{MSE}} = \frac{1}{N}\sum_i (\hat{y}_i - y_i)^2 \]

Penalizes large errors more strongly (squares the error)
Smooth gradients make it a good default for many regression problems
Example (target \(y = 4\)): prediction \(5 \Rightarrow \text{MSE}=1\); prediction \(8 \Rightarrow \text{MSE}=16\)

MAE Loss in Practice

Mean Absolute Error for real-valued targets \(y\) and predictions \(\hat{y}\):

\[ L_{\text{MAE}} = \frac{1}{N}\sum_i \lvert \hat{y}_i - y_i \rvert \]

More robust to outliers: large errors are not squared
Gradient magnitude does not grow with the size of the error
Example (target \(y = 4\)): prediction \(5 \Rightarrow \text{MAE}=1\); prediction \(8 \Rightarrow \text{MAE}=4\)
Consider MAE when you have many outliers or heavy-tailed noise

Binary Cross-Entropy (Sigmoid Output)

For binary labels \(y \in \{0, 1\}\) and predicted probability \(p = \hat{y}\):

\[ L_{\text{BCE}}(p, y) = -\big( y \log p + (1-y)\log(1-p) \big) \]

Large penalty when we are confident and wrong
Small penalty when we are confident and correct
Works naturally with a sigmoid output in \((0,1)\)
Interpretation: when \(y=1\), the loss is \(-\log p\); when \(y=0\), it is \(-\log(1-p)\) — in both cases this measures the model's "surprise" at the true label
Example: \(y=1, p=0.9 \Rightarrow L \approx 0.11\); \(y=1, p=0.1 \Rightarrow L \approx 2.30\) (confident and wrong)

Multiclass Cross-Entropy (Softmax Output)

For class scores \(\mathbf{z} \in \mathbb{R}^K\) and true class index \(y \in \{0, \dots, K-1\}\):

\[ \text{softmax}_k(\mathbf{z}) = \frac{e^{z_k}}{\sum_j e^{z_j}}, \quad L(\mathbf{z}, y) = -\log \text{softmax}_{y}(\mathbf{z}) \]

Here \(y\) is the index of the true class; we take the softmax probability at position \(y\) and apply \(-\log\) to it
Softmax converts scores into a probability distribution over classes
Cross-entropy encourages high probability on the correct class
Interpretation: \(-\log p(\text{correct class})\) is the model's "surprise" — low surprise for confident, correct predictions; high surprise when it assigns low probability to the true class
Default choice for multi-class problems like MNIST (10 digits)
Example: if the correct class has probability \(0.8\), loss is \(-\log 0.8 \approx 0.22\); if it has probability \(0.2\), loss is \(-\log 0.2 \approx 1.61\)

Logits, Softmax, and CrossEntropyLoss

Model output: the last linear layer returns logits (unnormalized scores), not probabilities
nn.CrossEntropyLoss expects logits and internally applies LogSoftmax + negative log-likelihood, so do not add a softmax layer inside the model

import torch
import torch.nn as nn

logits = model(x_batch)               # shape: [batch_size, num_classes]
targets = y_batch                     # integer class labels, shape: [batch_size]

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)       # expects raw logits

pred_classes = logits.argmax(dim=1)   # class predictions from logits
probs = torch.softmax(logits, dim=1)  # optional: class probabilities for analysis

Training: pass logits + integer labels to CrossEntropyLoss. Inference: use argmax on logits; apply softmax only if you need explicit probabilities.

Matching Activation and Loss (PyTorch)

Binary classification: use a single logit and BCEWithLogitsLoss (includes sigmoid)
Multi-class classification: use logits of size K and CrossEntropyLoss (includes softmax)
Regression: use linear outputs with MSELoss (or L1Loss for MAE)
Always check that the model outputs and loss function expect the same shape and scale

Optimizers: How We Step

SGD: basic gradient descent with learning rate \(\eta\)
SGD + Momentum: smooths updates using a running average of gradients
Adam: adaptive step sizes per-parameter + momentum (good default)
Choice affects convergence speed and stability, not model capacity

SGD with Momentum (Update Rule)

Momentum adds a velocity term that accumulates gradients over time.

\[ v_{t} = \beta v_{t-1} + (1 - \beta)\,\nabla_\theta L_t, \quad \theta_{t+1} = \theta_t - \eta v_t \]

\(v_t\): running average of recent gradients (velocity)
\(\beta \in [0,1)\): momentum factor (e.g., 0.9) controlling how much history to keep
Helps smooth noisy gradients and accelerates progress along consistent directions
In PyTorch: torch.optim.SGD(model.parameters(), lr=..., momentum=0.9)

Adam Optimizer (Update Rule)

Adam keeps moving averages of both gradients and squared gradients.

\[ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\,\nabla_\theta L_t \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)\,(\nabla_\theta L_t)^2 \\ \hat{m}_t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \\ \theta_{t+1} &= \theta_t - \eta\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} \end{aligned} \]

\(\hat{m}_t\): momentum-like term (first moment); \(\hat{v}_t\): per-parameter variance estimate (second moment)
Adaptive step size: parameters with noisy/large gradients get smaller effective steps
Works well out-of-the-box with defaults (\(\beta_1=0.9, \beta_2=0.999, \epsilon=10^{-8}\))
In PyTorch: torch.optim.Adam(model.parameters(), lr=1e-3)

Optimizer Cheat Sheet (PyTorch)

Default: Adam(model.parameters(), lr=1e-3) for many small/medium models
When to favor SGD: very large datasets or convnets, when you can tune learning rate + momentum
AdamW: variant of Adam with better weight decay; common for transformers

Optimizer Comparison on MNIST

Training loss comparison for SGD, SGD+Momentum, and Adam on MNIST

On a small MNIST MLP, SGD with momentum and Adam reduce training loss faster than plain SGD
Momentum accelerates progress along consistent directions; Adam adapts step sizes per parameter
Use Adam as a strong default; switch to SGD + momentum when you can afford tuning for large convnets

Learning Rate Intuition

Too small → very slow training, may get stuck
Too large → loss oscillates or diverges
Practical tip: try \(\eta \in \{1e{-3}, 3e{-3}, 1e{-4}\}\) with Adam
Always monitor training and validation curves

Reading Training Curves

Training loss ↓ and validation loss ↓ → learning and generalizing
Training loss ↓, validation loss ↑ → overfitting
Both flat → optimizer or learning rate issues
Use curves to decide when to stop or adjust hyperparameters

Switch to Notebook: MNIST

Now we will apply these ideas end-to-end on MNIST in the companion notebook
Open notebooks/lesson2_Pytorch_MNIST.ipynb to walk through:
Loading MNIST, defining the model, training, evaluation, and simple inference examples

Homework

Train a model on the Fashion-MNIST dataset and evaluate its performance (CIFAR-10 is preferred if you have a GPU or strong CPU).
Experiment to find a good combination of architecture, optimizer, learning rate, and number of epochs.
Add visualizations similar to the Lesson 2 MNIST notebook: plot training and validation loss and accuracy curves.
Run inference with your best model on a new image captured with a camera and analyze the prediction.