Lesson 1: From Regression to Deep Learning

Learning Objectives

  • Why linear models fail on complex data
  • How layers + activations extend regression
  • Visualize decision boundaries forming
  • Forward pass intuition with minimal math

Why Regression Isn’t Enough

  • Recall logistic regression decision boundaries
  • Curved datasets (circles, spirals) break linear separability
  • Prompt: “How could we bend this decision line?”
Concentric circles Intertwined spirals

Notation

  • Input vector: \(\mathbf{x} \in \mathbb{R}^d\)
  • Weights: \(\mathbf{w} \in \mathbb{R}^d\), bias: \(b \in \mathbb{R}\)
  • Activation (nonlinearity): \(\sigma(\cdot)\) e.g., ReLU, sigmoid

Tip: hidden layers typically use ReLU; outputs use sigmoid for binary classification.

Single Neuron (Scalar Output)

Linear combination + nonlinearity:

\[ \hat{y} = \sigma\big( \mathbf{w}^\top \mathbf{x} + b \big) \]

Without \(\sigma\), this is just linear regression/logistic logit.

Shapes: \(\mathbf{x}\in\mathbb{R}^d\), \(\mathbf{w}\in\mathbb{R}^d\), \(b\in\mathbb{R}\), \(\hat{y}\in\mathbb{R}\). Common \(\sigma\): ReLU, tanh, sigmoid.

Neuron + Bias Intuition

  • Linear part: z = w · x + b
  • Activation: a = σ(z) adds nonlinearity
  • Bias b moves the decision threshold
  • Layers stack these simple units

Diagram: Single Neuron


digraph neuron {
  rankdir=LR;
  node [shape=circle, fontsize=14, width=1.0];
  edge [penwidth=1.5];

  x1 [label="x₁"];
  x2 [label="x₂"];
  x3 [label="x₃"];
  h  [label="σ(w·x + b)", width=1.5];

  x1 -> h;
  x2 -> h;
  x3 -> h;
}
          

Multiple inputs are combined into a single neuron that applies σ to \(w \cdot x + b\).

ReLU Activation

\(\mathrm{ReLU}(z) = \max(0, z)\) — default for hidden layers.

ReLU activation function
  • Zero for negative inputs, linear for positive
  • Keeps computation simple and fast
  • Works well as a default for hidden layers

Sigmoid Activation

\(\sigma(z) = \frac{1}{1 + e^{-z}}\) — used for binary outputs.

Sigmoid activation function
  • Maps any real number to (0, 1)
  • Interpretable as probability for binary class
  • Used on the final neuron in this lesson

Layer (Vector Form)

Compute multiple neurons at once:

\[ \mathbf{a}^{(1)} = \sigma\big( \mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)} \big) \]

\(\mathbf{W}^{(1)} \in \mathbb{R}^{m\times d}\) maps input to \(m\) hidden units.

Shapes: \(\mathbf{b}^{(1)}\in\mathbb{R}^m\), \(\mathbf{a}^{(1)}\in\mathbb{R}^m\). \(\sigma\) applies elementwise.

Two-Layer Network

Compose layers to get flexible decision boundaries:

\[ \hat{y} = \sigma^{(2)}\!\big( \mathbf{W}^{(2)} \, \sigma^{(1)}( \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} ) + \mathbf{b}^{(2)} \big) \]

Each layer: linear map + nonlinearity. Stacking learns features of features.

Shapes: hidden size \(m\), output size \(k\). \(\mathbf{W}^{(2)}\in\mathbb{R}^{k\times m}\), \(\mathbf{b}^{(2)}\in\mathbb{R}^k\), \(\hat{y}\in\mathbb{R}^k\). Typical in this lesson: \(\sigma^{(1)}=\)ReLU, \(\sigma^{(2)}=\)sigmoid.

Diagram: Two-Layer Network


digraph two_layer {
  rankdir=LR;
  node [shape=circle, fontsize=14, width=0.8];
  edge [penwidth=1.5];
  graph [nodesep=0.5, ranksep=1.0];

  subgraph cluster_input {
    label="Inputs";
    color="white";
    x1 [label="x₁"];
    x2 [label="x₂"];
  }

  subgraph cluster_hidden {
    label="Hidden layer (ReLU)";
    style=filled; fillcolor="#f5f5f5";
    h1 [label="h₁"];
    h2 [label="h₂"];
    h3 [label="h₃"];
  }

  subgraph cluster_output {
    label="Output";
    color="white";
    y [label="ŷ"];
  }

  x1 -> h1; x1 -> h2; x1 -> h3;
  x2 -> h1; x2 -> h2; x2 -> h3;
  h1 -> y;  h2 -> y;  h3 -> y;
}
          

Inputs feed into hidden units, which then feed into the output neuron.

Forward Pass — Step by Step

  1. Compute each layer’s z = W·x + b
  2. Apply activation a = σ(z)
  3. Feed a into next layer
  4. Final output → prediction

We’ll worry about learning (backprop) next lesson.

Key Themes

  • Linear part learns weighted combinations; bias shifts thresholds
  • Nonlinearity lets boundaries bend (beyond any single line)
  • Depth composes simple units into complex patterns

Interactive Demo (TF Playground)

  1. Select circular or spiral dataset
  2. No hidden layers → poor separation
  3. 1 hidden layer (8) → improvement
  4. 2 layers (8+8) → complex shapes classified
  5. Tune activations, learning rate, noise

Discuss: Why deeper → better patterns? What might each neuron learn?

Open TensorFlow Playground

Mini Hands-On

Implement the idea in PyTorch on synthetic 2D points.

import torch
import torch.nn as nn

model = nn.Sequential(
  nn.Linear(2, 8),
  nn.ReLU(),
  nn.Linear(8, 8),
  nn.ReLU(),
  nn.Linear(1, 1),
  nn.Sigmoid(),
)
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# loss_fn = nn.BCELoss()
# training loop ...
          

Plot decision regions; relate to Playground behavior.

Code: Make a 2D Dataset

# Option A: use scikit-learn (concise)
from sklearn.datasets import make_circles
import numpy as np

X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1, random_state=0)
X = X.astype('float32')
y = y.astype('float32')

# Option B: quick NumPy spiral (for the curious)
def make_spiral(n=500, noise=0.2):
    n2 = n//2
    t = np.linspace(0, 2*np.pi, n2)
    r = np.linspace(0.2, 1.0, n2)
    x1 = np.c_[r*np.cos(t), r*np.sin(t)] + noise*np.random.randn(n2,2)
    x2 = np.c_[-r*np.cos(t), -r*np.sin(t)] + noise*np.random.randn(n2,2)
    Xs = np.vstack([x1, x2]).astype('float32')
    ys = np.r_[np.zeros(n2), np.ones(n2)].astype('float32')
    return Xs, ys
          

Code: Train and Evaluate (PyTorch)

import torch
import torch.nn as nn

X_t = torch.from_numpy(X)
y_t = torch.from_numpy(y).unsqueeze(1)

model = nn.Sequential(
    nn.Linear(2, 8),
    nn.ReLU(),
    nn.Linear(8, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid(),
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.BCELoss()

for epoch in range(20):
    optimizer.zero_grad()
    preds = model(X_t)
    loss = loss_fn(preds, y_t)
    loss.backward()
    optimizer.step()

print('Final training loss:', float(loss))
          

Expect validation accuracy to improve over epochs; details next lesson.

Code: Plot Decision Regions

import numpy as np, matplotlib.pyplot as plt
import torch

xx, yy = np.meshgrid(np.linspace(X[:,0].min()-0.5, X[:,0].max()+0.5, 200),
                     np.linspace(X[:,1].min()-0.5, X[:,1].max()+0.5, 200))
grid = np.c_[xx.ravel(), yy.ravel()].astype('float32')
with torch.no_grad():
    grid_t = torch.from_numpy(grid)
    probs = model(grid_t).detach().numpy().reshape(xx.shape)

plt.figure(figsize=(5,4))
plt.contourf(xx, yy, probs, levels=20, cmap='RdBu', alpha=0.6)
plt.scatter(X[:,0], X[:,1], c=y, cmap='RdBu', edgecolor='k', s=12)
plt.title('Decision regions')
plt.show()
          

Relate the learned boundary to the Playground visuals.

Wrap-Up

  • Neuron = regression + nonlinearity on top
  • Layers stack neurons to learn features of features
  • Depth + activations bend decision boundaries into complex shapes

Homework:

  • In the spirals notebook, recreate one of the Playground experiments (spirals or circles).
  • Try at least two different architectures.
  • Find the deepest network (most layers) that uses the fewest units but still fits the data.
  • Find the shallowest network (fewest layers) that uses the fewest units but still fits the data.

Next time: how networks actually learn these weights using loss functions and gradient descent.