Lesson 6 — NLP: From Words to Embeddings

Lesson 6: NLP — From Words to Embeddings

Learning Objectives

Understand how raw text becomes tokens and integer ids.
Explain word embeddings and the idea that similar context → similar vectors.
Use nn.Embedding in PyTorch to map tokens to dense vectors.
Outline a simple sentiment classifier built on embeddings.

Why NLP Feels Different from Vision

Images: fixed-size grids of pixels (e.g., 28×28, 64×64), naturally numeric.
Text: variable-length sequences of discrete symbols (characters, words, subwords).
Goal: turn text into numbers while preserving meaning and order.


digraph nlp_vs_vision {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  img   [label="Image\n(64x64x3 pixels)"];
  txt   [label="Text\n(\"I really liked this movie\")"];
  cnn   [label="CNN / VAE"];
  nlp   [label="Embedding + NLP model"];

  img -> cnn;
  txt -> nlp;
}

Text Processing Pipeline

High-level steps to go from raw text to model-ready tensors.


digraph nlp_pipeline {
  rankdir=LR;
  node [fontsize=12, shape=box, style=rounded, height=0.7];
  edge [penwidth=1.5];
  graph [nodesep=0.5, ranksep=0.8];

  text   [label="Raw text\n\"I loved the movie!\""];
  tokens [label="Tokenization\n['I', 'loved', 'the', 'movie']", style="filled,rounded", fillcolor="#e3f2fd"];
  ids    [label="Vocabulary lookup\n[12, 57, 3, 98]", style="filled,rounded", fillcolor="#fff3e0"];
  emb    [label="Embedding layer\nlookup vectors", style="filled,rounded", fillcolor="#e8f5e9"];
  model  [label="NLP model\nclassifier / LM"];

  text -> tokens -> ids -> emb -> model;
}

Today: focus on tokens, vocabulary, and embeddings.
Next lesson: sequence models (LSTM, Transformers) on top of embeddings.

Tokenization: Breaking Text into Pieces

Tokenization splits text into units that the model will see.

Common Choices

Whitespace / word-level: simple, language-dependent.
Character-level: smallest units, robust but long sequences.
Subword (BPE, WordPiece): balance between words and chars.

Example (word-level)

text = "I really liked this movie!"
tokens = ["I", "really", "liked", "this", "movie", "!"]

Punctuation can be kept as separate tokens.
Case folding (lowercasing) is a design choice.

Vocabulary & Integer Encoding

Models operate on integers, not strings. We build a vocabulary mapping each token to an id.

tokens = ["i", "really", "liked", "this", "movie", "!"]

vocab = {
    "<pad>": 0,
    "<unk>": 1,
    "i": 2,
    "really": 3,
    "liked": 4,
    "this": 5,
    "movie": 6,
    "!": 7,
}

ids = [vocab.get(t, vocab["<unk>"]) for t in tokens]
# ids: [2, 3, 4, 5, 6, 7]

<unk> handles out-of-vocabulary words at inference time.
<pad> is used later when batching sequences of different lengths.

From One-Hot to Embeddings

One-Hot Representation

Vector of size \(|V|\) (vocabulary size).
Exactly one position is 1, others 0.
Very sparse and high-dimensional.
Does not capture similarity between words.

Embedding Vectors

Dense vectors of size \(d\) (e.g., 64, 128).
Learned from data together with the model.
Similar words → similar vectors (in practice).
Basis of Word2Vec, GloVe, BERT, GPT, etc.

Embedding Math (Single Token)

Let \(|V|\) be vocabulary size and \(d\) embedding dimension.

Embedding matrix: \(E \in \mathbb{R}^{|V| \times d}\).
One-hot vector for token \(i\): \(\mathbf{e}_i \in \mathbb{R}^{|V|}\) with 1 at position \(i\), 0 elsewhere.
Embedding lookup is just matrix multiplication: \[ \mathbf{x}_i = \mathbf{e}_i^\top E \in \mathbb{R}^d. \]
In code, we skip building \(\mathbf{e}_i\) and directly index row \(i\) of \(E\).

Embedding Table Example

Toy vocabulary with \(|V| = 4\) and \(d = 2\).

Token	Id	One-hot \(\mathbf{e}_i\)	Embedding \(\mathbf{x}_i\)
good	0	\([1, 0, 0, 0]\)	\([0.8,\ 0.6]\)
great	1	\([0, 1, 0, 0]\)	\([0.9,\ 0.7]\)
bad	2	\([0, 0, 1, 0]\)	\([-0.7,\ -0.6]\)
terrible	3	\([0, 0, 0, 1]\)	\([-0.9,\ -0.8]\)

Vectors for “good” and “great” are close; “bad” and “terrible” are close but far from the positives.

PyTorch `nn.Embedding`

The embedding layer is just a learnable matrix of shape \((|V|, d)\).

import torch
import torch.nn as nn

vocab_size = len(vocab)
embedding_dim = 64

emb = nn.Embedding(num_embeddings=vocab_size,
                   embedding_dim=embedding_dim)

ids = torch.tensor([[2, 3, 4, 5, 6, 7]])  # shape: (batch=1, seq_len=6)
embedded = emb(ids)  # shape: (1, 6, 64)

Each token id picks one row from the embedding matrix.
Gradients update the embeddings during training, just like CNN weights.

Distributional Semantics & Word2Vec Intuition

Core idea: “You shall know a word by the company it keeps.” (Firth, 1957)

Words appearing in similar contexts should have similar embeddings.
Word2Vec trains a small network to predict a word from its neighbors (CBOW) or neighbors from a word (Skip-gram).
The hidden layer weights become useful word vectors.

# Pseudocode: Skip-gram training example
center = "movie"
context = ["great", "funny", "exciting"]

# Objective: embeddings("movie") should be
# good at predicting its context words.

CBOW: Predict Center Word from Context

Input: context words around a missing center word.
Look up embeddings, then average (or sum) them into a single context vector.
Linear layer + softmax predicts the center word id.
Training objective: maximize \(p(w_{\text{center}} \mid \text{context})\).


digraph cbow {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  c1   [label="Context word\n'I'"];
  c2   [label="Context word\n'the'"];
  c3   [label="Context word\n'movie'"];
  emb  [label="Embedding\nlookup"];
  avg  [label="Average\ncontext vectors"];
  out  [label="Linear + Softmax\np(center | context)"];
  ctr  [label="Predicted\ncenter word\n'loved'"];

  {c1 c2 c3} -> emb -> avg -> out -> ctr;
}

Skip-Gram: Predict Context from Center Word

Input: single center word.
Output: each context word in a sliding window.
Training objective: \[ \sum_{c \in \text{context}} \log p(c \mid w_{\text{center}}). \]
Tends to learn good embeddings even for rare words.


digraph skipgram {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  ctr   [label="Center word\n'movie'"];
  emb   [label="Embedding\nlookup"];
  out   [label="Linear + Softmax\np(context | center)"];
  ctx1  [label="Context\n'great'"];
  ctx2  [label="Context\n'funny'"];
  ctx3  [label="Context\n'exciting'"];

  ctr -> emb -> out;
  out -> {ctx1 ctx2 ctx3};
}

3D PCA Visualization of Word2Vec

We can see embeddings by projecting high-dimensional vectors (e.g., 300D Word2Vec) down to 3D with PCA.

Tool: TensorFlow Embedding Projector.
Preset dataset: Word2Vec 10K (10,000 common English words).
Projection: choose PCA and set the view to 3D.
Explore: search for words like “king”, “queen”, “man”, “woman” and see clusters and directions.

PCA finds directions of maximum variance; 3D PCA gives an intuitive, interactive view of the global structure of the embedding space.

Simple Sentiment Classifier Architecture

Input: token ids for a review (e.g., IMDB sentence).
Embedding layer maps each token to a vector.
Pool over time (e.g., mean) to get a single sentence vector.
Linear layer maps to sentiment score (positive / negative).

class SimpleSentimentModel(nn.Module):
    def __init__(self, vocab_size, emb_dim=64):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.fc = nn.Linear(emb_dim, 1)  # binary

    def forward(self, ids):
        # ids: (batch, seq_len)
        x = self.embedding(ids)          # (batch, seq_len, emb_dim)
        x = x.mean(dim=1)                # (batch, emb_dim)
        logits = self.fc(x)              # (batch, 1)
        return logits.squeeze(-1)

Handling Variable-Length Sequences

Real sentences have different lengths → we need batching.
Common approach: pad shorter sequences with <pad> up to a fixed length.
Later (LSTMs/Transformers) we will also use attention masks to ignore padding.

max_len = 10

def pad(ids, max_len, pad_id=0):
    return ids[:max_len] + [pad_id] * max(0, max_len - len(ids))

batch_ids = [
    pad([2, 3, 4, 5, 6, 7], max_len),
    pad([2, 4, 6], max_len),
]

Homework / Hands-On Ideas

Load a small sentiment dataset (e.g., movie reviews with positive/negative labels).
Implement a simple tokenizer and vocabulary builder.
Use nn.Embedding + mean pooling + linear classifier to predict sentiment.
Inspect a few learned word vectors (e.g., cosine similarity between “great”, “good”, “terrible”).
Reflect: compared to image models, what feels similar vs different about the pipeline?

Looking Ahead: Sequence Models & Transformers

Recurrent networks (LSTMs) process sequences step by step, keeping a hidden state.
Attention and Transformers let models look at all positions in parallel.
Pretrained models (BERT, GPT) start from large-scale text and fine-tune to tasks.
All of them start from embeddings like the ones we built today.