Lesson 6: NLP — From Words to Embeddings

Learning Objectives

  • Understand how raw text becomes tokens and integer ids.
  • Explain word embeddings and the idea that similar context → similar vectors.
  • Use nn.Embedding in PyTorch to map tokens to dense vectors.
  • Outline a simple sentiment classifier built on embeddings.

Why NLP Feels Different from Vision

  • Images: fixed-size grids of pixels (e.g., 28×28, 64×64), naturally numeric.
  • Text: variable-length sequences of discrete symbols (characters, words, subwords).
  • Goal: turn text into numbers while preserving meaning and order.

digraph nlp_vs_vision {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  img   [label="Image\n(64x64x3 pixels)"];
  txt   [label="Text\n(\"I really liked this movie\")"];
  cnn   [label="CNN / VAE"];
  nlp   [label="Embedding + NLP model"];

  img -> cnn;
  txt -> nlp;
}
          

Text Processing Pipeline

High-level steps to go from raw text to model-ready tensors.


digraph nlp_pipeline {
  rankdir=LR;
  node [fontsize=12, shape=box, style=rounded, height=0.7];
  edge [penwidth=1.5];
  graph [nodesep=0.5, ranksep=0.8];

  text   [label="Raw text\n\"I loved the movie!\""];
  tokens [label="Tokenization\n['I', 'loved', 'the', 'movie']", style="filled,rounded", fillcolor="#e3f2fd"];
  ids    [label="Vocabulary lookup\n[12, 57, 3, 98]", style="filled,rounded", fillcolor="#fff3e0"];
  emb    [label="Embedding layer\nlookup vectors", style="filled,rounded", fillcolor="#e8f5e9"];
  model  [label="NLP model\nclassifier / LM"];

  text -> tokens -> ids -> emb -> model;
}
          
  • Today: focus on tokens, vocabulary, and embeddings.
  • Next lesson: sequence models (LSTM, Transformers) on top of embeddings.

Tokenization: Breaking Text into Pieces

Tokenization splits text into units that the model will see.

Common Choices

  • Whitespace / word-level: simple, language-dependent.
  • Character-level: smallest units, robust but long sequences.
  • Subword (BPE, WordPiece): balance between words and chars.

Example (word-level)

text = "I really liked this movie!"
tokens = ["I", "really", "liked", "this", "movie", "!"]
  • Punctuation can be kept as separate tokens.
  • Case folding (lowercasing) is a design choice.

Vocabulary & Integer Encoding

Models operate on integers, not strings. We build a vocabulary mapping each token to an id.

tokens = ["i", "really", "liked", "this", "movie", "!"]

vocab = {
    "<pad>": 0,
    "<unk>": 1,
    "i": 2,
    "really": 3,
    "liked": 4,
    "this": 5,
    "movie": 6,
    "!": 7,
}

ids = [vocab.get(t, vocab["<unk>"]) for t in tokens]
# ids: [2, 3, 4, 5, 6, 7]
  • <unk> handles out-of-vocabulary words at inference time.
  • <pad> is used later when batching sequences of different lengths.

From One-Hot to Embeddings

One-Hot Representation

  • Vector of size \(|V|\) (vocabulary size).
  • Exactly one position is 1, others 0.
  • Very sparse and high-dimensional.
  • Does not capture similarity between words.

Embedding Vectors

  • Dense vectors of size \(d\) (e.g., 64, 128).
  • Learned from data together with the model.
  • Similar words → similar vectors (in practice).
  • Basis of Word2Vec, GloVe, BERT, GPT, etc.

Embedding Math (Single Token)

Let \(|V|\) be vocabulary size and \(d\) embedding dimension.

  • Embedding matrix: \(E \in \mathbb{R}^{|V| \times d}\).
  • One-hot vector for token \(i\): \(\mathbf{e}_i \in \mathbb{R}^{|V|}\) with 1 at position \(i\), 0 elsewhere.
  • Embedding lookup is just matrix multiplication: \[ \mathbf{x}_i = \mathbf{e}_i^\top E \in \mathbb{R}^d. \]
  • In code, we skip building \(\mathbf{e}_i\) and directly index row \(i\) of \(E\).

Embedding Table Example

Toy vocabulary with \(|V| = 4\) and \(d = 2\).

Token Id One-hot \(\mathbf{e}_i\) Embedding \(\mathbf{x}_i\)
good 0 \([1, 0, 0, 0]\) \([0.8,\ 0.6]\)
great 1 \([0, 1, 0, 0]\) \([0.9,\ 0.7]\)
bad 2 \([0, 0, 1, 0]\) \([-0.7,\ -0.6]\)
terrible 3 \([0, 0, 0, 1]\) \([-0.9,\ -0.8]\)

Vectors for “good” and “great” are close; “bad” and “terrible” are close but far from the positives.

PyTorch nn.Embedding

The embedding layer is just a learnable matrix of shape \((|V|, d)\).

import torch
import torch.nn as nn

vocab_size = len(vocab)
embedding_dim = 64

emb = nn.Embedding(num_embeddings=vocab_size,
                   embedding_dim=embedding_dim)

ids = torch.tensor([[2, 3, 4, 5, 6, 7]])  # shape: (batch=1, seq_len=6)
embedded = emb(ids)  # shape: (1, 6, 64)
  • Each token id picks one row from the embedding matrix.
  • Gradients update the embeddings during training, just like CNN weights.

Distributional Semantics & Word2Vec Intuition

Core idea: “You shall know a word by the company it keeps.” (Firth, 1957)

  • Words appearing in similar contexts should have similar embeddings.
  • Word2Vec trains a small network to predict a word from its neighbors (CBOW) or neighbors from a word (Skip-gram).
  • The hidden layer weights become useful word vectors.
# Pseudocode: Skip-gram training example
center = "movie"
context = ["great", "funny", "exciting"]

# Objective: embeddings("movie") should be
# good at predicting its context words.

CBOW: Predict Center Word from Context

  • Input: context words around a missing center word.
  • Look up embeddings, then average (or sum) them into a single context vector.
  • Linear layer + softmax predicts the center word id.
  • Training objective: maximize \(p(w_{\text{center}} \mid \text{context})\).

digraph cbow {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  c1   [label="Context word\n'I'"];
  c2   [label="Context word\n'the'"];
  c3   [label="Context word\n'movie'"];
  emb  [label="Embedding\nlookup"];
  avg  [label="Average\ncontext vectors"];
  out  [label="Linear + Softmax\np(center | context)"];
  ctr  [label="Predicted\ncenter word\n'loved'"];

  {c1 c2 c3} -> emb -> avg -> out -> ctr;
}
              

Skip-Gram: Predict Context from Center Word

  • Input: single center word.
  • Output: each context word in a sliding window.
  • Training objective: \[ \sum_{c \in \text{context}} \log p(c \mid w_{\text{center}}). \]
  • Tends to learn good embeddings even for rare words.

digraph skipgram {
  rankdir=LR;
  node [fontsize=11, shape=box, style=rounded];

  ctr   [label="Center word\n'movie'"];
  emb   [label="Embedding\nlookup"];
  out   [label="Linear + Softmax\np(context | center)"];
  ctx1  [label="Context\n'great'"];
  ctx2  [label="Context\n'funny'"];
  ctx3  [label="Context\n'exciting'"];

  ctr -> emb -> out;
  out -> {ctx1 ctx2 ctx3};
}
              

3D PCA Visualization of Word2Vec

We can see embeddings by projecting high-dimensional vectors (e.g., 300D Word2Vec) down to 3D with PCA.

  • Tool: TensorFlow Embedding Projector.
  • Preset dataset: Word2Vec 10K (10,000 common English words).
  • Projection: choose PCA and set the view to 3D.
  • Explore: search for words like “king”, “queen”, “man”, “woman” and see clusters and directions.

PCA finds directions of maximum variance; 3D PCA gives an intuitive, interactive view of the global structure of the embedding space.

Simple Sentiment Classifier Architecture

  • Input: token ids for a review (e.g., IMDB sentence).
  • Embedding layer maps each token to a vector.
  • Pool over time (e.g., mean) to get a single sentence vector.
  • Linear layer maps to sentiment score (positive / negative).
class SimpleSentimentModel(nn.Module):
    def __init__(self, vocab_size, emb_dim=64):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.fc = nn.Linear(emb_dim, 1)  # binary

    def forward(self, ids):
        # ids: (batch, seq_len)
        x = self.embedding(ids)          # (batch, seq_len, emb_dim)
        x = x.mean(dim=1)                # (batch, emb_dim)
        logits = self.fc(x)              # (batch, 1)
        return logits.squeeze(-1)

Handling Variable-Length Sequences

  • Real sentences have different lengths → we need batching.
  • Common approach: pad shorter sequences with <pad> up to a fixed length.
  • Later (LSTMs/Transformers) we will also use attention masks to ignore padding.
max_len = 10

def pad(ids, max_len, pad_id=0):
    return ids[:max_len] + [pad_id] * max(0, max_len - len(ids))

batch_ids = [
    pad([2, 3, 4, 5, 6, 7], max_len),
    pad([2, 4, 6], max_len),
]

Homework / Hands-On Ideas

  • Load a small sentiment dataset (e.g., movie reviews with positive/negative labels).
  • Implement a simple tokenizer and vocabulary builder.
  • Use nn.Embedding + mean pooling + linear classifier to predict sentiment.
  • Inspect a few learned word vectors (e.g., cosine similarity between “great”, “good”, “terrible”).
  • Reflect: compared to image models, what feels similar vs different about the pipeline?

Looking Ahead: Sequence Models & Transformers

  • Recurrent networks (LSTMs) process sequences step by step, keeping a hidden state.
  • Attention and Transformers let models look at all positions in parallel.
  • Pretrained models (BERT, GPT) start from large-scale text and fine-tune to tasks.
  • All of them start from embeddings like the ones we built today.