Teaching machines to understand both images and language.
digraph vision_language {
rankdir=LR;
node [fontsize=11, shape=box, style=rounded];
img [label="πΌοΈ Image\n(pixels)", shape=box];
vision [label="Visual\nUnderstanding"];
text [label="π Text\n(tokens)", shape=box];
lang [label="Language\nGeneration"];
q [label="???"];
img -> vision;
vision -> q;
text -> lang;
lang -> q;
}
Images and text live in completely different spaces:
Key insight: We need a way to map both modalities into a shared representation space.
Task-specific models (2012β2019)
The dominant paradigm before CLIP: CNN encoder + RNN decoder.
digraph captioning {
rankdir=LR;
node [fontsize=14, shape=box, style=rounded, width=1.5, height=0.8];
edge [penwidth=2];
graph [pad="0.5", nodesep="0.8", ranksep="1.2"];
img [label="Image\n224Γ224Γ3"];
cnn [label="CNN\n(ResNet/VGG)"];
feat [label="Feature Vector\n2048-d"];
rnn [label="RNN/LSTM\nDecoder"];
cap [label="Caption\n\"A cat sitting\non a mat\""];
img -> cnn -> feat -> rnn -> cap;
}
Google's influential image captioning model.
# Simplified PyTorch-style pseudocode
class ShowAndTell(nn.Module):
def __init__(self, embed_dim, hidden_dim, vocab_size):
self.cnn = models.resnet50(pretrained=True)
self.cnn.fc = nn.Linear(2048, embed_dim) # project to embed
self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
self.fc_out = nn.Linear(hidden_dim, vocab_size)
def forward(self, image, captions):
# Encode image β single vector
img_features = self.cnn(image) # (batch, embed_dim)
# Decode caption word by word
embeddings = self.embed(captions) # (batch, seq_len, embed_dim)
lstm_out, _ = self.lstm(embeddings, img_features)
return self.fc_out(lstm_out)
The CNN (e.g., GoogLeNet/VGG) extracts a fixed-size feature vector from the image, which initializes the LSTM hidden state. The LSTM then generates the caption word by word.
Added visual attention β the decoder can focus on different image regions.
digraph attend_tell {
rankdir=LR;
node [fontsize=11, shape=box, style=rounded];
img [label="Image"];
cnn [label="CNN\n(no pooling)"];
grid [label="Spatial Features\n14Γ14Γ512"];
attn [label="Attention\nMechanism"];
context [label="Context c_t"];
lstm [label="LSTM"];
word [label="Next Word"];
img -> cnn -> grid -> attn;
attn -> context [label="weighted sum"];
context -> lstm -> word;
lstm -> attn [style=dashed, color=blue, constraint=false, xlabel="h_{t-1}"];
}
Key idea: The LSTM's previous hidden state tells attention "what to look for". Attention returns a context vector that helps predict the next word.
When generating each word, the model learns to focus on the relevant image regions. White/bright areas show where the model "looks" when generating that particular word.
Attention and LSTM work together in a recurrent loop at each timestep:
digraph recurrent_loop {
rankdir=LR;
node [fontsize=11, shape=box, style=rounded];
newrank=true;
subgraph cluster_fixed {
label="Computed Once";
style=filled; fillcolor="#fff3e0";
features [label="Image Features\n(L regions)"];
}
subgraph cluster_recurrent {
label="Recurrent Loop (repeated for each word)";
style=filled; fillcolor="#e3f2fd";
color="#1976d2"; penwidth=2;
h_prev [label="h_{t-1}", style=filled, fillcolor="#bbdefb"];
attn [label="Attention"];
context [label="Context c_t"];
word [label="Word w_t"];
lstm [label="LSTM"];
h_next [label="h_t", style=filled, fillcolor="#c8e6c9"];
output [label="Predict\nNext Word"];
{rank=same; h_prev; word}
{rank=same; attn}
{rank=same; context}
{rank=same; lstm}
{rank=same; h_next; output}
}
features -> attn;
h_prev -> attn;
word -> lstm;
attn -> context -> lstm;
h_prev -> lstm [style=dashed, label="state"];
lstm -> h_next;
h_next -> output;
h_next -> h_prev [style=dashed, color="#1976d2", penwidth=2, constraint=false, xlabel="loop back"];
}
Blue box: Repeated for each word. Orange box: Computed once from image.
At timestep \(t\), generating word \(t+1\):
The loop continues until we generate [END] token or max length.
What happens inside the attention block? Three key steps:
digraph attention_detail {
rankdir=LR;
node [fontsize=12, shape=box, style=rounded];
subgraph cluster_input {
label="Inputs";
style=filled; fillcolor="#e3f2fd";
features [label="Image Features\naβ, aβ, ..., aβ\n(L regions)"];
hidden [label="Previous State\nh_{t-1}"];
}
subgraph cluster_attention {
label="Attention Computation";
style=filled; fillcolor="#fff3e0";
score [label="1. Score\nFunction"];
softmax [label="2. Softmax\nΞ± = softmax(e)"];
weighted [label="3. Weighted\nSum"];
}
context [label="Context\nVector c_t", style=filled, fillcolor="#e8f5e9"];
features -> score;
hidden -> score;
score -> softmax -> weighted;
features -> weighted;
weighted -> context;
}
For each image region, compute a "relevance score" given the current decoder state:
# Common scoring function: MLP
def attention_score(a_i, h_t):
# Combine image feature and hidden state
combined = torch.tanh(W_a @ a_i + W_h @ h_t + b)
# Project to scalar score
e = v.T @ combined # scalar
return e
Convert scores to a probability distribution over image regions:
Properties of \(\alpha\):
Example (L=4 regions):
Scores e: [2.1, 0.5, 0.8, -0.3]
β softmax
Weights Ξ±: [0.62, 0.13, 0.17, 0.08]
β
Region 1 gets 62% attention
Combine image features using attention weights:
# Weighted sum of image features
def compute_context(features, alpha):
"""
features: (L, D) - L regions, D-dimensional each
alpha: (L,) - attention weights (sum to 1)
"""
context = (alpha.unsqueeze(1) * features).sum(dim=0) # (D,)
return context
Putting it all together for one decoding step:
class Attention(nn.Module):
def __init__(self, feature_dim, hidden_dim, attention_dim):
self.W_a = nn.Linear(feature_dim, attention_dim) # project image features
self.W_h = nn.Linear(hidden_dim, attention_dim) # project hidden state
self.v = nn.Linear(attention_dim, 1) # score to scalar
def forward(self, features, hidden):
# features: (batch, L, feature_dim) β L image regions
# hidden: (batch, hidden_dim) β LSTM state
# Step 1: Compute scores for each region
scores = self.v(torch.tanh(
self.W_a(features) + self.W_h(hidden).unsqueeze(1)
)) # (batch, L, 1)
# Step 2: Softmax β attention weights
alpha = F.softmax(scores.squeeze(2), dim=1) # (batch, L)
# Step 3: Weighted sum β context vector
context = (alpha.unsqueeze(2) * features).sum(dim=1) # (batch, feature_dim)
return context, alpha
The attention mechanism is trained end-to-end with backpropagation:
digraph training {
rankdir=LR;
node [fontsize=11, shape=box, style=rounded];
edge [fontsize=9];
img [label="Image"];
cnn [label="CNN"];
attn [label="Attention\n(W_a, W_h, v)"];
lstm [label="LSTM"];
output [label="Softmax\nover vocab"];
loss [label="Cross-Entropy\nLoss", style=filled, fillcolor="#ffcdd2"];
target [label="Ground Truth\nCaption"];
img -> cnn -> attn -> lstm -> output -> loss;
target -> loss;
loss -> output -> lstm -> attn -> cnn [style=dashed, color=red, label="gradients"];
}
Key insight: All operations (linear layers, tanh, softmax, weighted sum) are differentiable!
The training loop for image captioning with attention:
for images, captions in dataloader:
# 1. Extract image features (L regions)
features = cnn(images) # (batch, L, feature_dim)
# 2. Initialize LSTM hidden state
hidden = init_hidden(features.mean(dim=1))
loss = 0
for t in range(caption_length):
# 3. Compute attention weights
context, alpha = attention(features, hidden)
# 4. LSTM step: input = [previous word embedding, context]
input_t = torch.cat([word_embed(captions[:, t]), context], dim=1)
hidden = lstm(input_t, hidden)
# 5. Predict next word
logits = output_layer(hidden) # (batch, vocab_size)
# 6. Cross-entropy loss against ground truth
loss += F.cross_entropy(logits, captions[:, t+1])
# 7. Backpropagate through everything (including attention!)
loss.backward()
optimizer.step()
Gradients flow back through the attention computation:
Result: The model learns which regions to attend to by minimizing caption prediction loss!
We never tell the model "look here for this word" β it discovers this on its own!
Training signal:
What attention learns:
Attention solves the "information bottleneck" problem:
This same idea powers Transformers! (Lesson 7)
We've computed attention weights \(\alpha_i\) for each region. But how do we use them?
The "Show, Attend and Tell" paper explored two approaches:
Weighted average of all regions:
Sample ONE region:
In practice, soft attention is almost always used. Here's why:
| Soft Attention | Hard Attention | |
|---|---|---|
| Training | Standard backprop β | REINFORCE (high variance) β οΈ |
| Gradients | Smooth, stable | Noisy, needs baselines |
| Computation | Deterministic | Stochastic (need multiple samples) |
| Interpretation | "Soft" focus on multiple regions | "Hard" focus on one region |
The model learns where to look when generating each word:
"A bird flying over water"
Image Grid (14Γ14): βββββ¬ββββ¬ββββ¬ββββ β β β β β β β β high Ξ± for β β β β β β β "bird" region βββββΌββββΌββββΌββββ€ β β β β β β β β β β βββββΌββββΌββββΌββββ€ β β β β β β β β β β high Ξ± for β β β β β β β β β "water" region βββββ΄ββββ΄ββββ΄ββββ
Soft attention in action:
This same idea powers Transformers! (Lesson 7)
Question: Can we learn a more general connection between vision and language?
From task-specific to representation learning
Researchers changed the question they were asking:
"Can we generate captions for images?"
β Task-specific, narrow
"Can we learn a shared meaning space for images and text?"
β General-purpose, transferable
This shift from task-specific training to representation learning leads directly to CLIP.
The goal: map images and text to the same vector space.
digraph shared_space {
rankdir=LR;
node [fontsize=10, shape=box, style=rounded];
subgraph cluster_input {
label="Inputs";
style=filled; fillcolor="#f5f5f5";
img1 [label="π± cat photo"];
img2 [label="π dog photo"];
txt1 [label="\"a cat\""];
txt2 [label="\"a dog\""];
}
subgraph cluster_space {
label="Shared Embedding Space";
style=filled; fillcolor="#e8f4f8";
node [shape=circle, width=0.3];
e1 [label="β"];
e2 [label="β"];
e3 [label="β"];
e4 [label="β"];
}
img1 -> e1 [label="encode"];
txt1 -> e3 [label="encode"];
img2 -> e2 [label="encode"];
txt2 -> e4 [label="encode"];
}
t-SNE projection of CLIP embeddings. Images and their corresponding text descriptions cluster together in the shared space. Different semantic categories form distinct clusters.
With a shared embedding space, many tasks become simple:
| Task | How It Works |
|---|---|
| Image β Text | Find text embeddings closest to image embedding |
| Text β Image | Find image embeddings closest to text embedding |
| Classification | Compare image to text descriptions of each class |
| Similarity | Directly measure distance between any image and text |
Contrastive LanguageβImage Pretraining
Contrastive LanguageβImage Pretraining (OpenAI, 2021)
digraph clip_core {
rankdir=TB;
node [fontsize=10, shape=box, style=rounded];
img [label="Image"];
txt [label="Text"];
img_enc [label="Image Encoder\n(ResNet/ViT)"];
txt_enc [label="Text Encoder\n(Transformer)"];
img_emb [label="Image\nEmbedding", shape=ellipse];
txt_emb [label="Text\nEmbedding", shape=ellipse];
sim [label="Cosine\nSimilarity", shape=diamond];
img -> img_enc -> img_emb;
txt -> txt_enc -> txt_emb;
img_emb -> sim;
txt_emb -> sim;
}
Left: Image and text encoders project inputs to a shared embedding space.
Right: The NΓN similarity matrix where diagonal entries (matching pairs) are maximized during training.
CLIP uses two separate encoders that project to the same space:
Critical: Both encoders output vectors of the same dimension!
CLIP was trained on a massive dataset of image-text pairs:
| Dataset | Size | Type |
|---|---|---|
| COCO Captions | ~330K images | Curated |
| Visual Genome | ~100K images | Curated |
| CLIP WIT | ~400M pairs | Web-scraped |
1000Γ more data than previous vision-language datasets!
The training objective that makes CLIP work
Learn by comparing: push similar things together, dissimilar things apart.
Before Training
Embedding Space:
β img1
βtxt1 βtxt2
β img2
β img3
βtxt3
(scattered randomly)
After Training
Embedding Space:
ββ (img1, txt1)
ββ (img2, txt2)
ββ (img3, txt3)
(matching pairs cluster)
No labels needed! The pairing itself provides the supervision signal.
In each training batch of \(N\) image-text pairs:
digraph batch {
rankdir=LR;
node [fontsize=10, shape=box, style=rounded];
subgraph cluster_batch {
label="Batch of N pairs";
style=filled; fillcolor="#f5f5f5";
i1 [label="imgβ"]; t1 [label="txtβ"];
i2 [label="imgβ"]; t2 [label="txtβ"];
i3 [label="imgβ"]; t3 [label="txtβ"];
iN [label="img_N"]; tN [label="txt_N"];
}
matrix [label="NΓN\nSimilarity\nMatrix", shape=box3d];
i1 -> matrix; i2 -> matrix; i3 -> matrix; iN -> matrix;
t1 -> matrix; t2 -> matrix; t3 -> matrix; tN -> matrix;
}
For a batch of N image-text pairs, we compute an NΓN matrix of cosine similarities. The training objective maximizes the diagonal (correct pairs) while minimizing off-diagonal entries (incorrect pairs).
CLIP measures similarity using normalized dot products:
# PyTorch: cosine similarity
def cosine_similarity(I, T):
I_norm = I / I.norm(dim=-1, keepdim=True)
T_norm = T / T.norm(dim=-1, keepdim=True)
return I_norm @ T_norm.T # (N, N) similarity matrix
For a batch, we compute all pairwise similarities:
Similarity Matrix (N=4):
txtβ txtβ txtβ txtβ
βββββββββββββββββββββββββββ
imgβ β 0.92 0.15 0.08 0.21 β
imgβ β 0.11 0.89 0.23 0.05 β
imgβ β 0.18 0.12 0.95 0.14 β
imgβ β 0.09 0.22 0.17 0.88 β
βββββββββββββββββββββββββββ
Diagonal = correct pairs β
Training objective:
The loss function that makes contrastive learning work:
Key insight: Each image must "pick" its correct text from N choices (and vice versa).
CLIP uses a symmetric loss β both directions matter:
Image β Text:
For each image, classify which text matches
Rows of similarity matrix
Text β Image:
For each text, classify which image matches
Columns of similarity matrix
This ensures embeddings work well for both retrieval directions.
import torch
import torch.nn.functional as F
def clip_loss(image_embeddings, text_embeddings, temperature=0.07):
"""
Compute CLIP contrastive loss.
image_embeddings: (N, D) normalized image vectors
text_embeddings: (N, D) normalized text vectors
"""
# Compute similarity matrix: (N, N)
logits = (image_embeddings @ text_embeddings.T) / temperature
# Labels: diagonal entries are correct pairs
labels = torch.arange(len(logits), device=logits.device)
# Symmetric cross-entropy loss
loss_i2t = F.cross_entropy(logits, labels) # image β text
loss_t2i = F.cross_entropy(logits.T, labels) # text β image
return (loss_i2t + loss_t2i) / 2
That's it! The simplicity of the loss is part of what makes CLIP so effective.
Key insight: Contrastive objectives are much more sample-efficient than generative (prediction) objectives. CLIP gets more "learning" per image-text pair!
Temperature controls the "sharpness" of the softmax distribution:
High \(\tau\) (e.g., 1.0):
Low \(\tau\) (e.g., 0.01):
CLIP learns \(\tau\) as a parameter, typically converging to ~0.07.
# Temperature as learnable parameter
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
logits = logits * self.logit_scale.exp()
Zero-shot transfer and beyond
CLIP can classify images into any categories without training!
digraph zero_shot {
rankdir=TB;
node [fontsize=10, shape=box, style=rounded];
img [label="π± Test Image"];
img_enc [label="Image\nEncoder"];
img_emb [label="Image\nEmbedding", shape=ellipse];
t1 [label="\"a photo of a cat\""];
t2 [label="\"a photo of a dog\""];
t3 [label="\"a photo of a car\""];
txt_enc [label="Text\nEncoder"];
compare [label="Cosine\nSimilarity", shape=diamond];
pred [label="Prediction:\ncat (0.92)"];
img -> img_enc -> img_emb -> compare;
t1 -> txt_enc -> compare;
t2 -> txt_enc -> compare;
t3 -> txt_enc -> compare;
compare -> pred;
}
No training on these classes! Just encode and compare.
The image is encoded once, then compared against text embeddings of all possible class labels. The class with highest cosine similarity is selected as the prediction β no task-specific training required!
import clip
import torch
from PIL import Image
# Load CLIP model
model, preprocess = clip.load("ViT-B/32", device="cuda")
# Prepare image
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to("cuda")
# Define class prompts
classes = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
text = clip.tokenize(classes).to("cuda")
# Encode both modalities
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize and compute similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f"Predictions: {similarity[0]}") # [0.92, 0.05, 0.03]
The text prompts matter! Better prompts = better zero-shot accuracy.
| Simple Prompt | Better Prompt |
|---|---|
| "cat" | "a photo of a cat" |
| "airplane" | "a photo of an airplane, a type of aircraft" |
| "beach" | "a photo of a beach, with sand and ocean" |
# Ensemble of prompts for "cat"
prompts = [
"a photo of a cat",
"a photograph of a cat",
"an image of a cat",
"a picture of a cat",
]
text_embeddings = [model.encode_text(clip.tokenize(p)) for p in prompts]
cat_embedding = torch.stack(text_embeddings).mean(dim=0)
Given an image, find matching text (or vice versa):
# Find best caption for image
image_emb = model.encode_image(image)
text_embs = model.encode_text(all_texts)
# Rank by similarity
sims = image_emb @ text_embs.T
best_idx = sims.argmax()
print(all_texts[best_idx])
# Find best image for query
text_emb = model.encode_text(query)
image_embs = model.encode_image(all_images)
# Rank by similarity
sims = text_emb @ image_embs.T
best_idx = sims.argmax()
show(all_images[best_idx])
This powers image search engines, visual databases, and content moderation!
CLIP achieves strong performance without task-specific training:
| Dataset | ResNet-50 (supervised) | CLIP ViT-L/14 (zero-shot) |
|---|---|---|
| ImageNet | 76.1% | 75.5% |
| CIFAR-10 | 95.6% | 95.6% |
| Food-101 | 72.8% | 92.9% |
| STL-10 | 96.3% | 99.3% |
Key insight: CLIP matches or beats supervised models on many datasets β without seeing a single example from those datasets during training!
From embeddings to generation
CLIP's aligned embeddings enable many downstream systems:
digraph clip_ecosystem {
rankdir=TB;
node [fontsize=10, shape=box, style=rounded];
clip [label="CLIP\nEmbeddings", shape=ellipse, style=filled, fillcolor="#e8f4f8"];
caption [label="+ Language Model\nβ Image Captioning"];
diffusion [label="+ Diffusion Model\nβ Text-to-Image"];
vqa [label="+ QA Head\nβ Visual QA"];
seg [label="+ Decoder\nβ Segmentation"];
clip -> caption;
clip -> diffusion;
clip -> vqa;
clip -> seg;
}
CLIP provides the vision-language bridge; other models provide the generation capabilities.
Use CLIP as the visual encoder, feed into a language model:
digraph clip_caption {
rankdir=LR;
node [fontsize=12, shape=box, style=rounded, width=1.3, height=0.7];
edge [penwidth=1.5];
graph [nodesep=0.8, ranksep=1.0];
img [label="Image"];
clip [label="CLIP\nImage Encoder", style="filled,rounded", fillcolor="#e3f2fd"];
proj [label="Projection\nLayer"];
lm [label="Language Model\n(GPT-2, OPT, ...)", style="filled,rounded", fillcolor="#fff3e0"];
cap [label="Caption:\n\"A tabby cat\nsitting on...\"", style="filled,rounded", fillcolor="#e8f5e9"];
img -> clip -> proj -> lm -> cap;
}
CLIP's role:
LM's role:
This is how models like BLIP, Flamingo, LLaVA work!
Early text-to-image used CLIP to guide diffusion models:
digraph clip_guidance {
rankdir=LR;
node [fontsize=11, shape=box, style=rounded];
subgraph cluster_text {
label="Text Path";
style=filled; fillcolor="#e3f2fd";
prompt [label="Prompt\n\"a cat wearing\na tiny hat\""];
clip_txt [label="CLIP Text\nEncoder"];
txt_emb [label="Text Emb", shape=ellipse];
}
subgraph cluster_image {
label="Image Generation Path";
style=filled; fillcolor="#fff3e0";
noise [label="Random\nNoise"];
diffusion [label="Diffusion\nModel"];
img [label="Generated\nImage"];
clip_img [label="CLIP Image\nEncoder"];
img_emb [label="Image Emb", shape=ellipse];
}
sim [label="Similarity", shape=diamond, style=filled, fillcolor="#e8f5e9"];
prompt -> clip_txt -> txt_emb;
noise -> diffusion -> img -> clip_img -> img_emb;
txt_emb -> sim;
img_emb -> sim;
sim -> diffusion [style=dashed, color=red, label="gradient\n(guide)"];
}
CLIP is used twice: encode prompt, evaluate generated image, guide optimization!
Stable Diffusion uses the CLIP text encoder to convert prompts into embeddings, which guide the diffusion process via cross-attention in the U-Net. The model operates in a compressed latent space for efficiency.
The latent diffusion approach: images are encoded to a smaller latent space, diffusion happens there (faster!), then decoded back to pixels. CLIP text embeddings condition the denoising process.
DALLΒ·E 2, Stable Diffusion, Midjourney all build on CLIP ideas:
| Model | CLIP's Role |
|---|---|
| DALLΒ·E 2 | Uses CLIP image embeddings as conditioning for diffusion |
| Stable Diffusion | Uses CLIP text encoder (frozen) to condition generation |
| Imagen | Uses T5 text encoder instead of CLIP |
CLIP also remains crucial for filtering, ranking, and safety in production systems.
Putting it all together
# Install: pip install git+https://github.com/openai/CLIP.git
import clip
import torch
from PIL import Image
# Load model (downloads on first run)
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Encode an image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
image_features = model.encode_image(image)
# Encode text
text = clip.tokenize(["a dog", "a cat", "a bird"]).to(device)
text_features = model.encode_text(text)
# Compute similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", similarity) # Probability for each class
OpenAI released several CLIP variants:
| Model | Image Encoder | Embedding Dim | Speed | Accuracy |
|---|---|---|---|---|
| RN50 | ResNet-50 | 1024 | Fast | Good |
| RN101 | ResNet-101 | 512 | Medium | Better |
| ViT-B/32 | ViT-Base, 32px patches | 512 | Fast | Good |
| ViT-B/16 | ViT-Base, 16px patches | 512 | Medium | Better |
| ViT-L/14 | ViT-Large, 14px patches | 768 | Slow | Best |
# List available models
print(clip.available_models())
# ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'RN50x64',
# 'ViT-B/32', 'ViT-B/16', 'ViT-L/14', 'ViT-L/14@336px']