Embeddings¶
The Problem¶
We have token IDs: 'e' = 4, 'm' = 12, etc. But a single number doesn't carry much information. The model needs a richer representation.
Consider: is 'e' (4) more similar to 'd' (3) than to 'z' (25)? No — the numeric ordering is arbitrary. But with a single number, the model has no way to tell.
The Solution: Embedding Vectors¶
Instead of representing each token as a single number, we represent it as a list of numbers (a vector). Each token gets its own 16-dimensional vector:
'a' (token 0) → [0.01, -0.03, 0.02, ..., 0.04] (16 numbers)
'b' (token 1) → [-0.02, 0.01, -0.01, ..., 0.03] (16 numbers)
'e' (token 4) → [0.03, 0.02, -0.04, ..., -0.01] (16 numbers)
These 16 numbers capture a token's "meaning" in a way the model can work with. After training, tokens with similar roles will have similar vectors.
The Code (Lines 109–111)¶
tok_emb = state_dict['wte'][token_id] # token embedding
pos_emb = state_dict['wpe'][pos_id] # position embedding
x = [t + p for t, p in zip(tok_emb, pos_emb)] # combined embedding
state_dict['wte'] is a 27×16 matrix — the token embedding table. For token_id = 4 (the letter 'e'), we grab row 4 — a list of 16 Value objects.
This is not a computation, just a lookup. But the values in this table are parameters — they'll be updated during training.
state_dict['wpe'] is an 8×16 matrix — the position embedding table. It encodes where a token sits in the sequence.
Why positions matter
The letter 'e' at position 0 (first letter of a name) is very different from 'e' at position 4 (middle of a name). Without position information, the model would treat every occurrence identically.
Visual Summary¶
flowchart LR
TID["token_id = 4<br>('e')"] --> WTE["wte<br>27 × 16"]
WTE --> |"row 4"| TOK["tok_emb<br>(16 Values)"]
PID["pos_id = 2"] --> WPE["wpe<br>8 × 16"]
WPE --> |"row 2"| POS["pos_emb<br>(16 Values)"]
TOK --> ADD["⊕ element-wise add"]
POS --> ADD
ADD --> X["x = 'e' at position 2<br>(16 Values)"] Why 16 Dimensions?¶
The choice of 16 is a hyperparameter (n_embd). More dimensions = more expressive power, but also more parameters and slower computation. For our tiny names dataset, 16 is sufficient.
Info
In GPT-2, n_embd = 768. Each token is a 768-dimensional vector. That's a much richer representation, needed for understanding complex language.
Terminology
| Term | Meaning |
|---|---|
| Embedding | A vector (list of numbers) representing a token |
| Embedding table | A matrix where each row is one token's embedding |
| Token embedding | Encodes what the token is |
| Position embedding | Encodes where the token is in the sequence |
| Lookup | Selecting a row from a table by index (no arithmetic) |
| Dimension | The number of elements in an embedding vector |