Multi-Head Attention¶
The Problem¶
A single attention mechanism computes one set of attention weights — one "perspective." But different aspects of language require different types of attention:
- One perspective might focus on adjacent characters (common pairs like "th", "er")
- Another might focus on the first character (names starting with certain letters tend to end certain ways)
- Another might focus on repeated patterns ("mm" → 'a' often follows)
One attention "head" can only learn one perspective. We need multiple.
The Solution: Multiple Heads¶
Split the 16-dimensional Q, K, V vectors into 4 independent groups of 4 dimensions each. Each group runs its own attention. Then concatenate the results.
flowchart LR
Q["Q (16 dims)"] --> H0["Head 0<br>dims 0-3"]
Q --> H1["Head 1<br>dims 4-7"]
Q --> H2["Head 2<br>dims 8-11"]
Q --> H3["Head 3<br>dims 12-15"]
H0 --> CAT["Concatenate<br>(16 dims)"]
H1 --> CAT
H2 --> CAT
H3 --> CAT
CAT --> WO["linear(Wo)<br>→ output (16 dims)"] Each head:
- Has its own 4-dimensional slice of Q, K, V
- Computes its own attention scores and weights
- Produces a 4-dimensional output
The Code (Lines 123–132)¶
microgpt.py — Lines 123-132
x_attn = []
for h in range(n_head): # n_head = 4
hs = h * head_dim # head start index: 0, 4, 8, 12
q_h = q[hs:hs+head_dim] # slice of Q for this head
k_h = [ki[hs:hs+head_dim] for ki in keys[li]] # slice of each K
v_h = [vi[hs:hs+head_dim] for vi in values[li]] # slice of each V
# Scaled dot-product attention (per head)
attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
for t in range(len(k_h))]
attn_weights = softmax(attn_logits)
head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
for j in range(head_dim)]
x_attn.extend(head_out) # concatenate head outputs
Output Projection (Line 133)¶
After concatenating the 4 heads, one more linear layer mixes them together. This lets the model combine information from different attention perspectives.
Why Split Instead of 4 Full-Size Heads?¶
| Approach | Operations per step |
|---|---|
| 4 full-size heads (16 dims each) | \(4 \times (16 \times 16) = 1024\) |
| 4 split heads (4 dims each) | \(4 \times (4 \times 4) = 64\) |
Tip
Same number of total parameters (the Q, K, V matrices are still 16×16), but the attention computations are 16× cheaper. And empirically, multiple small heads learn better than one big head.
Terminology
| Term | Meaning |
|---|---|
| Multi-head attention | Running multiple attention heads in parallel on subsets of dimensions |
| Head | One independent attention mechanism |
| head_dim | The dimension of each head (\(n_\text{embd} / n_\text{head}\)) |
| Concatenation | Joining head outputs end-to-end |
| Output projection | A linear layer that mixes the concatenated heads |