The Training Loop¶
The Complete Code (Lines 151–184)¶
microgpt.py — Lines 151-184
# ── SETUP ──
num_steps = 500
for step in range(num_steps):
# ── 1. SAMPLE ──
doc = docs[step % len(docs)] # pick a name (cycle through dataset)
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS] # tokenize
n = min(block_size, len(tokens) - 1) # cap at block_size (8)
# ── 2. FORWARD PASS + LOSS ──
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values) # model prediction
probs = softmax(logits) # to probabilities
loss_t = -probs[target_id].log() # cross-entropy loss
losses.append(loss_t)
loss = (1 / n) * sum(losses) # average loss over the sequence
# ── 3. BACKWARD PASS ──
loss.backward() # compute all gradients
# ── 4. PARAMETER UPDATE (Adam) ──
lr_t = learning_rate * 0.5 * (1 + math.cos(math.pi * step / num_steps))
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0 # reset gradient
# ── 5. LOG ──
print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")
Tracing Step 0 with "emma"¶
| Position | Input | Target | Loss |
|---|---|---|---|
| 0 | BOS (26) | 'e' (4) | 3.29 |
| 1 | 'e' (4) | 'm' (12) | 3.33 |
| 2 | 'm' (12) | 'm' (12) | 3.30 |
| 3 | 'm' (12) | 'a' (0) | 3.31 |
| 4 | 'a' (0) | BOS (26) | 3.28 |
| Avg | 3.302 |
At step 0, the loss is ~3.3. For a random model with 27 tokens, the expected loss is \(-\log(1/27) \approx 3.30\). The model is at random chance.
For each of 4,064 parameters: update m[i], v[i], nudge p.data, reset p.grad = 0.
The Arc of Training¶
| Step | What the model has learned |
|---|---|
| 1 | Nothing. Random predictions. |
| 50 | Common characters predicted more often |
| 100 | Frequent letter pairs (th, er, an) |
| 200 | Name-like structures (consonant-vowel patterns) |
| 300 | When names should end (predicts BOS) |
| 500 | Reasonable name generation capability |
Data Cycling¶
The % (modulo) operator cycles through the dataset. With ~32,000 names and 500 steps, we only see ~1.5% of the data. A longer training run would cycle through more.
Block Size Truncation¶
If a name is longer than block_size (8), we truncate. Most names are shorter than 8 characters, so this rarely matters.
Checkpoint ✓
You now understand the entire training process:
Sampling a document and tokenizing it
Forward pass: running the model on each token
Loss: measuring prediction quality with cross-entropy
Backward pass: computing all gradients automatically
Adam optimizer: updating parameters with momentum and adaptation
Cosine learning rate decay