What Is Training?¶
The Big Idea¶
You have a model (gpt()) that takes a token and produces predictions. But right now its 4,064 parameters are random — its predictions are garbage.
Training is the process of showing the model thousands of examples and adjusting its parameters so it gets better at predicting.
Before vs. After Training¶
The Training Recipe¶
Each training step follows this exact recipe:
flowchart LR
S["1. SAMPLE<br>a name"] --> F["2. FORWARD<br>predict each<br>next character"]
F --> L["3. LOSS<br>measure<br>how wrong"]
L --> B["4. BACKWARD<br>compute<br>gradients"]
B --> U["5. UPDATE<br>nudge<br>parameters"]
U --> |"repeat ×500"| S The Training Loop (Lines 153–184, Overview)¶
microgpt.py — Lines 152-184 (simplified)
num_steps = 500
for step in range(num_steps):
# 1. Sample a document
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)
# 2. Forward pass + 3. Loss computation
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses)
# 4. Backward pass
loss.backward()
# 5. Update parameters (Adam optimizer)
# ... (covered in a later lesson)
The flow for a name like \"emma\"
- Wrap it:
[BOS, 'e', 'm', 'm', 'a', BOS] - For each position, predict the next character
- Compute a loss (how wrong were the predictions overall)
- Call
backward()to compute all gradients - Update each parameter in the direction that reduces the loss
How the Loss Evolves¶
step 1 / 500 | loss 3.8912 ← random predictions, high loss
step 50 / 500 | loss 2.8401 ← starting to learn common letters
step 100 / 500 | loss 2.4102 ← learning letter pairs
step 200 / 500 | loss 2.0312 ← learning name-like patterns
step 300 / 500 | loss 1.8543 ← getting good
step 500 / 500 | loss 1.5012 ← reasonably trained
The loss decreasing means the model is getting better at predicting the next character.
Why only 500 steps?
This is a tiny model on a simple dataset. 500 steps is enough for reasonable results. Larger models on larger datasets train for millions of steps.
Terminology
| Term | Meaning |
|---|---|
| Training | Iteratively adjusting parameters to minimize loss |
| Training step | One iteration of forward → loss → backward → update |
| Epoch | One pass through the entire dataset |
| num_steps | How many training steps to run |
| Training loop | The for-loop that runs all the training steps |