The Training Loop¶

The Complete Code (Lines 151–184)¶

microgpt.py — Lines 151-184

# ── SETUP ──
num_steps = 500

for step in range(num_steps):

    # ── 1. SAMPLE ──
    doc = docs[step % len(docs)]        # pick a name (cycle through dataset)
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]  # tokenize
    n = min(block_size, len(tokens) - 1) # cap at block_size (8)

    # ── 2. FORWARD PASS + LOSS ──
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)   # model prediction
        probs = softmax(logits)                          # to probabilities
        loss_t = -probs[target_id].log()                 # cross-entropy loss
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)        # average loss over the sequence

    # ── 3. BACKWARD PASS ──
    loss.backward()                     # compute all gradients

    # ── 4. PARAMETER UPDATE (Adam) ──
    lr_t = learning_rate * 0.5 * (1 + math.cos(math.pi * step / num_steps))
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0                      # reset gradient

    # ── 5. LOG ──
    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}")

Tracing Step 0 with "emma"¶

1. Sample2. Forward + Loss3. Backward4. Update5. Log

doc = "emma"
tokens = [26, 4, 12, 12, 0, 26]     # [BOS, 'e', 'm', 'm', 'a', BOS]
n = min(8, 6 - 1) = 5               # 5 predictions to make

Position	Input	Target	Loss
0	BOS (26)	'e' (4)	3.29
1	'e' (4)	'm' (12)	3.33
2	'm' (12)	'm' (12)	3.30
3	'm' (12)	'a' (0)	3.31
4	'a' (0)	BOS (26)	3.28
Avg			3.302

At step 0, the loss is ~3.3. For a random model with 27 tokens, the expected loss is \(-\log(1/27) \approx 3.30\). The model is at random chance.

loss.backward()  # every parameter now has .grad set

lr_t = 0.01 × 0.5 × (1 + cos(0)) = 0.01   # full LR at step 0

For each of 4,064 parameters: update m[i], v[i], nudge p.data, reset p.grad = 0.

step    1 / 500 | loss 3.3020

The Arc of Training¶

Step	What the model has learned
1	Nothing. Random predictions.
50	Common characters predicted more often
100	Frequent letter pairs (th, er, an)
200	Name-like structures (consonant-vowel patterns)
300	When names should end (predicts BOS)
500	Reasonable name generation capability

Data Cycling¶

doc = docs[step % len(docs)]

The % (modulo) operator cycles through the dataset. With ~32,000 names and 500 steps, we only see ~1.5% of the data. A longer training run would cycle through more.

Block Size Truncation¶

n = min(block_size, len(tokens) - 1)

If a name is longer than block_size (8), we truncate. Most names are shorter than 8 characters, so this rarely matters.

Checkpoint ✓

You now understand the entire training process:

Sampling a document and tokenizing it
Forward pass: running the model on each token
Loss: measuring prediction quality with cross-entropy
Backward pass: computing all gradients automatically
Adam optimizer: updating parameters with momentum and adaptation
Cosine learning rate decay