Skip to content

The Loss Function

What Loss Measures

After the model predicts probabilities for the next character, we need a single number that says: "How wrong were you?"

That number is the loss. Lower = better.

Cross-Entropy Loss

The loss function in microgpt.py is cross-entropy loss:

\[\text{loss} = -\log(P(\text{correct token}))\]

Just the negative log of the probability assigned to the correct answer.

Why This Works

\[P(\text{correct}) = 0.9 \implies \text{loss} = -\log(0.9) = 0.105\]

Low loss — the model knew the answer. ✅

\[P(\text{correct}) = 0.2 \implies \text{loss} = -\log(0.2) = 1.609\]

Moderate loss — the model wasn't sure.

\[P(\text{correct}) = 0.01 \implies \text{loss} = -\log(0.01) = 4.605\]

High loss — the model was confident about the wrong answer. ❌

Warning

Cross-entropy heavily punishes confident wrong predictions. Going from 0.01 to 0.001 adds more loss than going from 0.5 to 0.1. This forces the model not to be overconfident about wrong answers.

The Loss Curve

\[y = -\log(x)\]
loss
 5 │ *
   │  *
 4 │   *
   │    *
 3 │     *
   │       *
 2 │         *
   │            *
 1 │                *
   │                       *
 0 │──────────────────────────── *
   0    0.2   0.4   0.6   0.8   1.0
              P(correct)
\(P(\text{correct})\) Loss Interpretation
1.0 0.0 Perfect prediction
0.5 0.693 50/50 guess
\(1/27 \approx 0.037\) 3.296 Random chance (27 tokens)
0.01 4.605 Barely considers the correct answer

The Code (Lines 166–170)

microgpt.py — Lines 166-170
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)

gpt() returns 27 raw logits (scores).

softmax() converts logits to 27 probabilities summing to 1.

probs[target_id] grabs the probability of the correct next character. .log() computes the natural logarithm. The - sign makes it a positive loss.

Append this position's loss. We'll average all positions at the end.

Averaging Over the Sequence (Line 171)

microgpt.py — Line 171
loss = (1 / n) * sum(losses)

For a name like "emma" (5 positions), we compute loss at each position and average:

\[\text{loss} = \frac{1}{5}(\text{loss}_0 + \text{loss}_1 + \text{loss}_2 + \text{loss}_3 + \text{loss}_4)\]

Why average?

Names have different lengths. Without averaging, longer names would have higher loss, biasing the model toward short names.

What's the Initial Loss?

At step 0, the model assigns roughly equal probability to all 27 tokens:

\[P(\text{correct}) \approx \frac{1}{27} \implies \text{loss} \approx -\log\left(\frac{1}{27}\right) \approx 3.30\]

If your model's first loss is near 3.3, everything is working correctly. If it's much higher, something is wrong.

Terminology
Term Meaning
Loss A single number measuring prediction error (lower = better)
Cross-entropy \(-\log(P(\text{correct}))\) — the standard loss for classification
Logits Raw scores before softmax
Target The correct next token
Average loss Mean loss over all positions in a sequence