Softmax¶
The Problem¶
A linear layer outputs a list of raw numbers called logits. These can be any values — positive, negative, huge, tiny:
But we need probabilities — numbers between 0 and 1 that sum to 1. For example: "There's a 40% chance the next character is 'a', 30% chance it's 'e', etc."
The Softmax Formula¶
In plain English:
- Apply \(e^x\) (exponential) to each logit → makes everything positive
- Divide each by the total → makes everything sum to 1
Step by Step¶
| Logit | \(e^z\) | Probability |
|---|---|---|
| 2.0 | 7.389 | \(7.389 / 11.212 = 0.659\) (65.9%) |
| 1.0 | 2.718 | \(2.718 / 11.212 = 0.242\) (24.2%) |
| 0.1 | 1.105 | \(1.105 / 11.212 = 0.099\) (9.9%) |
| Total | 1.000 |
The largest logit (2.0) gets the largest probability (65.9%). The exponential function amplifies differences.
The Code (Lines 97–101)¶
def softmax(logits):
max_val = max(val.data for val in logits)
exps = [(val - max_val).exp() for val in logits]
total = sum(exps)
return [e / total for e in exps]
Wait — what's max_val doing there?
Line 98 subtracts the maximum logit before exponentiating. This is the numerical stability trick.
Without it:
- \(e^{1000} = \infty\) (overflow!)
- \(e^{-1000} = 0\) (underflow!)
By subtracting the max, the largest value becomes 0, and \(e^0 = 1\). No overflow.
The math is unchanged — subtracting a constant from all logits doesn't change the ratios:
Properties of Softmax¶
| Property | Why it matters |
|---|---|
| All outputs are positive | Probabilities can't be negative |
| Outputs sum to 1 | They represent a valid probability distribution |
| Preserves ordering | Largest logit → largest probability |
| Differentiable | We can compute gradients through it |
Where Softmax Is Used¶
In microgpt.py, softmax appears in two places:
-
Attention weights (line 130): Converting attention scores into probabilities
"How much attention should I pay to each previous token?"
-
Output prediction (line 166): Converting final logits into character probabilities
"What's the probability of each possible next character?"
Terminology
| Term | Meaning |
|---|---|
| Logits | Raw, unnormalized scores from a linear layer |
| Softmax | Function that converts logits to probabilities |
| Probability distribution | List of non-negative numbers that sum to 1 |
| Numerical stability | Avoiding overflow/underflow by shifting values |
| \(e^x\) | The exponential function (\(\approx 2.718^x\)) |