Temperature and Sampling¶
The Problem¶
When generating text, the model outputs probabilities for the next token. But how "creative" should the model be?
- Too predictable: Always picking the most likely token → boring, repetitive
- Too random: Picking tokens nearly uniformly → nonsensical
The temperature parameter controls this tradeoff.
What Temperature Does¶
Before softmax, each logit is divided by the temperature:
The Pattern¶
| Temperature | Effect | Result |
|---|---|---|
| → 0 | Probabilities become one-hot | Always picks the most likely token |
| = 1 | Standard probabilities | Balanced |
| → ∞ | Probabilities become uniform | Pure random |
Why temperature = 0.5 in microgpt.py?
A temperature of 0.5 makes the model fairly confident — it strongly favors high-probability characters. This produces more "realistic" names. Higher temperatures produce more creative but potentially nonsensical combinations.
The Sampling Step¶
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
random.choices(population, weights) selects one item where each item's chance is proportional to its weight.
Even with temperature = 0.5, there's still randomness. The most likely character is usually picked, but not always. This is what makes each generation unique.
Alternative Sampling Methods¶
| Method | How it works | Tradeoff |
|---|---|---|
| Greedy | Always pick highest probability | Deterministic, repetitive |
| Random | Pick based on probabilities (what we do) | Varied, sometimes odd |
| Top-k | Only consider the top k most likely tokens | Less randomness |
| Nucleus (top-p) | Consider tokens until cumulative prob reaches p | Adaptive k |
microgpt.py uses simple random sampling with temperature — the most straightforward approach.
Why is it called 'temperature'?
From statistical mechanics in physics:
- High temperature → particles move randomly (high entropy)
- Low temperature → particles settle into ordered states (low entropy)
Same idea: high temperature = more randomness, low temperature = more order.
Terminology
| Term | Meaning |
|---|---|
| Temperature | A scalar that controls randomness in generation |
| Sharpening | Low temperature makes the distribution more peaked |
| Flattening | High temperature makes the distribution more uniform |
| Greedy decoding | Always choosing the most likely token |
| Entropy | A measure of randomness/uncertainty in a distribution |