Normalization (RMSNorm)¶
The Problem¶
As data flows through layers of linear transformations, the numbers can drift — becoming very large or very small. This causes two problems:
- Exploding values: Numbers get so large that \(e^x\) overflows → training crashes
- Vanishing values: Numbers get so small that they're effectively zero → model stops learning
We need a way to keep the numbers "well-behaved."
RMS Normalization¶
microgpt.py uses RMSNorm (Root Mean Square Normalization), a simplified version of the more common LayerNorm.
The Formula¶
In words: divide each element by the "average magnitude" of all elements.
Step by Step¶
Example: \(\mathbf{x} = [3.0, 4.0, 0.0]\)
| Step | Computation | Result |
|---|---|---|
| 1. Square each element | \([9.0, 16.0, 0.0]\) | |
| 2. Mean of squares | \((9.0 + 16.0 + 0.0) / 3\) | \(8.333\) |
| 3. Square root (RMS) | \(\sqrt{8.333}\) | \(2.887\) |
| 4. Divide each by RMS | \([3.0/2.887, 4.0/2.887, 0.0/2.887]\) | \([1.039, 1.386, 0.0]\) |
The values now have a consistent scale regardless of the original magnitudes.
The Code (Lines 103–106)¶
def rmsnorm(x):
ms = sum(xi * xi for xi in x) / len(x)
scale = (ms + 1e-5) ** -0.5
return [xi * scale for xi in x]
Compute \(x_i^2\) for each element, sum them, divide by count. This is the "mean square" (MS in RMS).
This computes \(\frac{1}{\sqrt{ms + \epsilon}}\):
** -0.5= "1 divided by the square root"1e-5(\(= 0.00001\)) is a tiny epsilon (\(\epsilon\)) to prevent division by zero
RMSNorm vs LayerNorm¶
| RMSNorm | LayerNorm | |
|---|---|---|
| Formula | \(x / \text{RMS}(x)\) | \((x - \mu) / \sigma\) |
| Centers at zero? | No | Yes (subtracts mean) |
| Used in | LLaMA, microgpt.py | GPT-2, BERT |
| Advantage | Simpler, less computation | Slightly more stable |
RMSNorm skips the mean subtraction. Research showed it works nearly as well with less computation.
Where RMSNorm Is Used¶
# Line 112 — after combining embeddings
x = rmsnorm(x)
# Line 117 — before attention
x = rmsnorm(x)
# Line 137 — before MLP
x = rmsnorm(x)
Pre-normalization
RMSNorm is applied before each major block (attention and MLP). This is called pre-normalization — it stabilizes the input to each block.
Terminology
| Term | Meaning |
|---|---|
| Normalization | Scaling values to have consistent magnitude |
| RMSNorm | Dividing by the root mean square of the values |
| LayerNorm | Subtract mean, divide by standard deviation |
| Epsilon (\(\epsilon\)) | A tiny number (1e-5) to prevent division by zero |
| Pre-normalization | Normalizing before (not after) each block |