Math Refresher¶
Everything you need to know for this course — and nothing more.
1. Exponents¶
What: Repeated multiplication.
Rules:
| Rule | Example |
|---|---|
| \(a^m \times a^n = a^{m+n}\) | \(2^3 \times 2^2 = 2^5 = 32\) |
| \(a^{-n} = 1 / a^n\) | \(2^{-3} = 1/8\) |
| \(a^{1/2} = \sqrt{a}\) | \(9^{1/2} = 3\) |
| \(a^{-1/2} = 1/\sqrt{a}\) | \(4^{-0.5} = 1/2\) |
| \(a^0 = 1\) | \(5^0 = 1\) |
Where it appears in microgpt.py
(ms + 1e-5) ** -0.5— computing \(1/\sqrt{\text{mean square}}\) in RMSNormself.data**other— the power operation in theValueclass
2. The Exponential Function (\(e^x\))¶
The number \(e \approx 2.718\) raised to the power \(x\):
Key properties:
- Always positive: \(e^x > 0\) for all \(x\)
- Grows very fast for large \(x\)
- Approaches zero for large negative \(x\)
- Its derivative is itself: \(\frac{d}{dx}e^x = e^x\)
Where it appears
math.exp(self.data) — used in softmax to make all values positive.
3. The Logarithm (\(\ln\) or \(\log\))¶
The inverse of the exponential. \(\ln(x)\) answers: "what power do I raise \(e\) to, to get \(x\)?"
Key properties:
- Only defined for positive numbers
- \(\ln(1) = 0\)
- \(\ln(x) < 0\) when \(0 < x < 1\)
- Derivative: \(\frac{d}{dx}\ln(x) = 1/x\)
Where it appears
-probs[target_id].log() — the cross-entropy loss function.
4. Summation (\(\Sigma\))¶
Shorthand for "add up a bunch of things":
In Python: sum(x[i] for i in range(3))
5. Derivatives (Basics)¶
The derivative \(\frac{dy}{dx}\) tells you: "if \(x\) changes by a tiny bit, how much does \(y\) change?"
| Function | Derivative | In English |
|---|---|---|
| \(y = c\) | \(\frac{dy}{dx} = 0\) | Constants don't change |
| \(y = x\) | \(\frac{dy}{dx} = 1\) | 1-to-1 relationship |
| \(y = cx\) | \(\frac{dy}{dx} = c\) | Scales the change |
| \(y = x^2\) | \(\frac{dy}{dx} = 2x\) | |
| \(y = x^n\) | \(\frac{dy}{dx} = nx^{n-1}\) | Power rule |
| \(y = e^x\) | \(\frac{dy}{dx} = e^x\) | Its own derivative |
| \(y = \ln(x)\) | \(\frac{dy}{dx} = 1/x\) | |
| \(y = \max(0,x)\) | \(\frac{dy}{dx} = \begin{cases}1 & x>0 \\ 0 & x \leq 0\end{cases}\) | Step function |
The Chain Rule:
"Multiply the derivatives along the chain."
6. Vectors (Lists of Numbers)¶
A vector is a list of numbers:
| Operation | Example | Result |
|---|---|---|
| Addition | \([1, 2] + [3, 4]\) | \([4, 6]\) |
| Scalar multiplication | \(2 \times [3, 4]\) | \([6, 8]\) |
| Dot product | \([1, 2] \cdot [3, 4]\) | \(1 \times 3 + 2 \times 4 = 11\) |
Every embedding, every layer input/output is a vector.
7. Matrices (Grids of Numbers)¶
A matrix is a 2D grid:
Matrix-vector multiplication — the linear() function in microgpt.py:
Each row's dot product with the input gives one output element.
8. Probability¶
A probability distribution assigns a number between 0 and 1 to each outcome, with all probabilities summing to 1:
Random sampling: Choosing an outcome where each option's chance equals its probability.
9. Square Root (\(\sqrt{\phantom{x}}\))¶
In code: x ** 0.5 or math.sqrt(x)
Where it appears
head_dim**0.5— scaling in attentionv_hat ** 0.5— in Adam optimizer
10. Mean (Average)¶
In code: sum(x) / len(x)
Where it appears
sum(xi * xi for xi in x) / len(x) — mean of squares in RMSNorm.
That's all the math. Every formula in microgpt.py uses only these concepts.