Math Refresher¶

Everything you need to know for this course — and nothing more.

1. Exponents¶

What: Repeated multiplication.

\[2^3 = 2 \times 2 \times 2 = 8, \quad 5^2 = 5 \times 5 = 25\]

Rules:

Rule	Example
\(a^m \times a^n = a^{m+n}\)	\(2^3 \times 2^2 = 2^5 = 32\)
\(a^{-n} = 1 / a^n\)	\(2^{-3} = 1/8\)
\(a^{1/2} = \sqrt{a}\)	\(9^{1/2} = 3\)
\(a^{-1/2} = 1/\sqrt{a}\)	\(4^{-0.5} = 1/2\)
\(a^0 = 1\)	\(5^0 = 1\)

Where it appears in microgpt.py

(ms + 1e-5) ** -0.5 — computing \(1/\sqrt{\text{mean square}}\) in RMSNorm
self.data**other — the power operation in the Value class

2. The Exponential Function (\(e^x\))¶

The number \(e \approx 2.718\) raised to the power \(x\):

\[e^0 = 1, \quad e^1 \approx 2.718, \quad e^2 \approx 7.389, \quad e^{-1} \approx 0.368\]

Key properties:

Always positive: \(e^x > 0\) for all \(x\)
Grows very fast for large \(x\)
Approaches zero for large negative \(x\)
Its derivative is itself: \(\frac{d}{dx}e^x = e^x\)

Where it appears

math.exp(self.data) — used in softmax to make all values positive.

3. The Logarithm (\(\ln\) or \(\log\))¶

The inverse of the exponential. \(\ln(x)\) answers: "what power do I raise \(e\) to, to get \(x\)?"

\[\ln(1) = 0, \quad \ln(e) = 1, \quad \ln(7.389) \approx 2\]

Key properties:

Only defined for positive numbers
\(\ln(1) = 0\)
\(\ln(x) < 0\) when \(0 < x < 1\)
Derivative: \(\frac{d}{dx}\ln(x) = 1/x\)

Where it appears

-probs[target_id].log() — the cross-entropy loss function.

4. Summation (\(\Sigma\))¶

Shorthand for "add up a bunch of things":

\[\sum_{i=1}^{3} x_i = x_1 + x_2 + x_3\]

In Python: sum(x[i] for i in range(3))

5. Derivatives (Basics)¶

The derivative \(\frac{dy}{dx}\) tells you: "if \(x\) changes by a tiny bit, how much does \(y\) change?"

Function	Derivative	In English
\(y = c\)	\(\frac{dy}{dx} = 0\)	Constants don't change
\(y = x\)	\(\frac{dy}{dx} = 1\)	1-to-1 relationship
\(y = cx\)	\(\frac{dy}{dx} = c\)	Scales the change
\(y = x^2\)	\(\frac{dy}{dx} = 2x\)
\(y = x^n\)	\(\frac{dy}{dx} = nx^{n-1}\)	Power rule
\(y = e^x\)	\(\frac{dy}{dx} = e^x\)	Its own derivative
\(y = \ln(x)\)	\(\frac{dy}{dx} = 1/x\)
\(y = \max(0,x)\)	\(\frac{dy}{dx} = \begin{cases}1 & x>0 \\ 0 & x \leq 0\end{cases}\)	Step function

The Chain Rule:

\[\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)\]

"Multiply the derivatives along the chain."

6. Vectors (Lists of Numbers)¶

A vector is a list of numbers:

v = [3.0, -1.0, 2.5]    # a 3-dimensional vector

Operation	Example	Result
Addition	\([1, 2] + [3, 4]\)	\([4, 6]\)
Scalar multiplication	\(2 \times [3, 4]\)	\([6, 8]\)
Dot product	\([1, 2] \cdot [3, 4]\)	\(1 \times 3 + 2 \times 4 = 11\)

Every embedding, every layer input/output is a vector.

7. Matrices (Grids of Numbers)¶

A matrix is a 2D grid:

M = [[1, 2, 3],
     [4, 5, 6]]    # a 2×3 matrix

Matrix-vector multiplication — the linear() function in microgpt.py:

\[\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} = \begin{bmatrix} 14 \\ 32 \end{bmatrix}\]

Each row's dot product with the input gives one output element.

8. Probability¶

A probability distribution assigns a number between 0 and 1 to each outcome, with all probabilities summing to 1:

\[P(\text{a}) = 0.3, \quad P(\text{b}) = 0.5, \quad P(\text{c}) = 0.2 \quad \implies \quad \text{Sum} = 1.0\]

Random sampling: Choosing an outcome where each option's chance equals its probability.

9. Square Root (\(\sqrt{\phantom{x}}\))¶

\[\sqrt{4} = 2, \quad \sqrt{9} = 3, \quad \sqrt{2} \approx 1.414\]

In code: x ** 0.5 or math.sqrt(x)

Where it appears

head_dim**0.5 — scaling in attention
v_hat ** 0.5 — in Adam optimizer

10. Mean (Average)¶

\[\text{mean}([2, 4, 6]) = \frac{2 + 4 + 6}{3} = 4\]

In code: sum(x) / len(x)

Where it appears

sum(xi * xi for xi in x) / len(x) — mean of squares in RMSNorm.

That's all the math. Every formula in microgpt.py uses only these concepts.