The Adam Optimizer¶
The Problem with Plain Gradient Descent¶
Gradient descent uses the same learning rate for every parameter and has no memory of past gradients. Adam fixes both.
What Adam Adds¶
Adam (Adaptive Moment Estimation) maintains two extra values for each parameter:
- m (first moment): A running average of past gradients → momentum
- v (second moment): A running average of past squared gradients → adaptive learning rate
Setup (Lines 147–149)¶
learning_rate, beta1, beta2, eps_adam = 1e-2, 0.9, 0.95, 1e-8
m = [0.0] * len(params) # first moment buffer (momentum)
v = [0.0] * len(params) # second moment buffer (adaptive rates)
| Value | What it is | Typical range |
|---|---|---|
learning_rate = 0.01 | Base step size | 1e-4 to 1e-2 |
beta1 = 0.9 | How much past momentum to keep | 0.9 – 0.99 |
beta2 = 0.95 | How much past variance to keep | 0.95 – 0.999 |
eps_adam = 1e-8 | Tiny number to prevent division by zero | 1e-8 |
The Update (Lines 175–182)¶
lr_t = learning_rate * 0.5 * (1 + math.cos(math.pi * step / num_steps))
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad # update momentum
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2 # update variance
m_hat = m[i] / (1 - beta1 ** (step + 1)) # bias correction
v_hat = v[i] / (1 - beta2 ** (step + 1)) # bias correction
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam) # update parameter
p.grad = 0 # reset gradient
- Starts at
learning_rate(0.01) - Smoothly decreases to 0 by the end of training
- Large steps early for fast learning, tiny steps late for fine-tuning
The formula: \(\cos(\pi \cdot t/T)\) goes from \(\cos(0)=1\) to \(\cos(\pi)=-1\). So \(\frac{1+\cos(...)}{2}\) goes from 1 → 0.
An exponential moving average of the gradient. Keeps 90% of previous momentum, adds 10% of current gradient.
Individual gradients are noisy (computed from a single name). Averaging smooths the noise.
It's like a rolling ball — instead of changing direction every instant, it has inertia.
Tracks how "noisy" each parameter's gradient is:
- Consistently large gradients → large \(v\) → smaller effective step
- Small gradients → small \(v\) → larger effective step
This is the adaptive part — each parameter gets its own effective learning rate.
Since \(m\) and \(v\) start at 0, they're biased toward zero early on. Correction inflates them:
| Step | Correction factor for \(m\) | Effect |
|---|---|---|
| 0 | \(1/(1-0.9^1) = 10\times\) | Large correction |
| 10 | \(\approx 2.9\times\) | Moderate |
| 100 | \(\approx 1.0\times\) | Almost none |
- \(\hat{m}\) = smoothed gradient (direction + momentum)
- \(\sqrt{\hat{v}}\) = gradient volatility (adaptive scaling)
- High variance → smaller steps (cautious)
- Low variance → larger steps (confident)
The Formula in One Line¶
Terminology
| Term | Meaning |
|---|---|
| Adam | Adaptive Moment Estimation — a popular optimizer |
| Momentum | Running average of past gradients (smooths noise) |
| Adaptive learning rate | Different effective step sizes per parameter |
| Exponential moving average | \(\text{new} = \beta \cdot \text{old} + (1-\beta) \cdot \text{current}\) |
| Bias correction | Compensating for zero-initialization in early steps |
| Cosine decay | Learning rate schedule that smoothly decreases to zero |
| Epsilon (\(\epsilon\)) | Tiny constant to prevent division by zero |