Gradient Descent¶
The Simplest Optimizer¶
Now that we have gradients, updating parameters is conceptually simple:
Or in plain code:
That's gradient descent — literally "descending" along the gradient (slope) of the loss surface.
Why Subtract?¶
The gradient points in the direction of steepest increase of the loss. We want the loss to decrease. So we move in the opposite direction.
The Learning Rate¶
The gradient tells us the direction to move, but not how far. The learning rate (\(\eta\)) controls the step size:
Steps are so big you overshoot the minimum and the loss bounces around.
Steps are so tiny that training takes forever.
Steady progress toward lower loss.
Why Not Just Use Gradient Descent?¶
Simple gradient descent (called SGD — Stochastic Gradient Descent) works but has problems:
| Problem | Why it matters |
|---|---|
| One learning rate for all parameters | Some parameters might need small steps, others large ones |
| No momentum | Each step only uses the current gradient. If the gradient is noisy, the path zigzags |
| Hard to tune | Learning rate is very sensitive |
Info
This is why microgpt.py uses Adam — a smarter optimizer that fixes all three problems.
Terminology
| Term | Meaning |
|---|---|
| Gradient descent | Update: \(\theta \leftarrow \theta - \eta \cdot g\) |
| SGD | Stochastic Gradient Descent — gradient descent on a random subset of data |
| Learning rate (\(\eta\)) | How big each step is |
| Convergence | When the loss stops decreasing |
| Overshoot | When the learning rate is so large that updates make things worse |