Backpropagation¶
The One-Line Magic¶
One line. It computes the gradient of the loss with respect to all 4,064 parameters, simultaneously.
What Just Happened?¶
If you went through Module 2, you know exactly what this does:
- Topological sort — Order all
Valuenodes so children come before parents - Seed — Set
loss.grad = 1 - Propagate — Walk the graph in reverse:
The result: every Value in params now has its .grad field set.
The Scale of This Computation¶
For a single training step on a 5-character name
Forward pass: ~24,000 Value nodes created
(5 positions × ~4,800 operations each)
Backward pass: ~24,000 nodes visited
Each node: 1-2 multiplications + additions
Total: ~50,000 arithmetic operations
All of this to fill in 4,064 .grad values.
Before and After¶
Each gradient means: "if this parameter increased by a tiny amount, this is how much the loss would change."
Why This Is Remarkable¶
Without autograd, to compute 4,064 gradients, you'd need:
| Method | Forward passes per step | × 500 steps | Total |
|---|---|---|---|
| Finite differences | 4,065 | × 500 | ~2,000,000 |
| Autograd | 1 forward + 1 backward | × 500 | 1,000 |
Tip
Autograd is ~2000× more efficient than the brute-force approach. This is why automatic differentiation was a breakthrough.
The Gradient Reset (Line 182)¶
Critical
After using each gradient for the parameter update, it's reset to zero. Why? Because backward() uses += to accumulate gradients. Without resetting, gradients from previous steps would leak into the current step, corrupting the computation.
Terminology
| Term | Meaning |
|---|---|
| Backpropagation | Running backward() on the loss to compute all gradients |
| Gradient | \(\partial\text{loss}/\partial\text{parameter}\) — how much the loss changes per unit parameter change |
| Gradient reset | Setting all gradients to zero before the next step |