The Backward Pass¶
The Problem¶
After the forward pass, we have:
- A final output: the loss (a single number measuring how wrong the model was)
- A computation graph: every operation that led to the loss, recorded as
Valuenodes
Now we need to answer: for every parameter in the model, how much did it contribute to the loss? That is, we need \(\frac{d(\text{loss})}{d(\text{parameter})}\) for every parameter.
The Backward Pass Algorithm (Lines 59–72)¶
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.grad
This is only 14 lines of code, but it's doing something profound. Let's decompose it.
Step 1: Topological Sort (Lines 60–68)¶
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_topo(child)
topo.append(v)
build_topo(self)
What is a topological sort?
A topological sort is an ordering of nodes such that every node comes after all its children.
The algorithm uses depth-first search: visit all children before adding yourself. The visited set prevents visiting the same node twice (since a node can be used in multiple places).
Why do we need this?¶
Because the backward pass processes nodes from the output back to the inputs. By reversing the topological order (reversed(topo)), we guarantee that every node's gradient is fully computed before we try to propagate it to its children.
Step 2: Seed the Gradient (Line 69)¶
self is the loss node — the root of the graph. We set its gradient to 1 because:
The derivative of anything with respect to itself is always 1. This is our starting point.
Step 3: Propagate Gradients Backward (Lines 70–72)¶
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.grad
For each node \(v\) (starting from the loss, going backwards):
This is the chain rule in action: the gradient of a child = (local gradient) × (parent's gradient). We use += because a node might have multiple parents.
Full Walkthrough¶
Let's trace the backward pass for \(f = \text{relu}(a \times b + d)\):
\(f = \text{relu}(e)\), local grad at \(e = 4.0\) is \(1.0\) (positive input)
\(e = c + d\), local grads are both \(1\)
\(c = a \times b\), local grad w.r.t. \(a\) is \(b = -3.0\), w.r.t. \(b\) is \(a = 2.0\)
Result:
| Node | .grad | Meaning |
|---|---|---|
| \(a\) | \(-3.0\) | "If \(a\) increases by 1, \(f\) decreases by 3" |
| \(b\) | \(2.0\) | "If \(b\) increases by 1, \(f\) increases by 2" |
| \(c\) | \(1.0\) | |
| \(d\) | \(1.0\) | |
| \(e\) | \(1.0\) | |
| \(f\) | \(1.0\) | seed |
Verification
The "+=" is Crucial¶
Warning
Notice child.grad += (not =). This is because a value might be used in multiple places:
Here \(a\) is a child of \(b\) through two paths. The gradient contributions from both paths must be summed:
And indeed, if \(a = 3\) then \(b = 6\), and if \(a = 4\) then \(b = 8\). Rate of change = 2.
What This Means for Training¶
After calling loss.backward():
- Every parameter (
Valuein the model) has its.gradset - This gradient tells us: "nudge this parameter in the opposite direction of
.gradto reduce the loss" - The optimizer then uses these gradients to update all parameters
Important
All of this happens automatically. The programmer only needs to:
- Build the forward pass (which records the graph)
- Call
.backward()on the loss
Terminology
| Term | Meaning |
|---|---|
| Backward pass | Walking the graph in reverse to compute gradients |
| Topological sort | Ordering nodes so children come before parents |
| Gradient accumulation | Summing gradient contributions from multiple paths (+=) |
| Seed gradient | Setting the loss's gradient to 1 to start the backward pass |
| Backpropagation | Another name for the backward pass |