Building a Computation Graph¶
Putting It All Together¶
We've seen the individual pieces:
Valuewraps numbers and records operations- The forward pass builds the graph
- The backward pass walks it in reverse to compute gradients
Now let's see how a realistic example looks — one that mirrors what actually happens inside microgpt.py.
A Mini Neural Network¶
Let's build the simplest possible "network" — one that takes a number and tries to predict another number:
# Two parameters (the model's "knowledge")
w = Value(0.5) # weight
b = Value(0.1) # bias
# Input and target
x = 2.0 # input (plain number, not a Value)
target = 3.0 # we want the model to output 3.0
# Forward pass: the prediction
prediction = w * x + b # 0.5 × 2.0 + 0.1 = 1.1
# Loss: how wrong are we?
error = prediction - target # 1.1 - 3.0 = -1.9
loss = error ** 2 # (-1.9)² = 3.61
The graph that gets built:¶
flowchart LR
W["w (0.5)"] --> MUL["× → wx (1.0)"]
X["x (2.0)"] --> MUL
MUL --> ADD["+ → pred (1.1)"]
B["b (0.1)"] --> ADD
ADD --> SUB["- → err (-1.9)"]
T["target (3.0)"] --> SUB
SUB --> POW["² → loss (3.61)"] Now backward:¶
| Node | Gradient | Computation |
|---|---|---|
loss | \(1.0\) | seed |
err | \(-3.8\) | \(2 \times (-1.9) \times 1.0\) (power rule) |
pred | \(-3.8\) | \(1 \times \text{err.grad}\) |
b | \(-3.8\) | \(1 \times \text{pred.grad}\) (addition) |
wx | \(-3.8\) | \(1 \times \text{pred.grad}\) (addition) |
w | \(-7.6\) | \(x \times \text{wx.grad} = 2.0 \times (-3.8)\) |
This makes sense! Our prediction was 1.1 but the target was 3.0 — we're too low. Both \(w\) and \(b\) need to increase.
learning_rate = 0.01
w.data -= learning_rate * w.grad # 0.5 - 0.01×(-7.6) = 0.576
b.data -= learning_rate * b.grad # 0.1 - 0.01×(-3.8) = 0.138
New prediction: \(0.576 \times 2.0 + 0.138 = 1.29\) (closer to 3.0!)
Repeat this hundreds of times, and \(w\) and \(b\) will converge to values that make the prediction close to 3.0.
Scale: What the Real Graph Looks Like¶
In the mini example above, the graph had ~8 nodes. In microgpt.py, a single forward pass through the gpt() function creates thousands of nodes:
| Component | Approximate # of Value operations |
|---|---|
| Embedding lookup | ~16 |
| RMSNorm | ~50 |
| One attention head | ~200 |
| Four attention heads | ~800 |
| MLP (expand + activate + compress) | ~1,500 |
| Output linear | ~400 |
| Softmax | ~80 |
| Loss (log) | ~2 |
| Total per token | ~3,000+ |
| Per 8-token sequence | ~24,000+ |
The Power of Autograd
All of these nodes are Value objects sitting in memory, linked together. When you call loss.backward(), the algorithm visits each node exactly once (thanks to topological sort) and computes the gradient. One pass, all gradients, every parameter.
- Without autograd: ~20,000 forward passes to estimate gradients for 10,000 parameters
- With autograd: Exactly 2 passes — one forward, one backward
Checkpoint ✓¶
What you understand now
Derivatives: "which direction to nudge" (Lesson 0)
Chain rule: composing derivatives through a chain (Lesson 1)
Value class: recording operations and storing local gradients (Lesson 2)
Forward pass: computing the output and building the graph (Lesson 3)
Backward pass: walking the graph in reverse to compute all gradients (Lesson 4)
The full picture: how these pieces work together (this lesson)
What we don't know yet: what does the gpt() function actually compute? What is attention? What is a linear layer? For that, we need to understand the architecture.