Building a Computation Graph¶

Putting It All Together¶

We've seen the individual pieces:

Value wraps numbers and records operations
The forward pass builds the graph
The backward pass walks it in reverse to compute gradients

Now let's see how a realistic example looks — one that mirrors what actually happens inside microgpt.py.

A Mini Neural Network¶

Let's build the simplest possible "network" — one that takes a number and tries to predict another number:

# Two parameters (the model's "knowledge")
w = Value(0.5)    # weight
b = Value(0.1)    # bias

# Input and target
x = 2.0           # input (plain number, not a Value)
target = 3.0      # we want the model to output 3.0

# Forward pass: the prediction
prediction = w * x + b    # 0.5 × 2.0 + 0.1 = 1.1

# Loss: how wrong are we?
error = prediction - target    # 1.1 - 3.0 = -1.9
loss = error ** 2              # (-1.9)² = 3.61

The graph that gets built:¶

flowchart LR
    W["w (0.5)"] --> MUL["× → wx (1.0)"]
    X["x (2.0)"] --> MUL
    MUL --> ADD["+ → pred (1.1)"]
    B["b (0.1)"] --> ADD
    ADD --> SUB["- → err (-1.9)"]
    T["target (3.0)"] --> SUB
    SUB --> POW["² → loss (3.61)"]

Now backward:¶

loss.backward()

Gradient computationWhat the gradients meanUpdate parameters

Node	Gradient	Computation
`loss`	\(1.0\)	seed
`err`	\(-3.8\)	\(2 \times (-1.9) \times 1.0\) (power rule)
`pred`	\(-3.8\)	\(1 \times \text{err.grad}\)
`b`	\(-3.8\)	\(1 \times \text{pred.grad}\) (addition)
`wx`	\(-3.8\)	\(1 \times \text{pred.grad}\) (addition)
`w`	\(-7.6\)	\(x \times \text{wx.grad} = 2.0 \times (-3.8)\)

\[w\text{.grad} = -7.6 \implies \text{increasing } w \text{ would DECREASE the loss}\]

\[b\text{.grad} = -3.8 \implies \text{increasing } b \text{ would DECREASE the loss}\]

This makes sense! Our prediction was 1.1 but the target was 3.0 — we're too low. Both \(w\) and \(b\) need to increase.

learning_rate = 0.01
w.data -= learning_rate * w.grad   # 0.5 - 0.01×(-7.6) = 0.576
b.data -= learning_rate * b.grad   # 0.1 - 0.01×(-3.8) = 0.138

New prediction: \(0.576 \times 2.0 + 0.138 = 1.29\) (closer to 3.0!)

Repeat this hundreds of times, and \(w\) and \(b\) will converge to values that make the prediction close to 3.0.

Scale: What the Real Graph Looks Like¶

In the mini example above, the graph had ~8 nodes. In microgpt.py, a single forward pass through the gpt() function creates thousands of nodes:

Component	Approximate # of Value operations
Embedding lookup	~16
RMSNorm	~50
One attention head	~200
Four attention heads	~800
MLP (expand + activate + compress)	~1,500
Output linear	~400
Softmax	~80
Loss (log)	~2
Total per token	~3,000+
Per 8-token sequence	~24,000+

The Power of Autograd

All of these nodes are Value objects sitting in memory, linked together. When you call loss.backward(), the algorithm visits each node exactly once (thanks to topological sort) and computes the gradient. One pass, all gradients, every parameter.

Without autograd: ~20,000 forward passes to estimate gradients for 10,000 parameters
With autograd: Exactly 2 passes — one forward, one backward

Checkpoint ✓¶

What you understand now

Derivatives: "which direction to nudge" (Lesson 0)
Chain rule: composing derivatives through a chain (Lesson 1)
Value class: recording operations and storing local gradients (Lesson 2)
Forward pass: computing the output and building the graph (Lesson 3)
Backward pass: walking the graph in reverse to compute all gradients (Lesson 4)
The full picture: how these pieces work together (this lesson)

What we don't know yet: what does the gpt() function actually compute? What is attention? What is a linear layer? For that, we need to understand the architecture.