The Forward Pass¶
What Is a "Forward Pass"?¶
The forward pass is simply computing the output given an input. You feed in numbers on one end, and math operations transform them step by step until you get a result.
It's called "forward" because data flows in one direction: from inputs → through operations → to output.
A Concrete Example¶
Let's trace a tiny computation:
a = Value(2.0)
b = Value(-3.0)
c = a * b # -6.0
d = Value(10.0)
e = c + d # 4.0
f = e.relu() # 4.0 (positive, so unchanged)
| Step | Computation | Result |
|---|---|---|
| 1 | a.data = 2.0 | given |
| 2 | b.data = -3.0 | given |
| 3 | c.data = 2.0 × (-3.0) | -6.0 |
| 4 | d.data = 10.0 | given |
| 5 | e.data = -6.0 + 10.0 | 4.0 |
| 6 | f.data = max(0, 4.0) | 4.0 |
flowchart LR
A["a (2.0)"] --> MUL["× → c (-6.0)"]
B["b (-3.0)"] --> MUL
MUL --> ADD["+ → e (4.0)"]
D["d (10.0)"] --> ADD
ADD --> RELU["relu → f (4.0)"] Each arrow is a Value node. Each operation creates a new node.
What Gets Recorded¶
During the forward pass, each new Value stores:
- The computed result (
.data) - References to the inputs (
._children) - The local derivatives (
._local_grads)
For node c = a * b
This recording is building the computation graph as a side effect of the forward pass. We'll need this graph for the backward pass.
The Forward Pass in microgpt.py¶
In the actual model, the forward pass happens when we call the gpt() function:
This call triggers a cascade of hundreds of operations:
- Look up embeddings (addition of two
Valuerows) - Normalize (multiply, divide, power operations on
Valuenodes) - Attention (matrix multiplications, softmax — all on
Valuenodes) - MLP (more linear transforms and activation)
- Output logits (one final linear transform)
Important
Every single arithmetic operation creates a new Value node, and by the time logits is returned, there's a massive computation graph in memory, with every node remembering exactly how it was produced.
Why Build This Graph?¶
Because the backward pass will walk this graph in reverse to compute gradients. Without the graph, we wouldn't know which operations happened, in what order, with what inputs. The graph is the "recording" that makes automatic differentiation possible.
The Full Picture¶
flowchart LR
direction LR
subgraph inputs["Inputs"]
A["a (2.0)"]
B["b (-3.0)"]
D["d (10.0)"]
end
subgraph ops["Operations"]
MUL["× → c (-6.0)"]
ADD["+ → e (4.0)"]
RELU["relu → f (4.0)"]
end
A --> MUL
B --> MUL
MUL --> ADD
D --> ADD
ADD --> RELU FORWARD = left to right (compute values) ➡️
BACKWARD = right to left (compute gradients) ⬅️
Terminology
| Term | Meaning |
|---|---|
| Forward pass | Computing the output from the input, step by step |
| Computation graph | The tree of Value nodes built during the forward pass |
| Leaf node | An input Value with no children (parameters, inputs) |
| Internal node | A Value created by an operation on other Values |
| Root node | The final output (usually the loss) |