Linear Layers¶
The Problem¶
We have a 16-dimensional embedding vector x. Now we need to transform it — mix the information in its 16 dimensions to produce a new vector. This is the most fundamental computation in neural networks.
What Is a Linear Layer?¶
A linear layer multiplies an input vector by a weight matrix. It's the neural network's way of "mixing and recombining" information.
Concrete Example¶
A tiny 3→2 example (3 inputs, 2 outputs):
Each output element is a weighted sum of all input elements. The weights determine "how much of each input goes into each output."
With numbers
The Code (Lines 94–95)¶
def linear(x, w):
return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
What's a Dot Product?¶
The inner sum is called a dot product:
It's a measure of "similarity" between two vectors. High dot product = vectors point in the same direction.
Visual Representation¶
flowchart LR
X["x<br>[x₀, x₁, x₂]"] --> ROW0["row 0: [w₀₀, w₀₁, w₀₂]<br>→ dot product → y₀"]
X --> ROW1["row 1: [w₁₀, w₁₁, w₁₂]<br>→ dot product → y₁"] Each row of the weight matrix produces one output element. The number of rows determines the output size.
Where Linear Layers Appear in microgpt.py¶
| Line | Usage | Transform |
|---|---|---|
| 118 | linear(x, attn_wq) | 16 → 16 (queries) |
| 119 | linear(x, attn_wk) | 16 → 16 (keys) |
| 120 | linear(x, attn_wv) | 16 → 16 (values) |
| 133 | linear(x_attn, attn_wo) | 16 → 16 (output projection) |
| 138 | linear(x, mlp_fc1) | 16 → 64 (expand) |
| 140 | linear(x, mlp_fc2) | 64 → 16 (compress) |
| 143 | linear(x, lm_head) | 16 → 27 (final prediction) |
The linear function is the workhorse — used 7 times per layer, plus one more for the output.
Why "Linear"?¶
Because \(y = Wx\) is a linear function — if you double the input, you double the output. No curves, no bends. Just scaling and mixing.
Limitation
Linear layers can only represent straight-line relationships. To model complex patterns, we need non-linearity — that's what activation functions like ReLU provide.
No Bias Terms
Standard neural networks often add a bias: \(y = Wx + b\). Karpathy's implementation skips biases entirely — a simplification common in modern Transformer architectures.
Terminology
| Term | Meaning |
|---|---|
| Linear layer | Multiplying input by a weight matrix; \(y = Wx\) |
| Dot product | Multiply elements pairwise, then sum |
| Weight matrix | The grid of learnable parameters in a linear layer |
| Bias | An optional additive vector (omitted in microgpt.py) |
| Projection | Another word for "linear transformation" |