Skip to content

Linear Layers

The Problem

We have a 16-dimensional embedding vector x. Now we need to transform it — mix the information in its 16 dimensions to produce a new vector. This is the most fundamental computation in neural networks.

What Is a Linear Layer?

A linear layer multiplies an input vector by a weight matrix. It's the neural network's way of "mixing and recombining" information.

Concrete Example

A tiny 3→2 example (3 inputs, 2 outputs):

\[\mathbf{x} = [x_0, x_1, x_2], \quad W = \begin{bmatrix} w_{00} & w_{01} & w_{02} \\ w_{10} & w_{11} & w_{12} \end{bmatrix}\]
\[y_0 = w_{00} \cdot x_0 + w_{01} \cdot x_1 + w_{02} \cdot x_2\]
\[y_1 = w_{10} \cdot x_0 + w_{11} \cdot x_1 + w_{12} \cdot x_2\]

Each output element is a weighted sum of all input elements. The weights determine "how much of each input goes into each output."

With numbers

\[\mathbf{x} = [2.0, 3.0, 1.0]\]
\[W = \begin{bmatrix} 0.5 & -0.3 & 0.1 \\ 0.2 & 0.4 & -0.2 \end{bmatrix}\]
\[y_0 = 0.5 \times 2 + (-0.3) \times 3 + 0.1 \times 1 = 1.0 - 0.9 + 0.1 = 0.2\]
\[y_1 = 0.2 \times 2 + 0.4 \times 3 + (-0.2) \times 1 = 0.4 + 1.2 - 0.2 = 1.4\]
\[\mathbf{y} = [0.2, 1.4]\]

The Code (Lines 94–95)

microgpt.py — Lines 94-95
def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
# For each row 'wo' in weight matrix 'w':
for wo in w:
    # Compute the dot product of that row with input 'x':
    sum(wi * xi for wi, xi in zip(wo, x))
Row 0: wo = [0.5, -0.3, 0.1]
       sum(0.5×2, -0.3×3, 0.1×1) = 0.2

Row 1: wo = [0.2, 0.4, -0.2]
       sum(0.2×2, 0.4×3, -0.2×1) = 1.4

Result: [0.2, 1.4]

What's a Dot Product?

The inner sum is called a dot product:

\[\text{dot}([a, b, c], [d, e, f]) = a \cdot d + b \cdot e + c \cdot f\]

It's a measure of "similarity" between two vectors. High dot product = vectors point in the same direction.

Visual Representation

flowchart LR
    X["x<br>[x₀, x₁, x₂]"] --> ROW0["row 0: [w₀₀, w₀₁, w₀₂]<br>→ dot product → y₀"]
    X --> ROW1["row 1: [w₁₀, w₁₁, w₁₂]<br>→ dot product → y₁"]

Each row of the weight matrix produces one output element. The number of rows determines the output size.

Where Linear Layers Appear in microgpt.py

Line Usage Transform
118 linear(x, attn_wq) 16 → 16 (queries)
119 linear(x, attn_wk) 16 → 16 (keys)
120 linear(x, attn_wv) 16 → 16 (values)
133 linear(x_attn, attn_wo) 16 → 16 (output projection)
138 linear(x, mlp_fc1) 16 → 64 (expand)
140 linear(x, mlp_fc2) 64 → 16 (compress)
143 linear(x, lm_head) 16 → 27 (final prediction)

The linear function is the workhorse — used 7 times per layer, plus one more for the output.

Why "Linear"?

Because \(y = Wx\) is a linear function — if you double the input, you double the output. No curves, no bends. Just scaling and mixing.

Limitation

Linear layers can only represent straight-line relationships. To model complex patterns, we need non-linearity — that's what activation functions like ReLU provide.

No Bias Terms

Standard neural networks often add a bias: \(y = Wx + b\). Karpathy's implementation skips biases entirely — a simplification common in modern Transformer architectures.

Terminology
Term Meaning
Linear layer Multiplying input by a weight matrix; \(y = Wx\)
Dot product Multiply elements pairwise, then sum
Weight matrix The grid of learnable parameters in a linear layer
Bias An optional additive vector (omitted in microgpt.py)
Projection Another word for "linear transformation"