Skip to content

Parameters Are Knowledge

The Problem

We have an autograd engine that can compute gradients. But gradients of what? We need actual numbers to compute with — the model's parameters.

Parameters are the thousands of numbers that the model will tune during training to get good at prediction. Before training, they're random. After training, they encode everything the model has "learned."

The Hyperparameters (Lines 75–79)

microgpt.py — Lines 75-79
n_embd = 16     # embedding dimension
n_head = 4      # number of attention heads
n_layer = 1     # number of layers
block_size = 8  # maximum sequence length
head_dim = n_embd // n_head  # dimension of each head = 4

These are hyperparameters — settings that the programmer chooses, not things the model learns:

Hyperparameter Value What it controls
n_embd 16 How "rich" each token's representation is
n_head 4 How many different "perspectives" in attention
n_layer 1 How many times we repeat the attention+MLP block
block_size 8 Maximum number of characters the model can see
head_dim 4 Size of each attention head (\(16 / 4 = 4\))

Scale comparison

In real GPT models, these numbers are much larger (GPT-2: n_embd=768, n_head=12, n_layer=12). The structure is identical.

Creating Parameter Matrices (Line 80)

microgpt.py — Line 80
matrix = lambda nout, nin, std=0.02: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]

This helper creates a 2D grid (matrix) of Value objects, each initialized with a small random number from a Gaussian distribution:

\[\text{random.gauss}(0, 0.02) \implies \text{a random number near 0, usually between } -0.06 \text{ and } +0.06\]

Example: matrix(3, 2)

[
  [Value(0.01), Value(-0.03)],   # row 0
  [Value(0.02), Value(0.01)],    # row 1
  [Value(-0.01), Value(0.04)],   # row 2
]

A 3×2 grid of random Value objects.

If all parameters started at the same value, they'd all receive the same gradient and update in lockstep forever. Randomness breaks this symmetry.

Large initial values cause numerical instability. Starting near zero is safe.

Draws values from a bell curve centered at 0. Most values are close to 0, rarely far from it.

The State Dictionary (Lines 81–89)

microgpt.py — Lines 81-89
state_dict = {
    'wte': matrix(vocab_size, n_embd),   # token embeddings: 27 × 16
    'wpe': matrix(block_size, n_embd),    # position embeddings: 8 × 16
    'lm_head': matrix(vocab_size, n_embd), # output layer: 27 × 16
}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)    # 16 × 16
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)    # 16 × 16
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)    # 16 × 16
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd, std=0)  # 16 × 16
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)  # 64 × 16
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd, std=0)  # 16 × 64
Name Shape Purpose
wte 27 × 16 Token embedding — gives each of 27 tokens a 16-dimensional "meaning"
wpe 8 × 16 Position embedding — encodes position (1st, 2nd, ..., 8th)
lm_head 27 × 16 Output layer — converts internal state back to token predictions
attn_wq 16 × 16 Query weights for attention
attn_wk 16 × 16 Key weights for attention
attn_wv 16 × 16 Value weights for attention (not Value class — confusing, but standard terminology)
attn_wo 16 × 16 Output projection for attention
mlp_fc1 64 × 16 Expand layer in the MLP block (16 → 64)
mlp_fc2 16 × 64 Compress layer in the MLP block (64 → 16)

Why std=0 for some matrices?

attn_wo and mlp_fc2 are initialized with std=0all zeros. These are output projection matrices. Initializing them to zero means the attention and MLP blocks initially do nothing (they output zeros, so the residual connection just passes the input through). This is a stability trick for training.

Flattening the Parameters (Line 89)

microgpt.py — Line 89
params = [p for mat in state_dict.values() for row in mat for p in row]
print(f"num params: {len(params)}")

This flattens all matrices into a single flat list of Value objects. The optimizer needs one flat list to loop over all parameters.

How many parameters?

Matrix Shape Count
wte 27 × 16 432
wpe 8 × 16 128
lm_head 27 × 16 432
attn_wq 16 × 16 256
attn_wk 16 × 16 256
attn_wv 16 × 16 256
attn_wo 16 × 16 256
mlp_fc1 64 × 16 1,024
mlp_fc2 16 × 64 1,024
Total 4,064

4,064 Value objects, each a small random number, each tracking its gradient. By comparison, GPT-2 has 124 million parameters, and GPT-4 is rumored to have over a trillion.

Terminology
Term Meaning
Parameters The learnable numbers in the model (weights and biases)
Hyperparameters Settings chosen by the programmer (n_embd, n_head, etc.)
State dict A dictionary mapping names to parameter matrices
Weight matrix A 2D grid of parameters used in a linear transformation
Initialization The strategy for setting initial parameter values
Gaussian A bell-curve distribution; most values cluster near the mean