Skip to content

The Full GPT Function

Putting It All Together

We've studied every component. Now let's see the complete gpt() function — all of them assembled into a single pipeline that takes a token and produces predictions.

The Code (Lines 108–144)

microgpt.py — Lines 108-144
def gpt(token_id, pos_id, keys, values):
    # Step 1: Embed
    tok_emb = state_dict['wte'][token_id]       # token embedding lookup
    pos_emb = state_dict['wpe'][pos_id]          # position embedding lookup
    x = [t + p for t, p in zip(tok_emb, pos_emb)] # combine: what + where
    x = rmsnorm(x)                                # normalize

    for li in range(n_layer):                     # for each layer (just 1)
        # 1) Multi-head attention block
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        keys[li].append(k)
        values[li].append(v)
        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]
            attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5
                           for t in range(len(k_h))]
            attn_weights = softmax(attn_logits)
            head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h)))
                        for j in range(head_dim)]
            x_attn.extend(head_out)
        x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]         # residual

        # 2) MLP block
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])    # expand 16→64
        x = [xi.relu() ** 2 for xi in x]                    # activate
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])    # compress 64→16
        x = [a + b for a, b in zip(x, x_residual)]         # residual

    logits = linear(x, state_dict['lm_head'])               # project to vocab
    return logits

The Data Flow

Input: token_id = 4 ('e'), pos_id = 0
tok_emb = wte[4]        → 16 numbers representing 'e'
pos_emb = wpe[0]        → 16 numbers representing "first position"
x = tok_emb + pos_emb   → 16 numbers: "'e' at position 0"
x = rmsnorm(x)          → 16 numbers, normalized
x_residual = x                                      (save)
x = rmsnorm(x)                                      (normalize)
q, k, v = linear(x, Wq), linear(x, Wk), linear(x, Wv)
For each of 4 heads:
  Compute attention weights over all past tokens
  Weighted sum of values → 4 numbers
Concatenate → 16 numbers
x = linear(concat, Wo) → 16 numbers
x = x + x_residual                                  (residual)
x_residual = x                                      (save)
x = rmsnorm(x)                                      (normalize)
x = linear(x, fc1)      → 64 numbers (expanded)
x = ReLU(x)²            → 64 numbers (activated)
x = linear(x, fc2)      → 16 numbers (compressed)
x = x + x_residual                                  (residual)
logits = linear(x, lm_head) → 27 numbers (one per character)

The Architecture Diagram

flowchart TD
    TID["token_id"] --> WTE["wte (embed)"]
    PID["pos_id"] --> WPE["wpe (embed)"]
    WTE --> ADD1["⊕"]
    WPE --> ADD1
    ADD1 --> NORM0["RMSNorm"]

    subgraph layer["Layer 0"]
        NORM0 --> SAVE1["save x_residual"]
        SAVE1 --> NORM1["RMSNorm"]
        NORM1 --> ATTN["Multi-Head Attention<br>(4 heads)"]
        ATTN --> RES1["⊕ residual"]
        SAVE1 -. "skip" .-> RES1
        RES1 --> SAVE2["save x_residual"]
        SAVE2 --> NORM2["RMSNorm"]
        NORM2 --> MLP["MLP<br>(16→64→16)"]
        MLP --> RES2["⊕ residual"]
        SAVE2 -. "skip" .-> RES2
    end

    RES2 --> LM["lm_head<br>(16 → 27)"]
    LM --> OUT["logits<br>(27 scores)"]

What the Logits Mean

The output is 27 numbers — one for each token in the vocabulary:

logits[0]  → raw score for 'a'
logits[1]  → raw score for 'b'
...
logits[25] → raw score for 'z'
logits[26] → raw score for <BOS>

Warning

These are not probabilities yet. They're raw scores that can be negative or very large. To get probabilities, we apply softmax outside this function.

Why "GPT"?

GPT = Generative Pre-trained Transformer

  • Generative: It generates text (one token at a time)
  • Pre-trained: It's trained on data before being used
  • Transformer: The architecture — attention + MLP + residual connections

Tip

This function IS the Transformer. The rest is training and inference.

Scaling Up

microgpt.py GPT-2 Small GPT-3
n_embd 16 768 12,288
n_head 4 12 96
n_layer 1 12 96
block_size 8 1,024 2,048
Parameters 4,064 124M 175B

Same architecture. Same code. Just bigger matrices.

Checkpoint ✓

You now understand the entire model architecture:

  • ✅ Parameters — random numbers that encode knowledge
  • ✅ Embeddings — representing tokens and positions as vectors
  • ✅ Linear layers — mixing information via matrix multiplication
  • ✅ Softmax — converting scores to probabilities
  • ✅ RMSNorm — keeping values well-behaved
  • ✅ Attention — deciding which tokens to focus on
  • ✅ Multi-head — multiple attention perspectives
  • ✅ Residual connections — preserving original information
  • ✅ MLP — non-linear processing and knowledge storage
  • ✅ Full GPT — all pieces assembled