Skip to content

The Complete Picture

You've Made It

If you've read through every lesson in order, you now understand every single line of microgpt.py. Let's zoom out and see the whole thing as one coherent story.

The Story of 200 Lines

"Let there be data."

We start with a text file of 32,000 human names. Each name is a sequence of characters. We build a tokenizer: 26 letters + 1 BOS token = 27 tokens total.

What we built: A dataset and a way to encode/decode text.

"Let there be learning."

We build a tiny automatic differentiation engine — the Value class. Every number remembers how it was computed. When we call backward(), it walks the computation graph in reverse, computing derivatives using the chain rule.

What we built: An autograd engine that makes training possible.

"Let there be intelligence."

We initialize ~4,000 random parameters and define the architecture:

Embed → Normalize → Attend → Think → Predict

  • Embeddings give each token a rich representation
  • Attention lets tokens look at their context
  • MLP does non-linear processing
  • Residual connections preserve information
  • RMSNorm keeps values stable

What we built: A Transformer that maps tokens to predictions.

"Let there be knowledge."

The model starts knowing nothing. Over 500 steps, we show it names, measure prediction error (cross-entropy), compute gradients (backpropagation), and adjust parameters (Adam with cosine decay).

The loss drops from ~3.3 (random chance) to ~1.5 (reasonably good).

What we built: A training loop that instills knowledge into parameters.

"Let there be creation."

We generate 20 new names: start with BOS, predict characters one at a time, sample with temperature=0.5, stop when BOS reappears.

What we built: An inference loop that generates new text.

The Complete Dependency Map

flowchart TD
    DATA["Data<br>(names.txt)"] --> TOK["Tokenizer<br>(chars → IDs)"]
    TOK --> AG["Autograd Engine<br>(Value class)"]
    AG --> PARAMS["Parameters<br>(4,064 Values)"]
    AG --> ARCH["Architecture<br>(gpt function)"]
    PARAMS --> TRAIN["Training Loop"]
    ARCH --> TRAIN
    TRAIN --> TRAINED["Trained Parameters"]
    TRAINED --> INF["Inference Loop"]
    INF --> NAMES["Generated Names"]

microgpt.py vs. ChatGPT

microgpt.py ChatGPT Same algorithm?
Character-level tokenizer BPE tokenizer (50k+ tokens) ✅
Value class (Python) PyTorch autograd (CUDA) ✅
4,064 params 175B+ params ✅
1 layer, 4 heads 96 layers, 96 heads ✅
500 training steps Millions of steps ✅
1 CPU Thousands of GPUs ✅
names.txt Terabytes of internet text ✅

Everything else is just efficiency

The algorithm is identical. The differences are scale (more parameters, data, compute), speed (GPU acceleration), and polish (better tokenizers, fine-tuning, RLHF). But the fundamental loop — embed, attend, predict, compute loss, backpropagate, update — is the same.

Concepts You Now Understand

Concept What you know
Tokenization Converting text to numbers and back
Embeddings Representing tokens as learnable vectors
Attention \(Q \cdot K / \sqrt{d_k}\) to compute relevance, weighted sum of \(V\)
Multi-head attention Multiple parallel attention perspectives
Residual connections Skip connections that preserve information
RMSNorm Keeping values well-scaled
MLP Non-linear processing (expand → activate → compress)
Forward pass Computing predictions and building the graph
Backward pass Computing gradients via chain rule
Cross-entropy loss \(-\log(P(\text{correct}))\)
Adam optimizer Momentum + adaptive learning rates
Temperature Controlling generation randomness
Autoregressive generation Each output becomes the next input

Where to Go From Here

Experiments to try

  • Change n_embd (16 → 32) and see the effect
  • Change temperature (0.5 → 0.1, 1.0, 2.0)
  • Train for more steps (500 → 2000)
  • Use a different dataset (cities, words, anything)

The Final Analogy

It's like learning that a car engine has just four strokes: intake, compress, ignite, exhaust. Everything else — turbochargers, fuel injection, cooling systems — is optimization. But the four strokes ARE the engine.

In our case:

  • Embed (intake)
  • Attend + Transform (compress + ignite)
  • Predict → Loss → Gradient → Update (exhaust + repeat)

That's the engine. You now understand every moving part.