Glossary¶
A reference of every term used in the course, in alphabetical order.
A
Activation Function — A non-linear function applied between linear layers. Enables networks to learn complex patterns. In microgpt.py: ReLU² (\(\max(0, x)^2\)).
Adam — Adaptive Moment Estimation. An optimizer that combines momentum (running average of gradients) and adaptive learning rates (per-parameter step sizes). Lines 174–182.
Attention — A mechanism that lets tokens "look at" other tokens to gather context. Computes relevance scores via dot products between queries and keys, then takes a weighted average of values.
Autograd — Automatic differentiation. A system that computes derivatives automatically by recording operations and replaying them in reverse. The Value class implements this.
Autoregressive — A generation strategy where each output token becomes the input for producing the next token.
B
Backward Pass — Walking the computation graph in reverse to compute gradients using the chain rule. Triggered by loss.backward().
Bias Correction — A fix for the zero-initialization of Adam's moment estimates, important in early training steps.
Block Size — The maximum sequence length the model can process. Set to 8 in microgpt.py.
BOS (Beginning of Sequence) — A special token (ID 26) used to mark the start and end of a sequence.
C–D
Causal Masking — Preventing the model from attending to future tokens. In microgpt.py, this happens naturally because tokens are processed one at a time.
Chain Rule — The derivative of a composition of functions equals the product of the individual derivatives: \(\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)\).
Computation Graph — A tree of Value nodes recording all operations performed during the forward pass.
Cosine Decay — A learning rate schedule that smoothly decreases from the initial value to zero using a cosine curve.
Cross-Entropy Loss — The loss function \(-\log(P(\text{correct token}))\). Heavily penalizes confident wrong predictions.
Dataset — The collection of training examples. Here: ~32,000 human names from input.txt.
Derivative — The rate of change of a function's output with respect to its input. Tells us "which direction to nudge."
Document — A single training example. In this case, one name (e.g., "emma").
Dot Product — Multiply corresponding elements and sum: \(\text{dot}(\mathbf{a}, \mathbf{b}) = a_0 b_0 + a_1 b_1 + \cdots\). Measures similarity.
E–G
Embedding — A learnable vector representation of a token. Converts token IDs into rich numerical representations.
Embedding Table — A matrix where row \(i\) is the embedding for token \(i\).
Epoch — One complete pass through the entire dataset.
Epsilon (\(\epsilon\)) — A tiny number (e.g., \(10^{-5}\) or \(10^{-8}\)) added to prevent division by zero.
Exponential Moving Average — \(\text{new} = \beta \cdot \text{old} + (1-\beta) \cdot \text{current}\). Smooths a sequence of values.
Forward Pass — Computing the output from the input, step by step, creating the computation graph.
Gradient — The derivative of the loss with respect to a parameter. Tells us how to adjust the parameter to reduce the loss.
Gradient Accumulation — Summing gradient contributions from multiple paths through the graph (\(+=\) in backward).
Gradient Descent — The simplest optimizer: \(\theta \leftarrow \theta - \eta \cdot g\).
H–L
Head (Attention Head) — One independent attention mechanism operating on a subset of dimensions.
Hyperparameter — A setting chosen by the programmer (n_embd, n_head, learning_rate, etc.), not learned during training.
Inference — Using the trained model to generate new predictions, without updating parameters.
KV Cache — Storing keys and values from previous tokens to avoid recomputation during generation.
Layer — One complete attention+MLP block in the Transformer (microgpt.py has 1 layer).
Learning Rate — Step size for parameter updates. Controls how much parameters change each step.
Linear Layer — Matrix multiplication: \(y = Wx\). Mixes and recombines information.
Local Gradient — The derivative of a single operation with respect to its immediate input.
Logits — Raw, unnormalized scores output by the model before softmax.
Loss — A single number measuring how wrong the model's predictions were. Lower is better.
M–P
MLP (Multi-Layer Perceptron) — A two-layer feedforward network with a non-linear activation in between. Expand → Activate → Compress.
Momentum — A running average of past gradients, used in Adam to smooth out noisy updates.
Multi-Head Attention — Running multiple attention heads in parallel, each on a subset of dimensions, then concatenating results.
Normalization — Scaling values to have consistent magnitude. Prevents numerical instability.
Output Projection — A linear layer applied after multi-head attention to mix the concatenated head outputs.
Parameters — The learnable numbers in the model. Start random, get tuned during training.
Position Embedding — A vector encoding a token's position in the sequence.
Pre-normalization — Applying normalization before (not after) each block. Used in microgpt.py.
Probability Distribution — A list of non-negative numbers that sum to 1.
Q–S
Query (Q) — In attention: "What am I looking for?" The current token's search vector.
Key (K) — In attention: "What do I offer?" Each token's advertisement vector.
Value (V) — In attention: "Here's my content." The actual information a token provides.
ReLU — Rectified Linear Unit: \(\max(0, x)\). A simple activation function.
Residual Connection — Adding the input back to the output: \(y = x + f(x)\).
RMSNorm — Root Mean Square Normalization: \(x / \sqrt{\text{mean}(x^2)}\).
Sampling — Randomly choosing the next token based on the probability distribution.
Scaled Attention — Dividing attention scores by \(\sqrt{d_k}\) to prevent softmax saturation.
Sequence — An ordered list of tokens.
Softmax — Function that converts logits to probabilities: \(e^{x_i} / \sum_j e^{x_j}\).
State Dict — A dictionary mapping parameter names to weight matrices.
T–Z
Temperature — A scalar that controls randomness during generation. Low = deterministic, high = creative.
Token — The smallest unit the model processes. In microgpt.py: individual characters.
Tokenizer — The system that converts between text and token IDs.
Topological Sort — Ordering graph nodes so children always come before parents. Needed for backward pass.
Training — The process of iteratively adjusting parameters to minimize loss.
Training Step — One complete forward → loss → backward → update cycle.
Transformer — The architecture: attention + MLP + residual connections + normalization.
Vocabulary — The complete set of all possible tokens (27 in microgpt.py).
Vocab Size — The number of unique tokens (27 = 26 letters + BOS).
Weight Matrix — A 2D grid of learnable parameters used in linear transformations.