The BOS Token¶
The Problem¶
Suppose the model is learning from the name "emma". We train it on these predictions:
| Given | Predict |
|---|---|
'e' | 'm' |
'm' | 'm' |
'm' | 'a' |
But two critical things are missing:
Two missing pieces
- How does the model know to start with 'e'? Something has to come before the first character.
- How does the model know to stop after 'a'? It could keep generating characters forever.
We need a way to say: "This is the beginning" and "This is the end."
The Solution: BOS (Beginning of Sequence)¶
microgpt.py uses a single special token called BOS for both purposes:
BOS = len(uchars) # token id 26 (one past the last character)
vocab_size = len(uchars) + 1 # 27 total tokens
BOS gets the token ID 26 — the next available number after 'z' (which is 25).
How BOS Works¶
When preparing a name for training, the code wraps it with BOS on both sides:
For the name "emma":
flowchart LR
B1["BOS<br>(26)"] --> E["e<br>(4)"] --> M1["m<br>(12)"] --> M2["m<br>(12)"] --> A["a<br>(0)"] --> B2["BOS<br>(26)"] Now the training pairs become:
| Input | → | Target | Meaning |
|---|---|---|---|
| BOS | → | 'e' | "After the start signal, predict 'e'" |
'e' | → | 'm' | "After 'e', predict 'm'" |
'm' | → | 'm' | "After 'm', predict 'm'" |
'm' | → | 'a' | "After 'm', predict 'a'" |
'a' | → | BOS | "After 'a', predict the STOP signal" |
This solves both problems:
- Starting: The model learns what characters are likely after BOS (i.e., which characters names typically start with)
- Stopping: The model learns when to produce BOS (i.e., when the name should end)
Why One Token for Both?¶
Elegant design
Using the same token for start and stop is elegant:
- It means fewer special tokens (smaller vocabulary)
- During generation, we feed BOS to start, and stop when the model produces BOS
- Mathematically, it doesn't matter — the model learns from context whether BOS means "start" or "stop"
The Complete Token System¶
| Token ID | Character | Type |
|---|---|---|
| 0 | 'a' | regular |
| 1 | 'b' | regular |
| ... | ... | regular |
| 25 | 'z' | regular |
| 26 | <BOS> | special |
| vocab_size = 27 |
Checkpoint ✓¶
What we know so far
How to load a dataset of names
How to convert characters to numbers (tokenization)
How to mark the start and end of each name (BOS token)
What we don't know yet: how does the model actually use these numbers to make predictions? For that, we need some math — specifically, we need a way to figure out "which direction to nudge" after making a wrong prediction.
Terminology
| Term | Meaning |
|---|---|
| BOS | Beginning of Sequence — a special token marking the start (and end) of a document |
| Special token | A token that doesn't represent a real character; it's a control signal |
| Sequence | An ordered list of tokens |
| Wrapping | Adding special tokens around a document before training |