Skip to content

The BOS Token

The Problem

Suppose the model is learning from the name "emma". We train it on these predictions:

Given Predict
'e' 'm'
'm' 'm'
'm' 'a'

But two critical things are missing:

Two missing pieces

  1. How does the model know to start with 'e'? Something has to come before the first character.
  2. How does the model know to stop after 'a'? It could keep generating characters forever.

We need a way to say: "This is the beginning" and "This is the end."

The Solution: BOS (Beginning of Sequence)

microgpt.py uses a single special token called BOS for both purposes:

microgpt.py — Lines 25-26
BOS = len(uchars)      # token id 26 (one past the last character)
vocab_size = len(uchars) + 1  # 27 total tokens

BOS gets the token ID 26 — the next available number after 'z' (which is 25).

How BOS Works

When preparing a name for training, the code wraps it with BOS on both sides:

microgpt.py — Line 157
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]

For the name "emma":

flowchart LR
    B1["BOS<br>(26)"] --> E["e<br>(4)"] --> M1["m<br>(12)"] --> M2["m<br>(12)"] --> A["a<br>(0)"] --> B2["BOS<br>(26)"]

Now the training pairs become:

Input Target Meaning
BOS 'e' "After the start signal, predict 'e'"
'e' 'm' "After 'e', predict 'm'"
'm' 'm' "After 'm', predict 'm'"
'm' 'a' "After 'm', predict 'a'"
'a' BOS "After 'a', predict the STOP signal"

This solves both problems:

  • Starting: The model learns what characters are likely after BOS (i.e., which characters names typically start with)
  • Stopping: The model learns when to produce BOS (i.e., when the name should end)

Why One Token for Both?

Elegant design

Using the same token for start and stop is elegant:

  • It means fewer special tokens (smaller vocabulary)
  • During generation, we feed BOS to start, and stop when the model produces BOS
  • Mathematically, it doesn't matter — the model learns from context whether BOS means "start" or "stop"

The Complete Token System

Token ID Character Type
0 'a' regular
1 'b' regular
... ... regular
25 'z' regular
26 <BOS> special
vocab_size = 27

Checkpoint ✓

What we know so far

  • ✅ How to load a dataset of names
  • ✅ How to convert characters to numbers (tokenization)
  • ✅ How to mark the start and end of each name (BOS token)

What we don't know yet: how does the model actually use these numbers to make predictions? For that, we need some math — specifically, we need a way to figure out "which direction to nudge" after making a wrong prediction.

Terminology
Term Meaning
BOS Beginning of Sequence — a special token marking the start (and end) of a document
Special token A token that doesn't represent a real character; it's a control signal
Sequence An ordered list of tokens
Wrapping Adding special tokens around a document before training