GPT in 243 lines — an annotated tour
Source links (open these first):
- Dataset (names) → tokenizer (characters) → training samples
- A tiny autograd engine (Value) to do backprop
- A GPT-2-like transformer core (attention + MLP + residuals)
- Adam optimizer + training loop + inference sampling
Part 1 — Setup: imports, randomness, data
The file announces its philosophy: keep only what is algorithmically necessary, remove all “framework comfort”.
No NumPy, no PyTorch. Just filesystem checks, scalar math, and random numbers.
Same seed → same initialization and sampling (useful for learning + debugging).
If the dataset file isn’t present, the script fetches a names list. Then it loads and shuffles documents.
Characters become tokens. One special token BOS (“beginning of sequence”) is added to mark boundaries and to stop generation.
Part 2 — Autograd: the tiny engine that makes learning possible
Without a deep learning framework, you still need one superpower: compute gradients. microGPT implements a minimal scalar autograd with a Value node: each operation creates a node in a graph, and backward() walks that graph in reverse.
Each scalar stores its numeric value (data), gradient accumulator (grad), and pointers to children + local derivatives.
Each math op returns a new Value and stores “how to backprop through it” (local gradients).
Build a DAG ordering, set loss.grad = 1, then propagate gradients back to every parameter by summing contributions.
Part 3 — Parameters: weights, embeddings, transformer blocks
After autograd exists, the rest is “just a program that builds a big computation graph”. microGPT creates a state_dict full of parameters (all are Value scalars).
Tiny values on purpose: small network, short context, fast training on CPU.
Creates weight matrices as nested Python lists of Value initialized from a normal distribution.
The same ingredients as GPT-2 (slightly simplified): embeddings, attention projections, MLP layers, and output projection to vocabulary logits.
Collect every scalar parameter into one list so the optimizer can iterate easily.
Part 4 — Building blocks: linear, softmax, normalization
A pure-Python dense layer: for each output row, sum weight*input. Returns a vector of Value.
Turns logits into probabilities. Subtracting max improves numerical stability even in tiny examples.
A simplified normalization (RMSNorm) to keep activations in a reasonable range.
Part 5 — The GPT forward pass: embeddings → attention → MLP → logits
This is the heart: a function that takes the current token, current position, and cached keys/values and returns logits for “what comes next”.
Token embedding carries “what”; position embedding carries “where”. Their sum forms the input vector.
Compute queries/keys/values, append keys/values to cache, compute attention scores across previous positions, then combine values into a context vector. Residual connection adds stability.
A simple feed-forward network expands then contracts (4× width), adding nonlinearity and capacity. Another residual keeps gradients healthy.
Final linear layer produces one logit per vocabulary token: the raw “scores” for next-token prediction.
Part 6 — Adam: the optimizer that updates parameters
Adam keeps running estimates of first and second moments of gradients (m and v). Here they are plain Python floats.
Part 7 — Training loop: build graph → loss → backward → update
This loop is the entire “learning” story: tokenize a document, predict the next token at each position, compute cross-entropy loss, backpropagate, then update all weights with Adam.
We wrap each name with BOS at both ends so the model learns “start” and “end”.
Classic next-token training: at each position, predict the following character. The loss is average negative log-likelihood.
Builds gradients for every parameter by walking the computation graph in reverse.
Updates p.data using Adam’s normalized step, then resets gradients to zero for the next iteration.
Part 8 — Inference: generate new names (token by token)
Lower temperature → safer, more repetitive outputs. Higher temperature → more variety (and more nonsense).
At each step: compute probabilities, sample a token, stop if BOS (end), otherwise append character and continue.
What I take from this (as a builder)
- LLMs aren’t magic. The “algorithmic core” fits on a page; scale and efficiency are the real beasts.
- Frameworks hide the graph. Here you can literally feel what backprop means.
- Production ≠ learning. This file is education, not a replacement for modern tooling.
- It’s a great mental model. When you build AI products, it helps to remember what is truly happening.