Published: 2026-02-22 • DIMO

GPT in 243 lines — an annotated tour

How to read the whole GPT “algorithmic core” in one sitting

How to use this post: open Andrej Karpathy’s original microgpt.py in a separate tab and scroll with it. I do not paste the full code here (it’s his work). Instead, I explain each statement in the file in a readable way.

Source links (open these first):

Note: “243 lines” depends on how your editor wraps and formats. Think “243 statements / units” rather than a sacred line-numbering system.

Very personal note: If you truly grasp the concept, you’ll see how misleading — and ultimately disappointing — an AI limited to LLMs alone can be, even as it proves remarkably useful at the dawn of this century… a sign, perhaps, that humanity is already running late.

What’s inside microGPT?

Dataset (names) → tokenizer (characters) → training samples
A tiny autograd engine (Value) to do backprop
A GPT-2-like transformer core (attention + MLP + residuals)
Adam optimizer + training loop + inference sampling

Part 1 — Setup: imports, randomness, data

1Docstring / intent

“atomic GPT… everything else is efficiency”

The file announces its philosophy: keep only what is algorithmically necessary, remove all “framework comfort”.

2Standard-library imports

import os / math / random

No NumPy, no PyTorch. Just filesystem checks, scalar math, and random numbers.

3Determinism

random.seed(42)

Same seed → same initialization and sampling (useful for learning + debugging).

4–10Dataset bootstrap

download input.txt if missing

If the dataset file isn’t present, the script fetches a names list. Then it loads and shuffles documents.

11–18Tokenizer setup

unique chars → ids + BOS

Characters become tokens. One special token BOS (“beginning of sequence”) is added to mark boundaries and to stop generation.

Part 2 — Autograd: the tiny engine that makes learning possible

Without a deep learning framework, you still need one superpower: compute gradients. microGPT implements a minimal scalar autograd with a Value node: each operation creates a node in a graph, and backward() walks that graph in reverse.

19–25Define Value node

class Value: data, grad, graph links

Each scalar stores its numeric value (data), gradient accumulator (grad), and pointers to children + local derivatives.

26–60Operator overloading

+ * ** log exp relu / - etc.

Each math op returns a new Value and stores “how to backprop through it” (local gradients).

61–85Backward pass

topological order + chain rule

Build a DAG ordering, set loss.grad = 1, then propagate gradients back to every parameter by summing contributions.

Part 3 — Parameters: weights, embeddings, transformer blocks

After autograd exists, the rest is “just a program that builds a big computation graph”. microGPT creates a state_dict full of parameters (all are Value scalars).

86–96Model hyperparameters

n_layer, n_embd, block_size, n_head

Tiny values on purpose: small network, short context, fast training on CPU.

97–104Matrix factory

matrix(nout, nin) → Value gaussian init

Creates weight matrices as nested Python lists of Value initialized from a normal distribution.

105–130state_dict contents

token/pos embeddings + attention + MLP + lm_head

The same ingredients as GPT-2 (slightly simplified): embeddings, attention projections, MLP layers, and output projection to vocabulary logits.

131–134Flatten params

params = [p for ...]

Collect every scalar parameter into one list so the optimizer can iterate easily.

Part 4 — Building blocks: linear, softmax, normalization

135–145linear()

dot products to produce outputs

A pure-Python dense layer: for each output row, sum weight*input. Returns a vector of Value.

146–158softmax()

exp(logits - max) / sum

Turns logits into probabilities. Subtracting max improves numerical stability even in tiny examples.

159–168rmsnorm()

scale by inverse RMS

A simplified normalization (RMSNorm) to keep activations in a reasonable range.

Part 5 — The GPT forward pass: embeddings → attention → MLP → logits

This is the heart: a function that takes the current token, current position, and cached keys/values and returns logits for “what comes next”.

169–178Token + position embedding

wte[token] + wpe[pos]

Token embedding carries “what”; position embedding carries “where”. Their sum forms the input vector.

179–220Attention block (per layer)

q,k,v + attention weights + output proj

Compute queries/keys/values, append keys/values to cache, compute attention scores across previous positions, then combine values into a context vector. Residual connection adds stability.

221–235MLP block (per layer)

fc1 → ReLU → fc2 + residual

A simple feed-forward network expands then contracts (4× width), adding nonlinearity and capacity. Another residual keeps gradients healthy.

236–243lm_head

project to vocab logits

Final linear layer produces one logit per vocabulary token: the raw “scores” for next-token prediction.

Part 6 — Adam: the optimizer that updates parameters

244–254Hyperparams + buffers

lr, betas, eps + m/v arrays

Adam keeps running estimates of first and second moments of gradients (m and v). Here they are plain Python floats.

Part 7 — Training loop: build graph → loss → backward → update

This loop is the entire “learning” story: tokenize a document, predict the next token at each position, compute cross-entropy loss, backpropagate, then update all weights with Adam.

255–270Select one doc + tokenize

tokens = [BOS] + chars + [BOS]

We wrap each name with BOS at both ends so the model learns “start” and “end”.

271–295Forward through positions

logits → softmax → -log(p[target])

Classic next-token training: at each position, predict the following character. The loss is average negative log-likelihood.

296–305Backward

loss.backward()

Builds gradients for every parameter by walking the computation graph in reverse.

306–335Adam update + grad reset

m/v update → bias-correct → param step

Updates p.data using Adam’s normalized step, then resets gradients to zero for the next iteration.

Part 8 — Inference: generate new names (token by token)

336–343Temperature

logits / temperature

Lower temperature → safer, more repetitive outputs. Higher temperature → more variety (and more nonsense).

344–365Sampling loop

random.choices(weights=probs)

At each step: compute probabilities, sample a token, stop if BOS (end), otherwise append character and continue.

What I take from this (as a builder)

LLMs aren’t magic. The “algorithmic core” fits on a page; scale and efficiency are the real beasts.
Frameworks hide the graph. Here you can literally feel what backprop means.
Production ≠ learning. This file is education, not a replacement for modern tooling.
It’s a great mental model. When you build AI products, it helps to remember what is truly happening.

← Back to other content