Demystify LLMs in 200 Lines of Code.
Read the actual code behind language models. Highlight any line. Get a precise explanation of what it does and why.
class Head(nn.Module):
""" one head of self-attention """
def forward(self, x):
k = self.key(x)
wei = q @ k.transpose(-2,-1)
Head is one attention head. It projects input x into queries, keys, and values, then computes scaled dot-product attention...
or scroll to understand how it works ↓
How LLMs Learn
At its core, an LLM learns by repeatedly trying to predict the next token in a sequence — and getting better at it over time. Here's the loop:
Feed tokens one by one
The model reads a sequence ("e", "m", "m", "a"…) and at each step outputs a probability distribution over all possible next tokens.
Measure surprise (the loss)
If the correct next token had low probability, the model was "surprised" — that's a high loss. If it predicted well, low loss. This is called cross-entropy loss.
Backpropagate
The loss flows backwards through the entire network via the chain rule of calculus, computing for every parameter: "if I nudge this number, does the loss go up or down?" Those are the gradients.
Update parameters
An optimizer (like Adam) uses the gradients to nudge every parameter slightly in the direction that reduces the loss. Repeat thousands — or billions — of times.
That's it. The "knowledge" of the model is nothing more than billions of floating-point numbers that have been gradually shaped by this process until the model becomes very good at predicting what comes next in text.
A model trained on internet-scale text ends up implicitly learning grammar, facts, reasoning patterns, and even style — not because anyone programmed those in, but because they're all useful for predicting the next token.
FAQ
Every technical term you'll encounter in the code, explained in plain English.
What is a tensor?
A tensor is a multi-dimensional array of numbers. A single number is a 0D tensor, a list is 1D, a table is 2D, and so on. In the code, torch.tensor(...) wraps data into this format so the GPU can process it efficiently.
What is a token?
A token is the smallest unit of text the model works with. In this code each character is a token; production models use subword chunks (like "the" or "ing") so one word may be one or several tokens.
What is an embedding?
An embedding is a learned lookup table that converts a token ID (just a number) into a vector of floating-point numbers the network can actually compute with. The model learns what these vectors should be during training.
What is self-attention?
Self-attention is the mechanism that lets each token "look at" every previous token and decide how much to attend to each one. It computes queries, keys, and values — then uses dot products to figure out what information to pull in.
What is an attention head?
An attention head is one independent attention computation. Each head can learn to focus on a different pattern (e.g. one might track grammar, another proximity). Multi-head attention runs several heads in parallel and concatenates the results.
What is a loss function?
A loss function produces a single number that measures how wrong the model's predictions are. Lower is better. The code uses cross-entropy loss: it checks how much probability the model assigned to the correct next token.
What is a gradient?
For each parameter, the gradient says "if I nudge this number up a tiny bit, does the loss go up or down, and by how much?" It's the direction and steepness the optimizer uses to improve the model.
What is backpropagation?
Backpropagation is the algorithm that computes gradients. It starts at the loss and works backwards through every operation in the network, applying the chain rule of calculus at each step. That's what loss.backward() does.
What is an optimizer like Adam?
An optimizer is the algorithm that actually updates the parameters using the gradients. Adam is smarter than plain gradient descent: it keeps a running average of past gradients and adapts the learning rate per-parameter.
What is a learning rate?
The learning rate controls how big a step the optimizer takes each update. Too high and the model overshoots; too low and it learns too slowly. The code sets it to 3e-4 (0.0003).
What is dropout?
Dropout is a technique that randomly zeroes out a fraction of activations during training. This forces the network not to rely on any single neuron and helps prevent overfitting. It's set to 0.2 in the code (20% of values dropped).
What is LayerNorm?
LayerNorm normalizes the values within each layer so they don't grow or shrink uncontrollably. It stabilizes training and makes the network easier to optimize.
What is a residual connection?
A residual connection is the "x = x + block(x)" pattern. Instead of replacing the input, the block's output is added to it. This lets gradients flow straight through and makes deep networks trainable.
What is a feed-forward network (MLP)?
A feed-forward network (or MLP) is a two-layer neural network applied to each token independently. If attention is "communication" between tokens, the MLP is "computation" — where the model does per-position processing.
What are logits?
Logits are raw scores the model outputs for each possible next token (one number per vocabulary entry). They're not probabilities yet — you pass them through softmax to get those.
What is softmax?
Softmax converts a vector of raw scores into probabilities that sum to 1. Higher scores get exponentially more probability. It's how the model turns logits into a prediction distribution.
What is batch size?
Batch size is how many independent sequences the model processes at once. Bigger batches give more stable gradient estimates but use more memory. The code uses 64.
What is context length (block size)?
Context length is the maximum number of tokens the model can look back at. Set to 256 in the code — meaning the model sees at most 256 characters of history when predicting the next one.