Question 1

What is a tensor?

Accepted Answer

A tensor is a multi-dimensional array of numbers. A single number is a 0D tensor, a list is 1D, a table is 2D, and so on. In the code, torch.tensor(...) wraps data into this format so the GPU can process it efficiently.

Question 2

What is a token?

Accepted Answer

A token is the smallest unit of text the model works with. In this code each character is a token; production models use subword chunks (like "the" or "ing") so one word may be one or several tokens.

Question 3

What is an embedding?

Accepted Answer

An embedding is a learned lookup table that converts a token ID (just a number) into a vector of floating-point numbers the network can actually compute with. The model learns what these vectors should be during training.

Question 4

What is self-attention?

Accepted Answer

Self-attention is the mechanism that lets each token "look at" every previous token and decide how much to attend to each one. It computes queries, keys, and values — then uses dot products to figure out what information to pull in.

Question 5

What is an attention head?

Accepted Answer

An attention head is one independent attention computation. Each head can learn to focus on a different pattern (e.g. one might track grammar, another proximity). Multi-head attention runs several heads in parallel and concatenates the results.

Question 6

What is a loss function?

Accepted Answer

A loss function produces a single number that measures how wrong the model's predictions are. Lower is better. The code uses cross-entropy loss: it checks how much probability the model assigned to the correct next token.

Question 7

What is a gradient?

Accepted Answer

For each parameter, the gradient says "if I nudge this number up a tiny bit, does the loss go up or down, and by how much?" It's the direction and steepness the optimizer uses to improve the model.

Question 8

What is backpropagation?

Accepted Answer

Backpropagation is the algorithm that computes gradients. It starts at the loss and works backwards through every operation in the network, applying the chain rule of calculus at each step. That's what loss.backward() does.

Question 9

What is an optimizer like Adam?

Accepted Answer

An optimizer is the algorithm that actually updates the parameters using the gradients. Adam is smarter than plain gradient descent: it keeps a running average of past gradients and adapts the learning rate per-parameter.

Question 10

What is a learning rate?

Accepted Answer

The learning rate controls how big a step the optimizer takes each update. Too high and the model overshoots; too low and it learns too slowly. The code sets it to 3e-4 (0.0003).

Question 11

What is dropout?

Accepted Answer

Dropout is a technique that randomly zeroes out a fraction of activations during training. This forces the network not to rely on any single neuron and helps prevent overfitting. It's set to 0.2 in the code (20% of values dropped).

Question 12

What is LayerNorm?

Accepted Answer

LayerNorm normalizes the values within each layer so they don't grow or shrink uncontrollably. It stabilizes training and makes the network easier to optimize.

Question 13

What is a residual connection?

Accepted Answer

A residual connection is the "x = x + block(x)" pattern. Instead of replacing the input, the block's output is added to it. This lets gradients flow straight through and makes deep networks trainable.

Question 14

What is a feed-forward network (MLP)?

Accepted Answer

A feed-forward network (or MLP) is a two-layer neural network applied to each token independently. If attention is "communication" between tokens, the MLP is "computation" — where the model does per-position processing.

Question 15

What are logits?

Accepted Answer

Logits are raw scores the model outputs for each possible next token (one number per vocabulary entry). They're not probabilities yet — you pass them through softmax to get those.

Question 16

What is softmax?

Accepted Answer

Softmax converts a vector of raw scores into probabilities that sum to 1. Higher scores get exponentially more probability. It's how the model turns logits into a prediction distribution.

Question 17

What is batch size?

Accepted Answer

Batch size is how many independent sequences the model processes at once. Bigger batches give more stable gradient estimates but use more memory. The code uses 64.

Question 18

What is context length (block size)?

Accepted Answer

Context length is the maximum number of tokens the model can look back at. Set to 256 in the code — meaning the model sees at most 256 characters of history when predicting the next one.

Demystify LLMs in 200 Lines of Code.

How LLMs Learn

Feed tokens one by one

Measure surprise (the loss)

Backpropagate

Update parameters

FAQ