I Built a Tiny GPT Model From Scratch — Here's Exactly How It Works
How GPT really works, explained by building a 10M-parameter model from scratch in PyTorch. Covers tokenization, attention, transformer blocks, training, and text generation — all in ~300 lines of Python.
I've been writing about large language models for a few years now — blockchain and AI/ML are the two topics I cover most. And for a while, I could explain what GPT does reasonably well. But if you asked me exactly how it works underneath — how a string of text becomes a prediction — I'd start waving my hands.
That bothered me. So I built one.
Not GPT-4. Not even GPT-2. A tiny version — around 10 million parameters — that trains on Shakespeare in about 30 minutes on a laptop CPU. I called it NanoGPT (yes, inspired by Andrej Karpathy's work). Same architecture as the real thing, just smaller.
After 5,000 training steps it generates things like:
"HAMLET: The king hath sent me hither to speak with thee and thy father's ghost hath spoke to me"
Not poetry. But also not random noise. The model learned something real.
This post walks through every piece of how it works — the code, the math, and what I actually understood once I stopped reading about it and started building.
What we're building
NanoGPT is a decoder-only transformer — the same class of model as GPT-2, GPT-3, and GPT-4. The main differences are scale:
| NanoGPT | GPT-2 Small | GPT-4 (estimated) | |
|---|---|---|---|
| Parameters | ~10M | 117M | ~1.8T |
| Embedding dim | 128 | 768 | ~12,288 |
| Attention heads | 4 | 12 | 96 |
| Layers | 4 | 12 | 96 |
| Training data | ~1MB | ~40GB | Unknown |
The project has eight files:
nanogpt/
├── config.py # all hyperparameters in one place
├── tokenizer.py # text ↔ token IDs
├── dataset.py # data loading and batching
├── model.py # the GPT architecture
├── train.py # training loop
├── generate.py # text generation
├── utils.py # helpers
└── README.mdLet's go through each piece.
Step 1: Tokenization — turning text into numbers
Computers don't read words. They read numbers. Tokenization converts text into a sequence of integer IDs that the model can process.
I used tiktoken with GPT-2's vocabulary — 50,257 unique tokens. The encoding is BPE (Byte Pair Encoding), which splits text into subword units. Common words get a single token; rarer words get split into pieces.
# tokenizer.py
import tiktoken
def get_tokenizer():
return tiktoken.get_encoding("gpt2")
def encode(text: str) -> list[int]:
enc = get_tokenizer()
return enc.encode(text)
def decode(token_ids: list[int]) -> str:
enc = get_tokenizer()
return enc.decode(token_ids)So "To be or not to be" becomes something like [2514, 307, 393, 407, 284, 307]. The model never sees the original text — only these IDs.
Step 2: Centralising hyperparameters
Every hyperparameter lives in one dataclass. This makes experiments easy — you change numbers in one file, not scattered across ten.
# config.py
from dataclasses import dataclass
@dataclass
class GPTConfig:
vocab_size: int = 50_257 # GPT-2 vocabulary
embed_dim: int = 128 # what makes it "nano"
num_heads: int = 4 # attention heads
num_layers: int = 4 # transformer blocks
max_seq_len: int = 256 # context window
dropout: float = 0.1
batch_size: int = 32
learning_rate: float = 3e-4
max_steps: int = 5_000
eval_interval: int = 500
device: str = "cpu"The embed_dim: 128 is what keeps this model tiny. GPT-2 Small uses 768. That single number multiplies into every weight matrix in the model.
Step 3: Loading and batching the data
The training data is TinyShakespeare — about 1MB of Shakespeare plays. Small enough to fit in memory, rich enough to train a basic language model.
# dataset.py
import torch
from tokenizer import encode
def load_data(path: str, config):
with open(path, "r") as f:
text = f.read()
tokens = encode(text)
data = torch.tensor(tokens, dtype=torch.long)
# 90/10 train/validation split
split = int(0.9 * len(data))
return data[:split], data[split:]
def get_batch(data: torch.Tensor, config):
"""
Returns a random batch of (input, target) pairs.
Target is input shifted right by one position.
"""
seq_len = config.max_seq_len
batch_size = config.batch_size
# random starting positions
ix = torch.randint(len(data) - seq_len, (batch_size,))
x = torch.stack([data[i : i + seq_len] for i in ix])
y = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
return x.to(config.device), y.to(config.device)Notice that the target y is just x shifted by one position. If the input is ["The", "cat", "sat"], the model should predict ["cat", "sat", "on"]. That's all language modeling is — predict the next token, over and over.
Step 4: The model architecture
This is the core. Four components build on top of each other.
Token and positional embeddings
Every token ID maps to a learned 128-dimensional vector. But the model also needs to know where in the sequence each token sits — "dog bites man" and "man bites dog" have the same tokens in different positions and mean different things.
# model.py (partial)
import torch
import torch.nn as nn
import torch.nn.functional as F
from config import GPTConfig
class NanoGPT(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.config = config
self.token_embedding = nn.Embedding(config.vocab_size, config.embed_dim)
self.pos_embedding = nn.Embedding(config.max_seq_len, config.embed_dim)
self.blocks = nn.ModuleList([
TransformerBlock(config) for _ in range(config.num_layers)
])
self.ln_f = nn.LayerNorm(config.embed_dim)
self.head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)
# weight tying: share parameters between input embedding and output head
self.head.weight = self.token_embedding.weight
def forward(self, idx: torch.Tensor, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding(idx) # (B, T, embed_dim)
pos = torch.arange(T, device=idx.device)
pos_emb = self.pos_embedding(pos) # (T, embed_dim)
x = tok_emb + pos_emb # (B, T, embed_dim)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, lossWeight tying (self.head.weight = self.token_embedding.weight) means the same matrix handles both encoding tokens into vectors and decoding vectors back to token probabilities. It reduces parameters and consistently improves performance — a trick borrowed from the original "Attention Is All You Need" paper.
Causal self-attention
Attention is the mechanism that lets each token look at other tokens in the sequence and decide which ones matter for predicting the next word.
Here's the intuition: when predicting the word after "The king sat on his ___", the model needs to look back and find "king" and "sat" — those are the relevant tokens. Attention is how it learns to do that.
class CausalSelfAttention(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
assert config.embed_dim % config.num_heads == 0
self.num_heads = config.num_heads
self.head_dim = config.embed_dim // config.num_heads
# Query, Key, Value projections — all in one matrix
self.qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
self.proj = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
self.dropout = nn.Dropout(config.dropout)
# causal mask: upper triangle of ones
self.register_buffer(
"mask",
torch.tril(torch.ones(config.max_seq_len, config.max_seq_len))
.view(1, 1, config.max_seq_len, config.max_seq_len)
)
def forward(self, x: torch.Tensor):
B, T, C = x.shape
# split into Q, K, V
q, k, v = self.qkv(x).split(C, dim=2)
# reshape for multi-head attention: (B, num_heads, T, head_dim)
q = q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
k = k.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
v = v.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
# attention scores
scale = self.head_dim ** -0.5
att = (q @ k.transpose(-2, -1)) * scale # (B, heads, T, T)
# apply causal mask — zero out future tokens
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
att = self.dropout(att)
# weighted sum of values
out = att @ v # (B, heads, T, head_dim)
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.proj(out)The causal mask is what makes this a language model rather than just a sequence encoder. It fills future positions with -inf before softmax, which forces them to zero after the exponent. Token at position 5 can only attend to positions 0–5, never 6 onwards. This is how the model learns to predict — it's never allowed to cheat by looking ahead.
Feed-forward network
After each attention layer, every token goes through a small neural network independently. It expands to 4× the embedding dimension, applies GELU activation, then compresses back.
class FeedForward(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.net = nn.Sequential(
nn.Linear(config.embed_dim, 4 * config.embed_dim),
nn.GELU(),
nn.Linear(4 * config.embed_dim, config.embed_dim),
nn.Dropout(config.dropout),
)
def forward(self, x: torch.Tensor):
return self.net(x)The attention layer decides which tokens to look at. The feed-forward layer processes what it found. Different jobs, run in sequence.
The transformer block
Each transformer block wraps attention and feed-forward together with LayerNorm and residual connections.
class TransformerBlock(nn.Module):
def __init__(self, config: GPTConfig):
super().__init__()
self.ln1 = nn.LayerNorm(config.embed_dim)
self.attn = CausalSelfAttention(config)
self.ln2 = nn.LayerNorm(config.embed_dim)
self.ff = FeedForward(config)
def forward(self, x: torch.Tensor):
x = x + self.attn(self.ln1(x)) # residual connection
x = x + self.ff(self.ln2(x)) # residual connection
return xThe x + ... is the residual connection. It looks trivial but it's not — without it, deep networks fail to train because gradients shrink to nothing on the way back through 4, 8, or 12 layers. Residual connections give gradients a direct path to early layers.
LayerNorm runs before the sublayer here (Pre-LN). This is a practical deviation from the original "Attention Is All You Need" paper (which used Post-LN) and it trains more stably.
Step 5: Training
The training loop is straightforward once the model exists.
# train.py
import torch
from config import GPTConfig
from dataset import load_data, get_batch
from model import NanoGPT
import math
def get_lr(step: int, config: GPTConfig) -> float:
"""Cosine learning rate schedule with warmup."""
warmup_steps = 100
if step < warmup_steps:
return config.learning_rate * step / warmup_steps
progress = (step - warmup_steps) / (config.max_steps - warmup_steps)
return config.learning_rate * 0.5 * (1.0 + math.cos(math.pi * progress))
def train():
config = GPTConfig()
train_data, val_data = load_data("data/tinyshakespeare.txt", config)
model = NanoGPT(config).to(config.device)
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
for step in range(config.max_steps):
# update learning rate
lr = get_lr(step, config)
for g in optimizer.param_groups:
g["lr"] = lr
model.train()
x, y = get_batch(train_data, config)
logits, loss = model(x, y)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # prevent exploding gradients
optimizer.step()
if step % config.eval_interval == 0:
model.eval()
with torch.no_grad():
_, val_loss = model(*get_batch(val_data, config))
print(f"step {step:5d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | lr {lr:.2e}")
torch.save(model.state_dict(), "nanogpt.pt")
if __name__ == "__main__":
train()The loss starts around 10.8. That's almost exactly ln(50257) — what you'd get from a model randomly guessing across 50,257 tokens. By step 5,000 it's around 1.5, which means the model has developed real preferences.
clip_grad_norm_ to 1.0 prevents training from blowing up. Without it, a bad batch occasionally produces enormous gradients that overwrite everything the model has learned. One line that matters a lot.
The cosine schedule decays the learning rate smoothly. Starting too high causes the loss to bounce. Staying too high at the end stops the model from settling. The cosine curve just works — I didn't tune it, I copied the formula from the original GPT-2 paper and it was fine.
step 0 | train loss 10.8231 | val loss 10.8189 | lr 3.00e-06
step 500 | train loss 4.2817 | val loss 4.3201 | lr 2.99e-04
step 1000 | train loss 3.1044 | val loss 3.2108 | lr 2.92e-04
step 2000 | train loss 2.3517 | val loss 2.5089 | lr 2.57e-04
step 3000 | train loss 1.9823 | val loss 2.1744 | lr 1.99e-04
step 4000 | train loss 1.7201 | val loss 1.9802 | lr 1.25e-04
step 5000 | train loss 1.5934 | val loss 1.8991 | lr 5.00e-05The gap between train and val loss is normal — it shows the model memorised some training patterns. That's fine at this scale.
Step 6: Generating text
Once trained, the model generates text autoregressively — one token at a time, each new token fed back as input.
# generate.py
import torch
import torch.nn.functional as F
from config import GPTConfig
from model import NanoGPT
from tokenizer import encode, decode
def generate(
model: NanoGPT,
prompt: str,
max_new_tokens: int = 200,
temperature: float = 1.0,
top_k: int = 40,
device: str = "cpu",
) -> str:
model.eval()
config = model.config
tokens = encode(prompt)
x = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)
with torch.no_grad():
for _ in range(max_new_tokens):
# crop to context window if needed
x_cond = x[:, -config.max_seq_len:]
logits, _ = model(x_cond)
logits = logits[:, -1, :] / temperature # only care about last token
# top-k: keep only the 40 most probable tokens
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
x = torch.cat([x, next_token], dim=1)
return decode(x[0].tolist())
if __name__ == "__main__":
config = GPTConfig()
model = NanoGPT(config)
model.load_state_dict(torch.load("nanogpt.pt", map_location=config.device))
output = generate(
model,
prompt="HAMLET:",
max_new_tokens=200,
temperature=0.8,
top_k=40,
device=config.device,
)
print(output)Temperature controls randomness. At 1.0, sampling is proportional to the raw probabilities. At 0.5, the distribution sharpens — the model picks safer, more predictable tokens. At 1.5, it flattens — more variety, more surprises, more nonsense.
Top-k sampling removes the long tail. Without it, the model occasionally picks a very unlikely token and the output goes off the rails. Restricting to the top 40 keeps things coherent without killing variation.
Start with temperature=0.8 and top_k=40. For more creative output try temperature=1.1. For more coherent output go lower, around 0.6. Never go below 0.3 — it starts repeating itself.
Scaling up to GPT-2
The only thing separating NanoGPT from GPT-2 Small is four numbers in the config:
# GPT-2 Small — change just these four values
@dataclass
class GPT2Config(GPTConfig):
embed_dim: int = 768 # was 128
num_heads: int = 12 # was 4
num_layers: int = 12 # was 4
max_seq_len: int = 1024 # was 256That's it. The architecture is identical. The parameter count goes from ~10M to ~117M. The training goes from 30 minutes on CPU to days on GPU clusters. But the code structure doesn't change.
GPT-3 is just GPT-2 with embed_dim=12288, num_heads=96, num_layers=96, and a lot more data.
What actually made sense once I built it
Attention is not as magical as it sounds. Once you write the QKV matrix multiply by hand, it's dot products and softmax. That's it. The power comes from training — the model learns which queries and keys to produce such that the attention scores end up meaningful. The mechanism itself is just math.
Residual connections surprised me. x = x + sublayer(x) is one line of code. I actually removed them from my model to see what would happen. The loss barely budged for the first thousand steps, which made me think they didn't matter. Then I realised I was comparing to a model that was also broken — neither had converged. When I let both run to step 5,000, the one with residuals hit 1.59. The one without was stuck at 2.8. One line, roughly half the learning.
Watching the loss number drop is weirdly motivating. 10.8 means the model is guessing randomly. 1.5 means it's narrowed each prediction down to a few likely tokens. The number actually means something, and seeing it fall across 5,000 steps feels more like understanding than any explanation I'd read.
The scale thing hit me when I wrote the GPT-2 config. Four numbers. That's all that separates this toy from a model that was genuinely impressive in 2019. GPT-3 is the same again, just bigger. The architecture hasn't changed much since 2017. All the gains since then are scale, data, and training tricks.
Getting the code
The full implementation is on my GitHub. It includes a training script, a generation script, and a pre-tokenized version of TinyShakespeare so you can start training without waiting for downloads.
If you want to go deeper, Andrej Karpathy's nanoGPT repo and his Zero to Hero series cover this in much more detail — that's where I started before writing my own version from scratch.
Build it yourself. Reading about transformers is useful. Writing the causal mask by hand is something else.
Krunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation — previously Cromtek Solution and freelance.