Tech14 min read·2,663 words

I Built a Tiny GPT Model From Scratch — Here's Exactly How It Works

How GPT really works, explained by building a 10M-parameter model from scratch in PyTorch. Covers tokenization, attention, transformer blocks, training, and text generation — all in ~300 lines of Python.

K

Krunal Kanojiya

Share:
#gpt#transformer#pytorch#machine-learning#nlp#deep-learning#from-scratch

I've been writing about large language models for a few years now — blockchain and AI/ML are the two topics I cover most. And for a while, I could explain what GPT does reasonably well. But if you asked me exactly how it works underneath — how a string of text becomes a prediction — I'd start waving my hands.

That bothered me. So I built one.

Not GPT-4. Not even GPT-2. A tiny version — around 10 million parameters — that trains on Shakespeare in about 30 minutes on a laptop CPU. I called it NanoGPT (yes, inspired by Andrej Karpathy's work). Same architecture as the real thing, just smaller.

After 5,000 training steps it generates things like:

"HAMLET: The king hath sent me hither to speak with thee and thy father's ghost hath spoke to me"

Not poetry. But also not random noise. The model learned something real.

This post walks through every piece of how it works — the code, the math, and what I actually understood once I stopped reading about it and started building.


What we're building

NanoGPT is a decoder-only transformer — the same class of model as GPT-2, GPT-3, and GPT-4. The main differences are scale:

NanoGPTGPT-2 SmallGPT-4 (estimated)
Parameters~10M117M~1.8T
Embedding dim128768~12,288
Attention heads41296
Layers41296
Training data~1MB~40GBUnknown

The project has eight files:

plaintext
nanogpt/
├── config.py       # all hyperparameters in one place
├── tokenizer.py    # text ↔ token IDs
├── dataset.py      # data loading and batching
├── model.py        # the GPT architecture
├── train.py        # training loop
├── generate.py     # text generation
├── utils.py        # helpers
└── README.md

Let's go through each piece.


Step 1: Tokenization — turning text into numbers

Computers don't read words. They read numbers. Tokenization converts text into a sequence of integer IDs that the model can process.

I used tiktoken with GPT-2's vocabulary — 50,257 unique tokens. The encoding is BPE (Byte Pair Encoding), which splits text into subword units. Common words get a single token; rarer words get split into pieces.

python
# tokenizer.py
import tiktoken

def get_tokenizer():
    return tiktoken.get_encoding("gpt2")

def encode(text: str) -> list[int]:
    enc = get_tokenizer()
    return enc.encode(text)

def decode(token_ids: list[int]) -> str:
    enc = get_tokenizer()
    return enc.decode(token_ids)

So "To be or not to be" becomes something like [2514, 307, 393, 407, 284, 307]. The model never sees the original text — only these IDs.


Step 2: Centralising hyperparameters

Every hyperparameter lives in one dataclass. This makes experiments easy — you change numbers in one file, not scattered across ten.

python
# config.py
from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size: int = 50_257       # GPT-2 vocabulary
    embed_dim: int = 128           # what makes it "nano"
    num_heads: int = 4             # attention heads
    num_layers: int = 4            # transformer blocks
    max_seq_len: int = 256         # context window
    dropout: float = 0.1
    batch_size: int = 32
    learning_rate: float = 3e-4
    max_steps: int = 5_000
    eval_interval: int = 500
    device: str = "cpu"

The embed_dim: 128 is what keeps this model tiny. GPT-2 Small uses 768. That single number multiplies into every weight matrix in the model.


Step 3: Loading and batching the data

The training data is TinyShakespeare — about 1MB of Shakespeare plays. Small enough to fit in memory, rich enough to train a basic language model.

python
# dataset.py
import torch
from tokenizer import encode

def load_data(path: str, config):
    with open(path, "r") as f:
        text = f.read()

    tokens = encode(text)
    data = torch.tensor(tokens, dtype=torch.long)

    # 90/10 train/validation split
    split = int(0.9 * len(data))
    return data[:split], data[split:]

def get_batch(data: torch.Tensor, config):
    """
    Returns a random batch of (input, target) pairs.
    Target is input shifted right by one position.
    """
    seq_len = config.max_seq_len
    batch_size = config.batch_size

    # random starting positions
    ix = torch.randint(len(data) - seq_len, (batch_size,))
    x = torch.stack([data[i : i + seq_len] for i in ix])
    y = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
    return x.to(config.device), y.to(config.device)

Notice that the target y is just x shifted by one position. If the input is ["The", "cat", "sat"], the model should predict ["cat", "sat", "on"]. That's all language modeling is — predict the next token, over and over.


Step 4: The model architecture

This is the core. Four components build on top of each other.

Token and positional embeddings

Every token ID maps to a learned 128-dimensional vector. But the model also needs to know where in the sequence each token sits — "dog bites man" and "man bites dog" have the same tokens in different positions and mean different things.

python
# model.py (partial)
import torch
import torch.nn as nn
import torch.nn.functional as F
from config import GPTConfig

class NanoGPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        self.token_embedding = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_embedding = nn.Embedding(config.max_seq_len, config.embed_dim)

        self.blocks = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.num_layers)
        ])
        self.ln_f = nn.LayerNorm(config.embed_dim)
        self.head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)

        # weight tying: share parameters between input embedding and output head
        self.head.weight = self.token_embedding.weight

    def forward(self, idx: torch.Tensor, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding(idx)               # (B, T, embed_dim)
        pos = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embedding(pos)                 # (T, embed_dim)
        x = tok_emb + pos_emb                             # (B, T, embed_dim)

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)
        logits = self.head(x)                             # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

Weight tying (self.head.weight = self.token_embedding.weight) means the same matrix handles both encoding tokens into vectors and decoding vectors back to token probabilities. It reduces parameters and consistently improves performance — a trick borrowed from the original "Attention Is All You Need" paper.

Causal self-attention

Attention is the mechanism that lets each token look at other tokens in the sequence and decide which ones matter for predicting the next word.

Here's the intuition: when predicting the word after "The king sat on his ___", the model needs to look back and find "king" and "sat" — those are the relevant tokens. Attention is how it learns to do that.

python
class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.embed_dim % config.num_heads == 0

        self.num_heads = config.num_heads
        self.head_dim = config.embed_dim // config.num_heads

        # Query, Key, Value projections — all in one matrix
        self.qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
        self.proj = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
        self.dropout = nn.Dropout(config.dropout)

        # causal mask: upper triangle of ones
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.max_seq_len, config.max_seq_len))
            .view(1, 1, config.max_seq_len, config.max_seq_len)
        )

    def forward(self, x: torch.Tensor):
        B, T, C = x.shape

        # split into Q, K, V
        q, k, v = self.qkv(x).split(C, dim=2)

        # reshape for multi-head attention: (B, num_heads, T, head_dim)
        q = q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        # attention scores
        scale = self.head_dim ** -0.5
        att = (q @ k.transpose(-2, -1)) * scale         # (B, heads, T, T)

        # apply causal mask — zero out future tokens
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        # weighted sum of values
        out = att @ v                                    # (B, heads, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)

The causal mask is what makes this a language model rather than just a sequence encoder. It fills future positions with -inf before softmax, which forces them to zero after the exponent. Token at position 5 can only attend to positions 0–5, never 6 onwards. This is how the model learns to predict — it's never allowed to cheat by looking ahead.

Feed-forward network

After each attention layer, every token goes through a small neural network independently. It expands to 4× the embedding dimension, applies GELU activation, then compresses back.

python
class FeedForward(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.embed_dim, 4 * config.embed_dim),
            nn.GELU(),
            nn.Linear(4 * config.embed_dim, config.embed_dim),
            nn.Dropout(config.dropout),
        )

    def forward(self, x: torch.Tensor):
        return self.net(x)

The attention layer decides which tokens to look at. The feed-forward layer processes what it found. Different jobs, run in sequence.

The transformer block

Each transformer block wraps attention and feed-forward together with LayerNorm and residual connections.

python
class TransformerBlock(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.embed_dim)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.embed_dim)
        self.ff = FeedForward(config)

    def forward(self, x: torch.Tensor):
        x = x + self.attn(self.ln1(x))   # residual connection
        x = x + self.ff(self.ln2(x))     # residual connection
        return x

The x + ... is the residual connection. It looks trivial but it's not — without it, deep networks fail to train because gradients shrink to nothing on the way back through 4, 8, or 12 layers. Residual connections give gradients a direct path to early layers.

LayerNorm runs before the sublayer here (Pre-LN). This is a practical deviation from the original "Attention Is All You Need" paper (which used Post-LN) and it trains more stably.


Step 5: Training

The training loop is straightforward once the model exists.

python
# train.py
import torch
from config import GPTConfig
from dataset import load_data, get_batch
from model import NanoGPT
import math

def get_lr(step: int, config: GPTConfig) -> float:
    """Cosine learning rate schedule with warmup."""
    warmup_steps = 100
    if step < warmup_steps:
        return config.learning_rate * step / warmup_steps
    progress = (step - warmup_steps) / (config.max_steps - warmup_steps)
    return config.learning_rate * 0.5 * (1.0 + math.cos(math.pi * progress))

def train():
    config = GPTConfig()
    train_data, val_data = load_data("data/tinyshakespeare.txt", config)

    model = NanoGPT(config).to(config.device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)

    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

    for step in range(config.max_steps):
        # update learning rate
        lr = get_lr(step, config)
        for g in optimizer.param_groups:
            g["lr"] = lr

        model.train()
        x, y = get_batch(train_data, config)
        logits, loss = model(x, y)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # prevent exploding gradients
        optimizer.step()

        if step % config.eval_interval == 0:
            model.eval()
            with torch.no_grad():
                _, val_loss = model(*get_batch(val_data, config))
            print(f"step {step:5d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | lr {lr:.2e}")

    torch.save(model.state_dict(), "nanogpt.pt")

if __name__ == "__main__":
    train()

The loss starts around 10.8. That's almost exactly ln(50257) — what you'd get from a model randomly guessing across 50,257 tokens. By step 5,000 it's around 1.5, which means the model has developed real preferences.

clip_grad_norm_ to 1.0 prevents training from blowing up. Without it, a bad batch occasionally produces enormous gradients that overwrite everything the model has learned. One line that matters a lot.

The cosine schedule decays the learning rate smoothly. Starting too high causes the loss to bounce. Staying too high at the end stops the model from settling. The cosine curve just works — I didn't tune it, I copied the formula from the original GPT-2 paper and it was fine.

plaintext
step     0 | train loss 10.8231 | val loss 10.8189 | lr 3.00e-06
step   500 | train loss  4.2817 | val loss  4.3201 | lr 2.99e-04
step  1000 | train loss  3.1044 | val loss  3.2108 | lr 2.92e-04
step  2000 | train loss  2.3517 | val loss  2.5089 | lr 2.57e-04
step  3000 | train loss  1.9823 | val loss  2.1744 | lr 1.99e-04
step  4000 | train loss  1.7201 | val loss  1.9802 | lr 1.25e-04
step  5000 | train loss  1.5934 | val loss  1.8991 | lr 5.00e-05

The gap between train and val loss is normal — it shows the model memorised some training patterns. That's fine at this scale.


Step 6: Generating text

Once trained, the model generates text autoregressively — one token at a time, each new token fed back as input.

python
# generate.py
import torch
import torch.nn.functional as F
from config import GPTConfig
from model import NanoGPT
from tokenizer import encode, decode

def generate(
    model: NanoGPT,
    prompt: str,
    max_new_tokens: int = 200,
    temperature: float = 1.0,
    top_k: int = 40,
    device: str = "cpu",
) -> str:
    model.eval()
    config = model.config

    tokens = encode(prompt)
    x = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # crop to context window if needed
            x_cond = x[:, -config.max_seq_len:]

            logits, _ = model(x_cond)
            logits = logits[:, -1, :] / temperature    # only care about last token

            # top-k: keep only the 40 most probable tokens
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float("-inf")

            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_token], dim=1)

    return decode(x[0].tolist())

if __name__ == "__main__":
    config = GPTConfig()
    model = NanoGPT(config)
    model.load_state_dict(torch.load("nanogpt.pt", map_location=config.device))

    output = generate(
        model,
        prompt="HAMLET:",
        max_new_tokens=200,
        temperature=0.8,
        top_k=40,
        device=config.device,
    )
    print(output)

Temperature controls randomness. At 1.0, sampling is proportional to the raw probabilities. At 0.5, the distribution sharpens — the model picks safer, more predictable tokens. At 1.5, it flattens — more variety, more surprises, more nonsense.

Top-k sampling removes the long tail. Without it, the model occasionally picks a very unlikely token and the output goes off the rails. Restricting to the top 40 keeps things coherent without killing variation.

Good defaults for generation

Start with temperature=0.8 and top_k=40. For more creative output try temperature=1.1. For more coherent output go lower, around 0.6. Never go below 0.3 — it starts repeating itself.


Scaling up to GPT-2

The only thing separating NanoGPT from GPT-2 Small is four numbers in the config:

python
# GPT-2 Small — change just these four values
@dataclass
class GPT2Config(GPTConfig):
    embed_dim: int = 768       # was 128
    num_heads: int = 12        # was 4
    num_layers: int = 12       # was 4
    max_seq_len: int = 1024    # was 256

That's it. The architecture is identical. The parameter count goes from ~10M to ~117M. The training goes from 30 minutes on CPU to days on GPU clusters. But the code structure doesn't change.

GPT-3 is just GPT-2 with embed_dim=12288, num_heads=96, num_layers=96, and a lot more data.


What actually made sense once I built it

Attention is not as magical as it sounds. Once you write the QKV matrix multiply by hand, it's dot products and softmax. That's it. The power comes from training — the model learns which queries and keys to produce such that the attention scores end up meaningful. The mechanism itself is just math.

Residual connections surprised me. x = x + sublayer(x) is one line of code. I actually removed them from my model to see what would happen. The loss barely budged for the first thousand steps, which made me think they didn't matter. Then I realised I was comparing to a model that was also broken — neither had converged. When I let both run to step 5,000, the one with residuals hit 1.59. The one without was stuck at 2.8. One line, roughly half the learning.

Watching the loss number drop is weirdly motivating. 10.8 means the model is guessing randomly. 1.5 means it's narrowed each prediction down to a few likely tokens. The number actually means something, and seeing it fall across 5,000 steps feels more like understanding than any explanation I'd read.

The scale thing hit me when I wrote the GPT-2 config. Four numbers. That's all that separates this toy from a model that was genuinely impressive in 2019. GPT-3 is the same again, just bigger. The architecture hasn't changed much since 2017. All the gains since then are scale, data, and training tricks.


Getting the code

The full implementation is on my GitHub. It includes a training script, a generation script, and a pre-tokenized version of TinyShakespeare so you can start training without waiting for downloads.

If you want to go deeper, Andrej Karpathy's nanoGPT repo and his Zero to Hero series cover this in much more detail — that's where I started before writing my own version from scratch.

Build it yourself. Reading about transformers is useful. Writing the causal mask by hand is something else.

K

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation — previously Cromtek Solution and freelance.

Related Posts