NanoGPT is a minimal GPT-2 style language model with around 10 million parameters. It uses the same decoder-only transformer architecture as GPT-2 and GPT-3 but at a much smaller scale - it can be trained on a laptop CPU in 20–30 minutes using the TinyShakespeare dataset.

How does attention work in a GPT model?

Attention in GPT uses Query, Key, and Value matrices. Each token produces a query vector (what it's looking for) and a key vector (what it offers). Dot products between queries and keys produce attention scores, which after softmax become weights applied to the value vectors. A causal mask prevents each token from attending to future positions, which is what makes it a language model rather than a bidirectional encoder.

What is a causal mask in a transformer?

A causal mask is an upper-triangular matrix of negative infinity values applied to the attention scores before softmax. It forces each token to only attend to itself and tokens before it - not future ones. This is essential for autoregressive language generation because the model must predict the next token without seeing it.

What is the difference between NanoGPT and GPT-2?

NanoGPT and GPT-2 share the same decoder-only transformer architecture. The difference is scale. NanoGPT uses an embedding dimension of 128, 4 attention heads, and 4 transformer layers (~10M parameters). GPT-2 Small uses an embedding dimension of 768, 12 heads, and 12 layers (~117M parameters). The code structure and training procedure are identical.

How do I train a GPT model from scratch in PyTorch?

To train a GPT model from scratch in PyTorch: (1) tokenize your text corpus using a BPE tokenizer like tiktoken, (2) define a decoder-only transformer with causal self-attention blocks, (3) train with cross-entropy loss predicting the next token, (4) use AdamW optimizer with a cosine learning rate schedule and gradient clipping. A minimal implementation requires around 300 lines of Python.

I Built a Tiny GPT From Scratch

I've been writing about large language models for a few years now. Blockchain and AI/ML are the two topics I cover most. And for a while, I could explain what GPT does reasonably well. But if you asked me exactly how it works underneath, how a string of text becomes a prediction. I would start waving my hands.

That bothered me. So I built one.

Not GPT-4. Not even GPT-2. A tiny version (around 10 million parameters) that trains on Shakespeare in about 30 minutes on a laptop CPU. I called it NanoGPT (yes, inspired by Andrej Karpathy's work). Same architecture as the real thing, just smaller.

After 5,000 training steps it generates things like:

"HAMLET: The king hath sent me hither to speak with thee and thy father's ghost hath spoke to me"

Not poetry. But also not random noise. The model learned something real.

This post walks through every piece of how it works: the code, the math, and what I actually understood once I stopped reading about it and started building.

For the deeper explanation of how embeddings work inside transformers, see what are embeddings.

What we're building

NanoGPT is a decoder-only transformer, the same class of model as GPT-2, GPT-3, and GPT-4. The main differences are scale:

	NanoGPT	GPT-2 Small	GPT-4 (estimated)
Parameters	~10M	117M	~1.8T
Embedding dim	128	768	~12,288
Attention heads	4	12	96
Layers	4	12	96
Training data	~1MB	~40GB	Unknown

The project has eight files:

plaintext

nanogpt/
├── config.py       # all hyperparameters in one place
├── tokenizer.py    # text ↔ token IDs
├── dataset.py      # data loading and batching
├── model.py        # the GPT architecture
├── train.py        # training loop
├── generate.py     # text generation
├── utils.py        # helpers
└── README.md

Let's go through each piece.

Step 1: Tokenization

Computers don't read words. They read numbers. Tokenization converts text into a sequence of integer IDs that the model can process.

I used tiktoken with GPT-2's vocabulary, which has 50,257 unique tokens. The encoding is BPE (Byte Pair Encoding), which splits text into subword units. Common words get a single token; rarer words get split into pieces.

python

# tokenizer.py
import tiktoken

def get_tokenizer():
    return tiktoken.get_encoding("gpt2")

def encode(text: str) -> list[int]:
    enc = get_tokenizer()
    return enc.encode(text)

def decode(token_ids: list[int]) -> str:
    enc = get_tokenizer()
    return enc.decode(token_ids)

So "To be or not to be" becomes something like [2514, 307, 393, 407, 284, 307]. The model never sees the original text. Only these IDs..

Step 2: Centralising hyperparameters

Every hyperparameter lives in one dataclass. This makes experiments easy. You change numbers in one file, not scattered across ten..

python

# config.py
from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size: int = 50_257       # GPT-2 vocabulary
    embed_dim: int = 128           # what makes it "nano"
    num_heads: int = 4             # attention heads
    num_layers: int = 4            # transformer blocks
    max_seq_len: int = 256         # context window
    dropout: float = 0.1
    batch_size: int = 32
    learning_rate: float = 3e-4
    max_steps: int = 5_000
    eval_interval: int = 500
    device: str = "cpu"

The embed_dim: 128 is what keeps this model tiny. GPT-2 Small uses 768. That single number multiplies into every weight matrix in the model.

Step 3: Loading and batching the data

The training data is TinyShakespeare, about 1MB of Shakespeare plays. Small enough to fit in memory, rich enough to train a basic language model.

python

# dataset.py
import torch
from tokenizer import encode

def load_data(path: str, config):
    with open(path, "r") as f:
        text = f.read()

    tokens = encode(text)
    data = torch.tensor(tokens, dtype=torch.long)

    # 90/10 train/validation split
    split = int(0.9 * len(data))
    return data[:split], data[split:]

def get_batch(data: torch.Tensor, config):
    """
    Returns a random batch of (input, target) pairs.
    Target is input shifted right by one position.
    """
    seq_len = config.max_seq_len
    batch_size = config.batch_size

    # random starting positions
    ix = torch.randint(len(data) - seq_len, (batch_size,))
    x = torch.stack([data[i : i + seq_len] for i in ix])
    y = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
    return x.to(config.device), y.to(config.device)

Notice that the target y is just x shifted by one position. If the input is ["The", "cat", "sat"], the model should predict ["cat", "sat", "on"]. That is all language modeling is: predict the next token, over and over..

Step 4: The model architecture

This is the core. Four components build on top of each other.

Token and positional embeddings

Every token ID maps to a learned 128-dimensional vector. But the model also needs to know where in the sequence each token sits. "dog bites man" and "man bites dog" have the same tokens in different positions and mean different things.

python

# model.py (partial)
import torch
import torch.nn as nn
import torch.nn.functional as F
from config import GPTConfig

class NanoGPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        self.token_embedding = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_embedding = nn.Embedding(config.max_seq_len, config.embed_dim)

        self.blocks = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.num_layers)
        ])
        self.ln_f = nn.LayerNorm(config.embed_dim)
        self.head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)

        # weight tying: share parameters between input embedding and output head
        self.head.weight = self.token_embedding.weight

    def forward(self, idx: torch.Tensor, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding(idx)               # (B, T, embed_dim)
        pos = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embedding(pos)                 # (T, embed_dim)
        x = tok_emb + pos_emb                             # (B, T, embed_dim)

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)
        logits = self.head(x)                             # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

Weight tying (self.head.weight = self.token_embedding.weight) means the same matrix handles both encoding tokens into vectors and decoding vectors back to token probabilities. It reduces parameters and consistently improves performance, a trick borrowed from the original "Attention Is All You Need" paper.

Causal self-attention

Attention is the mechanism that lets each token look at other tokens in the sequence and decide which ones matter for predicting the next word.

Here's the intuition: when predicting the word after "The king sat on his ___", the model needs to look back and find "king" and "sat". Those are the relevant tokens.. Attention is how it learns to do that.

python

class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.embed_dim % config.num_heads == 0

        self.num_heads = config.num_heads
        self.head_dim = config.embed_dim // config.num_heads

        # Query, Key, Value projections - all in one matrix
        self.qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
        self.proj = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
        self.dropout = nn.Dropout(config.dropout)

        # causal mask: upper triangle of ones
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.max_seq_len, config.max_seq_len))
            .view(1, 1, config.max_seq_len, config.max_seq_len)
        )

    def forward(self, x: torch.Tensor):
        B, T, C = x.shape

        # split into Q, K, V
        q, k, v = self.qkv(x).split(C, dim=2)

        # reshape for multi-head attention: (B, num_heads, T, head_dim)
        q = q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        # attention scores
        scale = self.head_dim ** -0.5
        att = (q @ k.transpose(-2, -1)) * scale         # (B, heads, T, T)

        # apply causal mask - zero out future tokens
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        # weighted sum of values
        out = att @ v                                    # (B, heads, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)

The causal mask is what makes this a language model rather than just a sequence encoder. It fills future positions with -inf before softmax, which forces them to zero after the exponent. Token at position 5 can only attend to positions 0 to 5, never 6 onwards. This is how the model learns to predict. It is never allowed to cheat by looking ahead..

Feed-forward network

After each attention layer, every token goes through a small neural network independently. It expands to 4× the embedding dimension, applies GELU activation, then compresses back.

python

class FeedForward(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.embed_dim, 4 * config.embed_dim),
            nn.GELU(),
            nn.Linear(4 * config.embed_dim, config.embed_dim),
            nn.Dropout(config.dropout),
        )

    def forward(self, x: torch.Tensor):
        return self.net(x)

The attention layer decides which tokens to look at. The feed-forward layer processes what it found. Different jobs, run in sequence.

The transformer block

Each transformer block wraps attention and feed-forward together with LayerNorm and residual connections.

python

class TransformerBlock(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.embed_dim)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.embed_dim)
        self.ff = FeedForward(config)

    def forward(self, x: torch.Tensor):
        x = x + self.attn(self.ln1(x))   # residual connection
        x = x + self.ff(self.ln2(x))     # residual connection
        return x

The x + ... is the residual connection. It looks trivial but it is not. Without it, deep networks fail to train because gradients shrink to nothing on the way back through 4, 8, or 12 layers. Residual connections give gradients a direct path to early layers.

LayerNorm runs before the sublayer here (Pre-LN). This is a practical deviation from the original "Attention Is All You Need" paper (which used Post-LN) and it trains more stably.

Step 5: Training

The training loop is straightforward once the model exists.

python

# train.py
import torch
from config import GPTConfig
from dataset import load_data, get_batch
from model import NanoGPT
import math

def get_lr(step: int, config: GPTConfig) -> float:
    """Cosine learning rate schedule with warmup."""
    warmup_steps = 100
    if step < warmup_steps:
        return config.learning_rate * step / warmup_steps
    progress = (step - warmup_steps) / (config.max_steps - warmup_steps)
    return config.learning_rate * 0.5 * (1.0 + math.cos(math.pi * progress))

def train():
    config = GPTConfig()
    train_data, val_data = load_data("data/tinyshakespeare.txt", config)

    model = NanoGPT(config).to(config.device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)

    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

    for step in range(config.max_steps):
        # update learning rate
        lr = get_lr(step, config)
        for g in optimizer.param_groups:
            g["lr"] = lr

        model.train()
        x, y = get_batch(train_data, config)
        logits, loss = model(x, y)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # prevent exploding gradients
        optimizer.step()

        if step % config.eval_interval == 0:
            model.eval()
            with torch.no_grad():
                _, val_loss = model(*get_batch(val_data, config))
            print(f"step {step:5d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | lr {lr:.2e}")

    torch.save(model.state_dict(), "nanogpt.pt")

if __name__ == "__main__":
    train()

The loss starts around 10.8. That's almost exactly ln(50257), which is what you get from a model randomly guessing across 50,257 tokens. By step 5,000 it's around 1.5, which means the model has developed real preferences.

clip_grad_norm_ to 1.0 prevents training from blowing up. Without it, a bad batch occasionally produces enormous gradients that overwrite everything the model has learned. One line that matters a lot.

The cosine schedule decays the learning rate smoothly. Starting too high causes the loss to bounce. Staying too high at the end stops the model from settling. The cosine curve just works. I did not tune it, I copied the formula from the original GPT-2 paper and it was fine.

plaintext

step     0 | train loss 10.8231 | val loss 10.8189 | lr 3.00e-06
step   500 | train loss  4.2817 | val loss  4.3201 | lr 2.99e-04
step  1000 | train loss  3.1044 | val loss  3.2108 | lr 2.92e-04
step  2000 | train loss  2.3517 | val loss  2.5089 | lr 2.57e-04
step  3000 | train loss  1.9823 | val loss  2.1744 | lr 1.99e-04
step  4000 | train loss  1.7201 | val loss  1.9802 | lr 1.25e-04
step  5000 | train loss  1.5934 | val loss  1.8991 | lr 5.00e-05

The gap between train and val loss is normal. It shows the model memorised some training patterns.. That's fine at this scale.

Step 6: Generating text

Once trained, the model generates text autoregressively, one token at a time, each new token fed back as input.

python

# generate.py
import torch
import torch.nn.functional as F
from config import GPTConfig
from model import NanoGPT
from tokenizer import encode, decode

def generate(
    model: NanoGPT,
    prompt: str,
    max_new_tokens: int = 200,
    temperature: float = 1.0,
    top_k: int = 40,
    device: str = "cpu",
) -> str:
    model.eval()
    config = model.config

    tokens = encode(prompt)
    x = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # crop to context window if needed
            x_cond = x[:, -config.max_seq_len:]

            logits, _ = model(x_cond)
            logits = logits[:, -1, :] / temperature    # only care about last token

            # top-k: keep only the 40 most probable tokens
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float("-inf")

            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_token], dim=1)

    return decode(x[0].tolist())

if __name__ == "__main__":
    config = GPTConfig()
    model = NanoGPT(config)
    model.load_state_dict(torch.load("nanogpt.pt", map_location=config.device))

    output = generate(
        model,
        prompt="HAMLET:",
        max_new_tokens=200,
        temperature=0.8,
        top_k=40,
        device=config.device,
    )
    print(output)

Temperature controls randomness. At 1.0, sampling is proportional to the raw probabilities. At 0.5, the distribution sharpens. The model picks safer, more predictable tokens. At 1.5, it flattens, giving more variety, more surprises, more nonsense.

Top-k sampling removes the long tail. Without it, the model occasionally picks a very unlikely token and the output goes off the rails. Restricting to the top 40 keeps things coherent without killing variation.

Good defaults for generation

Start with temperature=0.8 and top_k=40. For more creative output try temperature=1.1. For more coherent output go lower, around 0.6. Never go below 0.3. It starts repeating itself.

Scaling up to GPT-2

The only thing separating NanoGPT from GPT-2 Small is four numbers in the config:

python

# GPT-2 Small - change just these four values
@dataclass
class GPT2Config(GPTConfig):
    embed_dim: int = 768       # was 128
    num_heads: int = 12        # was 4
    num_layers: int = 12       # was 4
    max_seq_len: int = 1024    # was 256

That's it. The architecture is identical. The parameter count goes from ~10M to ~117M. The training goes from 30 minutes on CPU to days on GPU clusters. But the code structure doesn't change.

GPT-3 is just GPT-2 with embed_dim=12288, num_heads=96, num_layers=96, and a lot more data.

What actually made sense once I built it

Attention is not as magical as it sounds. Once you write the QKV matrix multiply by hand, it's dot products and softmax. That is it. The power comes from training. The model learns which queries and keys to produce such that the attention scores end up meaningful. The mechanism itself is just math.

Residual connections surprised me. x = x + sublayer(x) is one line of code. I actually removed them from my model to see what would happen. The loss barely budged for the first thousand steps, which made me think they didn't matter. Then I realised I was comparing to a model that was also broken. Neither had converged.. When I let both run to step 5,000, the one with residuals hit 1.59. The one without was stuck at 2.8. One line, roughly half the learning.

Watching the loss number drop is weirdly motivating. 10.8 means the model is guessing randomly. 1.5 means it's narrowed each prediction down to a few likely tokens. The number actually means something, and seeing it fall across 5,000 steps feels more like understanding than any explanation I'd read.

The scale thing hit me when I wrote the GPT-2 config. Four numbers. That's all that separates this toy from a model that was genuinely impressive in 2019. GPT-3 is the same again, just bigger. The architecture hasn't changed much since 2017. All the gains since then are scale, data, and training tricks.

Getting the code

The full implementation is on my GitHub. It includes a training script, a generation script, and a pre-tokenized version of TinyShakespeare so you can start training without waiting for downloads.

If you want to go deeper, Andrej Karpathy's nanoGPT repo and his Zero to Hero series cover this in much more detail. That is where I started before writing my own version from scratch.

Build it yourself. Reading about transformers is useful. Writing the causal mask by hand is something else.

What Happens to These Embeddings After the Model Runs?

When NanoGPT converts tokens into 128-dimensional vectors, it is doing exactly what modern AI search and retrieval systems do at scale. The difference is what happens next.

In this model those vectors flow through attention layers and come out the other side as next-token predictions. In a production RAG pipeline, the same kind of vectors are stored in a vector database and searched against millions of similar vectors at millisecond speed.

If you want to follow that thread, the Vector Database Fundamentals series on this site starts from the same place this article does, with the question of what a vector actually is, and walks through everything that happens when you use those vectors for retrieval instead of generation.

Start here

What Is a Vector Database? The Complete Beginner Guide

How embeddings like the ones you just built get stored, indexed, and searched at scale.

Deepens this article

What Are Embeddings? How AI Converts Text Into Numbers

The same embedding layer you wrote here, explained from the model's point of view.

The geometry behind attention

Latent Space Explained: The Hidden Structure of AI Models

What the hidden states of this transformer actually represent, and why vector arithmetic works.

Where vectors go next

What Is Semantic Search? How It Works Step by Step

How the same embedding mechanism powers search that understands meaning, not just keywords.

That bothered me. So I built one.

After 5,000 training steps it generates things like:

"HAMLET: The king hath sent me hither to speak with thee and thy father's ghost hath spoke to me"

Not poetry. But also not random noise. The model learned something real.

This post walks through every piece of how it works: the code, the math, and what I actually understood once I stopped reading about it and started building.

For the deeper explanation of how embeddings work inside transformers, see what are embeddings.

What we're building

NanoGPT is a decoder-only transformer, the same class of model as GPT-2, GPT-3, and GPT-4. The main differences are scale:

	NanoGPT	GPT-2 Small	GPT-4 (estimated)
Parameters	~10M	117M	~1.8T
Embedding dim	128	768	~12,288
Attention heads	4	12	96
Layers	4	12	96
Training data	~1MB	~40GB	Unknown

The project has eight files:

plaintext

nanogpt/
├── config.py       # all hyperparameters in one place
├── tokenizer.py    # text ↔ token IDs
├── dataset.py      # data loading and batching
├── model.py        # the GPT architecture
├── train.py        # training loop
├── generate.py     # text generation
├── utils.py        # helpers
└── README.md

Let's go through each piece.

Step 1: Tokenization

Computers don't read words. They read numbers. Tokenization converts text into a sequence of integer IDs that the model can process.

python

# tokenizer.py
import tiktoken

def get_tokenizer():
    return tiktoken.get_encoding("gpt2")

def encode(text: str) -> list[int]:
    enc = get_tokenizer()
    return enc.encode(text)

def decode(token_ids: list[int]) -> str:
    enc = get_tokenizer()
    return enc.decode(token_ids)

So "To be or not to be" becomes something like [2514, 307, 393, 407, 284, 307]. The model never sees the original text. Only these IDs..

Step 2: Centralising hyperparameters

Every hyperparameter lives in one dataclass. This makes experiments easy. You change numbers in one file, not scattered across ten..

python

# config.py
from dataclasses import dataclass

@dataclass
class GPTConfig:
    vocab_size: int = 50_257       # GPT-2 vocabulary
    embed_dim: int = 128           # what makes it "nano"
    num_heads: int = 4             # attention heads
    num_layers: int = 4            # transformer blocks
    max_seq_len: int = 256         # context window
    dropout: float = 0.1
    batch_size: int = 32
    learning_rate: float = 3e-4
    max_steps: int = 5_000
    eval_interval: int = 500
    device: str = "cpu"

The embed_dim: 128 is what keeps this model tiny. GPT-2 Small uses 768. That single number multiplies into every weight matrix in the model.

Step 3: Loading and batching the data

The training data is TinyShakespeare, about 1MB of Shakespeare plays. Small enough to fit in memory, rich enough to train a basic language model.

python

# dataset.py
import torch
from tokenizer import encode

def load_data(path: str, config):
    with open(path, "r") as f:
        text = f.read()

    tokens = encode(text)
    data = torch.tensor(tokens, dtype=torch.long)

    # 90/10 train/validation split
    split = int(0.9 * len(data))
    return data[:split], data[split:]

def get_batch(data: torch.Tensor, config):
    """
    Returns a random batch of (input, target) pairs.
    Target is input shifted right by one position.
    """
    seq_len = config.max_seq_len
    batch_size = config.batch_size

    # random starting positions
    ix = torch.randint(len(data) - seq_len, (batch_size,))
    x = torch.stack([data[i : i + seq_len] for i in ix])
    y = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
    return x.to(config.device), y.to(config.device)

Step 4: The model architecture

This is the core. Four components build on top of each other.

Token and positional embeddings

python

# model.py (partial)
import torch
import torch.nn as nn
import torch.nn.functional as F
from config import GPTConfig

class NanoGPT(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        self.token_embedding = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_embedding = nn.Embedding(config.max_seq_len, config.embed_dim)

        self.blocks = nn.ModuleList([
            TransformerBlock(config) for _ in range(config.num_layers)
        ])
        self.ln_f = nn.LayerNorm(config.embed_dim)
        self.head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)

        # weight tying: share parameters between input embedding and output head
        self.head.weight = self.token_embedding.weight

    def forward(self, idx: torch.Tensor, targets=None):
        B, T = idx.shape

        tok_emb = self.token_embedding(idx)               # (B, T, embed_dim)
        pos = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embedding(pos)                 # (T, embed_dim)
        x = tok_emb + pos_emb                             # (B, T, embed_dim)

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)
        logits = self.head(x)                             # (B, T, vocab_size)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

Causal self-attention

Attention is the mechanism that lets each token look at other tokens in the sequence and decide which ones matter for predicting the next word.

python

class CausalSelfAttention(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        assert config.embed_dim % config.num_heads == 0

        self.num_heads = config.num_heads
        self.head_dim = config.embed_dim // config.num_heads

        # Query, Key, Value projections - all in one matrix
        self.qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
        self.proj = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
        self.dropout = nn.Dropout(config.dropout)

        # causal mask: upper triangle of ones
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.max_seq_len, config.max_seq_len))
            .view(1, 1, config.max_seq_len, config.max_seq_len)
        )

    def forward(self, x: torch.Tensor):
        B, T, C = x.shape

        # split into Q, K, V
        q, k, v = self.qkv(x).split(C, dim=2)

        # reshape for multi-head attention: (B, num_heads, T, head_dim)
        q = q.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.num_heads, self.head_dim).transpose(1, 2)

        # attention scores
        scale = self.head_dim ** -0.5
        att = (q @ k.transpose(-2, -1)) * scale         # (B, heads, T, T)

        # apply causal mask - zero out future tokens
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        # weighted sum of values
        out = att @ v                                    # (B, heads, T, head_dim)
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)

Feed-forward network

After each attention layer, every token goes through a small neural network independently. It expands to 4× the embedding dimension, applies GELU activation, then compresses back.

python

class FeedForward(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.embed_dim, 4 * config.embed_dim),
            nn.GELU(),
            nn.Linear(4 * config.embed_dim, config.embed_dim),
            nn.Dropout(config.dropout),
        )

    def forward(self, x: torch.Tensor):
        return self.net(x)

The attention layer decides which tokens to look at. The feed-forward layer processes what it found. Different jobs, run in sequence.

The transformer block

Each transformer block wraps attention and feed-forward together with LayerNorm and residual connections.

python

class TransformerBlock(nn.Module):
    def __init__(self, config: GPTConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.embed_dim)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.embed_dim)
        self.ff = FeedForward(config)

    def forward(self, x: torch.Tensor):
        x = x + self.attn(self.ln1(x))   # residual connection
        x = x + self.ff(self.ln2(x))     # residual connection
        return x

LayerNorm runs before the sublayer here (Pre-LN). This is a practical deviation from the original "Attention Is All You Need" paper (which used Post-LN) and it trains more stably.

Step 5: Training

The training loop is straightforward once the model exists.

python

# train.py
import torch
from config import GPTConfig
from dataset import load_data, get_batch
from model import NanoGPT
import math

def get_lr(step: int, config: GPTConfig) -> float:
    """Cosine learning rate schedule with warmup."""
    warmup_steps = 100
    if step < warmup_steps:
        return config.learning_rate * step / warmup_steps
    progress = (step - warmup_steps) / (config.max_steps - warmup_steps)
    return config.learning_rate * 0.5 * (1.0 + math.cos(math.pi * progress))

def train():
    config = GPTConfig()
    train_data, val_data = load_data("data/tinyshakespeare.txt", config)

    model = NanoGPT(config).to(config.device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)

    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

    for step in range(config.max_steps):
        # update learning rate
        lr = get_lr(step, config)
        for g in optimizer.param_groups:
            g["lr"] = lr

        model.train()
        x, y = get_batch(train_data, config)
        logits, loss = model(x, y)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # prevent exploding gradients
        optimizer.step()

        if step % config.eval_interval == 0:
            model.eval()
            with torch.no_grad():
                _, val_loss = model(*get_batch(val_data, config))
            print(f"step {step:5d} | train loss {loss.item():.4f} | val loss {val_loss.item():.4f} | lr {lr:.2e}")

    torch.save(model.state_dict(), "nanogpt.pt")

if __name__ == "__main__":
    train()

plaintext

step     0 | train loss 10.8231 | val loss 10.8189 | lr 3.00e-06
step   500 | train loss  4.2817 | val loss  4.3201 | lr 2.99e-04
step  1000 | train loss  3.1044 | val loss  3.2108 | lr 2.92e-04
step  2000 | train loss  2.3517 | val loss  2.5089 | lr 2.57e-04
step  3000 | train loss  1.9823 | val loss  2.1744 | lr 1.99e-04
step  4000 | train loss  1.7201 | val loss  1.9802 | lr 1.25e-04
step  5000 | train loss  1.5934 | val loss  1.8991 | lr 5.00e-05

The gap between train and val loss is normal. It shows the model memorised some training patterns.. That's fine at this scale.

Step 6: Generating text

Once trained, the model generates text autoregressively, one token at a time, each new token fed back as input.

python

# generate.py
import torch
import torch.nn.functional as F
from config import GPTConfig
from model import NanoGPT
from tokenizer import encode, decode

def generate(
    model: NanoGPT,
    prompt: str,
    max_new_tokens: int = 200,
    temperature: float = 1.0,
    top_k: int = 40,
    device: str = "cpu",
) -> str:
    model.eval()
    config = model.config

    tokens = encode(prompt)
    x = torch.tensor(tokens, dtype=torch.long, device=device).unsqueeze(0)

    with torch.no_grad():
        for _ in range(max_new_tokens):
            # crop to context window if needed
            x_cond = x[:, -config.max_seq_len:]

            logits, _ = model(x_cond)
            logits = logits[:, -1, :] / temperature    # only care about last token

            # top-k: keep only the 40 most probable tokens
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float("-inf")

            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            x = torch.cat([x, next_token], dim=1)

    return decode(x[0].tolist())

if __name__ == "__main__":
    config = GPTConfig()
    model = NanoGPT(config)
    model.load_state_dict(torch.load("nanogpt.pt", map_location=config.device))

    output = generate(
        model,
        prompt="HAMLET:",
        max_new_tokens=200,
        temperature=0.8,
        top_k=40,
        device=config.device,
    )
    print(output)

Good defaults for generation

Start with temperature=0.8 and top_k=40. For more creative output try temperature=1.1. For more coherent output go lower, around 0.6. Never go below 0.3. It starts repeating itself.

Scaling up to GPT-2

The only thing separating NanoGPT from GPT-2 Small is four numbers in the config:

python

# GPT-2 Small - change just these four values
@dataclass
class GPT2Config(GPTConfig):
    embed_dim: int = 768       # was 128
    num_heads: int = 12        # was 4
    num_layers: int = 12       # was 4
    max_seq_len: int = 1024    # was 256

That's it. The architecture is identical. The parameter count goes from ~10M to ~117M. The training goes from 30 minutes on CPU to days on GPU clusters. But the code structure doesn't change.

GPT-3 is just GPT-2 with embed_dim=12288, num_heads=96, num_layers=96, and a lot more data.

What actually made sense once I built it

Getting the code

The full implementation is on my GitHub. It includes a training script, a generation script, and a pre-tokenized version of TinyShakespeare so you can start training without waiting for downloads.

If you want to go deeper, Andrej Karpathy's nanoGPT repo and his Zero to Hero series cover this in much more detail. That is where I started before writing my own version from scratch.

Build it yourself. Reading about transformers is useful. Writing the causal mask by hand is something else.

What Happens to These Embeddings After the Model Runs?

When NanoGPT converts tokens into 128-dimensional vectors, it is doing exactly what modern AI search and retrieval systems do at scale. The difference is what happens next.

Start here

What Is a Vector Database? The Complete Beginner Guide

How embeddings like the ones you just built get stored, indexed, and searched at scale.

Deepens this article

What Are Embeddings? How AI Converts Text Into Numbers

The same embedding layer you wrote here, explained from the model's point of view.

The geometry behind attention

Latent Space Explained: The Hidden Structure of AI Models

What the hidden states of this transformer actually represent, and why vector arithmetic works.

Where vectors go next

What Is Semantic Search? How It Works Step by Step

How the same embedding mechanism powers search that understands meaning, not just keywords.

I Built a Tiny GPT Model From Scratch: Here's Exactly How It Works

What we're building

Step 1: Tokenization

Step 2: Centralising hyperparameters

Step 3: Loading and batching the data

Step 4: The model architecture

Token and positional embeddings

Causal self-attention

Feed-forward network

The transformer block

Step 5: Training

Step 6: Generating text

Scaling up to GPT-2

What actually made sense once I built it

Getting the code

What Happens to These Embeddings After the Model Runs?

Krunal Kanojiya

Related Posts

I Built a Tiny GPT Model From Scratch: Here's Exactly How It Works

What we're building

Step 1: Tokenization

Step 2: Centralising hyperparameters

Step 3: Loading and batching the data

Step 4: The model architecture

Token and positional embeddings

Causal self-attention

Feed-forward network

The transformer block

Step 5: Training

Step 6: Generating text

Scaling up to GPT-2

What actually made sense once I built it

Getting the code

What Happens to These Embeddings After the Model Runs?

Krunal Kanojiya

Related Posts

What we're building

Step 1: Tokenization

Step 2: Centralising hyperparameters

Step 3: Loading and batching the data

Step 4: The model architecture

Token and positional embeddings

Causal self-attention

Feed-forward network

The transformer block

Step 5: Training

Step 6: Generating text

Scaling up to GPT-2

What actually made sense once I built it

Getting the code

What Happens to These Embeddings After the Model Runs?

Related reading

Krunal Kanojiya

Related Posts

What we're building

Step 1: Tokenization

Step 2: Centralising hyperparameters

Step 3: Loading and batching the data

Step 4: The model architecture

Token and positional embeddings

Causal self-attention

Feed-forward network

The transformer block

Step 5: Training

Step 6: Generating text

Scaling up to GPT-2

What actually made sense once I built it

Getting the code

What Happens to These Embeddings After the Model Runs?

Related reading

Krunal Kanojiya

Related Posts