Pre-training and Language Modeling: How a Transformer Learns to Predict Text
The transformer architecture from Article 6 starts as a random function. Pre-training is what turns it into a language model. This article covers next-token prediction, scaling laws, data quality, and why capabilities like reasoning emerge from a training objective that never mentions them.
Article 6 ended with a transformer that has random weights and a loss of about 10.82. That number means the model has no idea what word comes next. It assigns equal probability to all 50,257 tokens in the vocabulary, which is statistically the same as guessing blindly. Pre-training is the process of changing that.
The architecture does not change. The same decoder-only transformer from Article 6 runs through the same forward pass, the same causal mask, the same residual connections. What changes is that you run it on hundreds of billions or trillions of tokens of real text, updating the weights after every batch. Over enough steps, the model stops guessing. It starts knowing.
This is Article 7 in the series. Article 6 built the transformer. Article 8 covers fine-tuning and RLHF, where a pre-trained model gets shaped into something that follows instructions. You cannot fine-tune a model that was never pre-trained well. The quality of the pre-trained base sets a ceiling on everything that comes after.
The training objective: predict the next token
Pre-training is simpler than it sounds. The model gets a sequence of tokens. For every position in that sequence, it must predict which token comes next. The loss is cross-entropy between the predicted distribution and the actual next token.
That is the entire objective. No human annotations. No task labels. No reward signal beyond "how well did you predict the next word?"
import torch
import torch.nn.functional as F
# simulated forward pass of a pre-trained transformer
# (using the model from Article 6)
batch_size = 8
seq_len = 256
vocab_size = 50_257
# model produces logits: one distribution per token position
logits = torch.randn(batch_size, seq_len, vocab_size)
# targets: the actual next tokens (input shifted by 1 position)
# if input is [t0, t1, t2, ..., t255], target is [t1, t2, t3, ..., t256]
targets = torch.randint(0, vocab_size, (batch_size, seq_len))
# cross-entropy loss across all positions and all examples in the batch
loss = F.cross_entropy(
logits.view(-1, vocab_size),
targets.view(-1)
)
print(f"loss: {loss.item():.4f}")
# random logits: ~10.82 = ln(50257)
# well-trained model on language: ~1.5 to 2.5 depending on corpusThe targets tensor is just the input sequence shifted by one position. You already have the labels in the training data. You do not need to label anything. The next word in a book is the label for the current word.
This is why pre-training can scale to trillions of tokens cheaply. Every sentence on the internet is labeled training data. The label is just the next word.
What the loss actually measures
The cross-entropy loss during pre-training is not just a number you watch fall. It has a direct interpretation.
Loss of 10.82 means the model assigns each token a probability of roughly 1/50257. Total ignorance.
Loss of 2.5 means the model has narrowed the next token down to about e^2.5 ≈ 12 plausible options.
Loss of 1.5 means it has narrowed it to about e^1.5 ≈ 4 options.
The model never reaches zero loss because language is genuinely unpredictable. After "The weather in Mumbai is", any of dozens of words could reasonably follow. A model with a loss near zero would be memorizing, not learning language.
import math
print("Loss interpretation:")
for loss_val in [10.82, 4.0, 2.5, 1.8, 1.5]:
effective_vocab = math.exp(loss_val)
print(f" loss {loss_val:.2f} => model narrows next token to ~{effective_vocab:.0f} candidates")Loss interpretation:
loss 10.82 => model narrows next token to ~50257 candidates (random guessing)
loss 4.00 => model narrows next token to ~55 candidates
loss 2.50 => model narrows next token to ~12 candidates
loss 1.80 => model narrows next token to ~6 candidates
loss 1.50 => model narrows next token to ~4 candidatesWhen I built NanoGPT and watched this number fall from 10.8 to 1.59 over 5,000 steps on Shakespeare, it clicked for me in a way reading about it never did. The model was not learning "Shakespeare." It was getting better at predicting the next character of Shakespeare. That is the whole thing.
A minimal pre-training loop
Here is what the actual training code looks like for a small language model. This is the same loop used in my NanoGPT project, cleaned up and annotated.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# assume model is the DecoderOnlyTransformer from Article 6
# assume train_data is a large 1D tensor of token IDs
def get_batch(data: torch.Tensor, seq_len: int, batch_size: int, device: str):
"""Sample random starting positions and return (inputs, targets)."""
ix = torch.randint(len(data) - seq_len, (batch_size,))
x = torch.stack([data[i : i + seq_len ] for i in ix])
y = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
return x.to(device), y.to(device)
def cosine_lr(step: int, max_steps: int, max_lr: float, min_lr: float, warmup: int):
"""Learning rate warmup then cosine decay."""
if step < warmup:
return max_lr * step / warmup
progress = (step - warmup) / (max_steps - warmup)
return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
def train(model, train_data, val_data, config):
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config["max_lr"],
betas=(0.9, 0.95),
weight_decay=0.1
)
for step in range(config["max_steps"]):
# update learning rate
lr = cosine_lr(step, config["max_steps"], config["max_lr"], config["min_lr"], config["warmup"])
for g in optimizer.param_groups:
g["lr"] = lr
# training step
model.train()
x, y = get_batch(train_data, config["seq_len"], config["batch_size"], config["device"])
logits, loss = model(x, y)
optimizer.zero_grad()
loss.backward()
# gradient clipping prevents occasional bad batches from corrupting weights
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
# periodic validation
if step % config["eval_every"] == 0:
model.eval()
with torch.no_grad():
_, val_loss = model(*get_batch(val_data, config["seq_len"], config["batch_size"], config["device"]))
print(f"step {step:6d} | train {loss.item():.4f} | val {val_loss.item():.4f} | lr {lr:.2e}")
# example output (training on TinyShakespeare, NanoGPT scale):
# step 0 | train 10.8231 | val 10.8189 | lr 3.00e-06
# step 1000 | train 3.1044 | val 3.2108 | lr 2.92e-04
# step 3000 | train 1.9823 | val 2.1744 | lr 1.99e-04
# step 5000 | train 1.5934 | val 1.8991 | lr 5.00e-05A few things worth stopping on here.
Weight decay of 0.1 is higher than you might expect from Article 3 where we used 0.01. At pre-training scale with billions of parameters, stronger regularization helps. The AdamW betas of (0.9, 0.95) are slightly different from the default (0.9, 0.999) used for fine-tuning tasks. The second beta controls how fast the optimizer adapts to gradient variance, and 0.95 works better for large-scale language model training on diverse text.
Gradient clipping at 1.0 is non-negotiable. Pre-training runs for hundreds of thousands of steps. Any one bad batch that blows up the gradients can corrupt weeks of compute.
Scaling laws: how loss depends on compute, parameters, and data
In 2020, OpenAI published what became known as the Kaplan scaling laws. They showed that language model performance follows predictable power-law relationships with model size, dataset size, and compute. Bigger models trained on more data with more compute performed better, and the relationship was smooth enough to extrapolate.
There was a problem. Kaplan's laws suggested you should scale model size faster than data. GPT-3 followed this: 175 billion parameters trained on roughly 300 billion tokens, a ratio of about 1.7 tokens per parameter. Most large models of that era followed similar patterns.
In 2022, DeepMind published the Chinchilla paper. By training over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens, they found that for compute-optimal training, model size and training tokens should scale equally. The optimal ratio was approximately 20 tokens per parameter.
The implication was sobering. GPT-3 had been trained on far less data than it could have used. A 70 billion parameter model trained on 1.4 trillion tokens would outperform a 280 billion parameter model trained on the same compute budget but fewer tokens. Chinchilla, their 70 billion parameter model, proved this by outperforming Gopher (280 billion parameters) on almost every benchmark while using the same compute.
# what the Chinchilla finding means in practice
models = [
("GPT-3", 175e9, 300e9), # 175B params, 300B tokens
("Chinchilla", 70e9, 1.4e12), # 70B params, 1.4T tokens
("LLaMA-3-8B", 8e9, 15e12), # 8B params, 15T tokens (2024)
("Qwen3-0.6B", 0.6e9, 36e12), # 0.6B params, 36T tokens (2025)
]
print(f"{'Model':<16} | {'Params':>10} | {'Tokens':>10} | {'Ratio':>10}")
print("-" * 54)
for name, params, tokens in models:
ratio = tokens / params
print(f"{name:<16} | {params/1e9:>8.1f}B | {tokens/1e12:>8.1f}T | {ratio:>10.0f}:1")Model | Params | Tokens | Ratio
------------------------------------------------------
GPT-3 | 175.0B | 0.3T | 2:1
Chinchilla | 70.0B | 1.4T | 20:1
LLaMA-3-8B | 8.0B | 15.0T | 1875:1
Qwen3-0.6B | 0.6B | 36.0T | 60000:1That last row is not a typo. Qwen3-0.6B, released by Alibaba in April 2025, was trained on 36 trillion tokens with 600 million parameters, a ratio of 60,000 to 1. That is the highest tokens-to-parameters ratio ever recorded for a text model.
The ratio of training data to active parameters in open-weight LLMs has grown 3.1 times per year since 2022, and recent models have been trained with 20 times more data per parameter than the optimal ratio suggested by the Chinchilla scaling laws.
The reason is inference economics. A smaller model trained for longer costs less per query when you serve it to millions of users. Chinchilla optimal means optimal for training compute. It does not mean optimal for total cost including deployment. Following Chinchilla strictly leads to what some researchers call the "Chinchilla trap": an overparameterized model that is expensive to serve.
Why data quality became a first-class concern
Scaling laws tell you how loss behaves as you add more tokens and parameters. For years, the community treated data quality as a preprocessing checkbox: deduplicate, filter obvious junk, move on. That changed in 2024 and 2025.
Research on FineWeb-Edu, HuggingFace's 1.3 trillion token educational subset of their web crawl, showed that a 1.82 billion parameter model trained on 350 billion tokens of FineWeb-Edu outperformed models trained on the full 15 trillion token FineWeb dataset. On the ARC reasoning benchmark, performance jumped from 46% to 57%. On MMLU, it improved from 33% to 37%.
Higher quality data substituted for both scale and compute.
A 2025 paper formalizing this relationship found that data quality is itself a variable in the loss function, not just a preprocessing step. Their key finding: loss scales predictably with data quality, and higher quality data can substantially reduce both model size and compute requirements.
# illustrating the data quality effect on effective compute
# rough model: effective_tokens = actual_tokens * quality_multiplier
# quality_multiplier = 1.0 for average web text
# quality_multiplier = 1.3 to 2.0 for curated educational data (estimated from FineWeb-Edu results)
datasets = [
("Raw Common Crawl (C4)", 156e9, 1.0),
("Dolma", 3e12, 1.1),
("FineWeb", 15e12, 1.2),
("FineWeb-Edu", 1.3e12, 1.8),
]
print(f"{'Dataset':<28} | {'Raw tokens':>12} | {'Effective':>12}")
print("-" * 58)
for name, tokens, quality in datasets:
effective = tokens * quality
print(f"{name:<28} | {tokens/1e12:>9.1f}T | {effective/1e12:>9.1f}T")Dataset | Raw tokens | Effective
----------------------------------------------------------
Raw Common Crawl (C4) | 0.2T | 0.2T
Dolma | 3.0T | 3.3T
FineWeb | 15.0T | 18.0T
FineWeb-Edu | 1.3T | 2.3TThe FineWeb-Edu story is striking: 1.3 trillion carefully filtered tokens outperforming 15 trillion tokens of average web text on reasoning benchmarks. You cannot just count tokens anymore. You have to ask what those tokens are worth.
Modern curation pipelines for pre-training data now run multiple stages. The first stage collects text at scale from web crawls, books, code repositories, and scientific papers. The second stage applies heuristic filters: remove duplicate content, filter out short or low-information documents, remove text in unsupported languages. The third stage runs classifier-based quality scoring, sometimes using a smaller model trained on high-quality seeds to score the rest of the corpus. Data that scores below the threshold gets removed even if it would add to the raw token count.
Mixing also matters. Pure synthetic data consistently underperforms a mixture of real and synthetic data in 2024 and 2025 research. Models trained entirely on synthetically generated text tend to be narrower and less capable on out-of-distribution tasks. The current best practice is synthetic data as a supplement, not a replacement, for real web text.
What the model actually learns during pre-training
This is the part that I find genuinely strange and interesting.
Nobody tells the model what a sentence is. Nobody labels subject and verb. Nobody says "this token refers to a person" or "this paragraph is about economics." The training objective is just: predict the next token.
And yet, after pre-training on enough text, the model has learned:
Grammar, because sentences that violate grammar have lower probability under any reasonable language model. The model learns grammar as a side effect of trying to predict text correctly.
Facts, because a model that knows Paris is the capital of France assigns higher probability to "Paris" after "The capital of France is" than a model that does not. Factual knowledge improves next-token prediction.
Reasoning patterns, because step-by-step reasoning leads to more predictable text than random conclusions. A model that can follow a chain of reasoning predicts text better than one that cannot.
Code structure, because syntactically valid code is far more predictable than random character sequences. Learning to predict code means learning to understand it.
None of this was programmed. All of it emerges because it helps with next-token prediction. That is the insight that makes pre-training so powerful and so strange at the same time.
A model that was pre-trained poorly cannot be fine-tuned into a capable assistant. RLHF, covered in Article 8, can shape the model's behavior, but it cannot install knowledge or reasoning ability that was never learned during pre-training. The pre-trained base sets a ceiling on what fine-tuning can achieve. This is why teams at major labs spend enormous effort on data quality and training runs before any alignment work begins.
The emergent capability question
One thing that surprised researchers as models scaled was the appearance of capabilities that were not present in smaller models at all, then suddenly appeared above a certain scale threshold.
Models below roughly 10 billion parameters would fail completely on certain multi-step reasoning tasks. Models above 100 billion parameters would solve them. The transition happened quickly, over a narrow range of scale. Researchers called this emergence.
The honest interpretation is still debated. Some researchers argue the capabilities were always there in seed form and became measurable only when models got large enough to be accurate enough on the component steps. Others believe genuine phase transitions occur where qualitatively new abilities appear.
What is not debated is the practical implication: you cannot always predict what a model will be able to do by extrapolating from smaller scale experiments. Something becomes possible at scale that was impossible at smaller scale, even with the same architecture and training objective.
This is what Article 8 deals with directly. After pre-training, the model has all of this learned capability, including some capabilities nobody explicitly trained it to have. Fine-tuning and RLHF shape how the model uses those capabilities in response to human instructions.
The connection between Articles 6 and 8
The transformer from Article 6 is a function. Pre-training is how you set the parameters of that function so it has learned something useful about language.
After pre-training, the model can predict text. It can complete sentences, generate code, summarize passages. But it is not a conversational assistant. Ask it a question and it will continue the text as if generating more of a document, not answer the question directly. It might generate "What is the capital of France? The capital of France is..." and keep going for pages.
Article 8 covers how you take this pre-trained model and make it useful as a product. That requires supervised fine-tuning on demonstrations of good behavior, then reinforcement learning from human feedback to align the model with what users actually want. The entire alignment stack depends on pre-training having done its job well.
Next in the series
Article 8 covers fine-tuning and RLHF. You will see how supervised fine-tuning shapes the model's response style, what reward modeling means and how it captures human preferences, and why PPO-based alignment changed how language models behave in practice. Everything that makes ChatGPT, Claude, or Gemini feel like an assistant rather than a text autocomplete engine comes from this stage.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.
Related Posts
Transformer Architecture and Attention: Why Every Modern LLM Is Built This Way
Apr 15, 2026 · 17 min read
Evaluation, Inference, and Deployment: Shipping an LLM Product That Actually Works
Apr 21, 2026 · 12 min read
Prompting, RAG, and In-Context Learning: Using LLMs in Real Products
Apr 20, 2026 · 11 min read