LLMs & Deep Learning·14 min read·2,737 words

Pre-training and Language Modeling: How a Transformer Learns to Predict Text

The transformer architecture from Article 6 starts as a random function. Pre-training is what turns it into a language model. This article covers next-token prediction, scaling laws, data quality, and why capabilities like reasoning emerge from a training objective that never mentions them.

Krunal Kanojiya

April 18, 2026

#pre-training#language-modeling#scaling-laws#chinchilla#next-token-prediction#llm#deep-learning#training-data

Pre-training and Language Modeling: How a Transformer Learns to Predict Text

Article 6 ended with a transformer that has random weights and a loss of about 10.82. That number means the model has no idea what word comes next. It assigns equal probability to all 50,257 tokens in the vocabulary, which is statistically the same as guessing blindly. Pre-training is the process of changing that.

The architecture does not change. The same decoder-only transformer from Article 6 runs through the same forward pass, the same causal mask, the same residual connections. What changes is that you run it on hundreds of billions or trillions of tokens of real text, updating the weights after every batch. Over enough steps, the model stops guessing. It starts knowing.

This is Article 7 in the series. Article 6 built the transformer. Article 8 covers fine-tuning and RLHF, where a pre-trained model gets shaped into something that follows instructions. You cannot fine-tune a model that was never pre-trained well. The quality of the pre-trained base sets a ceiling on everything that comes after.

The training objective: predict the next token

Pre-training is simpler than it sounds. The model gets a sequence of tokens. For every position in that sequence, it must predict which token comes next. The loss is cross-entropy between the predicted distribution and the actual next token.

That is the entire objective. No human annotations. No task labels. No reward signal beyond "how well did you predict the next word?"

python

import torch
import torch.nn.functional as F

# simulated forward pass of a pre-trained transformer
# (using the model from Article 6)
batch_size = 8
seq_len    = 256
vocab_size = 50_257

# model produces logits: one distribution per token position
logits  = torch.randn(batch_size, seq_len, vocab_size)

# targets: the actual next tokens (input shifted by 1 position)
# if input is [t0, t1, t2, ..., t255], target is [t1, t2, t3, ..., t256]
targets = torch.randint(0, vocab_size, (batch_size, seq_len))

# cross-entropy loss across all positions and all examples in the batch
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

print(f"loss: {loss.item():.4f}")
# random logits: ~10.82 = ln(50257)
# well-trained model on language: ~1.5 to 2.5 depending on corpus

The targets tensor is just the input sequence shifted by one position. You already have the labels in the training data. You do not need to label anything. The next word in a book is the label for the current word.

This is why pre-training can scale to trillions of tokens cheaply. Every sentence on the internet is labeled training data. The label is just the next word.

What the loss actually measures

The cross-entropy loss during pre-training is not just a number you watch fall. It has a direct interpretation.

Loss of 10.82 means the model assigns each token a probability of roughly 1/50257. Total ignorance.

Loss of 2.5 means the model has narrowed the next token down to about e^2.5 ≈ 12 plausible options.

Loss of 1.5 means it has narrowed it to about e^1.5 ≈ 4 options.

The model never reaches zero loss because language is genuinely unpredictable. After "The weather in Mumbai is", any of dozens of words could reasonably follow. A model with a loss near zero would be memorizing, not learning language.

python

import math

print("Loss interpretation:")
for loss_val in [10.82, 4.0, 2.5, 1.8, 1.5]:
    effective_vocab = math.exp(loss_val)
    print(f"  loss {loss_val:.2f} => model narrows next token to ~{effective_vocab:.0f} candidates")

plaintext

Loss interpretation:
  loss 10.82 => model narrows next token to ~50257 candidates  (random guessing)
  loss  4.00 => model narrows next token to ~55 candidates
  loss  2.50 => model narrows next token to ~12 candidates
  loss  1.80 => model narrows next token to ~6 candidates
  loss  1.50 => model narrows next token to ~4 candidates

When I built NanoGPT and watched this number fall from 10.8 to 1.59 over 5,000 steps on Shakespeare, it clicked for me in a way reading about it never did. The model was not learning "Shakespeare." It was getting better at predicting the next character of Shakespeare. That is the whole thing.

A minimal pre-training loop

Here is what the actual training code looks like for a small language model. This is the same loop used in my NanoGPT project, cleaned up and annotated.

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# assume model is the DecoderOnlyTransformer from Article 6
# assume train_data is a large 1D tensor of token IDs

def get_batch(data: torch.Tensor, seq_len: int, batch_size: int, device: str):
    """Sample random starting positions and return (inputs, targets)."""
    ix = torch.randint(len(data) - seq_len, (batch_size,))
    x  = torch.stack([data[i     : i + seq_len    ] for i in ix])
    y  = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
    return x.to(device), y.to(device)

def cosine_lr(step: int, max_steps: int, max_lr: float, min_lr: float, warmup: int):
    """Learning rate warmup then cosine decay."""
    if step < warmup:
        return max_lr * step / warmup
    progress = (step - warmup) / (max_steps - warmup)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

def train(model, train_data, val_data, config):
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["max_lr"],
        betas=(0.9, 0.95),
        weight_decay=0.1
    )

    for step in range(config["max_steps"]):
        # update learning rate
        lr = cosine_lr(step, config["max_steps"], config["max_lr"], config["min_lr"], config["warmup"])
        for g in optimizer.param_groups:
            g["lr"] = lr

        # training step
        model.train()
        x, y = get_batch(train_data, config["seq_len"], config["batch_size"], config["device"])
        logits, loss = model(x, y)

        optimizer.zero_grad()
        loss.backward()

        # gradient clipping prevents occasional bad batches from corrupting weights
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        # periodic validation
        if step % config["eval_every"] == 0:
            model.eval()
            with torch.no_grad():
                _, val_loss = model(*get_batch(val_data, config["seq_len"], config["batch_size"], config["device"]))
            print(f"step {step:6d} | train {loss.item():.4f} | val {val_loss.item():.4f} | lr {lr:.2e}")

# example output (training on TinyShakespeare, NanoGPT scale):
# step      0 | train 10.8231 | val 10.8189 | lr 3.00e-06
# step   1000 | train  3.1044 | val  3.2108 | lr 2.92e-04
# step   3000 | train  1.9823 | val  2.1744 | lr 1.99e-04
# step   5000 | train  1.5934 | val  1.8991 | lr 5.00e-05

A few things worth stopping on here.

Weight decay of 0.1 is higher than you might expect from Article 3 where we used 0.01. At pre-training scale with billions of parameters, stronger regularization helps. The AdamW betas of (0.9, 0.95) are slightly different from the default (0.9, 0.999) used for fine-tuning tasks. The second beta controls how fast the optimizer adapts to gradient variance, and 0.95 works better for large-scale language model training on diverse text.

Gradient clipping at 1.0 is non-negotiable. Pre-training runs for hundreds of thousands of steps. Any one bad batch that blows up the gradients can corrupt weeks of compute.

Scaling laws: how loss depends on compute, parameters, and data

In 2020, OpenAI published what became known as the Kaplan scaling laws. They showed that language model performance follows predictable power-law relationships with model size, dataset size, and compute. Bigger models trained on more data with more compute performed better, and the relationship was smooth enough to extrapolate.

There was a problem. Kaplan's laws suggested you should scale model size faster than data. GPT-3 followed this: 175 billion parameters trained on roughly 300 billion tokens, a ratio of about 1.7 tokens per parameter. Most large models of that era followed similar patterns.

In 2022, DeepMind published the Chinchilla paper. By training over 400 language models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens, they found that for compute-optimal training, model size and training tokens should scale equally. The optimal ratio was approximately 20 tokens per parameter.

The implication was sobering. GPT-3 had been trained on far less data than it could have used. A 70 billion parameter model trained on 1.4 trillion tokens would outperform a 280 billion parameter model trained on the same compute budget but fewer tokens. Chinchilla, their 70 billion parameter model, proved this by outperforming Gopher (280 billion parameters) on almost every benchmark while using the same compute.

python

# what the Chinchilla finding means in practice

models = [
    ("GPT-3",     175e9, 300e9),    # 175B params, 300B tokens
    ("Chinchilla",  70e9, 1.4e12),  # 70B params,  1.4T tokens
    ("LLaMA-3-8B",  8e9, 15e12),    # 8B params,   15T tokens (2024)
    ("Qwen3-0.6B", 0.6e9, 36e12),   # 0.6B params, 36T tokens (2025)
]

print(f"{'Model':<16} | {'Params':>10} | {'Tokens':>10} | {'Ratio':>10}")
print("-" * 54)
for name, params, tokens in models:
    ratio = tokens / params
    print(f"{name:<16} | {params/1e9:>8.1f}B | {tokens/1e12:>8.1f}T | {ratio:>10.0f}:1")

plaintext

Model            |     Params |     Tokens |      Ratio
------------------------------------------------------
GPT-3            |    175.0B  |     0.3T   |       2:1
Chinchilla       |     70.0B  |     1.4T   |      20:1
LLaMA-3-8B       |      8.0B  |    15.0T   |    1875:1
Qwen3-0.6B       |      0.6B  |    36.0T   |   60000:1

That last row is not a typo. Qwen3-0.6B, released by Alibaba in April 2025, was trained on 36 trillion tokens with 600 million parameters, a ratio of 60,000 to 1. That is the highest tokens-to-parameters ratio ever recorded for a text model.

The ratio of training data to active parameters in open-weight LLMs has grown 3.1 times per year since 2022, and recent models have been trained with 20 times more data per parameter than the optimal ratio suggested by the Chinchilla scaling laws.

The reason is inference economics. A smaller model trained for longer costs less per query when you serve it to millions of users. Chinchilla optimal means optimal for training compute. It does not mean optimal for total cost including deployment. Following Chinchilla strictly leads to what some researchers call the "Chinchilla trap": an overparameterized model that is expensive to serve.

Why data quality became a first-class concern

Scaling laws tell you how loss behaves as you add more tokens and parameters. For years, the community treated data quality as a preprocessing checkbox: deduplicate, filter obvious junk, move on. That changed in 2024 and 2025.

Research on FineWeb-Edu, HuggingFace's 1.3 trillion token educational subset of their web crawl, showed that a 1.82 billion parameter model trained on 350 billion tokens of FineWeb-Edu outperformed models trained on the full 15 trillion token FineWeb dataset. On the ARC reasoning benchmark, performance jumped from 46% to 57%. On MMLU, it improved from 33% to 37%.

Higher quality data substituted for both scale and compute.

A 2025 paper formalizing this relationship found that data quality is itself a variable in the loss function, not just a preprocessing step. Their key finding: loss scales predictably with data quality, and higher quality data can substantially reduce both model size and compute requirements.

python

# illustrating the data quality effect on effective compute

# rough model: effective_tokens = actual_tokens * quality_multiplier
# quality_multiplier = 1.0 for average web text
# quality_multiplier = 1.3 to 2.0 for curated educational data (estimated from FineWeb-Edu results)

datasets = [
    ("Raw Common Crawl (C4)",   156e9,   1.0),
    ("Dolma",                     3e12,  1.1),
    ("FineWeb",                  15e12,  1.2),
    ("FineWeb-Edu",               1.3e12, 1.8),
]

print(f"{'Dataset':<28} | {'Raw tokens':>12} | {'Effective':>12}")
print("-" * 58)
for name, tokens, quality in datasets:
    effective = tokens * quality
    print(f"{name:<28} | {tokens/1e12:>9.1f}T   | {effective/1e12:>9.1f}T")

plaintext

Dataset                      |   Raw tokens |    Effective
----------------------------------------------------------
Raw Common Crawl (C4)        |       0.2T   |       0.2T
Dolma                        |       3.0T   |       3.3T
FineWeb                      |      15.0T   |      18.0T
FineWeb-Edu                  |       1.3T   |       2.3T

The FineWeb-Edu story is striking: 1.3 trillion carefully filtered tokens outperforming 15 trillion tokens of average web text on reasoning benchmarks. You cannot just count tokens anymore. You have to ask what those tokens are worth.

Modern curation pipelines for pre-training data now run multiple stages. The first stage collects text at scale from web crawls, books, code repositories, and scientific papers. The second stage applies heuristic filters: remove duplicate content, filter out short or low-information documents, remove text in unsupported languages. The third stage runs classifier-based quality scoring, sometimes using a smaller model trained on high-quality seeds to score the rest of the corpus. Data that scores below the threshold gets removed even if it would add to the raw token count.

Mixing also matters. Pure synthetic data consistently underperforms a mixture of real and synthetic data in 2024 and 2025 research. Models trained entirely on synthetically generated text tend to be narrower and less capable on out-of-distribution tasks. The current best practice is synthetic data as a supplement, not a replacement, for real web text.

What the model actually learns during pre-training

This is the part that I find genuinely strange and interesting.

Nobody tells the model what a sentence is. Nobody labels subject and verb. Nobody says "this token refers to a person" or "this paragraph is about economics." The training objective is just: predict the next token.

And yet, after pre-training on enough text, the model has learned:

Grammar, because sentences that violate grammar have lower probability under any reasonable language model. The model learns grammar as a side effect of trying to predict text correctly.

Facts, because a model that knows Paris is the capital of France assigns higher probability to "Paris" after "The capital of France is" than a model that does not. Factual knowledge improves next-token prediction.

Reasoning patterns, because step-by-step reasoning leads to more predictable text than random conclusions. A model that can follow a chain of reasoning predicts text better than one that cannot.

Code structure, because syntactically valid code is far more predictable than random character sequences. Learning to predict code means learning to understand it.

None of this was programmed. All of it emerges because it helps with next-token prediction. That is the insight that makes pre-training so powerful and so strange at the same time.

Why pre-training matters for everything downstream

A model that was pre-trained poorly cannot be fine-tuned into a capable assistant. RLHF, covered in Article 8, can shape the model's behavior, but it cannot install knowledge or reasoning ability that was never learned during pre-training. The pre-trained base sets a ceiling on what fine-tuning can achieve. This is why teams at major labs spend enormous effort on data quality and training runs before any alignment work begins.

The emergent capability question

One thing that surprised researchers as models scaled was the appearance of capabilities that were not present in smaller models at all, then suddenly appeared above a certain scale threshold.

Models below roughly 10 billion parameters would fail completely on certain multi-step reasoning tasks. Models above 100 billion parameters would solve them. The transition happened quickly, over a narrow range of scale. Researchers called this emergence.

The honest interpretation is still debated. Some researchers argue the capabilities were always there in seed form and became measurable only when models got large enough to be accurate enough on the component steps. Others believe genuine phase transitions occur where qualitatively new abilities appear.

What is not debated is the practical implication: you cannot always predict what a model will be able to do by extrapolating from smaller scale experiments. Something becomes possible at scale that was impossible at smaller scale, even with the same architecture and training objective.

This is what Article 8 deals with directly. After pre-training, the model has all of this learned capability, including some capabilities nobody explicitly trained it to have. Fine-tuning and RLHF shape how the model uses those capabilities in response to human instructions.

The connection between Articles 6 and 8

The transformer from Article 6 is a function. Pre-training is how you set the parameters of that function so it has learned something useful about language.

After pre-training, the model can predict text. It can complete sentences, generate code, summarize passages. But it is not a conversational assistant. Ask it a question and it will continue the text as if generating more of a document, not answer the question directly. It might generate "What is the capital of France? The capital of France is..." and keep going for pages.

Article 8 covers how you take this pre-trained model and make it useful as a product. That requires supervised fine-tuning on demonstrations of good behavior, then reinforcement learning from human feedback to align the model with what users actually want. The entire alignment stack depends on pre-training having done its job well.

Next in the series

Article 8 covers fine-tuning and RLHF. You will see how supervised fine-tuning shapes the model's response style, what reward modeling means and how it captures human preferences, and why PPO-based alignment changed how language models behave in practice. Everything that makes ChatGPT, Claude, or Gemini feel like an assistant rather than a text autocomplete engine comes from this stage.

Frequently Asked Questions

What is pre-training in LLMs?

Pre-training is the first stage of training a large language model on a massive text corpus. The model learns by predicting the next token in each sequence using cross-entropy loss. No human labels, no task-specific data. Just raw text and the objective of predicting what comes next. Everything the model learns about language, facts, reasoning, and style comes from this single objective applied across trillions of tokens.

What is next-token prediction?

Next-token prediction is the training objective where the model sees a sequence of tokens and must predict what token comes next at each position. The correct answer is already in the training text, shifted by one position. The model's cross-entropy loss penalizes it for assigning low probability to the correct next token. Over millions of gradient steps, this pushes the model to compress language patterns into its weights.

What are Chinchilla scaling laws?

The Chinchilla scaling laws, published by DeepMind in 2022, showed that for compute-optimal training, model parameters and training tokens should scale in equal proportion. The original finding was that approximately 20 tokens per parameter gives the best performance for a given compute budget. This challenged the prevailing approach of building ever-larger models on relatively small datasets, and redirected the field toward training smaller models on more data.

Why do models go beyond the Chinchilla point in 2025?

Chinchilla optimal means optimal for training compute, not for deployment. A smaller model trained on more tokens is cheaper to run at inference time, which matters enormously when serving millions of users. LLaMA 3 8B was trained on 15 trillion tokens, putting its token-to-parameter ratio at 1,875 to 1, far beyond the Chinchilla-optimal 20 to 1. Epoch AI data from 2025 shows the average token-to-parameter ratio in open models has grown 3.1 times per year since 2022.

What is the difference between pre-training and fine-tuning?

Pre-training teaches the model about language on a massive general corpus. Fine-tuning adapts the pre-trained model to a specific task or behavior using a smaller, targeted dataset. A model that cannot predict text well cannot be fine-tuned into a useful assistant. Pre-training is the foundation. Fine-tuning shapes what the model does with that foundation.

Why does data quality matter so much for pre-training?

Scaling laws say that loss improves predictably with more tokens and more parameters. But those laws assume reasonable data quality. Research in 2024 and 2025 showed that data quality is itself a first-class variable in the loss function. A 1.82B model trained on 350 billion tokens of FineWeb-Edu outperformed larger models trained on less curated web data. Higher quality data can substitute for both model size and compute.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

I am a technical writer and former software developer from India. I publish practical tutorials and in-depth guides on AI engineering, data engineering, programming, algorithms, blockchain, and modern software development.

GitHub LinkedIn X

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

Apr 21, 2026 · 13 min read

Transformer Architecture and Attention: Why Every Modern LLM Is Built This Way

Apr 15, 2026 · 17 min read

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read

LLMs & Deep Learning·14 min read·2,737 words

Pre-training and Language Modeling: How a Transformer Learns to Predict Text

Krunal Kanojiya

April 18, 2026

#pre-training#language-modeling#scaling-laws#chinchilla#next-token-prediction#llm#deep-learning#training-data

The training objective: predict the next token

That is the entire objective. No human annotations. No task labels. No reward signal beyond "how well did you predict the next word?"

python

import torch
import torch.nn.functional as F

# simulated forward pass of a pre-trained transformer
# (using the model from Article 6)
batch_size = 8
seq_len    = 256
vocab_size = 50_257

# model produces logits: one distribution per token position
logits  = torch.randn(batch_size, seq_len, vocab_size)

# targets: the actual next tokens (input shifted by 1 position)
# if input is [t0, t1, t2, ..., t255], target is [t1, t2, t3, ..., t256]
targets = torch.randint(0, vocab_size, (batch_size, seq_len))

# cross-entropy loss across all positions and all examples in the batch
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    targets.view(-1)
)

print(f"loss: {loss.item():.4f}")
# random logits: ~10.82 = ln(50257)
# well-trained model on language: ~1.5 to 2.5 depending on corpus

This is why pre-training can scale to trillions of tokens cheaply. Every sentence on the internet is labeled training data. The label is just the next word.

What the loss actually measures

The cross-entropy loss during pre-training is not just a number you watch fall. It has a direct interpretation.

Loss of 10.82 means the model assigns each token a probability of roughly 1/50257. Total ignorance.

Loss of 2.5 means the model has narrowed the next token down to about e^2.5 ≈ 12 plausible options.

Loss of 1.5 means it has narrowed it to about e^1.5 ≈ 4 options.

python

import math

print("Loss interpretation:")
for loss_val in [10.82, 4.0, 2.5, 1.8, 1.5]:
    effective_vocab = math.exp(loss_val)
    print(f"  loss {loss_val:.2f} => model narrows next token to ~{effective_vocab:.0f} candidates")

plaintext

Loss interpretation:
  loss 10.82 => model narrows next token to ~50257 candidates  (random guessing)
  loss  4.00 => model narrows next token to ~55 candidates
  loss  2.50 => model narrows next token to ~12 candidates
  loss  1.80 => model narrows next token to ~6 candidates
  loss  1.50 => model narrows next token to ~4 candidates

A minimal pre-training loop

Here is what the actual training code looks like for a small language model. This is the same loop used in my NanoGPT project, cleaned up and annotated.

python

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# assume model is the DecoderOnlyTransformer from Article 6
# assume train_data is a large 1D tensor of token IDs

def get_batch(data: torch.Tensor, seq_len: int, batch_size: int, device: str):
    """Sample random starting positions and return (inputs, targets)."""
    ix = torch.randint(len(data) - seq_len, (batch_size,))
    x  = torch.stack([data[i     : i + seq_len    ] for i in ix])
    y  = torch.stack([data[i + 1 : i + seq_len + 1] for i in ix])
    return x.to(device), y.to(device)

def cosine_lr(step: int, max_steps: int, max_lr: float, min_lr: float, warmup: int):
    """Learning rate warmup then cosine decay."""
    if step < warmup:
        return max_lr * step / warmup
    progress = (step - warmup) / (max_steps - warmup)
    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))

def train(model, train_data, val_data, config):
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["max_lr"],
        betas=(0.9, 0.95),
        weight_decay=0.1
    )

    for step in range(config["max_steps"]):
        # update learning rate
        lr = cosine_lr(step, config["max_steps"], config["max_lr"], config["min_lr"], config["warmup"])
        for g in optimizer.param_groups:
            g["lr"] = lr

        # training step
        model.train()
        x, y = get_batch(train_data, config["seq_len"], config["batch_size"], config["device"])
        logits, loss = model(x, y)

        optimizer.zero_grad()
        loss.backward()

        # gradient clipping prevents occasional bad batches from corrupting weights
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()

        # periodic validation
        if step % config["eval_every"] == 0:
            model.eval()
            with torch.no_grad():
                _, val_loss = model(*get_batch(val_data, config["seq_len"], config["batch_size"], config["device"]))
            print(f"step {step:6d} | train {loss.item():.4f} | val {val_loss.item():.4f} | lr {lr:.2e}")

# example output (training on TinyShakespeare, NanoGPT scale):
# step      0 | train 10.8231 | val 10.8189 | lr 3.00e-06
# step   1000 | train  3.1044 | val  3.2108 | lr 2.92e-04
# step   3000 | train  1.9823 | val  2.1744 | lr 1.99e-04
# step   5000 | train  1.5934 | val  1.8991 | lr 5.00e-05

A few things worth stopping on here.

Gradient clipping at 1.0 is non-negotiable. Pre-training runs for hundreds of thousands of steps. Any one bad batch that blows up the gradients can corrupt weeks of compute.

Scaling laws: how loss depends on compute, parameters, and data

python

# what the Chinchilla finding means in practice

models = [
    ("GPT-3",     175e9, 300e9),    # 175B params, 300B tokens
    ("Chinchilla",  70e9, 1.4e12),  # 70B params,  1.4T tokens
    ("LLaMA-3-8B",  8e9, 15e12),    # 8B params,   15T tokens (2024)
    ("Qwen3-0.6B", 0.6e9, 36e12),   # 0.6B params, 36T tokens (2025)
]

print(f"{'Model':<16} | {'Params':>10} | {'Tokens':>10} | {'Ratio':>10}")
print("-" * 54)
for name, params, tokens in models:
    ratio = tokens / params
    print(f"{name:<16} | {params/1e9:>8.1f}B | {tokens/1e12:>8.1f}T | {ratio:>10.0f}:1")

plaintext

Model            |     Params |     Tokens |      Ratio
------------------------------------------------------
GPT-3            |    175.0B  |     0.3T   |       2:1
Chinchilla       |     70.0B  |     1.4T   |      20:1
LLaMA-3-8B       |      8.0B  |    15.0T   |    1875:1
Qwen3-0.6B       |      0.6B  |    36.0T   |   60000:1

Why data quality became a first-class concern

Higher quality data substituted for both scale and compute.

python

# illustrating the data quality effect on effective compute

# rough model: effective_tokens = actual_tokens * quality_multiplier
# quality_multiplier = 1.0 for average web text
# quality_multiplier = 1.3 to 2.0 for curated educational data (estimated from FineWeb-Edu results)

datasets = [
    ("Raw Common Crawl (C4)",   156e9,   1.0),
    ("Dolma",                     3e12,  1.1),
    ("FineWeb",                  15e12,  1.2),
    ("FineWeb-Edu",               1.3e12, 1.8),
]

print(f"{'Dataset':<28} | {'Raw tokens':>12} | {'Effective':>12}")
print("-" * 58)
for name, tokens, quality in datasets:
    effective = tokens * quality
    print(f"{name:<28} | {tokens/1e12:>9.1f}T   | {effective/1e12:>9.1f}T")

plaintext

Dataset                      |   Raw tokens |    Effective
----------------------------------------------------------
Raw Common Crawl (C4)        |       0.2T   |       0.2T
Dolma                        |       3.0T   |       3.3T
FineWeb                      |      15.0T   |      18.0T
FineWeb-Edu                  |       1.3T   |       2.3T

What the model actually learns during pre-training

This is the part that I find genuinely strange and interesting.

And yet, after pre-training on enough text, the model has learned:

Grammar, because sentences that violate grammar have lower probability under any reasonable language model. The model learns grammar as a side effect of trying to predict text correctly.

Reasoning patterns, because step-by-step reasoning leads to more predictable text than random conclusions. A model that can follow a chain of reasoning predicts text better than one that cannot.

Code structure, because syntactically valid code is far more predictable than random character sequences. Learning to predict code means learning to understand it.

None of this was programmed. All of it emerges because it helps with next-token prediction. That is the insight that makes pre-training so powerful and so strange at the same time.

Why pre-training matters for everything downstream

The emergent capability question

One thing that surprised researchers as models scaled was the appearance of capabilities that were not present in smaller models at all, then suddenly appeared above a certain scale threshold.

The connection between Articles 6 and 8

The transformer from Article 6 is a function. Pre-training is how you set the parameters of that function so it has learned something useful about language.

Next in the series

Frequently Asked Questions

What is pre-training in LLMs?

What is next-token prediction?

What are Chinchilla scaling laws?

Why do models go beyond the Chinchilla point in 2025?

What is the difference between pre-training and fine-tuning?

Why does data quality matter so much for pre-training?

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

GitHub LinkedIn X

KV Cache Explained: How LLMs Generate Text Without Recomputing Everything

Apr 21, 2026 · 13 min read

Transformer Architecture and Attention: Why Every Modern LLM Is Built This Way

Apr 15, 2026 · 17 min read

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read

Pre-training and Language Modeling: How a Transformer Learns to Predict Text

The training objective: predict the next token

What the loss actually measures

A minimal pre-training loop

Scaling laws: how loss depends on compute, parameters, and data

Why data quality became a first-class concern

What the model actually learns during pre-training

The emergent capability question

The connection between Articles 6 and 8

Next in the series

Frequently Asked Questions

Krunal Kanojiya

Related Posts

Pre-training and Language Modeling: How a Transformer Learns to Predict Text

The training objective: predict the next token

What the loss actually measures

A minimal pre-training loop

Scaling laws: how loss depends on compute, parameters, and data

Why data quality became a first-class concern

What the model actually learns during pre-training

The emergent capability question

The connection between Articles 6 and 8

Next in the series

Frequently Asked Questions

Krunal Kanojiya

Related Posts