Tech11 min read2,078 words

Fine-tuning and RLHF: How a Pre-trained Model Becomes a Useful Assistant

Pre-training gives a model language. Fine-tuning and RLHF give it behavior. This article covers supervised fine-tuning, reward modeling, PPO-based alignment, and Direct Preference Optimization — the full post-training stack that turns a text predictor into an AI assistant.

K

Krunal Kanojiya

Share:
#fine-tuning#rlhf#sft#ppo#dpo#alignment#llm#reward-model#reinforcement-learning

Article 7 ended with a pre-trained model that can predict text but cannot hold a conversation. Ask it "What is the capital of France?" and it might respond with "What is the capital of Germany? What is the capital of Italy?" — continuing the text as if it were generating a trivia quiz. That is not a bug. That is what next-token prediction produces.

Fine-tuning and RLHF are what turn that text predictor into something that answers your question, follows your instruction, declines requests it should decline, and stays coherent across a conversation.

This is Article 8 in the series on AI and ML fundamentals. Article 7 covered pre-training. Article 9 covers prompting, RAG, and in-context learning, where you learn how to use a fully trained model effectively in real products. Everything in this article explains why models behave the way they do when you prompt them.


The problem with a raw pre-trained model

A pre-trained model has learned language. It has compressed an enormous amount of world knowledge into its weights through next-token prediction. But its behavior reflects the statistical patterns of its training data, not what you want it to do.

If you ask it to help with a harmful request, it might comply because the training data contained such content. If you ask it to be concise, it keeps writing because the training data contained long-form text. If you ask it a question, it generates more questions because that is what follows questions in many training documents.

The training data was the internet. The internet is not a helpful assistant.

python
# a raw pre-trained model completing text
# (illustrative — not actual model output)

prompt = "What is the capital of France?"

# what a pre-trained model might produce
raw_completion = """
What is the capital of France?
What is the capital of Germany?
What is the capital of Spain?
What is the capital of Italy?
These are common geography quiz questions.
Answer: France - Paris, Germany - Berlin...
"""

# what we want after fine-tuning
aligned_response = "The capital of France is Paris."

Supervised fine-tuning is the first fix. RLHF is the second. Together they shape how the model uses its pre-trained knowledge.


Step 1: Supervised fine-tuning

Supervised fine-tuning (SFT) trains the model on a curated dataset of (prompt, ideal response) pairs. The loss function is identical to pre-training cross-entropy, but computed only on the response tokens. The model learns to match the style and format of the demonstration responses.

python
import torch
import torch.nn.functional as F

def compute_sft_loss(model, input_ids, response_start_idx):
    """
    SFT loss: cross-entropy only on response tokens, not the prompt.
    input_ids: full sequence (prompt + response)
    response_start_idx: where the response begins
    """
    logits, _ = model(input_ids)

    # shift for next-token prediction
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = input_ids[:, 1:].contiguous()

    # mask prompt tokens — only learn from response
    loss_mask = torch.zeros_like(shift_labels, dtype=torch.float)
    loss_mask[:, response_start_idx - 1:] = 1.0

    # cross-entropy with masking
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        reduction="none"
    )
    loss = (loss * loss_mask.view(-1)).sum() / loss_mask.sum()
    return loss

SFT datasets contain thousands to hundreds of thousands of examples. Each example shows the model what a good response looks like for a given prompt. The model learns to mimic that style.

SFT alone gets you surprisingly far. A well-curated SFT dataset with 10,000 high-quality examples can produce a model that behaves helpfully on most everyday prompts. The limitation is that SFT can only teach the model to match what humans wrote as demonstrations. It cannot teach relative preferences: that one response is better than another for subtle reasons that are hard to demonstrate but easy to judge.


Step 2: Reward modeling

The reward model captures human preferences in a learnable form. Annotators compare pairs of model responses and label which one they prefer. The reward model learns to predict those preferences.

python
import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """
    Takes a (prompt, response) pair and outputs a scalar reward.
    Built on top of a pre-trained LLM backbone.
    """
    def __init__(self, base_model, hidden_dim):
        super().__init__()
        self.backbone = base_model
        # scalar head: maps last hidden state to a reward score
        self.reward_head = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, input_ids):
        with torch.no_grad():
            hidden = self.backbone(input_ids)
        # use the last token's hidden state as the sequence representation
        last_hidden = hidden[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

def reward_model_loss(reward_chosen, reward_rejected):
    """
    Preference loss: chosen response should score higher than rejected.
    This is the Bradley-Terry model for pairwise preferences.
    """
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

The reward model is trained on the preference dataset until it reliably predicts which response humans preferred. It becomes a proxy for human judgment — a function you can call millions of times without actually involving a human.

The quality ceiling of your entire RLHF pipeline is set here. In classic RLHF, the policy optimization (PPO) cannot exceed the quality of the reward model. A reward model that captures superficial signals (longer is better, sounds confident) will produce a policy that games those signals. This failure mode is called reward hacking.


Step 3: PPO alignment

Proximal Policy Optimization (PPO) is the reinforcement learning algorithm that fine-tunes the language model to produce responses the reward model scores highly, while a KL divergence penalty prevents it from drifting too far from the SFT baseline.

python
import torch
import torch.nn.functional as F

def ppo_step(policy_model, ref_model, reward_model, prompts, kl_coeff=0.1):
    """
    One PPO update step.
    policy_model: the model being fine-tuned
    ref_model:    the SFT baseline (frozen), used for KL penalty
    reward_model: trained reward model
    kl_coeff:     controls how far the policy can drift from the reference
    """
    # generate responses from current policy
    with torch.no_grad():
        responses = policy_model.generate(prompts, max_new_tokens=200)

    # score responses with reward model
    rewards = reward_model(responses)

    # compute KL divergence from SFT reference
    with torch.no_grad():
        ref_logprobs  = ref_model.get_logprobs(responses)
    policy_logprobs = policy_model.get_logprobs(responses)

    kl_penalty = (policy_logprobs - ref_logprobs).mean()

    # final objective: maximize reward, minimize KL drift
    objective = rewards - kl_coeff * kl_penalty
    loss = -objective.mean()

    return loss, rewards.mean().item(), kl_penalty.item()

The KL penalty is not optional. Without it, PPO quickly discovers that the reward model can be fooled with responses that are grammatically nonsensical but happen to score well. The KL term anchors the model to its SFT weights, preventing reward hacking while still allowing meaningful behavior change.

A 2024 comprehensive study comparing DPO and PPO found that well-implemented PPO outperforms DPO across dialogue and code generation benchmarks, achieving state-of-the-art results in code competitions. The catch is implementation complexity. PPO for LLMs requires four models running simultaneously: the policy being trained, the frozen SFT reference, a critic model that estimates value, and the reward model. Memory requirements are substantial.


DPO: the simpler alternative

Direct Preference Optimization, published in 2023, made an observation that simplified the whole pipeline significantly. The optimal policy under the RLHF objective can be expressed directly in terms of the policy model's own probabilities. This means you do not need a separate reward model or a PPO training loop. The language model is secretly a reward model.

python
import torch
import torch.nn.functional as F

def dpo_loss(policy_model, ref_model, chosen_ids, rejected_ids, beta=0.1):
    """
    DPO loss: directly optimize preferences without a reward model.

    beta controls the KL constraint strength (lower = stronger constraint).
    chosen_ids:   token IDs of the preferred response
    rejected_ids: token IDs of the less preferred response
    """
    # log probabilities under current policy
    policy_chosen_logprobs  = policy_model.get_sequence_logprob(chosen_ids)
    policy_rejected_logprobs = policy_model.get_sequence_logprob(rejected_ids)

    # log probabilities under frozen SFT reference
    with torch.no_grad():
        ref_chosen_logprobs  = ref_model.get_sequence_logprob(chosen_ids)
        ref_rejected_logprobs = ref_model.get_sequence_logprob(rejected_ids)

    # log ratio: how much more the policy prefers chosen vs reference
    chosen_ratio  = policy_chosen_logprobs  - ref_chosen_logprobs
    rejected_ratio = policy_rejected_logprobs - ref_rejected_logprobs

    # DPO classification loss
    loss = -F.logsigmoid(beta * (chosen_ratio - rejected_ratio)).mean()

    return loss

DPO eliminates the reward model, the PPO training loop, and the critic model. You train directly on (prompt, chosen, rejected) triplets. The loss function pushes the model to assign higher probability to chosen responses relative to the SFT reference, and lower probability to rejected responses.

The original DPO paper showed it matched or exceeded PPO on sentiment control and summarization tasks while being substantially simpler to implement. By 2025, DPO and its variants have become the standard approach for open-source fine-tuning pipelines.

A practical warning from current research: DPO with a small beta value (weak KL constraint) on a small preference dataset can produce significant alignment tax, degrading benchmark performance. Always monitor your evaluation benchmarks during DPO training and tune beta to find the right balance between alignment quality and capability retention.


LoRA: making fine-tuning feasible

Full fine-tuning updates every weight in the model. For a 7 billion parameter model with 32-bit precision, that requires roughly 112 GB of GPU memory just to store the gradients and optimizer states. Most teams cannot afford that.

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable low-rank matrices to specific layers. A typical LoRA configuration for a 7B model trains around 20 to 80 million parameters instead of 7 billion.

python
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """
    Replaces a linear layer with a LoRA-adapted version.
    Original weights are frozen. Only A and B are trained.
    """
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        # freeze the original weights
        self.weight = original_layer.weight
        self.weight.requires_grad = False

        # trainable low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features)  * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        self.rank   = rank
        self.scale  = alpha / rank

    def forward(self, x):
        # original frozen computation + low-rank update
        base_output = F.linear(x, self.weight)
        lora_output = (x @ self.lora_A.T) @ self.lora_B.T
        return base_output + self.scale * lora_output

def count_trainable_params(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

# illustrative: what LoRA does to parameter count
# total_params: 7,000,000,000 (7B model)
# trainable_params with LoRA rank=8: ~20,000,000 (0.3% of total)

LoRA is now the default for domain-specific fine-tuning. It works well with both SFT and DPO. After training, the LoRA weights can be merged back into the original model for deployment, adding no inference latency.


What alignment actually changes

After SFT and RLHF, the model's behavior changes in ways that are hard to measure with standard benchmarks but obvious in use.

It follows instruction format. The pre-trained model generates whatever continuation has the highest probability. The aligned model generates a response to your prompt specifically.

It declines harmful requests. The reward model was trained on data where refusals received high scores for certain categories of prompts. The model learned this pattern.

It calibrates uncertainty. Aligned models are more likely to say they do not know something rather than generating a confident-sounding hallucination, because annotators consistently preferred honest uncertainty over confident errors.

It stays on topic. RLHF training on conversational data shapes the model to maintain dialogue context rather than drifting into unrelated continuations.

None of this capability exists in the pre-trained model. It is installed by post-training.


The connection to Article 9

Article 9 covers prompting, retrieval-augmented generation, and in-context learning. All of those techniques assume you are working with an aligned model. Prompt engineering works precisely because the aligned model follows instructions reliably. RAG works because the model can incorporate retrieved context and answer questions from it. Few-shot prompting works because the model generalizes from examples in the context.

If you tried these techniques on a raw pre-trained model, they would fail or produce unreliable results. The SFT and RLHF training is what makes prompting meaningful.

The practical decision you will face

For most product teams, the choice is between fine-tuning with LoRA plus DPO on your own data, or using a strong API model that was already aligned. Fine-tuning gives you control over behavior and often better performance on your specific domain. It requires data, compute, and evaluation infrastructure. The API model is faster to ship but you cannot control its post-training. Knowing what happened in Articles 7 and 8 helps you make that call with actual understanding of the tradeoffs.


Next in the series

Article 9 covers prompting, RAG, and in-context learning. These are the techniques you use every day when building on top of LLMs. Understanding why they work requires everything from Articles 1 through 8. The geometry of embedding space from Article 4, the attention mechanism from Article 6, and the aligned behavior from this article all come together in how you write a system prompt and structure a retrieval pipeline.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
K

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

Related Posts