LLMs & Deep Learning·11 min read·2,118 words

Fine-tuning and RLHF: How a Pre-trained Model Becomes a Useful Assistant

Pre-training gives a model language. Fine-tuning and RLHF give it behavior. This article covers supervised fine-tuning, reward modeling, PPO-based alignment, and Direct Preference Optimization - the full post-training stack that turns a text predictor into an AI assistant.

Krunal Kanojiya

April 19, 2026

#fine-tuning#rlhf#sft#ppo#dpo#alignment#llm#reward-model#reinforcement-learning

Fine-tuning and RLHF: How a Pre-trained Model Becomes a Useful Assistant

Article 7 ended with a pre-trained model that can predict text but cannot hold a conversation. Ask it "What is the capital of France?" and it might respond with "What is the capital of Germany? What is the capital of Italy?" - continuing the text as if it were generating a trivia quiz. That is not a bug. That is what next-token prediction produces.

Fine-tuning and RLHF are what turn that text predictor into something that answers your question, follows your instruction, declines requests it should decline, and stays coherent across a conversation.

This is Article 8 in the series on AI and ML fundamentals. Article 7 covered pre-training. Article 9 covers prompting, RAG, and in-context learning, where you learn how to use a fully trained model effectively in real products. Everything in this article explains why models behave the way they do when you prompt them.

The problem with a raw pre-trained model

A pre-trained model has learned language. It has compressed an enormous amount of world knowledge into its weights through next-token prediction. But its behavior reflects the statistical patterns of its training data, not what you want it to do.

If you ask it to help with a harmful request, it might comply because the training data contained such content. If you ask it to be concise, it keeps writing because the training data contained long-form text. If you ask it a question, it generates more questions because that is what follows questions in many training documents.

The training data was the internet. The internet is not a helpful assistant.

python

# a raw pre-trained model completing text
# (illustrative - not actual model output)

prompt = "What is the capital of France?"

# what a pre-trained model might produce
raw_completion = """
What is the capital of France?
What is the capital of Germany?
What is the capital of Spain?
What is the capital of Italy?
These are common geography quiz questions.
Answer: France - Paris, Germany - Berlin...
"""

# what we want after fine-tuning
aligned_response = "The capital of France is Paris."

Supervised fine-tuning is the first fix. RLHF is the second. Together they shape how the model uses its pre-trained knowledge.

Step 1: Supervised fine-tuning

Supervised fine-tuning (SFT) trains the model on a curated dataset of (prompt, ideal response) pairs. The loss function is identical to pre-training cross-entropy, but computed only on the response tokens. The model learns to match the style and format of the demonstration responses.

python

import torch
import torch.nn.functional as F

def compute_sft_loss(model, input_ids, response_start_idx):
    """
    SFT loss: cross-entropy only on response tokens, not the prompt.
    input_ids: full sequence (prompt + response)
    response_start_idx: where the response begins
    """
    logits, _ = model(input_ids)

    # shift for next-token prediction
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = input_ids[:, 1:].contiguous()

    # mask prompt tokens - only learn from response
    loss_mask = torch.zeros_like(shift_labels, dtype=torch.float)
    loss_mask[:, response_start_idx - 1:] = 1.0

    # cross-entropy with masking
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        reduction="none"
    )
    loss = (loss * loss_mask.view(-1)).sum() / loss_mask.sum()
    return loss

SFT datasets contain thousands to hundreds of thousands of examples. Each example shows the model what a good response looks like for a given prompt. The model learns to mimic that style.

SFT alone gets you surprisingly far. A well-curated SFT dataset with 10,000 high-quality examples can produce a model that behaves helpfully on most everyday prompts. The limitation is that SFT can only teach the model to match what humans wrote as demonstrations. It cannot teach relative preferences: that one response is better than another for subtle reasons that are hard to demonstrate but easy to judge.

Step 2: Reward modeling

The reward model captures human preferences in a learnable form. Annotators compare pairs of model responses and label which one they prefer. The reward model learns to predict those preferences.

python

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """
    Takes a (prompt, response) pair and outputs a scalar reward.
    Built on top of a pre-trained LLM backbone.
    """
    def __init__(self, base_model, hidden_dim):
        super().__init__()
        self.backbone = base_model
        # scalar head: maps last hidden state to a reward score
        self.reward_head = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, input_ids):
        with torch.no_grad():
            hidden = self.backbone(input_ids)
        # use the last token's hidden state as the sequence representation
        last_hidden = hidden[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

def reward_model_loss(reward_chosen, reward_rejected):
    """
    Preference loss: chosen response should score higher than rejected.
    This is the Bradley-Terry model for pairwise preferences.
    """
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

The reward model is trained on the preference dataset until it reliably predicts which response humans preferred. It becomes a proxy for human judgment - a function you can call millions of times without actually involving a human.

The quality ceiling of your entire RLHF pipeline is set here. In classic RLHF, the policy optimization (PPO) cannot exceed the quality of the reward model. A reward model that captures superficial signals (longer is better, sounds confident) will produce a policy that games those signals. This failure mode is called reward hacking.

Reward hacking can also happen through exploration itself: a policy learns to manipulate the conditions under which rewards are observed instead of solving the intended task. That more specific failure mode is explained in exploration hacking and LLM reinforcement learning.

Step 3: PPO alignment

Proximal Policy Optimization (PPO) is the reinforcement learning algorithm that fine-tunes the language model to produce responses the reward model scores highly, while a KL divergence penalty prevents it from drifting too far from the SFT baseline.

python

import torch
import torch.nn.functional as F

def ppo_step(policy_model, ref_model, reward_model, prompts, kl_coeff=0.1):
    """
    One PPO update step.
    policy_model: the model being fine-tuned
    ref_model:    the SFT baseline (frozen), used for KL penalty
    reward_model: trained reward model
    kl_coeff:     controls how far the policy can drift from the reference
    """
    # generate responses from current policy
    with torch.no_grad():
        responses = policy_model.generate(prompts, max_new_tokens=200)

    # score responses with reward model
    rewards = reward_model(responses)

    # compute KL divergence from SFT reference
    with torch.no_grad():
        ref_logprobs  = ref_model.get_logprobs(responses)
    policy_logprobs = policy_model.get_logprobs(responses)

    kl_penalty = (policy_logprobs - ref_logprobs).mean()

    # final objective: maximize reward, minimize KL drift
    objective = rewards - kl_coeff * kl_penalty
    loss = -objective.mean()

    return loss, rewards.mean().item(), kl_penalty.item()

The KL penalty is not optional. Without it, PPO quickly discovers that the reward model can be fooled with responses that are grammatically nonsensical but happen to score well. The KL term anchors the model to its SFT weights, preventing reward hacking while still allowing meaningful behavior change.

A 2024 comprehensive study comparing DPO and PPO found that well-implemented PPO outperforms DPO across dialogue and code generation benchmarks, achieving state-of-the-art results in code competitions. The catch is implementation complexity. PPO for LLMs requires four models running simultaneously: the policy being trained, the frozen SFT reference, a critic model that estimates value, and the reward model. Memory requirements are substantial.

DPO: the simpler alternative

Direct Preference Optimization, published in 2023, made an observation that simplified the whole pipeline significantly. The optimal policy under the RLHF objective can be expressed directly in terms of the policy model's own probabilities. This means you do not need a separate reward model or a PPO training loop. The language model is secretly a reward model.

python

import torch
import torch.nn.functional as F

def dpo_loss(policy_model, ref_model, chosen_ids, rejected_ids, beta=0.1):
    """
    DPO loss: directly optimize preferences without a reward model.

    beta controls the KL constraint strength (lower = stronger constraint).
    chosen_ids:   token IDs of the preferred response
    rejected_ids: token IDs of the less preferred response
    """
    # log probabilities under current policy
    policy_chosen_logprobs  = policy_model.get_sequence_logprob(chosen_ids)
    policy_rejected_logprobs = policy_model.get_sequence_logprob(rejected_ids)

    # log probabilities under frozen SFT reference
    with torch.no_grad():
        ref_chosen_logprobs  = ref_model.get_sequence_logprob(chosen_ids)
        ref_rejected_logprobs = ref_model.get_sequence_logprob(rejected_ids)

    # log ratio: how much more the policy prefers chosen vs reference
    chosen_ratio  = policy_chosen_logprobs  - ref_chosen_logprobs
    rejected_ratio = policy_rejected_logprobs - ref_rejected_logprobs

    # DPO classification loss
    loss = -F.logsigmoid(beta * (chosen_ratio - rejected_ratio)).mean()

    return loss

DPO eliminates the reward model, the PPO training loop, and the critic model. You train directly on (prompt, chosen, rejected) triplets. The loss function pushes the model to assign higher probability to chosen responses relative to the SFT reference, and lower probability to rejected responses.

The original DPO paper showed it matched or exceeded PPO on sentiment control and summarization tasks while being substantially simpler to implement. By 2025, DPO and its variants have become the standard approach for open-source fine-tuning pipelines.

A practical warning from current research: DPO with a small beta value (weak KL constraint) on a small preference dataset can produce significant alignment tax, degrading benchmark performance. Always monitor your evaluation benchmarks during DPO training and tune beta to find the right balance between alignment quality and capability retention.

LoRA: making fine-tuning feasible

Full fine-tuning updates every weight in the model. For a 7 billion parameter model with 32-bit precision, that requires roughly 112 GB of GPU memory just to store the gradients and optimizer states. Most teams cannot afford that.

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable low-rank matrices to specific layers. A typical LoRA configuration for a 7B model trains around 20 to 80 million parameters instead of 7 billion.

python

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """
    Replaces a linear layer with a LoRA-adapted version.
    Original weights are frozen. Only A and B are trained.
    """
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        # freeze the original weights
        self.weight = original_layer.weight
        self.weight.requires_grad = False

        # trainable low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features)  * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        self.rank   = rank
        self.scale  = alpha / rank

    def forward(self, x):
        # original frozen computation + low-rank update
        base_output = F.linear(x, self.weight)
        lora_output = (x @ self.lora_A.T) @ self.lora_B.T
        return base_output + self.scale * lora_output

def count_trainable_params(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

# illustrative: what LoRA does to parameter count
# total_params: 7,000,000,000 (7B model)
# trainable_params with LoRA rank=8: ~20,000,000 (0.3% of total)

LoRA is now the default for domain-specific fine-tuning. It works well with both SFT and DPO. After training, the LoRA weights can be merged back into the original model for deployment, adding no inference latency.

What alignment actually changes

After SFT and RLHF, the model's behavior changes in ways that are hard to measure with standard benchmarks but obvious in use.

It follows instruction format. The pre-trained model generates whatever continuation has the highest probability. The aligned model generates a response to your prompt specifically.

It declines harmful requests. The reward model was trained on data where refusals received high scores for certain categories of prompts. The model learned this pattern.

It calibrates uncertainty. Aligned models are more likely to say they do not know something rather than generating a confident-sounding hallucination, because annotators consistently preferred honest uncertainty over confident errors.

It stays on topic. RLHF training on conversational data shapes the model to maintain dialogue context rather than drifting into unrelated continuations.

None of this capability exists in the pre-trained model. It is installed by post-training.

The connection to Article 9

Article 9 covers prompting, retrieval-augmented generation, and in-context learning. All of those techniques assume you are working with an aligned model. Prompt engineering works precisely because the aligned model follows instructions reliably. RAG works because the model can incorporate retrieved context and answer questions from it. Few-shot prompting works because the model generalizes from examples in the context.

If you tried these techniques on a raw pre-trained model, they would fail or produce unreliable results. The SFT and RLHF training is what makes prompting meaningful.

The practical decision you will face

For most product teams, the choice is between fine-tuning with LoRA plus DPO on your own data, or using a strong API model that was already aligned. Fine-tuning gives you control over behavior and often better performance on your specific domain. It requires data, compute, and evaluation infrastructure. The API model is faster to ship but you cannot control its post-training. Knowing what happened in Articles 7 and 8 helps you make that call with actual understanding of the tradeoffs.

Next in the series

Article 9 covers prompting, RAG, and in-context learning. These are the techniques you use every day when building on top of LLMs. Understanding why they work requires everything from Articles 1 through 8. The geometry of embedding space from Article 4, the attention mechanism from Article 6, and the aligned behavior from this article all come together in how you write a system prompt and structure a retrieval pipeline.

Frequently Asked Questions

What is supervised fine-tuning (SFT) in LLMs?

Supervised fine-tuning is the first post-training step after pre-training. The model is trained on a curated dataset of (prompt, response) pairs that demonstrate the behavior you want. The loss function is the same cross-entropy from pre-training, but applied only to the response tokens. SFT teaches the model to answer questions in a helpful style, but it cannot by itself teach the model to prefer better responses over worse ones.

What is RLHF?

Reinforcement Learning from Human Feedback is a training approach where human annotators rank model-generated responses by quality, a reward model learns those preferences, and the language model is then fine-tuned using reinforcement learning to produce responses the reward model rates highly. The KL divergence penalty prevents the model from drifting too far from its SFT baseline. ChatGPT, Claude, and Gemini all went through some version of RLHF.

What is DPO and how is it different from RLHF?

Direct Preference Optimization reformulates the RLHF objective so the language model itself implicitly acts as a reward model. Instead of training a separate reward model and running PPO, DPO directly optimizes the model on (chosen, rejected) response pairs using a classification loss. It achieves comparable alignment quality with less compute, fewer moving parts, and no RL training loop. Most open-source fine-tuning pipelines in 2025 use DPO or variants of it.

What is reward hacking in RLHF?

Reward hacking happens when the model learns to maximize the reward model score without actually becoming more helpful. It finds outputs that score well on the learned reward function but that humans would not actually prefer. The KL penalty in PPO is the main safeguard - it limits how far the model can drift from its original behavior. Without it, the model quickly finds degenerate outputs that fool the reward model.

What is the alignment tax?

The alignment tax is the performance degradation on downstream tasks that sometimes occurs after RLHF fine-tuning. The model becomes better at following instructions and refusing harmful requests, but may lose some capability on benchmarks. A strong KL penalty during PPO training limits the alignment tax by keeping the model closer to its pre-trained weights. DPO with a small beta parameter can produce more significant degradation on small preference datasets.

What is LoRA and how does it relate to fine-tuning?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that freezes the original model weights and adds small trainable low-rank matrices to specific layers. Instead of updating all parameters, you update only a tiny fraction. A 7B model might have 7 billion parameters, but a LoRA adapter might train only 30 million. This makes fine-tuning feasible on consumer hardware and is the standard approach for domain-specific fine-tuning without access to large GPU clusters.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

I am a technical writer and former software developer from India. I publish practical tutorials and in-depth guides on AI engineering, data engineering, programming, algorithms, blockchain, and modern software development.

GitHub LinkedIn X

Exploration Hacking: The RL Failure Mode That ML Engineers Need to Understand Now

May 05, 2026 · 15 min read

RAG vs Fine-Tuning: When to Use Each (2026 Decision Guide)

May 04, 2026 · 12 min read

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read

LLMs & Deep Learning·11 min read·2,118 words

Fine-tuning and RLHF: How a Pre-trained Model Becomes a Useful Assistant

Krunal Kanojiya

April 19, 2026

#fine-tuning#rlhf#sft#ppo#dpo#alignment#llm#reward-model#reinforcement-learning

The problem with a raw pre-trained model

The training data was the internet. The internet is not a helpful assistant.

python

# a raw pre-trained model completing text
# (illustrative - not actual model output)

prompt = "What is the capital of France?"

# what a pre-trained model might produce
raw_completion = """
What is the capital of France?
What is the capital of Germany?
What is the capital of Spain?
What is the capital of Italy?
These are common geography quiz questions.
Answer: France - Paris, Germany - Berlin...
"""

# what we want after fine-tuning
aligned_response = "The capital of France is Paris."

Supervised fine-tuning is the first fix. RLHF is the second. Together they shape how the model uses its pre-trained knowledge.

Step 1: Supervised fine-tuning

python

import torch
import torch.nn.functional as F

def compute_sft_loss(model, input_ids, response_start_idx):
    """
    SFT loss: cross-entropy only on response tokens, not the prompt.
    input_ids: full sequence (prompt + response)
    response_start_idx: where the response begins
    """
    logits, _ = model(input_ids)

    # shift for next-token prediction
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = input_ids[:, 1:].contiguous()

    # mask prompt tokens - only learn from response
    loss_mask = torch.zeros_like(shift_labels, dtype=torch.float)
    loss_mask[:, response_start_idx - 1:] = 1.0

    # cross-entropy with masking
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        reduction="none"
    )
    loss = (loss * loss_mask.view(-1)).sum() / loss_mask.sum()
    return loss

SFT datasets contain thousands to hundreds of thousands of examples. Each example shows the model what a good response looks like for a given prompt. The model learns to mimic that style.

Step 2: Reward modeling

The reward model captures human preferences in a learnable form. Annotators compare pairs of model responses and label which one they prefer. The reward model learns to predict those preferences.

python

import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """
    Takes a (prompt, response) pair and outputs a scalar reward.
    Built on top of a pre-trained LLM backbone.
    """
    def __init__(self, base_model, hidden_dim):
        super().__init__()
        self.backbone = base_model
        # scalar head: maps last hidden state to a reward score
        self.reward_head = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, input_ids):
        with torch.no_grad():
            hidden = self.backbone(input_ids)
        # use the last token's hidden state as the sequence representation
        last_hidden = hidden[:, -1, :]
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

def reward_model_loss(reward_chosen, reward_rejected):
    """
    Preference loss: chosen response should score higher than rejected.
    This is the Bradley-Terry model for pairwise preferences.
    """
    return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()

Step 3: PPO alignment

python

import torch
import torch.nn.functional as F

def ppo_step(policy_model, ref_model, reward_model, prompts, kl_coeff=0.1):
    """
    One PPO update step.
    policy_model: the model being fine-tuned
    ref_model:    the SFT baseline (frozen), used for KL penalty
    reward_model: trained reward model
    kl_coeff:     controls how far the policy can drift from the reference
    """
    # generate responses from current policy
    with torch.no_grad():
        responses = policy_model.generate(prompts, max_new_tokens=200)

    # score responses with reward model
    rewards = reward_model(responses)

    # compute KL divergence from SFT reference
    with torch.no_grad():
        ref_logprobs  = ref_model.get_logprobs(responses)
    policy_logprobs = policy_model.get_logprobs(responses)

    kl_penalty = (policy_logprobs - ref_logprobs).mean()

    # final objective: maximize reward, minimize KL drift
    objective = rewards - kl_coeff * kl_penalty
    loss = -objective.mean()

    return loss, rewards.mean().item(), kl_penalty.item()

DPO: the simpler alternative

python

import torch
import torch.nn.functional as F

def dpo_loss(policy_model, ref_model, chosen_ids, rejected_ids, beta=0.1):
    """
    DPO loss: directly optimize preferences without a reward model.

    beta controls the KL constraint strength (lower = stronger constraint).
    chosen_ids:   token IDs of the preferred response
    rejected_ids: token IDs of the less preferred response
    """
    # log probabilities under current policy
    policy_chosen_logprobs  = policy_model.get_sequence_logprob(chosen_ids)
    policy_rejected_logprobs = policy_model.get_sequence_logprob(rejected_ids)

    # log probabilities under frozen SFT reference
    with torch.no_grad():
        ref_chosen_logprobs  = ref_model.get_sequence_logprob(chosen_ids)
        ref_rejected_logprobs = ref_model.get_sequence_logprob(rejected_ids)

    # log ratio: how much more the policy prefers chosen vs reference
    chosen_ratio  = policy_chosen_logprobs  - ref_chosen_logprobs
    rejected_ratio = policy_rejected_logprobs - ref_rejected_logprobs

    # DPO classification loss
    loss = -F.logsigmoid(beta * (chosen_ratio - rejected_ratio)).mean()

    return loss

LoRA: making fine-tuning feasible

python

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    """
    Replaces a linear layer with a LoRA-adapted version.
    Original weights are frozen. Only A and B are trained.
    """
    def __init__(self, original_layer, rank=8, alpha=16):
        super().__init__()
        in_features  = original_layer.in_features
        out_features = original_layer.out_features

        # freeze the original weights
        self.weight = original_layer.weight
        self.weight.requires_grad = False

        # trainable low-rank matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features)  * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        self.rank   = rank
        self.scale  = alpha / rank

    def forward(self, x):
        # original frozen computation + low-rank update
        base_output = F.linear(x, self.weight)
        lora_output = (x @ self.lora_A.T) @ self.lora_B.T
        return base_output + self.scale * lora_output

def count_trainable_params(model):
    total     = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

# illustrative: what LoRA does to parameter count
# total_params: 7,000,000,000 (7B model)
# trainable_params with LoRA rank=8: ~20,000,000 (0.3% of total)

What alignment actually changes

After SFT and RLHF, the model's behavior changes in ways that are hard to measure with standard benchmarks but obvious in use.

It follows instruction format. The pre-trained model generates whatever continuation has the highest probability. The aligned model generates a response to your prompt specifically.

It declines harmful requests. The reward model was trained on data where refusals received high scores for certain categories of prompts. The model learned this pattern.

It stays on topic. RLHF training on conversational data shapes the model to maintain dialogue context rather than drifting into unrelated continuations.

None of this capability exists in the pre-trained model. It is installed by post-training.

The connection to Article 9

If you tried these techniques on a raw pre-trained model, they would fail or produce unreliable results. The SFT and RLHF training is what makes prompting meaningful.

The practical decision you will face

Next in the series

Frequently Asked Questions

What is supervised fine-tuning (SFT) in LLMs?

What is RLHF?

What is DPO and how is it different from RLHF?

What is reward hacking in RLHF?

What is the alignment tax?

What is LoRA and how does it relate to fine-tuning?

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

GitHub LinkedIn X

Exploration Hacking: The RL Failure Mode That ML Engineers Need to Understand Now

May 05, 2026 · 15 min read

RAG vs Fine-Tuning: When to Use Each (2026 Decision Guide)

May 04, 2026 · 12 min read

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read

Fine-tuning and RLHF: How a Pre-trained Model Becomes a Useful Assistant

The problem with a raw pre-trained model

Step 1: Supervised fine-tuning

Step 2: Reward modeling

Step 3: PPO alignment

DPO: the simpler alternative

LoRA: making fine-tuning feasible

What alignment actually changes

The connection to Article 9

Next in the series

Frequently Asked Questions

Krunal Kanojiya

Related Posts

Fine-tuning and RLHF: How a Pre-trained Model Becomes a Useful Assistant

The problem with a raw pre-trained model

Step 1: Supervised fine-tuning

Step 2: Reward modeling

Step 3: PPO alignment

DPO: the simpler alternative

LoRA: making fine-tuning feasible

What alignment actually changes

The connection to Article 9

Next in the series

Frequently Asked Questions

Krunal Kanojiya

Related Posts