RAG Evaluation as an Engineering Discipline: Build the Pipeline From Zero
57% of organizations have RAG agents in production. 32% cite quality as the top barrier. Systematic evaluation reduces post-deployment failures by 50 to 70%, but most teams still treat it as a one-time check. This is the practitioner's guide: what metrics matter, how to build a CI/CD quality gate, and how to wire production failures back into your test suite without buying another SaaS tool.
Most RAG systems do not fail on launch day. They fail quietly over the following six weeks as documents update, query patterns shift, and the retrieval layer starts returning chunks that were accurate when the index was built and wrong now. Nobody notices because there is no metric moving on a dashboard. There is just a slow erosion in answer quality that surfaces first in user complaints and second in an engineering post-mortem.
According to LangChain's 2026 State of AI Agents report, 57% of organizations now have agents in production, with quality cited as the top barrier to deployment by 32% of respondents. That quality barrier is not primarily a model problem. It is an evaluation infrastructure problem. Teams ship RAG systems without measurement and then cannot distinguish retrieval failures from generation failures when things go wrong.
Enterprise implementations that build systematic evaluation from the start reduce post-deployment issues by 50 to 70%. But 70% of RAG systems in production still lack systematic evaluation frameworks, which makes quality regressions invisible until a user flags them.
This article builds the evaluation system from zero. No SaaS platform required.
Why RAG Evaluation Is a Different Engineering Problem
Traditional machine learning models have one failure mode: the model makes a wrong prediction. You measure prediction quality against labeled ground truth. The process is well understood and well tooled.
RAG pipelines have two distinct failure modes that require separate measurement systems, and standard ML metrics miss both of them.
Retrieval failure. The right document exists in your knowledge base, but the retrieval step does not surface it. Or it surfaces the wrong document, and the model generates an answer grounded in irrelevant context. The answer sounds confident and is factually wrong.
Generation failure. Retrieval worked correctly. The right chunks are in the context window. The model ignores them, misrepresents them, or supplements them with training data that contradicts what the retrieved documents actually say.
RAG introduces two distinct failure modes that classic metrics cannot distinguish. BLEU, ROUGE, and BERTScore measure surface text similarity between a generated answer and a reference string. A RAG system that retrieves the wrong document and generates a plausible wrong answer can still score well on ROUGE if the output text happens to overlap with the reference. These metrics were designed for translation and summarization. They were not built for retrieval pipelines.
Poorly evaluated RAG systems hallucinate in up to 40% of responses even when the correct source documents were retrieved, according to the Stanford AI Lab. The documents are there. The model is not using them. ROUGE does not catch this.
47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024, according to the Suprmind AI Hallucination Report. Every one of those decisions traces back to a RAG pipeline that nobody properly benchmarked.
The Four Metrics That Actually Matter
The evaluation framework for a RAG pipeline needs exactly four metrics. Each one measures a different layer of the pipeline independently, so when a score drops you know immediately which component broke.
| Metric | What It Measures | Failure Mode It Catches | Production Threshold |
|---|---|---|---|
| Faithfulness | Every answer claim is supported by retrieved context | Model hallucinating beyond its context | Above 0.9 |
| Answer Relevancy | Answer addresses the actual question asked | Adjacent but off-topic answers | Above 0.85 |
| Context Precision | Relevant chunks rank high in the retrieved set | Retrieval is noisy, wrong chunks surface | Above 0.8 |
| Context Recall | All relevant chunks in the KB were retrieved | Retrieval misses documents it should find | Above 0.8 |
Source: MarsDevs production RAG targets, 2026, Premai.io RAG evaluation guide
Faithfulness
Faithfulness is the hallucination metric for RAG. RAGAS computes it by extracting all factual claims from the generated answer using an LLM, then verifying each claim against the retrieved context using the same LLM as a judge. The score is the fraction of claims that are supported. A score of 0.6 means 40% of what the model said was not grounded in its context window.
A faithfulness score below 0.9 means your system prompt is not constraining the model to its retrieved context tightly enough, or retrieval is so poor that the model fills gaps from training data. Both are fixable, but they require different fixes.
Answer Relevancy
Answer relevancy measures whether the answer actually helps the user rather than being technically accurate but adjacent to what they asked. RAGAS measures this by generating multiple question paraphrases from the answer text and computing similarity between those paraphrases and the original question. If the answer contains the right information, the generated questions should resemble the original closely.
Low answer relevancy combined with high faithfulness means retrieval is surfacing real content from the knowledge base that is not quite relevant to the specific query. This is usually a chunking or hybrid search problem, not a generation problem.
Context Precision
This is a retrieval layer metric, not a generation metric. Low context precision means the wrong chunks are reaching the model. The fix is better chunking strategy, improved embedding model fit for the domain, or adding a reranker after retrieval.
Context Recall
Context recall measures whether the retrieval step found all the relevant chunks that existed in the knowledge base. High precision with low recall means the chunks that were retrieved were all relevant, but the system missed other chunks that would have produced a more complete answer.
Together, context precision and context recall give a complete diagnostic picture of the retrieval layer. Low precision points to noise in retrieval. Low recall points to gaps in coverage.
The Tools Available Without Buying a Platform
Three open-source tools cover the full evaluation lifecycle without requiring a paid platform.
| Tool | Primary Role | Strength | Limitation | License |
|---|---|---|---|---|
| RAGAS | Metric computation and synthetic data generation | Reference-free evaluation, LLM-as-judge | No CI/CD pass/fail gates built in | Apache 2.0 |
| DeepEval | CI/CD quality gates | Native pytest integration, hard thresholds | Requires LLM API for judge calls | Apache 2.0 |
| Arize Phoenix | Production observability and trace capture | Self-hostable, UMAP visualization, zero feature gates | No paid support on free tier | Apache 2.0 |
| Langfuse | Production tracing and session logging | Clean UI, easy self-host, wide framework support | Evaluation metrics require integration | MIT |
For most production teams: use RAGAS for metric exploration and synthetic dataset generation, DeepEval for CI/CD quality gates, and Arize Phoenix or Langfuse for production monitoring. Each tool does one job well. Combining them gives you the full evaluation pipeline without paying for a unified SaaS layer on top.
Step 1: Build the Golden Dataset
The golden dataset is the foundation of the entire evaluation pipeline. Everything else runs against it. Without it, there are no baselines, no regression tests, and no way to measure whether a pipeline change improved or degraded quality.
A production golden dataset contains between 100 and 300 highly diverse, mutually exclusive question-answer pairs. This size provides statistical significance for metric calculations without excessive computational overhead during CI/CD runs. Below 50 questions, the metrics are too noisy to trust at the individual score level.
What Each Record Contains
- Question: A real user query sampled from production logs or constructed to cover domain edge cases
- Expected answer: The correct answer, verified by a domain expert
- Ground truth chunks: The specific document chunks that contain the correct answer
- Category:
factual,procedural,comparative, oredge_case - Difficulty tier:
easy,medium, orhard - Source documents: The document names the chunks come from, for traceability
How to Source Questions
Do not generate questions synthetically from the start. Teams that skip human review often miss systemic issues in compliance-heavy use cases such as finance or healthcare. The sequence that works in practice is:
- Export 500 real queries from your production query logs or customer support tickets
- Deduplicate and cluster them by intent using an embedding similarity pass
- Select the 100 to 300 most representative queries across clusters, covering both common cases and known edge cases
- Have a domain expert write and verify the expected answer for each selected query
- Identify the exact document chunks that support each expected answer
RAGAS can generate synthetic questions from your document corpus to cover gaps where real query data does not exist yet, using its TestsetGenerator. Use synthetic data to fill coverage gaps, not as the primary dataset source.
# golden_dataset_builder.py
from dataclasses import dataclass
from typing import Literal
@dataclass
class GoldenRecord:
question: str
expected_answer: str
ground_truth_chunks: list[str]
category: Literal["factual", "procedural", "comparative", "edge_case"]
difficulty: Literal["easy", "medium", "hard"]
source_documents: list[str]
last_verified: str
# Version your golden dataset as Python so it lives in version control
# alongside your pipeline code and changes are tracked in git history
GOLDEN_DATASET: list[GoldenRecord] = [
GoldenRecord(
question="What is the refund window for enterprise customers?",
expected_answer="Enterprise customers receive full refunds within 60 days of purchase.",
ground_truth_chunks=[
"Enterprise plan customers are eligible for a full refund within 60 days "
"of the original purchase date, no questions asked."
],
category="factual",
difficulty="easy",
source_documents=["refund_policy_v3.pdf"],
last_verified="2026-05-01"
),
GoldenRecord(
question="How do I migrate from API v1 to v2 without downtime?",
expected_answer=(
"Install the compatibility shim available in v1.9, run both versions in "
"parallel in staging, then cut over when v2 error rate drops below 0.1%."
),
ground_truth_chunks=[
"The v1 to v2 migration guide recommends installing the compatibility shim "
"in version 1.9. Run both API versions simultaneously in staging. When the "
"v2 error rate falls below 0.1% over a 24-hour window, proceed with production cutover."
],
category="procedural",
difficulty="hard",
source_documents=["api_migration_guide_v2.pdf"],
last_verified="2026-05-01"
),
# Add 98 to 298 more records covering your domain's full query distribution
]Version your golden dataset explicitly. Many teams waste days tracking mysterious regressions that trace back to untracked changes in their evaluation inputs. Store the golden dataset as a Python file in the same repository as your pipeline code. Every change to a question, expected answer, or source document chunk is a git commit with a message explaining why the record changed. This makes it trivial to correlate metric changes with dataset changes.
Step 2: Compute Baseline Metrics With RAGAS
Before wiring evaluation into CI/CD, run RAGAS against the full golden dataset on your current pipeline to establish baseline scores. These baselines become the thresholds for the quality gate in the next step.
RAGAS was created by researchers Shahul Es and Jithin James, published in September 2023, and presented at EACL 2024. Backed by Y Combinator Winter 2024, it processes over 5 million evaluations monthly for companies including AWS, Microsoft, Databricks, and Moody's. It is the standard reference-free evaluation framework for RAG.
# evaluate_baseline.py
# Run this once before wiring into CI/CD.
# The output becomes your quality gate thresholds.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag # your RAG function
def build_ragas_dataset(golden_records, rag_fn) -> Dataset:
rows = []
for record in golden_records:
# Run the question through your current RAG pipeline
result = rag_fn(record.question)
rows.append({
"question": record.question,
"answer": result["answer"], # what your pipeline generated
"contexts": result["retrieved_chunks"], # what was actually retrieved
"ground_truth": record.expected_answer,
})
return Dataset.from_list(rows)
if __name__ == "__main__":
dataset = build_ragas_dataset(GOLDEN_DATASET, run_rag)
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(results)
# Save these as your baseline thresholds:
# faithfulness: target >= 0.9
# answer_relevancy: target >= 0.85
# context_precision: target >= 0.8
# context_recall: target >= 0.8When the baseline run completes, record every score in a baseline_metrics.json file committed to the repository. These numbers are the floor. Any future pipeline change that drops a metric below the baseline fails the quality gate.
The per-metric diagnostic table below maps low scores to the pipeline component that needs fixing, which is more useful than the score alone:
| Failing Metric | First Place to Look | Likely Root Cause |
|---|---|---|
| Faithfulness below 0.9 | System prompt | Model generating beyond retrieved context |
| Answer relevancy below 0.85 | Chunking strategy | Chunks topically adjacent but not query-specific |
| Context precision below 0.8 | Retrieval and reranking | Wrong chunks surfacing above relevant ones |
| Context recall below 0.8 | Embedding model fit | Relevant chunks not matching query vectors |
| Both precision and recall low | Chunking and indexing | Fundamental indexing problem upstream |
Step 3: Wire the CI/CD Quality Gate With DeepEval
The goal of CI/CD evaluation integration is simple: fail the build when RAG quality drops below your thresholds before a PR gets merged. DeepEval integrates with pytest natively and is the strongest open-source option for this role. RAGAS is better for metric exploration. DeepEval is better for hard pass/fail gates with a testing-framework mindset.
The Test File
# tests/test_rag_regression.py
# This file runs on every PR that touches:
# - pipeline code
# - retrieval configuration
# - prompt templates
# - embedding model settings
# - chunking strategy
import pytest
from deepeval import assert_test
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag
# Define hard thresholds -- any score below these fails the build
FAITHFULNESS_THRESHOLD = 0.9
ANSWER_RELEVANCY_THRESHOLD = 0.85
CONTEXT_PRECISION_THRESHOLD = 0.8
CONTEXT_RECALL_THRESHOLD = 0.8
# Initialize metrics once -- each uses gpt-4o-mini as judge to keep cost low
faithfulness_metric = FaithfulnessMetric(threshold=FAITHFULNESS_THRESHOLD, model="gpt-4o-mini")
answer_relevancy_metric = AnswerRelevancyMetric(threshold=ANSWER_RELEVANCY_THRESHOLD, model="gpt-4o-mini")
context_precision_metric = ContextualPrecisionMetric(threshold=CONTEXT_PRECISION_THRESHOLD, model="gpt-4o-mini")
context_recall_metric = ContextualRecallMetric(threshold=CONTEXT_RECALL_THRESHOLD, model="gpt-4o-mini")
@pytest.mark.parametrize("record", GOLDEN_DATASET, ids=lambda r: r.question[:60])
def test_rag_quality(record):
"""
Run each golden record through the RAG pipeline.
Assert all four metrics pass their thresholds.
A single failure blocks the merge.
"""
result = run_rag(record.question)
test_case = LLMTestCase(
input=record.question,
actual_output=result["answer"],
retrieval_context=result["retrieved_chunks"],
expected_output=record.expected_answer,
)
assert_test(
test_case,
metrics=[
faithfulness_metric,
answer_relevancy_metric,
context_precision_metric,
context_recall_metric,
]
)The GitHub Actions Workflow
# .github/workflows/rag_quality_gate.yml
name: RAG Quality Gate
on:
pull_request:
paths:
- 'src/pipeline/**'
- 'src/retrieval/**'
- 'src/prompts/**'
- 'config/chunking.yaml'
- 'config/embedding.yaml'
- 'tests/test_rag_regression.py'
- 'golden_dataset.py'
jobs:
rag-evaluation:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install deepeval ragas pytest pytest-asyncio
- name: Run RAG quality gate
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
QDRANT_URL: ${{ secrets.QDRANT_URL }}
run: |
pytest tests/test_rag_regression.py \
--tb=short \
-v \
--timeout=600
- name: Upload evaluation results
if: always()
uses: actions/upload-artifact@v4
with:
name: rag-eval-results
path: .deepeval/Run the quality gate only on PRs that touch pipeline-relevant paths. A PR that updates a README or changes a CI config should not trigger a 30-minute evaluation run that costs real API money. The paths filter in the workflow above restricts evaluation to changes that could actually affect RAG behavior. At roughly $0.002 per golden record evaluated with gpt-4o-mini as judge, a 200-record golden dataset costs about $0.40 per run, which is acceptable for every pipeline-related PR.
Step 4: Production Monitoring With Arize Phoenix
The CI/CD gate prevents known regressions from reaching production. It does not catch regressions caused by things the test suite does not cover: query distribution drift, document staleness, embedding model API changes, or edge cases that only appear at production query volume.
Research from Getmaxim shows that 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. The production monitoring layer is what catches the other 40%.
Instrumenting the Pipeline
# rag_pipeline_with_tracing.py
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from ragas.metrics import faithfulness as faithfulness_metric
from ragas import evaluate
from datasets import Dataset
import threading
# Start Phoenix server (self-hosted, no API key)
# Run: python -m phoenix.server.main
# Phoenix UI available at http://localhost:6006
provider = TracerProvider()
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument() # auto-traces all OpenAI calls
tracer = trace.get_tracer(__name__)
def compute_faithfulness_async(question, answer, retrieved_chunks):
"""
Compute faithfulness in a background thread so it does not block response time.
Log the result to Phoenix for monitoring.
"""
def _compute():
dataset = Dataset.from_list([{
"question": question,
"answer": answer,
"contexts": retrieved_chunks,
"ground_truth": "" # not needed for faithfulness
}])
result = evaluate(dataset, metrics=[faithfulness_metric])
score = result["faithfulness"]
# Log to Phoenix via custom span attribute
with tracer.start_as_current_span("faithfulness_eval") as span:
span.set_attribute("eval.faithfulness", score)
span.set_attribute("eval.question", question[:200])
if score < 0.85:
span.set_attribute("eval.alert", "faithfulness_below_threshold")
thread = threading.Thread(target=_compute, daemon=True)
thread.start()
def run_rag_with_monitoring(question: str) -> dict:
with tracer.start_as_current_span("rag_query") as span:
span.set_attribute("query", question)
# Retrieval step
retrieved_chunks = retrieve(question)
span.set_attribute("retrieval.chunk_count", len(retrieved_chunks))
# Generation step
answer = generate(question, retrieved_chunks)
span.set_attribute("generation.answer_length", len(answer))
# Async faithfulness check -- does not block the response
compute_faithfulness_async(question, answer, retrieved_chunks)
return {"answer": answer, "retrieved_chunks": retrieved_chunks}Weekly Drift Monitoring
Run a scheduled job every week that re-evaluates the frozen golden dataset against the live production pipeline and compares the scores against the baseline recorded at the last deployment.
# scripts/weekly_drift_check.py
# Run on a schedule: cron every Monday at 08:00 UTC
# Alert when any metric drops more than 5% from deployment baseline
import json
from ragas import evaluate
from ragas.metrics import (
faithfulness, answer_relevancy,
context_precision, context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag
ALERT_THRESHOLD_DELTA = 0.05 # alert if any metric drops more than 5 points
def load_baseline() -> dict:
with open("baseline_metrics.json") as f:
return json.load(f)
def run_weekly_drift_check():
rows = []
for record in GOLDEN_DATASET:
result = run_rag(record.question)
rows.append({
"question": record.question,
"answer": result["answer"],
"contexts": result["retrieved_chunks"],
"ground_truth": record.expected_answer,
})
dataset = Dataset.from_list(rows)
current = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
baseline = load_baseline()
alerts = []
for metric_name in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
delta = baseline[metric_name] - current[metric_name]
if delta > ALERT_THRESHOLD_DELTA:
alerts.append({
"metric": metric_name,
"baseline": baseline[metric_name],
"current": current[metric_name],
"delta": round(delta, 4),
})
if alerts:
send_alert(alerts) # your alerting function: Slack, PagerDuty, email
print("ALERT: RAG quality drift detected")
for a in alerts:
print(f" {a['metric']}: {a['baseline']:.3f} -> {a['current']:.3f} (delta -{a['delta']})")
else:
print("Drift check passed. All metrics within 5% of baseline.")
return current
if __name__ == "__main__":
run_weekly_drift_check()Step 5: Wire Production Failures Back Into the Test Suite
This is the step most teams skip, and it is the one that makes evaluation compound over time rather than stay static.
Every time a production query produces a low faithfulness score (below 0.85) or receives a negative user signal (thumbs down, explicit complaint, escalation to a human), that query is a candidate for the golden dataset. Production failures represent the real distribution of hard cases that your synthetic or manually curated golden dataset does not yet cover.
# scripts/promote_failure_to_golden.py
# Run this whenever production monitoring flags a low-quality response.
# Requires human review before the record enters the golden dataset.
from golden_dataset import GoldenRecord
import json
from datetime import datetime
def create_golden_candidate_from_failure(
question: str,
low_quality_answer: str,
retrieved_chunks: list[str],
faithfulness_score: float,
failure_reason: str,
) -> dict:
"""
Package a production failure as a golden dataset candidate.
Outputs a dict for human review before adding to the dataset.
"""
return {
"status": "PENDING_HUMAN_REVIEW",
"question": question,
"failed_answer": low_quality_answer,
"faithfulness": faithfulness_score,
"failure_reason": failure_reason,
"retrieved_chunks": retrieved_chunks,
"instructions": (
"1. Write the correct expected_answer for this question.\n"
"2. Identify the ground_truth_chunks from the knowledge base.\n"
"3. Tag the category and difficulty.\n"
"4. Add to GOLDEN_DATASET in golden_dataset.py and commit."
),
"template": {
"question": question,
"expected_answer": "[HUMAN: fill in correct answer]",
"ground_truth_chunks": ["[HUMAN: find the relevant chunks]"],
"category": "[factual|procedural|comparative|edge_case]",
"difficulty": "[easy|medium|hard]",
"source_documents": ["[HUMAN: identify source doc]"],
"last_verified": datetime.utcnow().strftime("%Y-%m-%d"),
},
"flagged_at": datetime.utcnow().isoformat(),
}
# Example usage inside your monitoring pipeline
def handle_low_quality_response(query_trace: dict):
if query_trace["faithfulness"] < 0.85:
candidate = create_golden_candidate_from_failure(
question=query_trace["question"],
low_quality_answer=query_trace["answer"],
retrieved_chunks=query_trace["retrieved_chunks"],
faithfulness_score=query_trace["faithfulness"],
failure_reason="faithfulness_below_threshold",
)
# Write to a review queue (file, Notion, Linear ticket, etc.)
with open("golden_candidates_review_queue.jsonl", "a") as f:
f.write(json.dumps(candidate) + "\n")The review queue is the bridge between production monitoring and the test suite. A domain expert reviews each candidate, writes the correct expected answer, identifies the ground truth chunks, and adds the record to the golden dataset. The next CI/CD run picks it up automatically.
Over six months, this loop produces a golden dataset that reflects the actual hard cases your system encounters in production, not just the cases you anticipated at build time.
The Three-Layer Architecture
The full evaluation system has three layers that operate at different frequencies and catch different types of failure.
| Layer | When It Runs | What It Catches | Tools |
|---|---|---|---|
| Offline test suite | On every PR, before merge | Regressions from code or config changes | RAGAS for baselines, DeepEval for gates |
| CI/CD quality gate | Blocking PR merge | Any metric drop below threshold | DeepEval + pytest + GitHub Actions |
| Production monitoring | Continuously, drift weekly | Query distribution shift, doc staleness, edge cases | Arize Phoenix or Langfuse |
The layers compound. The offline test suite gives you a stable baseline. The CI/CD gate enforces it. The production monitoring layer catches what the test suite does not cover. And the failure-to-golden pipeline makes the test suite better over time.
Research shows that 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. Teams that build this infrastructure before launch catch problems in code review. Teams that skip it catch problems in user complaints.
What You Do Not Need to Buy
The three-layer architecture described above is built entirely from open-source tools. Here is a cost comparison between the fully self-hosted approach and representative managed platforms.
| Approach | Setup Cost | Monthly Cost at 10K queries | Vendor Lock-in |
|---|---|---|---|
| RAGAS + DeepEval + Phoenix (self-hosted) | 3 to 5 days engineering | Under $50 (LLM judge API calls only) | None |
| LangSmith (managed) | Half a day | $39 per seat minimum | LangChain ecosystem |
| Maxim AI (managed) | Half a day | Custom pricing, demo required | Platform-specific |
| Braintrust (managed) | Half a day | Usage-based, CI/CD gates included | Platform-specific |
The managed platforms offer faster setup and team collaboration features. The self-hosted approach gives full control over data, no per-seat pricing, and no vendor dependency. For teams with GDPR or HIPAA data constraints that prevent sending queries to a third-party SaaS, the self-hosted approach is not optional.
The engineering investment in the self-hosted stack is roughly three to five days to set up properly. After that, maintenance is low. The golden dataset grows incrementally. The CI/CD workflow runs automatically. The Phoenix dashboard updates continuously.
Common Mistakes That Break Evaluation Pipelines
Teams building RAG evaluation for the first time make the same set of mistakes. Knowing them in advance saves weeks of debugging.
-
Using the LLM that powers the RAG pipeline as the evaluation judge. The model will rate its own outputs highly regardless of quality. Use a separate model, ideally one from a different provider, as the judge. OpenAI's gpt-4o-mini as judge for a Claude-powered RAG system, or vice versa.
-
Never updating the golden dataset after launch. A static golden dataset from launch day does not reflect the query distribution six months later. New products ship. New policies change. New edge cases emerge. The failure-to-golden promotion pipeline exists specifically to prevent this.
-
Running evaluation on a subset of the golden dataset to save money. Statistical significance requires consistency. Run the full golden dataset on every CI/CD evaluation or the metric comparison is not meaningful. At $0.40 per full run with gpt-4o-mini as judge, the cost is not the bottleneck.
-
Setting thresholds higher than the current baseline. If your current faithfulness is 0.82, setting the CI/CD threshold at 0.9 will block every PR immediately. Set thresholds at or slightly below the current baseline to prevent regressions, then raise them as pipeline improvements move the baseline up.
-
Treating all metrics equally regardless of failure mode. A faithfulness drop from 0.91 to 0.87 is a generation layer problem. A context precision drop from 0.83 to 0.72 is a retrieval layer problem. They need different investigations and different fixes. Read the metric diagnostics table before starting a debugging session.
Where This Fits in the Bigger RAG Picture
Evaluation infrastructure does not replace good pipeline engineering. A well-evaluated bad pipeline is just a bad pipeline you understand better. The evaluation system surfaces where to invest engineering effort, not a substitute for it.
If the faithfulness metric is consistently low, the fix is in the system prompt or retrieval validation, as covered in Why RAG Fails. If context precision is low, the fix is in chunking strategy, embedding model selection, or reranking, as covered in RAG Architecture Explained. If context recall is low, the fix is in hybrid search or embedding model domain fit, as covered in How Embeddings Work in RAG.
Evaluation tells you which metric is failing. The RAG engineering work tells you how to fix it. Both are required. Neither is sufficient alone.
The teams that ship reliable RAG systems in 2026 are not doing anything mysterious. They build the golden dataset before the first production deployment. They wire DeepEval into CI/CD before the second sprint. They stand up Phoenix monitoring before the third. Then they run the system, watch the metrics, promote production failures to the golden dataset, and raise the thresholds as the pipeline improves.
That is RAG evaluation as an engineering discipline. Not glamorous. Not optional.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.