K
Krunal Kanojiya
HomeAboutServicesBlog
Hire Me
K
Krunal Kanojiya

Technical Content Writer

BlogRSSSitemapEmail
© 2026 Krunal Kanojiya · Built with Next.js
Privacy PolicyTerms of Service
  1. Home
  2. /
  3. Blog
  4. /
  5. RAG Evaluation as an Engineering Discipline: Build the Pipeline From Zero
Tech22 min read4,362 words

RAG Evaluation as an Engineering Discipline: Build the Pipeline From Zero

57% of organizations have RAG agents in production. 32% cite quality as the top barrier. Systematic evaluation reduces post-deployment failures by 50 to 70%, but most teams still treat it as a one-time check. This is the practitioner's guide: what metrics matter, how to build a CI/CD quality gate, and how to wire production failures back into your test suite without buying another SaaS tool.

Krunal Kanojiya

Krunal Kanojiya

May 15, 2026
Share:
#rag#rag-evaluation#ragas#deepeval#arize-phoenix#langfuse#ci-cd#llm-testing#faithfulness#context-precision#golden-dataset#production-monitoring

Most RAG systems do not fail on launch day. They fail quietly over the following six weeks as documents update, query patterns shift, and the retrieval layer starts returning chunks that were accurate when the index was built and wrong now. Nobody notices because there is no metric moving on a dashboard. There is just a slow erosion in answer quality that surfaces first in user complaints and second in an engineering post-mortem.

According to LangChain's 2026 State of AI Agents report, 57% of organizations now have agents in production, with quality cited as the top barrier to deployment by 32% of respondents. That quality barrier is not primarily a model problem. It is an evaluation infrastructure problem. Teams ship RAG systems without measurement and then cannot distinguish retrieval failures from generation failures when things go wrong.

Enterprise implementations that build systematic evaluation from the start reduce post-deployment issues by 50 to 70%. But 70% of RAG systems in production still lack systematic evaluation frameworks, which makes quality regressions invisible until a user flags them.

This article builds the evaluation system from zero. No SaaS platform required.

Why RAG Evaluation Is a Different Engineering Problem

Traditional machine learning models have one failure mode: the model makes a wrong prediction. You measure prediction quality against labeled ground truth. The process is well understood and well tooled.

RAG pipelines have two distinct failure modes that require separate measurement systems, and standard ML metrics miss both of them.

Retrieval failure. The right document exists in your knowledge base, but the retrieval step does not surface it. Or it surfaces the wrong document, and the model generates an answer grounded in irrelevant context. The answer sounds confident and is factually wrong.

Generation failure. Retrieval worked correctly. The right chunks are in the context window. The model ignores them, misrepresents them, or supplements them with training data that contradicts what the retrieved documents actually say.

RAG introduces two distinct failure modes that classic metrics cannot distinguish. BLEU, ROUGE, and BERTScore measure surface text similarity between a generated answer and a reference string. A RAG system that retrieves the wrong document and generates a plausible wrong answer can still score well on ROUGE if the output text happens to overlap with the reference. These metrics were designed for translation and summarization. They were not built for retrieval pipelines.

Poorly evaluated RAG systems hallucinate in up to 40% of responses even when the correct source documents were retrieved, according to the Stanford AI Lab. The documents are there. The model is not using them. ROUGE does not catch this.

47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024, according to the Suprmind AI Hallucination Report. Every one of those decisions traces back to a RAG pipeline that nobody properly benchmarked.

The Four Metrics That Actually Matter

The evaluation framework for a RAG pipeline needs exactly four metrics. Each one measures a different layer of the pipeline independently, so when a score drops you know immediately which component broke.

MetricWhat It MeasuresFailure Mode It CatchesProduction Threshold
FaithfulnessEvery answer claim is supported by retrieved contextModel hallucinating beyond its contextAbove 0.9
Answer RelevancyAnswer addresses the actual question askedAdjacent but off-topic answersAbove 0.85
Context PrecisionRelevant chunks rank high in the retrieved setRetrieval is noisy, wrong chunks surfaceAbove 0.8
Context RecallAll relevant chunks in the KB were retrievedRetrieval misses documents it should findAbove 0.8

Source: MarsDevs production RAG targets, 2026, Premai.io RAG evaluation guide

Faithfulness

Faithfulness is the hallucination metric for RAG. RAGAS computes it by extracting all factual claims from the generated answer using an LLM, then verifying each claim against the retrieved context using the same LLM as a judge. The score is the fraction of claims that are supported. A score of 0.6 means 40% of what the model said was not grounded in its context window.

A faithfulness score below 0.9 means your system prompt is not constraining the model to its retrieved context tightly enough, or retrieval is so poor that the model fills gaps from training data. Both are fixable, but they require different fixes.

Answer Relevancy

Answer relevancy measures whether the answer actually helps the user rather than being technically accurate but adjacent to what they asked. RAGAS measures this by generating multiple question paraphrases from the answer text and computing similarity between those paraphrases and the original question. If the answer contains the right information, the generated questions should resemble the original closely.

Low answer relevancy combined with high faithfulness means retrieval is surfacing real content from the knowledge base that is not quite relevant to the specific query. This is usually a chunking or hybrid search problem, not a generation problem.

Context Precision

Context precision measures whether the retriever ranks relevant chunks at the top of results. A retriever returning 10 chunks where only 2 are relevant has context precision of 0.2, and the downstream LLM is polluted by noise on every single query.

This is a retrieval layer metric, not a generation metric. Low context precision means the wrong chunks are reaching the model. The fix is better chunking strategy, improved embedding model fit for the domain, or adding a reranker after retrieval.

Context Recall

Context recall measures whether the retrieval step found all the relevant chunks that existed in the knowledge base. High precision with low recall means the chunks that were retrieved were all relevant, but the system missed other chunks that would have produced a more complete answer.

Together, context precision and context recall give a complete diagnostic picture of the retrieval layer. Low precision points to noise in retrieval. Low recall points to gaps in coverage.

The Tools Available Without Buying a Platform

Three open-source tools cover the full evaluation lifecycle without requiring a paid platform.

ToolPrimary RoleStrengthLimitationLicense
RAGASMetric computation and synthetic data generationReference-free evaluation, LLM-as-judgeNo CI/CD pass/fail gates built inApache 2.0
DeepEvalCI/CD quality gatesNative pytest integration, hard thresholdsRequires LLM API for judge callsApache 2.0
Arize PhoenixProduction observability and trace captureSelf-hostable, UMAP visualization, zero feature gatesNo paid support on free tierApache 2.0
LangfuseProduction tracing and session loggingClean UI, easy self-host, wide framework supportEvaluation metrics require integrationMIT

For most production teams: use RAGAS for metric exploration and synthetic dataset generation, DeepEval for CI/CD quality gates, and Arize Phoenix or Langfuse for production monitoring. Each tool does one job well. Combining them gives you the full evaluation pipeline without paying for a unified SaaS layer on top.

Step 1: Build the Golden Dataset

The golden dataset is the foundation of the entire evaluation pipeline. Everything else runs against it. Without it, there are no baselines, no regression tests, and no way to measure whether a pipeline change improved or degraded quality.

A production golden dataset contains between 100 and 300 highly diverse, mutually exclusive question-answer pairs. This size provides statistical significance for metric calculations without excessive computational overhead during CI/CD runs. Below 50 questions, the metrics are too noisy to trust at the individual score level.

What Each Record Contains

  • Question: A real user query sampled from production logs or constructed to cover domain edge cases
  • Expected answer: The correct answer, verified by a domain expert
  • Ground truth chunks: The specific document chunks that contain the correct answer
  • Category: factual, procedural, comparative, or edge_case
  • Difficulty tier: easy, medium, or hard
  • Source documents: The document names the chunks come from, for traceability

How to Source Questions

Do not generate questions synthetically from the start. Teams that skip human review often miss systemic issues in compliance-heavy use cases such as finance or healthcare. The sequence that works in practice is:

  • Export 500 real queries from your production query logs or customer support tickets
  • Deduplicate and cluster them by intent using an embedding similarity pass
  • Select the 100 to 300 most representative queries across clusters, covering both common cases and known edge cases
  • Have a domain expert write and verify the expected answer for each selected query
  • Identify the exact document chunks that support each expected answer

RAGAS can generate synthetic questions from your document corpus to cover gaps where real query data does not exist yet, using its TestsetGenerator. Use synthetic data to fill coverage gaps, not as the primary dataset source.

python
# golden_dataset_builder.py
from dataclasses import dataclass
from typing import Literal

@dataclass
class GoldenRecord:
    question: str
    expected_answer: str
    ground_truth_chunks: list[str]
    category: Literal["factual", "procedural", "comparative", "edge_case"]
    difficulty: Literal["easy", "medium", "hard"]
    source_documents: list[str]
    last_verified: str

# Version your golden dataset as Python so it lives in version control
# alongside your pipeline code and changes are tracked in git history
GOLDEN_DATASET: list[GoldenRecord] = [
    GoldenRecord(
        question="What is the refund window for enterprise customers?",
        expected_answer="Enterprise customers receive full refunds within 60 days of purchase.",
        ground_truth_chunks=[
            "Enterprise plan customers are eligible for a full refund within 60 days "
            "of the original purchase date, no questions asked."
        ],
        category="factual",
        difficulty="easy",
        source_documents=["refund_policy_v3.pdf"],
        last_verified="2026-05-01"
    ),
    GoldenRecord(
        question="How do I migrate from API v1 to v2 without downtime?",
        expected_answer=(
            "Install the compatibility shim available in v1.9, run both versions in "
            "parallel in staging, then cut over when v2 error rate drops below 0.1%."
        ),
        ground_truth_chunks=[
            "The v1 to v2 migration guide recommends installing the compatibility shim "
            "in version 1.9. Run both API versions simultaneously in staging. When the "
            "v2 error rate falls below 0.1% over a 24-hour window, proceed with production cutover."
        ],
        category="procedural",
        difficulty="hard",
        source_documents=["api_migration_guide_v2.pdf"],
        last_verified="2026-05-01"
    ),
    # Add 98 to 298 more records covering your domain's full query distribution
]

Version your golden dataset explicitly. Many teams waste days tracking mysterious regressions that trace back to untracked changes in their evaluation inputs. Store the golden dataset as a Python file in the same repository as your pipeline code. Every change to a question, expected answer, or source document chunk is a git commit with a message explaining why the record changed. This makes it trivial to correlate metric changes with dataset changes.

Step 2: Compute Baseline Metrics With RAGAS

Before wiring evaluation into CI/CD, run RAGAS against the full golden dataset on your current pipeline to establish baseline scores. These baselines become the thresholds for the quality gate in the next step.

RAGAS was created by researchers Shahul Es and Jithin James, published in September 2023, and presented at EACL 2024. Backed by Y Combinator Winter 2024, it processes over 5 million evaluations monthly for companies including AWS, Microsoft, Databricks, and Moody's. It is the standard reference-free evaluation framework for RAG.

python
# evaluate_baseline.py
# Run this once before wiring into CI/CD.
# The output becomes your quality gate thresholds.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag  # your RAG function

def build_ragas_dataset(golden_records, rag_fn) -> Dataset:
    rows = []
    for record in golden_records:
        # Run the question through your current RAG pipeline
        result = rag_fn(record.question)

        rows.append({
            "question":     record.question,
            "answer":       result["answer"],        # what your pipeline generated
            "contexts":     result["retrieved_chunks"],  # what was actually retrieved
            "ground_truth": record.expected_answer,
        })

    return Dataset.from_list(rows)

if __name__ == "__main__":
    dataset = build_ragas_dataset(GOLDEN_DATASET, run_rag)

    results = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )

    print(results)
    # Save these as your baseline thresholds:
    # faithfulness:      target >= 0.9
    # answer_relevancy:  target >= 0.85
    # context_precision: target >= 0.8
    # context_recall:    target >= 0.8

When the baseline run completes, record every score in a baseline_metrics.json file committed to the repository. These numbers are the floor. Any future pipeline change that drops a metric below the baseline fails the quality gate.

The per-metric diagnostic table below maps low scores to the pipeline component that needs fixing, which is more useful than the score alone:

Failing MetricFirst Place to LookLikely Root Cause
Faithfulness below 0.9System promptModel generating beyond retrieved context
Answer relevancy below 0.85Chunking strategyChunks topically adjacent but not query-specific
Context precision below 0.8Retrieval and rerankingWrong chunks surfacing above relevant ones
Context recall below 0.8Embedding model fitRelevant chunks not matching query vectors
Both precision and recall lowChunking and indexingFundamental indexing problem upstream

Step 3: Wire the CI/CD Quality Gate With DeepEval

The goal of CI/CD evaluation integration is simple: fail the build when RAG quality drops below your thresholds before a PR gets merged. DeepEval integrates with pytest natively and is the strongest open-source option for this role. RAGAS is better for metric exploration. DeepEval is better for hard pass/fail gates with a testing-framework mindset.

The Test File

python
# tests/test_rag_regression.py
# This file runs on every PR that touches:
# - pipeline code
# - retrieval configuration
# - prompt templates
# - embedding model settings
# - chunking strategy

import pytest
from deepeval import assert_test
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag

# Define hard thresholds -- any score below these fails the build
FAITHFULNESS_THRESHOLD      = 0.9
ANSWER_RELEVANCY_THRESHOLD  = 0.85
CONTEXT_PRECISION_THRESHOLD = 0.8
CONTEXT_RECALL_THRESHOLD    = 0.8

# Initialize metrics once -- each uses gpt-4o-mini as judge to keep cost low
faithfulness_metric       = FaithfulnessMetric(threshold=FAITHFULNESS_THRESHOLD, model="gpt-4o-mini")
answer_relevancy_metric   = AnswerRelevancyMetric(threshold=ANSWER_RELEVANCY_THRESHOLD, model="gpt-4o-mini")
context_precision_metric  = ContextualPrecisionMetric(threshold=CONTEXT_PRECISION_THRESHOLD, model="gpt-4o-mini")
context_recall_metric     = ContextualRecallMetric(threshold=CONTEXT_RECALL_THRESHOLD, model="gpt-4o-mini")

@pytest.mark.parametrize("record", GOLDEN_DATASET, ids=lambda r: r.question[:60])
def test_rag_quality(record):
    """
    Run each golden record through the RAG pipeline.
    Assert all four metrics pass their thresholds.
    A single failure blocks the merge.
    """
    result = run_rag(record.question)

    test_case = LLMTestCase(
        input=record.question,
        actual_output=result["answer"],
        retrieval_context=result["retrieved_chunks"],
        expected_output=record.expected_answer,
    )

    assert_test(
        test_case,
        metrics=[
            faithfulness_metric,
            answer_relevancy_metric,
            context_precision_metric,
            context_recall_metric,
        ]
    )

The GitHub Actions Workflow

yaml
# .github/workflows/rag_quality_gate.yml
name: RAG Quality Gate

on:
  pull_request:
    paths:
      - 'src/pipeline/**'
      - 'src/retrieval/**'
      - 'src/prompts/**'
      - 'config/chunking.yaml'
      - 'config/embedding.yaml'
      - 'tests/test_rag_regression.py'
      - 'golden_dataset.py'

jobs:
  rag-evaluation:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install deepeval ragas pytest pytest-asyncio

      - name: Run RAG quality gate
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          QDRANT_URL: ${{ secrets.QDRANT_URL }}
        run: |
          pytest tests/test_rag_regression.py \
            --tb=short \
            -v \
            --timeout=600

      - name: Upload evaluation results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: rag-eval-results
          path: .deepeval/

Run the quality gate only on PRs that touch pipeline-relevant paths. A PR that updates a README or changes a CI config should not trigger a 30-minute evaluation run that costs real API money. The paths filter in the workflow above restricts evaluation to changes that could actually affect RAG behavior. At roughly $0.002 per golden record evaluated with gpt-4o-mini as judge, a 200-record golden dataset costs about $0.40 per run, which is acceptable for every pipeline-related PR.

Step 4: Production Monitoring With Arize Phoenix

The CI/CD gate prevents known regressions from reaching production. It does not catch regressions caused by things the test suite does not cover: query distribution drift, document staleness, embedding model API changes, or edge cases that only appear at production query volume.

Research from Getmaxim shows that 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. The production monitoring layer is what catches the other 40%.

Arize Phoenix supports UMAP-based embedding visualization, which lets you visually cluster retrieval results to spot semantic gaps and drift. It is fully self-hostable with zero feature gates on the open-source tier.

Instrumenting the Pipeline

python
# rag_pipeline_with_tracing.py
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from ragas.metrics import faithfulness as faithfulness_metric
from ragas import evaluate
from datasets import Dataset
import threading

# Start Phoenix server (self-hosted, no API key)
# Run: python -m phoenix.server.main
# Phoenix UI available at http://localhost:6006

provider = TracerProvider()
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()   # auto-traces all OpenAI calls

tracer = trace.get_tracer(__name__)

def compute_faithfulness_async(question, answer, retrieved_chunks):
    """
    Compute faithfulness in a background thread so it does not block response time.
    Log the result to Phoenix for monitoring.
    """
    def _compute():
        dataset = Dataset.from_list([{
            "question": question,
            "answer": answer,
            "contexts": retrieved_chunks,
            "ground_truth": ""   # not needed for faithfulness
        }])
        result = evaluate(dataset, metrics=[faithfulness_metric])
        score = result["faithfulness"]

        # Log to Phoenix via custom span attribute
        with tracer.start_as_current_span("faithfulness_eval") as span:
            span.set_attribute("eval.faithfulness", score)
            span.set_attribute("eval.question", question[:200])
            if score < 0.85:
                span.set_attribute("eval.alert", "faithfulness_below_threshold")

    thread = threading.Thread(target=_compute, daemon=True)
    thread.start()

def run_rag_with_monitoring(question: str) -> dict:
    with tracer.start_as_current_span("rag_query") as span:
        span.set_attribute("query", question)

        # Retrieval step
        retrieved_chunks = retrieve(question)
        span.set_attribute("retrieval.chunk_count", len(retrieved_chunks))

        # Generation step
        answer = generate(question, retrieved_chunks)
        span.set_attribute("generation.answer_length", len(answer))

        # Async faithfulness check -- does not block the response
        compute_faithfulness_async(question, answer, retrieved_chunks)

        return {"answer": answer, "retrieved_chunks": retrieved_chunks}

Weekly Drift Monitoring

RAG evaluation is not a one-time audit. Production query distributions shift, source documents update, and embedding models improve. Enterprises that treat evaluation as a continuous process catch regressions in hours, not weeks.

Run a scheduled job every week that re-evaluates the frozen golden dataset against the live production pipeline and compares the scores against the baseline recorded at the last deployment.

python
# scripts/weekly_drift_check.py
# Run on a schedule: cron every Monday at 08:00 UTC
# Alert when any metric drops more than 5% from deployment baseline

import json
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag

ALERT_THRESHOLD_DELTA = 0.05   # alert if any metric drops more than 5 points

def load_baseline() -> dict:
    with open("baseline_metrics.json") as f:
        return json.load(f)

def run_weekly_drift_check():
    rows = []
    for record in GOLDEN_DATASET:
        result = run_rag(record.question)
        rows.append({
            "question":     record.question,
            "answer":       result["answer"],
            "contexts":     result["retrieved_chunks"],
            "ground_truth": record.expected_answer,
        })

    dataset = Dataset.from_list(rows)
    current = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )

    baseline = load_baseline()
    alerts = []

    for metric_name in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
        delta = baseline[metric_name] - current[metric_name]
        if delta > ALERT_THRESHOLD_DELTA:
            alerts.append({
                "metric":    metric_name,
                "baseline":  baseline[metric_name],
                "current":   current[metric_name],
                "delta":     round(delta, 4),
            })

    if alerts:
        send_alert(alerts)  # your alerting function: Slack, PagerDuty, email
        print("ALERT: RAG quality drift detected")
        for a in alerts:
            print(f"  {a['metric']}: {a['baseline']:.3f} -> {a['current']:.3f} (delta -{a['delta']})")
    else:
        print("Drift check passed. All metrics within 5% of baseline.")

    return current

if __name__ == "__main__":
    run_weekly_drift_check()

Step 5: Wire Production Failures Back Into the Test Suite

This is the step most teams skip, and it is the one that makes evaluation compound over time rather than stay static.

Every time a production query produces a low faithfulness score (below 0.85) or receives a negative user signal (thumbs down, explicit complaint, escalation to a human), that query is a candidate for the golden dataset. Production failures represent the real distribution of hard cases that your synthetic or manually curated golden dataset does not yet cover.

python
# scripts/promote_failure_to_golden.py
# Run this whenever production monitoring flags a low-quality response.
# Requires human review before the record enters the golden dataset.

from golden_dataset import GoldenRecord
import json
from datetime import datetime

def create_golden_candidate_from_failure(
    question: str,
    low_quality_answer: str,
    retrieved_chunks: list[str],
    faithfulness_score: float,
    failure_reason: str,
) -> dict:
    """
    Package a production failure as a golden dataset candidate.
    Outputs a dict for human review before adding to the dataset.
    """
    return {
        "status":            "PENDING_HUMAN_REVIEW",
        "question":          question,
        "failed_answer":     low_quality_answer,
        "faithfulness":      faithfulness_score,
        "failure_reason":    failure_reason,
        "retrieved_chunks":  retrieved_chunks,
        "instructions":      (
            "1. Write the correct expected_answer for this question.\n"
            "2. Identify the ground_truth_chunks from the knowledge base.\n"
            "3. Tag the category and difficulty.\n"
            "4. Add to GOLDEN_DATASET in golden_dataset.py and commit."
        ),
        "template": {
            "question":           question,
            "expected_answer":    "[HUMAN: fill in correct answer]",
            "ground_truth_chunks": ["[HUMAN: find the relevant chunks]"],
            "category":           "[factual|procedural|comparative|edge_case]",
            "difficulty":         "[easy|medium|hard]",
            "source_documents":   ["[HUMAN: identify source doc]"],
            "last_verified":      datetime.utcnow().strftime("%Y-%m-%d"),
        },
        "flagged_at": datetime.utcnow().isoformat(),
    }

# Example usage inside your monitoring pipeline
def handle_low_quality_response(query_trace: dict):
    if query_trace["faithfulness"] < 0.85:
        candidate = create_golden_candidate_from_failure(
            question=query_trace["question"],
            low_quality_answer=query_trace["answer"],
            retrieved_chunks=query_trace["retrieved_chunks"],
            faithfulness_score=query_trace["faithfulness"],
            failure_reason="faithfulness_below_threshold",
        )
        # Write to a review queue (file, Notion, Linear ticket, etc.)
        with open("golden_candidates_review_queue.jsonl", "a") as f:
            f.write(json.dumps(candidate) + "\n")

The review queue is the bridge between production monitoring and the test suite. A domain expert reviews each candidate, writes the correct expected answer, identifies the ground truth chunks, and adds the record to the golden dataset. The next CI/CD run picks it up automatically.

Over six months, this loop produces a golden dataset that reflects the actual hard cases your system encounters in production, not just the cases you anticipated at build time.

The Three-Layer Architecture

The full evaluation system has three layers that operate at different frequencies and catch different types of failure.

LayerWhen It RunsWhat It CatchesTools
Offline test suiteOn every PR, before mergeRegressions from code or config changesRAGAS for baselines, DeepEval for gates
CI/CD quality gateBlocking PR mergeAny metric drop below thresholdDeepEval + pytest + GitHub Actions
Production monitoringContinuously, drift weeklyQuery distribution shift, doc staleness, edge casesArize Phoenix or Langfuse

The layers compound. The offline test suite gives you a stable baseline. The CI/CD gate enforces it. The production monitoring layer catches what the test suite does not cover. And the failure-to-golden pipeline makes the test suite better over time.

Research shows that 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. Teams that build this infrastructure before launch catch problems in code review. Teams that skip it catch problems in user complaints.

What You Do Not Need to Buy

The three-layer architecture described above is built entirely from open-source tools. Here is a cost comparison between the fully self-hosted approach and representative managed platforms.

ApproachSetup CostMonthly Cost at 10K queriesVendor Lock-in
RAGAS + DeepEval + Phoenix (self-hosted)3 to 5 days engineeringUnder $50 (LLM judge API calls only)None
LangSmith (managed)Half a day$39 per seat minimumLangChain ecosystem
Maxim AI (managed)Half a dayCustom pricing, demo requiredPlatform-specific
Braintrust (managed)Half a dayUsage-based, CI/CD gates includedPlatform-specific

The managed platforms offer faster setup and team collaboration features. The self-hosted approach gives full control over data, no per-seat pricing, and no vendor dependency. For teams with GDPR or HIPAA data constraints that prevent sending queries to a third-party SaaS, the self-hosted approach is not optional.

The engineering investment in the self-hosted stack is roughly three to five days to set up properly. After that, maintenance is low. The golden dataset grows incrementally. The CI/CD workflow runs automatically. The Phoenix dashboard updates continuously.

Common Mistakes That Break Evaluation Pipelines

Teams building RAG evaluation for the first time make the same set of mistakes. Knowing them in advance saves weeks of debugging.

  • Using the LLM that powers the RAG pipeline as the evaluation judge. The model will rate its own outputs highly regardless of quality. Use a separate model, ideally one from a different provider, as the judge. OpenAI's gpt-4o-mini as judge for a Claude-powered RAG system, or vice versa.

  • Never updating the golden dataset after launch. A static golden dataset from launch day does not reflect the query distribution six months later. New products ship. New policies change. New edge cases emerge. The failure-to-golden promotion pipeline exists specifically to prevent this.

  • Running evaluation on a subset of the golden dataset to save money. Statistical significance requires consistency. Run the full golden dataset on every CI/CD evaluation or the metric comparison is not meaningful. At $0.40 per full run with gpt-4o-mini as judge, the cost is not the bottleneck.

  • Setting thresholds higher than the current baseline. If your current faithfulness is 0.82, setting the CI/CD threshold at 0.9 will block every PR immediately. Set thresholds at or slightly below the current baseline to prevent regressions, then raise them as pipeline improvements move the baseline up.

  • Treating all metrics equally regardless of failure mode. A faithfulness drop from 0.91 to 0.87 is a generation layer problem. A context precision drop from 0.83 to 0.72 is a retrieval layer problem. They need different investigations and different fixes. Read the metric diagnostics table before starting a debugging session.

Where This Fits in the Bigger RAG Picture

Evaluation infrastructure does not replace good pipeline engineering. A well-evaluated bad pipeline is just a bad pipeline you understand better. The evaluation system surfaces where to invest engineering effort, not a substitute for it.

If the faithfulness metric is consistently low, the fix is in the system prompt or retrieval validation, as covered in Why RAG Fails. If context precision is low, the fix is in chunking strategy, embedding model selection, or reranking, as covered in RAG Architecture Explained. If context recall is low, the fix is in hybrid search or embedding model domain fit, as covered in How Embeddings Work in RAG.

Evaluation tells you which metric is failing. The RAG engineering work tells you how to fix it. Both are required. Neither is sufficient alone.

The teams that ship reliable RAG systems in 2026 are not doing anything mysterious. They build the golden dataset before the first production deployment. They wire DeepEval into CI/CD before the second sprint. They stand up Phoenix monitoring before the third. Then they run the system, watch the metrics, promote production failures to the golden dataset, and raise the thresholds as the pipeline improves.

That is RAG evaluation as an engineering discipline. Not glamorous. Not optional.

On this page

Why RAG Evaluation Is a Different Engineering ProblemThe Four Metrics That Actually MatterFaithfulnessAnswer RelevancyContext PrecisionContext RecallThe Tools Available Without Buying a PlatformStep 1: Build the Golden DatasetWhat Each Record ContainsHow to Source QuestionsStep 2: Compute Baseline Metrics With RAGASStep 3: Wire the CI/CD Quality Gate With DeepEvalThe Test FileThe GitHub Actions WorkflowStep 4: Production Monitoring With Arize PhoenixInstrumenting the PipelineWeekly Drift MonitoringStep 5: Wire Production Failures Back Into the Test SuiteThe Three-Layer ArchitectureWhat You Do Not Need to BuyCommon Mistakes That Break Evaluation PipelinesWhere This Fits in the Bigger RAG Picture

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
All posts

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
Krunal Kanojiya

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

GitHubLinkedIn

Related Posts

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

May 07, 2026 · 17 min read

RAG Architecture Explained: How Production Pipelines Actually Work (2026)

May 04, 2026 · 18 min read

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read