What metrics should I use to evaluate a RAG system?

Four metrics cover the full failure surface of a RAG pipeline. Faithfulness measures whether every claim in the generated answer is grounded in the retrieved context. Answer relevancy measures whether the answer addresses the actual question. Context precision measures whether retrieved chunks are relevant and ranked correctly. Context recall measures whether the retrieval step found all the relevant information that existed in the knowledge base. Each metric isolates a different layer. Tracking all four tells you exactly which component broke.

Why do BLEU and ROUGE not work for RAG evaluation?

BLEU and ROUGE measure surface text similarity between a generated answer and a reference. They were built for translation and summarization where the expected output is known in advance. RAG introduces two failure modes those metrics cannot distinguish: retrieval failures where wrong chunks were fetched, and generation failures where the model ignored its context. A system that retrieves the wrong document and generates a confident wrong answer can still score well on ROUGE if the output text happens to overlap with the reference string.

What is a golden dataset for RAG evaluation?

A golden dataset is a versioned set of question, context, and answer triples that represent the real distribution of queries your RAG system handles. Each record contains a question, the document chunks that contain the correct answer, and the expected answer verified by a domain expert. A production golden dataset contains 100 to 300 records. You build it by sampling real queries from production logs, categorizing them by type and difficulty, and having domain experts verify expected answers.

How do I wire RAG evaluation into CI/CD?

Use DeepEval with pytest. DeepEval integrates natively with pytest and supports faithfulness, contextual precision, contextual recall, and answer relevancy metrics. Create a test file that loads your golden dataset, runs each question through your RAG pipeline, and asserts metrics against your thresholds. Wire this test file into your GitHub Actions workflow so it runs on every pull request touching pipeline code, retrieval configuration, or prompt templates. A failing test blocks the merge.

What is LLM-as-a-judge evaluation?

LLM-as-a-judge evaluation uses a separate language model to score your RAG pipeline output against a rubric. RAGAS uses this approach for faithfulness and answer relevancy, generating intermediate reasoning steps that explain the score. It works well for subjective quality dimensions that are difficult to measure programmatically. The limitation is cost and non-determinism. For CI/CD gates, supplement LLM-as-a-judge scores with deterministic metrics like context precision at k, which do not require an additional LLM call.

How do I monitor RAG quality in production without a paid platform?

Self-host Arize Phoenix. It is fully open-source under Apache 2.0 with zero feature gates on the free tier and no per-seat pricing. Instrument your pipeline with the OpenInference tracing spec, which Phoenix consumes natively. Capture every query, retrieved chunks, generated answer, and faithfulness score computed in a background thread. Run a weekly drift check against your frozen golden dataset and alert when any metric drops more than 5% from the baseline established at the last deployment.

What is the three-layer RAG evaluation architecture?

Layer one is the offline test suite: a golden dataset of 100 to 300 question-answer pairs that runs on every pull request and blocks merges when metrics fall below threshold. Layer two is the CI/CD quality gate: DeepEval wired into GitHub Actions with hard cutoffs on faithfulness, context precision, and answer relevancy. Layer three is production monitoring: continuous trace capture with Arize Phoenix or Langfuse, weekly drift monitoring against a frozen golden set, and an automated pipeline that converts production failures into new golden dataset entries.

RAG Evaluation as an Engineering Discipline: Build the Pipeline From Zero

Most RAG systems do not fail on launch day. They fail quietly over the following six weeks as documents update, query patterns shift, and the retrieval layer starts returning chunks that were accurate when the index was built and wrong now. Nobody notices because there is no metric moving on a dashboard. There is just a slow erosion in answer quality that surfaces first in user complaints and second in an engineering post-mortem.

According to LangChain's 2026 State of AI Agents report, 57% of organizations now have agents in production, with quality cited as the top barrier to deployment by 32% of respondents. That quality barrier is not primarily a model problem. It is an evaluation infrastructure problem. Teams ship RAG systems without measurement and then cannot distinguish retrieval failures from generation failures when things go wrong.

Enterprise implementations that build systematic evaluation from the start reduce post-deployment issues by 50 to 70%. But 70% of RAG systems in production still lack systematic evaluation frameworks, which makes quality regressions invisible until a user flags them.

This article builds the evaluation system from zero. No SaaS platform required.

Why RAG Evaluation Is a Different Engineering Problem

Traditional machine learning models have one failure mode: the model makes a wrong prediction. You measure prediction quality against labeled ground truth. The process is well understood and well tooled.

RAG pipelines have two distinct failure modes that require separate measurement systems, and standard ML metrics miss both of them.

Retrieval failure. The right document exists in your knowledge base, but the retrieval step does not surface it. Or it surfaces the wrong document, and the model generates an answer grounded in irrelevant context. The answer sounds confident and is factually wrong.

Generation failure. Retrieval worked correctly. The right chunks are in the context window. The model ignores them, misrepresents them, or supplements them with training data that contradicts what the retrieved documents actually say.

RAG introduces two distinct failure modes that classic metrics cannot distinguish. BLEU, ROUGE, and BERTScore measure surface text similarity between a generated answer and a reference string. A RAG system that retrieves the wrong document and generates a plausible wrong answer can still score well on ROUGE if the output text happens to overlap with the reference. These metrics were designed for translation and summarization. They were not built for retrieval pipelines.

Poorly evaluated RAG systems hallucinate in up to 40% of responses even when the correct source documents were retrieved, according to the Stanford AI Lab. The documents are there. The model is not using them. ROUGE does not catch this.

47% of enterprise AI users made at least one major business decision based on hallucinated content in 2024, according to the Suprmind AI Hallucination Report. Every one of those decisions traces back to a RAG pipeline that nobody properly benchmarked.

The Four Metrics That Actually Matter

The evaluation framework for a RAG pipeline needs exactly four metrics. Each one measures a different layer of the pipeline independently, so when a score drops you know immediately which component broke.

Metric	What It Measures	Failure Mode It Catches	Production Threshold
Faithfulness	Every answer claim is supported by retrieved context	Model hallucinating beyond its context	Above 0.9
Answer Relevancy	Answer addresses the actual question asked	Adjacent but off-topic answers	Above 0.85
Context Precision	Relevant chunks rank high in the retrieved set	Retrieval is noisy, wrong chunks surface	Above 0.8
Context Recall	All relevant chunks in the KB were retrieved	Retrieval misses documents it should find	Above 0.8

Source: MarsDevs production RAG targets, 2026, Premai.io RAG evaluation guide

Faithfulness

Faithfulness is the hallucination metric for RAG. RAGAS computes it by extracting all factual claims from the generated answer using an LLM, then verifying each claim against the retrieved context using the same LLM as a judge. The score is the fraction of claims that are supported. A score of 0.6 means 40% of what the model said was not grounded in its context window.

A faithfulness score below 0.9 means your system prompt is not constraining the model to its retrieved context tightly enough, or retrieval is so poor that the model fills gaps from training data. Both are fixable, but they require different fixes.

Answer Relevancy

Answer relevancy measures whether the answer actually helps the user rather than being technically accurate but adjacent to what they asked. RAGAS measures this by generating multiple question paraphrases from the answer text and computing similarity between those paraphrases and the original question. If the answer contains the right information, the generated questions should resemble the original closely.

Low answer relevancy combined with high faithfulness means retrieval is surfacing real content from the knowledge base that is not quite relevant to the specific query. This is usually a chunking or hybrid search problem, not a generation problem.

Context Precision

Context precision measures whether the retriever ranks relevant chunks at the top of results. A retriever returning 10 chunks where only 2 are relevant has context precision of 0.2, and the downstream LLM is polluted by noise on every single query.

This is a retrieval layer metric, not a generation metric. Low context precision means the wrong chunks are reaching the model. The fix is better chunking strategy, improved embedding model fit for the domain, or adding a reranker after retrieval.

Context Recall

Context recall measures whether the retrieval step found all the relevant chunks that existed in the knowledge base. High precision with low recall means the chunks that were retrieved were all relevant, but the system missed other chunks that would have produced a more complete answer.

Together, context precision and context recall give a complete diagnostic picture of the retrieval layer. Low precision points to noise in retrieval. Low recall points to gaps in coverage.

The Tools Available Without Buying a Platform

Three open-source tools cover the full evaluation lifecycle without requiring a paid platform.

Tool	Primary Role	Strength	Limitation	License
RAGAS	Metric computation and synthetic data generation	Reference-free evaluation, LLM-as-judge	No CI/CD pass/fail gates built in	Apache 2.0
DeepEval	CI/CD quality gates	Native pytest integration, hard thresholds	Requires LLM API for judge calls	Apache 2.0
Arize Phoenix	Production observability and trace capture	Self-hostable, UMAP visualization, zero feature gates	No paid support on free tier	Apache 2.0
Langfuse	Production tracing and session logging	Clean UI, easy self-host, wide framework support	Evaluation metrics require integration	MIT

For most production teams: use RAGAS for metric exploration and synthetic dataset generation, DeepEval for CI/CD quality gates, and Arize Phoenix or Langfuse for production monitoring. Each tool does one job well. Combining them gives you the full evaluation pipeline without paying for a unified SaaS layer on top.

Step 1: Build the Golden Dataset

The golden dataset is the foundation of the entire evaluation pipeline. Everything else runs against it. Without it, there are no baselines, no regression tests, and no way to measure whether a pipeline change improved or degraded quality.

A production golden dataset contains between 100 and 300 highly diverse, mutually exclusive question-answer pairs. This size provides statistical significance for metric calculations without excessive computational overhead during CI/CD runs. Below 50 questions, the metrics are too noisy to trust at the individual score level.

What Each Record Contains

Question: A real user query sampled from production logs or constructed to cover domain edge cases
Expected answer: The correct answer, verified by a domain expert
Ground truth chunks: The specific document chunks that contain the correct answer
Category: factual, procedural, comparative, or edge_case
Difficulty tier: easy, medium, or hard
Source documents: The document names the chunks come from, for traceability

How to Source Questions

Do not generate questions synthetically from the start. Teams that skip human review often miss systemic issues in compliance-heavy use cases such as finance or healthcare. The sequence that works in practice is:

Export 500 real queries from your production query logs or customer support tickets
Deduplicate and cluster them by intent using an embedding similarity pass
Select the 100 to 300 most representative queries across clusters, covering both common cases and known edge cases
Have a domain expert write and verify the expected answer for each selected query
Identify the exact document chunks that support each expected answer

RAGAS can generate synthetic questions from your document corpus to cover gaps where real query data does not exist yet, using its TestsetGenerator. Use synthetic data to fill coverage gaps, not as the primary dataset source.

python

# golden_dataset_builder.py
from dataclasses import dataclass
from typing import Literal

@dataclass
class GoldenRecord:
    question: str
    expected_answer: str
    ground_truth_chunks: list[str]
    category: Literal["factual", "procedural", "comparative", "edge_case"]
    difficulty: Literal["easy", "medium", "hard"]
    source_documents: list[str]
    last_verified: str

# Version your golden dataset as Python so it lives in version control
# alongside your pipeline code and changes are tracked in git history
GOLDEN_DATASET: list[GoldenRecord] = [
    GoldenRecord(
        question="What is the refund window for enterprise customers?",
        expected_answer="Enterprise customers receive full refunds within 60 days of purchase.",
        ground_truth_chunks=[
            "Enterprise plan customers are eligible for a full refund within 60 days "
            "of the original purchase date, no questions asked."
        ],
        category="factual",
        difficulty="easy",
        source_documents=["refund_policy_v3.pdf"],
        last_verified="2026-05-01"
    ),
    GoldenRecord(
        question="How do I migrate from API v1 to v2 without downtime?",
        expected_answer=(
            "Install the compatibility shim available in v1.9, run both versions in "
            "parallel in staging, then cut over when v2 error rate drops below 0.1%."
        ),
        ground_truth_chunks=[
            "The v1 to v2 migration guide recommends installing the compatibility shim "
            "in version 1.9. Run both API versions simultaneously in staging. When the "
            "v2 error rate falls below 0.1% over a 24-hour window, proceed with production cutover."
        ],
        category="procedural",
        difficulty="hard",
        source_documents=["api_migration_guide_v2.pdf"],
        last_verified="2026-05-01"
    ),
    # Add 98 to 298 more records covering your domain's full query distribution
]

Version your golden dataset explicitly. Many teams waste days tracking mysterious regressions that trace back to untracked changes in their evaluation inputs. Store the golden dataset as a Python file in the same repository as your pipeline code. Every change to a question, expected answer, or source document chunk is a git commit with a message explaining why the record changed. This makes it trivial to correlate metric changes with dataset changes.

Step 2: Compute Baseline Metrics With RAGAS

Before wiring evaluation into CI/CD, run RAGAS against the full golden dataset on your current pipeline to establish baseline scores. These baselines become the thresholds for the quality gate in the next step.

RAGAS was created by researchers Shahul Es and Jithin James, published in September 2023, and presented at EACL 2024. Backed by Y Combinator Winter 2024, it processes over 5 million evaluations monthly for companies including AWS, Microsoft, Databricks, and Moody's. It is the standard reference-free evaluation framework for RAG.

python

# evaluate_baseline.py
# Run this once before wiring into CI/CD.
# The output becomes your quality gate thresholds.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag  # your RAG function

def build_ragas_dataset(golden_records, rag_fn) -> Dataset:
    rows = []
    for record in golden_records:
        # Run the question through your current RAG pipeline
        result = rag_fn(record.question)

        rows.append({
            "question":     record.question,
            "answer":       result["answer"],        # what your pipeline generated
            "contexts":     result["retrieved_chunks"],  # what was actually retrieved
            "ground_truth": record.expected_answer,
        })

    return Dataset.from_list(rows)

if __name__ == "__main__":
    dataset = build_ragas_dataset(GOLDEN_DATASET, run_rag)

    results = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )

    print(results)
    # Save these as your baseline thresholds:
    # faithfulness:      target >= 0.9
    # answer_relevancy:  target >= 0.85
    # context_precision: target >= 0.8
    # context_recall:    target >= 0.8

When the baseline run completes, record every score in a baseline_metrics.json file committed to the repository. These numbers are the floor. Any future pipeline change that drops a metric below the baseline fails the quality gate.

The per-metric diagnostic table below maps low scores to the pipeline component that needs fixing, which is more useful than the score alone:

Failing Metric	First Place to Look	Likely Root Cause
Faithfulness below 0.9	System prompt	Model generating beyond retrieved context
Answer relevancy below 0.85	Chunking strategy	Chunks topically adjacent but not query-specific
Context precision below 0.8	Retrieval and reranking	Wrong chunks surfacing above relevant ones
Context recall below 0.8	Embedding model fit	Relevant chunks not matching query vectors
Both precision and recall low	Chunking and indexing	Fundamental indexing problem upstream

Step 3: Wire the CI/CD Quality Gate With DeepEval

The goal of CI/CD evaluation integration is simple: fail the build when RAG quality drops below your thresholds before a PR gets merged. DeepEval integrates with pytest natively and is the strongest open-source option for this role. RAGAS is better for metric exploration. DeepEval is better for hard pass/fail gates with a testing-framework mindset.

The Test File

python

# tests/test_rag_regression.py
# This file runs on every PR that touches:
# - pipeline code
# - retrieval configuration
# - prompt templates
# - embedding model settings
# - chunking strategy

import pytest
from deepeval import assert_test
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag

# Define hard thresholds -- any score below these fails the build
FAITHFULNESS_THRESHOLD      = 0.9
ANSWER_RELEVANCY_THRESHOLD  = 0.85
CONTEXT_PRECISION_THRESHOLD = 0.8
CONTEXT_RECALL_THRESHOLD    = 0.8

# Initialize metrics once -- each uses gpt-4o-mini as judge to keep cost low
faithfulness_metric       = FaithfulnessMetric(threshold=FAITHFULNESS_THRESHOLD, model="gpt-4o-mini")
answer_relevancy_metric   = AnswerRelevancyMetric(threshold=ANSWER_RELEVANCY_THRESHOLD, model="gpt-4o-mini")
context_precision_metric  = ContextualPrecisionMetric(threshold=CONTEXT_PRECISION_THRESHOLD, model="gpt-4o-mini")
context_recall_metric     = ContextualRecallMetric(threshold=CONTEXT_RECALL_THRESHOLD, model="gpt-4o-mini")

@pytest.mark.parametrize("record", GOLDEN_DATASET, ids=lambda r: r.question[:60])
def test_rag_quality(record):
    """
    Run each golden record through the RAG pipeline.
    Assert all four metrics pass their thresholds.
    A single failure blocks the merge.
    """
    result = run_rag(record.question)

    test_case = LLMTestCase(
        input=record.question,
        actual_output=result["answer"],
        retrieval_context=result["retrieved_chunks"],
        expected_output=record.expected_answer,
    )

    assert_test(
        test_case,
        metrics=[
            faithfulness_metric,
            answer_relevancy_metric,
            context_precision_metric,
            context_recall_metric,
        ]
    )

The GitHub Actions Workflow

yaml

# .github/workflows/rag_quality_gate.yml
name: RAG Quality Gate

on:
  pull_request:
    paths:
      - 'src/pipeline/**'
      - 'src/retrieval/**'
      - 'src/prompts/**'
      - 'config/chunking.yaml'
      - 'config/embedding.yaml'
      - 'tests/test_rag_regression.py'
      - 'golden_dataset.py'

jobs:
  rag-evaluation:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install deepeval ragas pytest pytest-asyncio

      - name: Run RAG quality gate
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          QDRANT_URL: ${{ secrets.QDRANT_URL }}
        run: |
          pytest tests/test_rag_regression.py \
            --tb=short \
            -v \
            --timeout=600

      - name: Upload evaluation results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: rag-eval-results
          path: .deepeval/

Run the quality gate only on PRs that touch pipeline-relevant paths. A PR that updates a README or changes a CI config should not trigger a 30-minute evaluation run that costs real API money. The paths filter in the workflow above restricts evaluation to changes that could actually affect RAG behavior. At roughly $0.002 per golden record evaluated with gpt-4o-mini as judge, a 200-record golden dataset costs about $0.40 per run, which is acceptable for every pipeline-related PR.

Step 4: Production Monitoring With Arize Phoenix

The CI/CD gate prevents known regressions from reaching production. It does not catch regressions caused by things the test suite does not cover: query distribution drift, document staleness, embedding model API changes, or edge cases that only appear at production query volume.

Research from Getmaxim shows that 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. The production monitoring layer is what catches the other 40%.

Arize Phoenix supports UMAP-based embedding visualization, which lets you visually cluster retrieval results to spot semantic gaps and drift. It is fully self-hostable with zero feature gates on the open-source tier.

Instrumenting the Pipeline

python

# rag_pipeline_with_tracing.py
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from ragas.metrics import faithfulness as faithfulness_metric
from ragas import evaluate
from datasets import Dataset
import threading

# Start Phoenix server (self-hosted, no API key)
# Run: python -m phoenix.server.main
# Phoenix UI available at http://localhost:6006

provider = TracerProvider()
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()   # auto-traces all OpenAI calls

tracer = trace.get_tracer(__name__)

def compute_faithfulness_async(question, answer, retrieved_chunks):
    """
    Compute faithfulness in a background thread so it does not block response time.
    Log the result to Phoenix for monitoring.
    """
    def _compute():
        dataset = Dataset.from_list([{
            "question": question,
            "answer": answer,
            "contexts": retrieved_chunks,
            "ground_truth": ""   # not needed for faithfulness
        }])
        result = evaluate(dataset, metrics=[faithfulness_metric])
        score = result["faithfulness"]

        # Log to Phoenix via custom span attribute
        with tracer.start_as_current_span("faithfulness_eval") as span:
            span.set_attribute("eval.faithfulness", score)
            span.set_attribute("eval.question", question[:200])
            if score < 0.85:
                span.set_attribute("eval.alert", "faithfulness_below_threshold")

    thread = threading.Thread(target=_compute, daemon=True)
    thread.start()

def run_rag_with_monitoring(question: str) -> dict:
    with tracer.start_as_current_span("rag_query") as span:
        span.set_attribute("query", question)

        # Retrieval step
        retrieved_chunks = retrieve(question)
        span.set_attribute("retrieval.chunk_count", len(retrieved_chunks))

        # Generation step
        answer = generate(question, retrieved_chunks)
        span.set_attribute("generation.answer_length", len(answer))

        # Async faithfulness check -- does not block the response
        compute_faithfulness_async(question, answer, retrieved_chunks)

        return {"answer": answer, "retrieved_chunks": retrieved_chunks}

Weekly Drift Monitoring

RAG evaluation is not a one-time audit. Production query distributions shift, source documents update, and embedding models improve. Enterprises that treat evaluation as a continuous process catch regressions in hours, not weeks.

Run a scheduled job every week that re-evaluates the frozen golden dataset against the live production pipeline and compares the scores against the baseline recorded at the last deployment.

python

# scripts/weekly_drift_check.py
# Run on a schedule: cron every Monday at 08:00 UTC
# Alert when any metric drops more than 5% from deployment baseline

import json
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag

ALERT_THRESHOLD_DELTA = 0.05   # alert if any metric drops more than 5 points

def load_baseline() -> dict:
    with open("baseline_metrics.json") as f:
        return json.load(f)

def run_weekly_drift_check():
    rows = []
    for record in GOLDEN_DATASET:
        result = run_rag(record.question)
        rows.append({
            "question":     record.question,
            "answer":       result["answer"],
            "contexts":     result["retrieved_chunks"],
            "ground_truth": record.expected_answer,
        })

    dataset = Dataset.from_list(rows)
    current = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )

    baseline = load_baseline()
    alerts = []

    for metric_name in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
        delta = baseline[metric_name] - current[metric_name]
        if delta > ALERT_THRESHOLD_DELTA:
            alerts.append({
                "metric":    metric_name,
                "baseline":  baseline[metric_name],
                "current":   current[metric_name],
                "delta":     round(delta, 4),
            })

    if alerts:
        send_alert(alerts)  # your alerting function: Slack, PagerDuty, email
        print("ALERT: RAG quality drift detected")
        for a in alerts:
            print(f"  {a['metric']}: {a['baseline']:.3f} -> {a['current']:.3f} (delta -{a['delta']})")
    else:
        print("Drift check passed. All metrics within 5% of baseline.")

    return current

if __name__ == "__main__":
    run_weekly_drift_check()

Step 5: Wire Production Failures Back Into the Test Suite

This is the step most teams skip, and it is the one that makes evaluation compound over time rather than stay static.

Every time a production query produces a low faithfulness score (below 0.85) or receives a negative user signal (thumbs down, explicit complaint, escalation to a human), that query is a candidate for the golden dataset. Production failures represent the real distribution of hard cases that your synthetic or manually curated golden dataset does not yet cover.

python

# scripts/promote_failure_to_golden.py
# Run this whenever production monitoring flags a low-quality response.
# Requires human review before the record enters the golden dataset.

from golden_dataset import GoldenRecord
import json
from datetime import datetime

def create_golden_candidate_from_failure(
    question: str,
    low_quality_answer: str,
    retrieved_chunks: list[str],
    faithfulness_score: float,
    failure_reason: str,
) -> dict:
    """
    Package a production failure as a golden dataset candidate.
    Outputs a dict for human review before adding to the dataset.
    """
    return {
        "status":            "PENDING_HUMAN_REVIEW",
        "question":          question,
        "failed_answer":     low_quality_answer,
        "faithfulness":      faithfulness_score,
        "failure_reason":    failure_reason,
        "retrieved_chunks":  retrieved_chunks,
        "instructions":      (
            "1. Write the correct expected_answer for this question.\n"
            "2. Identify the ground_truth_chunks from the knowledge base.\n"
            "3. Tag the category and difficulty.\n"
            "4. Add to GOLDEN_DATASET in golden_dataset.py and commit."
        ),
        "template": {
            "question":           question,
            "expected_answer":    "[HUMAN: fill in correct answer]",
            "ground_truth_chunks": ["[HUMAN: find the relevant chunks]"],
            "category":           "[factual|procedural|comparative|edge_case]",
            "difficulty":         "[easy|medium|hard]",
            "source_documents":   ["[HUMAN: identify source doc]"],
            "last_verified":      datetime.utcnow().strftime("%Y-%m-%d"),
        },
        "flagged_at": datetime.utcnow().isoformat(),
    }

# Example usage inside your monitoring pipeline
def handle_low_quality_response(query_trace: dict):
    if query_trace["faithfulness"] < 0.85:
        candidate = create_golden_candidate_from_failure(
            question=query_trace["question"],
            low_quality_answer=query_trace["answer"],
            retrieved_chunks=query_trace["retrieved_chunks"],
            faithfulness_score=query_trace["faithfulness"],
            failure_reason="faithfulness_below_threshold",
        )
        # Write to a review queue (file, Notion, Linear ticket, etc.)
        with open("golden_candidates_review_queue.jsonl", "a") as f:
            f.write(json.dumps(candidate) + "\n")

The review queue is the bridge between production monitoring and the test suite. A domain expert reviews each candidate, writes the correct expected answer, identifies the ground truth chunks, and adds the record to the golden dataset. The next CI/CD run picks it up automatically.

Over six months, this loop produces a golden dataset that reflects the actual hard cases your system encounters in production, not just the cases you anticipated at build time.

The Three-Layer Architecture

The full evaluation system has three layers that operate at different frequencies and catch different types of failure.

Layer	When It Runs	What It Catches	Tools
Offline test suite	On every PR, before merge	Regressions from code or config changes	RAGAS for baselines, DeepEval for gates
CI/CD quality gate	Blocking PR merge	Any metric drop below threshold	DeepEval + pytest + GitHub Actions
Production monitoring	Continuously, drift weekly	Query distribution shift, doc staleness, edge cases	Arize Phoenix or Langfuse

The layers compound. The offline test suite gives you a stable baseline. The CI/CD gate enforces it. The production monitoring layer catches what the test suite does not cover. And the failure-to-golden pipeline makes the test suite better over time.

Research shows that 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. Teams that build this infrastructure before launch catch problems in code review. Teams that skip it catch problems in user complaints.

What You Do Not Need to Buy

The three-layer architecture described above is built entirely from open-source tools. Here is a cost comparison between the fully self-hosted approach and representative managed platforms.

Approach	Setup Cost	Monthly Cost at 10K queries	Vendor Lock-in
RAGAS + DeepEval + Phoenix (self-hosted)	3 to 5 days engineering	Under $50 (LLM judge API calls only)	None
LangSmith (managed)	Half a day	$39 per seat minimum	LangChain ecosystem
Maxim AI (managed)	Half a day	Custom pricing, demo required	Platform-specific
Braintrust (managed)	Half a day	Usage-based, CI/CD gates included	Platform-specific

The managed platforms offer faster setup and team collaboration features. The self-hosted approach gives full control over data, no per-seat pricing, and no vendor dependency. For teams with GDPR or HIPAA data constraints that prevent sending queries to a third-party SaaS, the self-hosted approach is not optional.

The engineering investment in the self-hosted stack is roughly three to five days to set up properly. After that, maintenance is low. The golden dataset grows incrementally. The CI/CD workflow runs automatically. The Phoenix dashboard updates continuously.

Common Mistakes That Break Evaluation Pipelines

Teams building RAG evaluation for the first time make the same set of mistakes. Knowing them in advance saves weeks of debugging.

Using the LLM that powers the RAG pipeline as the evaluation judge. The model will rate its own outputs highly regardless of quality. Use a separate model, ideally one from a different provider, as the judge. OpenAI's gpt-4o-mini as judge for a Claude-powered RAG system, or vice versa.
Never updating the golden dataset after launch. A static golden dataset from launch day does not reflect the query distribution six months later. New products ship. New policies change. New edge cases emerge. The failure-to-golden promotion pipeline exists specifically to prevent this.
Running evaluation on a subset of the golden dataset to save money. Statistical significance requires consistency. Run the full golden dataset on every CI/CD evaluation or the metric comparison is not meaningful. At $0.40 per full run with gpt-4o-mini as judge, the cost is not the bottleneck.
Setting thresholds higher than the current baseline. If your current faithfulness is 0.82, setting the CI/CD threshold at 0.9 will block every PR immediately. Set thresholds at or slightly below the current baseline to prevent regressions, then raise them as pipeline improvements move the baseline up.
Treating all metrics equally regardless of failure mode. A faithfulness drop from 0.91 to 0.87 is a generation layer problem. A context precision drop from 0.83 to 0.72 is a retrieval layer problem. They need different investigations and different fixes. Read the metric diagnostics table before starting a debugging session.

Where This Fits in the Bigger RAG Picture

Evaluation infrastructure does not replace good pipeline engineering. A well-evaluated bad pipeline is just a bad pipeline you understand better. The evaluation system surfaces where to invest engineering effort, not a substitute for it.

If the faithfulness metric is consistently low, the fix is in the system prompt or retrieval validation, as covered in Why RAG Fails. If context precision is low, the fix is in chunking strategy, embedding model selection, or reranking, as covered in RAG Architecture Explained. If context recall is low, the fix is in hybrid search or embedding model domain fit, as covered in How Embeddings Work in RAG.

Evaluation tells you which metric is failing. The RAG engineering work tells you how to fix it. Both are required. Neither is sufficient alone.

The teams that ship reliable RAG systems in 2026 are not doing anything mysterious. They build the golden dataset before the first production deployment. They wire DeepEval into CI/CD before the second sprint. They stand up Phoenix monitoring before the third. Then they run the system, watch the metrics, promote production failures to the golden dataset, and raise the thresholds as the pipeline improves.

That is RAG evaluation as an engineering discipline. Not glamorous. Not optional.

This article builds the evaluation system from zero. No SaaS platform required.

Why RAG Evaluation Is a Different Engineering Problem

RAG pipelines have two distinct failure modes that require separate measurement systems, and standard ML metrics miss both of them.

The Four Metrics That Actually Matter

Metric	What It Measures	Failure Mode It Catches	Production Threshold
Faithfulness	Every answer claim is supported by retrieved context	Model hallucinating beyond its context	Above 0.9
Answer Relevancy	Answer addresses the actual question asked	Adjacent but off-topic answers	Above 0.85
Context Precision	Relevant chunks rank high in the retrieved set	Retrieval is noisy, wrong chunks surface	Above 0.8
Context Recall	All relevant chunks in the KB were retrieved	Retrieval misses documents it should find	Above 0.8

Source: MarsDevs production RAG targets, 2026, Premai.io RAG evaluation guide

Faithfulness

Answer Relevancy

Context Precision

Context Recall

Together, context precision and context recall give a complete diagnostic picture of the retrieval layer. Low precision points to noise in retrieval. Low recall points to gaps in coverage.

The Tools Available Without Buying a Platform

Three open-source tools cover the full evaluation lifecycle without requiring a paid platform.

Tool	Primary Role	Strength	Limitation	License
RAGAS	Metric computation and synthetic data generation	Reference-free evaluation, LLM-as-judge	No CI/CD pass/fail gates built in	Apache 2.0
DeepEval	CI/CD quality gates	Native pytest integration, hard thresholds	Requires LLM API for judge calls	Apache 2.0
Arize Phoenix	Production observability and trace capture	Self-hostable, UMAP visualization, zero feature gates	No paid support on free tier	Apache 2.0
Langfuse	Production tracing and session logging	Clean UI, easy self-host, wide framework support	Evaluation metrics require integration	MIT

Step 1: Build the Golden Dataset

What Each Record Contains

Question: A real user query sampled from production logs or constructed to cover domain edge cases
Expected answer: The correct answer, verified by a domain expert
Ground truth chunks: The specific document chunks that contain the correct answer
Category: factual, procedural, comparative, or edge_case
Difficulty tier: easy, medium, or hard
Source documents: The document names the chunks come from, for traceability

How to Source Questions

Export 500 real queries from your production query logs or customer support tickets
Deduplicate and cluster them by intent using an embedding similarity pass
Select the 100 to 300 most representative queries across clusters, covering both common cases and known edge cases
Have a domain expert write and verify the expected answer for each selected query
Identify the exact document chunks that support each expected answer

python

# golden_dataset_builder.py
from dataclasses import dataclass
from typing import Literal

@dataclass
class GoldenRecord:
    question: str
    expected_answer: str
    ground_truth_chunks: list[str]
    category: Literal["factual", "procedural", "comparative", "edge_case"]
    difficulty: Literal["easy", "medium", "hard"]
    source_documents: list[str]
    last_verified: str

# Version your golden dataset as Python so it lives in version control
# alongside your pipeline code and changes are tracked in git history
GOLDEN_DATASET: list[GoldenRecord] = [
    GoldenRecord(
        question="What is the refund window for enterprise customers?",
        expected_answer="Enterprise customers receive full refunds within 60 days of purchase.",
        ground_truth_chunks=[
            "Enterprise plan customers are eligible for a full refund within 60 days "
            "of the original purchase date, no questions asked."
        ],
        category="factual",
        difficulty="easy",
        source_documents=["refund_policy_v3.pdf"],
        last_verified="2026-05-01"
    ),
    GoldenRecord(
        question="How do I migrate from API v1 to v2 without downtime?",
        expected_answer=(
            "Install the compatibility shim available in v1.9, run both versions in "
            "parallel in staging, then cut over when v2 error rate drops below 0.1%."
        ),
        ground_truth_chunks=[
            "The v1 to v2 migration guide recommends installing the compatibility shim "
            "in version 1.9. Run both API versions simultaneously in staging. When the "
            "v2 error rate falls below 0.1% over a 24-hour window, proceed with production cutover."
        ],
        category="procedural",
        difficulty="hard",
        source_documents=["api_migration_guide_v2.pdf"],
        last_verified="2026-05-01"
    ),
    # Add 98 to 298 more records covering your domain's full query distribution
]

Step 2: Compute Baseline Metrics With RAGAS

python

# evaluate_baseline.py
# Run this once before wiring into CI/CD.
# The output becomes your quality gate thresholds.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag  # your RAG function

def build_ragas_dataset(golden_records, rag_fn) -> Dataset:
    rows = []
    for record in golden_records:
        # Run the question through your current RAG pipeline
        result = rag_fn(record.question)

        rows.append({
            "question":     record.question,
            "answer":       result["answer"],        # what your pipeline generated
            "contexts":     result["retrieved_chunks"],  # what was actually retrieved
            "ground_truth": record.expected_answer,
        })

    return Dataset.from_list(rows)

if __name__ == "__main__":
    dataset = build_ragas_dataset(GOLDEN_DATASET, run_rag)

    results = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )

    print(results)
    # Save these as your baseline thresholds:
    # faithfulness:      target >= 0.9
    # answer_relevancy:  target >= 0.85
    # context_precision: target >= 0.8
    # context_recall:    target >= 0.8

The per-metric diagnostic table below maps low scores to the pipeline component that needs fixing, which is more useful than the score alone:

Failing Metric	First Place to Look	Likely Root Cause
Faithfulness below 0.9	System prompt	Model generating beyond retrieved context
Answer relevancy below 0.85	Chunking strategy	Chunks topically adjacent but not query-specific
Context precision below 0.8	Retrieval and reranking	Wrong chunks surfacing above relevant ones
Context recall below 0.8	Embedding model fit	Relevant chunks not matching query vectors
Both precision and recall low	Chunking and indexing	Fundamental indexing problem upstream

Step 3: Wire the CI/CD Quality Gate With DeepEval

The Test File

python

# tests/test_rag_regression.py
# This file runs on every PR that touches:
# - pipeline code
# - retrieval configuration
# - prompt templates
# - embedding model settings
# - chunking strategy

import pytest
from deepeval import assert_test
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
)
from deepeval.test_case import LLMTestCase
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag

# Define hard thresholds -- any score below these fails the build
FAITHFULNESS_THRESHOLD      = 0.9
ANSWER_RELEVANCY_THRESHOLD  = 0.85
CONTEXT_PRECISION_THRESHOLD = 0.8
CONTEXT_RECALL_THRESHOLD    = 0.8

# Initialize metrics once -- each uses gpt-4o-mini as judge to keep cost low
faithfulness_metric       = FaithfulnessMetric(threshold=FAITHFULNESS_THRESHOLD, model="gpt-4o-mini")
answer_relevancy_metric   = AnswerRelevancyMetric(threshold=ANSWER_RELEVANCY_THRESHOLD, model="gpt-4o-mini")
context_precision_metric  = ContextualPrecisionMetric(threshold=CONTEXT_PRECISION_THRESHOLD, model="gpt-4o-mini")
context_recall_metric     = ContextualRecallMetric(threshold=CONTEXT_RECALL_THRESHOLD, model="gpt-4o-mini")

@pytest.mark.parametrize("record", GOLDEN_DATASET, ids=lambda r: r.question[:60])
def test_rag_quality(record):
    """
    Run each golden record through the RAG pipeline.
    Assert all four metrics pass their thresholds.
    A single failure blocks the merge.
    """
    result = run_rag(record.question)

    test_case = LLMTestCase(
        input=record.question,
        actual_output=result["answer"],
        retrieval_context=result["retrieved_chunks"],
        expected_output=record.expected_answer,
    )

    assert_test(
        test_case,
        metrics=[
            faithfulness_metric,
            answer_relevancy_metric,
            context_precision_metric,
            context_recall_metric,
        ]
    )

The GitHub Actions Workflow

yaml

# .github/workflows/rag_quality_gate.yml
name: RAG Quality Gate

on:
  pull_request:
    paths:
      - 'src/pipeline/**'
      - 'src/retrieval/**'
      - 'src/prompts/**'
      - 'config/chunking.yaml'
      - 'config/embedding.yaml'
      - 'tests/test_rag_regression.py'
      - 'golden_dataset.py'

jobs:
  rag-evaluation:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install deepeval ragas pytest pytest-asyncio

      - name: Run RAG quality gate
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          QDRANT_URL: ${{ secrets.QDRANT_URL }}
        run: |
          pytest tests/test_rag_regression.py \
            --tb=short \
            -v \
            --timeout=600

      - name: Upload evaluation results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: rag-eval-results
          path: .deepeval/

Step 4: Production Monitoring With Arize Phoenix

Instrumenting the Pipeline

python

# rag_pipeline_with_tracing.py
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from ragas.metrics import faithfulness as faithfulness_metric
from ragas import evaluate
from datasets import Dataset
import threading

# Start Phoenix server (self-hosted, no API key)
# Run: python -m phoenix.server.main
# Phoenix UI available at http://localhost:6006

provider = TracerProvider()
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument()   # auto-traces all OpenAI calls

tracer = trace.get_tracer(__name__)

def compute_faithfulness_async(question, answer, retrieved_chunks):
    """
    Compute faithfulness in a background thread so it does not block response time.
    Log the result to Phoenix for monitoring.
    """
    def _compute():
        dataset = Dataset.from_list([{
            "question": question,
            "answer": answer,
            "contexts": retrieved_chunks,
            "ground_truth": ""   # not needed for faithfulness
        }])
        result = evaluate(dataset, metrics=[faithfulness_metric])
        score = result["faithfulness"]

        # Log to Phoenix via custom span attribute
        with tracer.start_as_current_span("faithfulness_eval") as span:
            span.set_attribute("eval.faithfulness", score)
            span.set_attribute("eval.question", question[:200])
            if score < 0.85:
                span.set_attribute("eval.alert", "faithfulness_below_threshold")

    thread = threading.Thread(target=_compute, daemon=True)
    thread.start()

def run_rag_with_monitoring(question: str) -> dict:
    with tracer.start_as_current_span("rag_query") as span:
        span.set_attribute("query", question)

        # Retrieval step
        retrieved_chunks = retrieve(question)
        span.set_attribute("retrieval.chunk_count", len(retrieved_chunks))

        # Generation step
        answer = generate(question, retrieved_chunks)
        span.set_attribute("generation.answer_length", len(answer))

        # Async faithfulness check -- does not block the response
        compute_faithfulness_async(question, answer, retrieved_chunks)

        return {"answer": answer, "retrieved_chunks": retrieved_chunks}

Weekly Drift Monitoring

Run a scheduled job every week that re-evaluates the frozen golden dataset against the live production pipeline and compares the scores against the baseline recorded at the last deployment.

python

# scripts/weekly_drift_check.py
# Run on a schedule: cron every Monday at 08:00 UTC
# Alert when any metric drops more than 5% from deployment baseline

import json
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)
from datasets import Dataset
from golden_dataset import GOLDEN_DATASET
from your_rag_pipeline import run_rag

ALERT_THRESHOLD_DELTA = 0.05   # alert if any metric drops more than 5 points

def load_baseline() -> dict:
    with open("baseline_metrics.json") as f:
        return json.load(f)

def run_weekly_drift_check():
    rows = []
    for record in GOLDEN_DATASET:
        result = run_rag(record.question)
        rows.append({
            "question":     record.question,
            "answer":       result["answer"],
            "contexts":     result["retrieved_chunks"],
            "ground_truth": record.expected_answer,
        })

    dataset = Dataset.from_list(rows)
    current = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )

    baseline = load_baseline()
    alerts = []

    for metric_name in ["faithfulness", "answer_relevancy", "context_precision", "context_recall"]:
        delta = baseline[metric_name] - current[metric_name]
        if delta > ALERT_THRESHOLD_DELTA:
            alerts.append({
                "metric":    metric_name,
                "baseline":  baseline[metric_name],
                "current":   current[metric_name],
                "delta":     round(delta, 4),
            })

    if alerts:
        send_alert(alerts)  # your alerting function: Slack, PagerDuty, email
        print("ALERT: RAG quality drift detected")
        for a in alerts:
            print(f"  {a['metric']}: {a['baseline']:.3f} -> {a['current']:.3f} (delta -{a['delta']})")
    else:
        print("Drift check passed. All metrics within 5% of baseline.")

    return current

if __name__ == "__main__":
    run_weekly_drift_check()

Step 5: Wire Production Failures Back Into the Test Suite

This is the step most teams skip, and it is the one that makes evaluation compound over time rather than stay static.

python

# scripts/promote_failure_to_golden.py
# Run this whenever production monitoring flags a low-quality response.
# Requires human review before the record enters the golden dataset.

from golden_dataset import GoldenRecord
import json
from datetime import datetime

def create_golden_candidate_from_failure(
    question: str,
    low_quality_answer: str,
    retrieved_chunks: list[str],
    faithfulness_score: float,
    failure_reason: str,
) -> dict:
    """
    Package a production failure as a golden dataset candidate.
    Outputs a dict for human review before adding to the dataset.
    """
    return {
        "status":            "PENDING_HUMAN_REVIEW",
        "question":          question,
        "failed_answer":     low_quality_answer,
        "faithfulness":      faithfulness_score,
        "failure_reason":    failure_reason,
        "retrieved_chunks":  retrieved_chunks,
        "instructions":      (
            "1. Write the correct expected_answer for this question.\n"
            "2. Identify the ground_truth_chunks from the knowledge base.\n"
            "3. Tag the category and difficulty.\n"
            "4. Add to GOLDEN_DATASET in golden_dataset.py and commit."
        ),
        "template": {
            "question":           question,
            "expected_answer":    "[HUMAN: fill in correct answer]",
            "ground_truth_chunks": ["[HUMAN: find the relevant chunks]"],
            "category":           "[factual|procedural|comparative|edge_case]",
            "difficulty":         "[easy|medium|hard]",
            "source_documents":   ["[HUMAN: identify source doc]"],
            "last_verified":      datetime.utcnow().strftime("%Y-%m-%d"),
        },
        "flagged_at": datetime.utcnow().isoformat(),
    }

# Example usage inside your monitoring pipeline
def handle_low_quality_response(query_trace: dict):
    if query_trace["faithfulness"] < 0.85:
        candidate = create_golden_candidate_from_failure(
            question=query_trace["question"],
            low_quality_answer=query_trace["answer"],
            retrieved_chunks=query_trace["retrieved_chunks"],
            faithfulness_score=query_trace["faithfulness"],
            failure_reason="faithfulness_below_threshold",
        )
        # Write to a review queue (file, Notion, Linear ticket, etc.)
        with open("golden_candidates_review_queue.jsonl", "a") as f:
            f.write(json.dumps(candidate) + "\n")

Over six months, this loop produces a golden dataset that reflects the actual hard cases your system encounters in production, not just the cases you anticipated at build time.

The Three-Layer Architecture

The full evaluation system has three layers that operate at different frequencies and catch different types of failure.

Layer	When It Runs	What It Catches	Tools
Offline test suite	On every PR, before merge	Regressions from code or config changes	RAGAS for baselines, DeepEval for gates
CI/CD quality gate	Blocking PR merge	Any metric drop below threshold	DeepEval + pytest + GitHub Actions
Production monitoring	Continuously, drift weekly	Query distribution shift, doc staleness, edge cases	Arize Phoenix or Langfuse

What You Do Not Need to Buy

The three-layer architecture described above is built entirely from open-source tools. Here is a cost comparison between the fully self-hosted approach and representative managed platforms.

Approach	Setup Cost	Monthly Cost at 10K queries	Vendor Lock-in
RAGAS + DeepEval + Phoenix (self-hosted)	3 to 5 days engineering	Under $50 (LLM judge API calls only)	None
LangSmith (managed)	Half a day	$39 per seat minimum	LangChain ecosystem
Maxim AI (managed)	Half a day	Custom pricing, demo required	Platform-specific
Braintrust (managed)	Half a day	Usage-based, CI/CD gates included	Platform-specific

Common Mistakes That Break Evaluation Pipelines

Teams building RAG evaluation for the first time make the same set of mistakes. Knowing them in advance saves weeks of debugging.

Using the LLM that powers the RAG pipeline as the evaluation judge. The model will rate its own outputs highly regardless of quality. Use a separate model, ideally one from a different provider, as the judge. OpenAI's gpt-4o-mini as judge for a Claude-powered RAG system, or vice versa.
Never updating the golden dataset after launch. A static golden dataset from launch day does not reflect the query distribution six months later. New products ship. New policies change. New edge cases emerge. The failure-to-golden promotion pipeline exists specifically to prevent this.
Running evaluation on a subset of the golden dataset to save money. Statistical significance requires consistency. Run the full golden dataset on every CI/CD evaluation or the metric comparison is not meaningful. At $0.40 per full run with gpt-4o-mini as judge, the cost is not the bottleneck.
Setting thresholds higher than the current baseline. If your current faithfulness is 0.82, setting the CI/CD threshold at 0.9 will block every PR immediately. Set thresholds at or slightly below the current baseline to prevent regressions, then raise them as pipeline improvements move the baseline up.
Treating all metrics equally regardless of failure mode. A faithfulness drop from 0.91 to 0.87 is a generation layer problem. A context precision drop from 0.83 to 0.72 is a retrieval layer problem. They need different investigations and different fixes. Read the metric diagnostics table before starting a debugging session.

Where This Fits in the Bigger RAG Picture

Evaluation tells you which metric is failing. The RAG engineering work tells you how to fix it. Both are required. Neither is sufficient alone.

That is RAG evaluation as an engineering discipline. Not glamorous. Not optional.

Why RAG Evaluation Is a Different Engineering Problem

The Four Metrics That Actually Matter

Faithfulness

Answer Relevancy

Context Precision

Context Recall

The Tools Available Without Buying a Platform

Step 1: Build the Golden Dataset

What Each Record Contains

How to Source Questions

Step 2: Compute Baseline Metrics With RAGAS

Step 3: Wire the CI/CD Quality Gate With DeepEval

The Test File

The GitHub Actions Workflow

Step 4: Production Monitoring With Arize Phoenix

Instrumenting the Pipeline

Weekly Drift Monitoring

Step 5: Wire Production Failures Back Into the Test Suite

The Three-Layer Architecture

What You Do Not Need to Buy

Common Mistakes That Break Evaluation Pipelines

Where This Fits in the Bigger RAG Picture

Krunal Kanojiya

Related Posts

Why RAG Evaluation Is a Different Engineering Problem

The Four Metrics That Actually Matter

Faithfulness

Answer Relevancy

Context Precision

Context Recall

The Tools Available Without Buying a Platform

Step 1: Build the Golden Dataset

What Each Record Contains

How to Source Questions

Step 2: Compute Baseline Metrics With RAGAS

Step 3: Wire the CI/CD Quality Gate With DeepEval

The Test File

The GitHub Actions Workflow

Step 4: Production Monitoring With Arize Phoenix

Instrumenting the Pipeline

Weekly Drift Monitoring

Step 5: Wire Production Failures Back Into the Test Suite

The Three-Layer Architecture

What You Do Not Need to Buy

Common Mistakes That Break Evaluation Pipelines

Where This Fits in the Bigger RAG Picture

Krunal Kanojiya

Related Posts