Why does RAG fail even when the LLM is good?

Because failure almost never lives in the generation layer. Industry analysis in 2026 consistently shows that RAG fails at retrieval 73% of the time. The LLM writes a confident, well-structured answer — grounded in the wrong document. You cannot fix a retrieval problem by swapping the language model.

What is the most common reason RAG pipelines fail in production?

Bad chunking. Fixed-size chunking splits sentences mid-thought, separates questions from their answers, and cuts tables away from their headers. The retrieved chunk is technically about the right topic but missing the specific information needed to answer the question. Most teams discover this after spending weeks tuning prompts and embedding models while the real problem was always upstream in the indexing step.

Why does RAG produce confident wrong answers?

Two separate causes. The first is retrieval failure — the system finds a plausible-but-wrong document and the model generates confidently from it. The second is missing context constraint — the system prompt does not tell the model to stay within its retrieved context, so the model fills gaps from training data. Both produce confident wrong answers through different mechanisms and need different fixes.

How do I know if my RAG failure is a retrieval problem or a generation problem?

Check the retrieved context directly. Take a question that produces a wrong answer, look at what chunks were actually retrieved, and ask whether the correct answer is present in those chunks. If the answer is not in the retrieved context, you have a retrieval problem. If the answer is in the context and the model still gets it wrong, you have a generation or prompt problem. Most teams find the answer is not in the retrieved context.

What is vocabulary mismatch in RAG and how do I fix it?

Vocabulary mismatch happens when the user's question uses different words than the documents in the knowledge base. A query about 'money back guarantee' will not reliably retrieve documents that say 'refund policy' through pure vector search, because the embedding vectors may not overlap strongly enough. The fix is hybrid search combining dense vector similarity with BM25 keyword matching, which handles both semantic similarity and exact term matching simultaneously.

What evaluation framework should I use to catch RAG failures early?

RAGAS is the standard. It measures faithfulness (whether the answer is grounded in retrieved context), answer relevancy (whether the answer addresses the question), context precision (whether retrieved chunks are relevant), and context recall (whether the right chunks were retrieved at all). Each failing metric points to a specific layer in the pipeline. Run RAGAS against a golden test set of 50 to 100 representative questions before shipping anything to production.

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

Here is the thing nobody explains clearly: a RAG system can produce wrong answers through at least seven distinct mechanisms, and only two of them live in the generation layer. The other five happen before the LLM ever sees a single character of input.

Most RAG systems struggle when they move from prototype to production. The interesting part is that the problem usually is not the language model. It is the retrieval architecture.

The demo worked because the documents were clean, the questions were predictable, and the team was asking questions they already knew the answers to. Production is different. Real users ask vague questions. Documents are messy. The knowledge base grows and gets stale. And nobody is watching the retrieval step because everyone is focused on the model's output.

Industry analysis in 2026 consistently shows that when RAG fails, the failure point is retrieval 73% of the time, not generation. This article covers every failure mode with its specific diagnosis and fix.

The Failure Map

Before going deep on each problem, here is the full map. Each failure mode lives at a specific layer of the pipeline.

Failure Mode	Where It Lives	Symptom
Fixed-size chunking artifacts	Indexing	Retrieved chunks miss the actual answer by a paragraph
Vocabulary mismatch	Retrieval	Semantically correct question, wrong documents returned
Missing reranking	Retrieval	Right chunk retrieved at position 5 — model ignores it
Lost in the middle	Context assembly	Correct chunk in context, model still gets it wrong
Stale index	Indexing	Model answers confidently from outdated documents
No context constraint	Generation	Model answers from training data when retrieval fails
No evaluation	All layers	Failures accumulate silently for weeks

A production RAG system is not an AI feature. It is a knowledge access system that happens to use an LLM. Most failures are predictable, repeatable, and preventable.

Failure 1: Fixed-Size Chunking Artifacts

This is the most common root cause and the hardest to notice until you look directly at retrieved chunks.

Fixed-size chunking splits documents at a character or token limit. It does not care what is at that boundary. A sentence gets cut in half. A table gets separated from its header row. A numbered list gets split between item 3 and item 4. The chunk about rate limiting contains the explanation but not the actual numbers, because those are in a table two paragraphs away from the text that was retrieved.

A query about an API rate limiting policy would surface a sentence from the rate limiting section, but the window was three sentences in the middle of a twelve-step configuration process. The model received context that was technically about rate limiting but was missing the actual numbers.

The retrieved chunk scores well on cosine similarity because it is topically related. The model reads it and cannot produce the specific answer. It either hedges or fabricates a plausible-sounding number.

The fix: Stop chunking by character count. Use semantic chunking that detects topic boundaries by measuring cosine similarity between consecutive sentences. When similarity drops below a threshold, that is a natural split. For the RAG architecture context, this is covered in detail in RAG Architecture Explained.

For tables specifically, repeat the header row in every table chunk. Without headers, retrieved rows become ambiguous facts with no column context. With headers, they become usable evidence. For code, chunk at function or class level using AST parsing rather than character counts — a function boundary is a semantic boundary.

Hierarchical chunking solves the precision-context tradeoff that forces a choice between small chunks (precise retrieval, missing context) and large chunks (complete context, imprecise retrieval). Store small chunks for retrieval precision. When a small chunk is retrieved, return its larger parent chunk to the LLM. Retrieve at paragraph granularity, generate with section-level context.

python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Parent splitter: larger chunks for context
parent_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90   # fewer splits = larger chunks
)

# Child splitter: smaller chunks for retrieval precision
child_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=97   # more splits = smaller chunks
)

def build_parent_child_chunks(text: str) -> list[dict]:
    parent_docs = parent_splitter.create_documents([text])
    all_chunks = []

    for parent_idx, parent in enumerate(parent_docs):
        # Split parent into child chunks for retrieval
        child_docs = child_splitter.create_documents([parent.page_content])

        for child in child_docs:
            all_chunks.append({
                "child_text": child.page_content,      # indexed in vector DB
                "parent_text": parent.page_content,    # returned to LLM
                "parent_id": parent_idx
            })

    return all_chunks

Failure 2: Vocabulary Mismatch

The user asks about a "money back guarantee." Your documentation says "refund policy." Dense vector search finds chunks that are semantically close to "money back guarantee" — but if the embedding model did not place those two phrases close together in vector space, the refund policy document ranks low. A document about payment methods might rank higher because "payment" is closer to "money" than "policy" is.

This is not a failure of the embedding model. It is a fundamental property of how dense retrieval works. Embedding models compress meaning into vectors. Exact term matching is a separate problem from semantic similarity.

A law firm notices their vector search returns semantically similar cases but misses cases containing a specific statute number. The fix is hybrid search — add a BM25 index so exact-match keyword searches run in parallel with the dense vector search, then fuse the results.

The same applies to error codes, product model numbers, person names, version strings, and any domain-specific terminology that the embedding model tokenizes differently from how users phrase it.

The fix: Implement hybrid search combining dense vector similarity with BM25 keyword matching, merged through Reciprocal Rank Fusion. For how to implement this on Qdrant and Weaviate, see RAG Architecture Explained. Do not treat vocabulary mismatch as an embedding model selection problem — no embedding model fixes this. It requires a structural change to the retrieval layer.

Retrieval Type	Strengths	Weaknesses
Dense vector only	Semantic concepts, paraphrases, intent	Exact terms, product codes, proper nouns
BM25 only	Exact matches, rare terms, identifiers	Synonyms, conceptual similarity
Hybrid (dense + BM25)	Both	Slightly more complex to implement

Benchmarks show hybrid search delivers roughly 17% recall improvement over pure vector search in production pipelines. For most production RAG systems, this is not optional.

Failure 3: Missing Reranking

Retrieval returns the top-k chunks ranked by cosine similarity. Cosine similarity is a good proxy for semantic relatedness. It is not the same as relevance to this specific question.

Without a reranker, the top-k retrieved chunks are sorted by embedding similarity, which does not account for query-specific intent, negation, or specificity. We consistently saw the most relevant chunk sitting at position 4 or 5 in the retrieval output, behind noisier but closer matches.

A chunk about "refund timelines for enterprise contracts" and a chunk about "refund policy overview" both have high cosine similarity to a query about enterprise refunds. The overview ranks higher because it uses more of the query's words. The enterprise-specific chunk is more relevant but ranks lower. The model generates an answer from the overview, misses the enterprise-specific detail, and the answer is technically correct but wrong for this user.

The fix: Add a cross-encoder reranker after retrieval. Retrieve 20 to 50 candidates. Run a reranker that scores each candidate jointly against the query — cross-encoders read query and chunk together rather than embedding them independently, which gives far more accurate relevance scores. Return only the top 3 to 5 after reranking.

Cohere Rerank v3.5 and Voyage AI rerank-2.5 are the leading managed options. BGE-Reranker from Hugging Face is the self-hosted choice.

python

import cohere

co = cohere.Client("your-cohere-api-key")

def retrieve_and_rerank(
    query: str,
    vector_db_results: list[str],
    top_n: int = 5
) -> list[dict]:

    # Rerank the candidates from vector retrieval
    reranked = co.rerank(
        query=query,
        documents=vector_db_results,
        model="rerank-v3.5",
        top_n=top_n
    )

    return [
        {
            "text": vector_db_results[r.index],
            "relevance_score": r.relevance_score
        }
        for r in reranked.results
    ]

Reranking adds roughly 50ms latency per query and costs between $0.001 and $0.01 per query depending on the number of candidates. In every production RAG system where retrieval quality is the product rather than an internal tool, the quality improvement justifies this cost without exception.

Failure 4: Lost in the Middle

This one is counterintuitive because it feels like a retrieval success. The right chunk was retrieved. It is in the context window. The model still misses it.

LLMs have lower recall for information placed in the middle of long contexts. Research on transformer attention patterns consistently shows this effect. Information at the beginning and end of the context receives more attention than information in the middle. If you pass 8 retrieved chunks to the model and the most relevant one lands at position 4 or 5, the model is statistically likely to under-attend to it and give a partial or incorrect answer.

You can have perfect embeddings and still get poor answers if the correct chunk is placed in the middle of a long prompt. Embeddings decide what gets retrieved. They do not control what gets attended to.

The fix: Keep fewer chunks in the context. Three to five high-quality chunks produce better answers than eight chunks of mixed relevance. After reranking, place the highest-scoring chunk first in the assembled context — not in the middle. If you need to pass multiple chunks, use the "sandwich" strategy: most relevant first, supporting chunks in the middle, second-most-relevant last.

python

def assemble_context(reranked_chunks: list[dict]) -> str:
    """
    Anti-lost-in-middle context assembly.
    Most relevant chunk goes first.
    Keep to top 5 maximum.
    """
    # Already sorted by relevance from reranker, highest first
    top_chunks = reranked_chunks[:5]

    context_parts = []
    for idx, chunk in enumerate(top_chunks):
        context_parts.append(
            f"[Source {idx + 1} | Relevance: {chunk['relevance_score']:.3f}]\n"
            f"{chunk['text']}"
        )

    return "\n\n---\n\n".join(context_parts)

Failure 5: Stale Index

A knowledge base that does not update automatically will drift from reality. Documents change. Policies get updated. Features get deprecated. Prices change. The vector index does not know any of this unless you tell it.

The stale index failure produces a particularly bad outcome: the model answers confidently from a document that was accurate six months ago. The user has no indication the answer is outdated. They act on it.

RAG changes what the model can see right now. A document update that costs zero dollars in a RAG system costs between $500 and $5,000 with fine-tuning. But that zero-dollar advantage only holds if you actually update the document in the index.

The fix: Build an update pipeline before you go to production, not after you discover the problem. For most knowledge bases, a daily re-ingestion job that checks document hashes is sufficient. When a document hash changes, delete the old chunks for that document and reindex the new version. For real-time data sources, trigger reindexing via webhooks from the source system on every document change.

python

import hashlib
import datetime

def compute_document_hash(content: str) -> str:
    return hashlib.sha256(content.encode()).hexdigest()

def sync_document_to_index(
    doc_id: str,
    new_content: str,
    vector_db_client,
    embedding_fn,
    chunker
) -> dict:
    """
    Hash-based sync: only reindex if the document actually changed.
    Returns sync status for monitoring.
    """
    new_hash = compute_document_hash(new_content)

    # Check current hash stored in metadata
    existing = vector_db_client.get_document_metadata(doc_id)

    if existing and existing.get("hash") == new_hash:
        return {"doc_id": doc_id, "status": "unchanged", "action": "skipped"}

    # Document changed: delete old chunks and reindex
    vector_db_client.delete_by_filter({"doc_id": doc_id})

    new_chunks = chunker.split(new_content)
    new_embeddings = embedding_fn(new_chunks)

    vector_db_client.upsert_chunks(
        chunks=new_chunks,
        embeddings=new_embeddings,
        metadata={
            "doc_id": doc_id,
            "hash": new_hash,
            "indexed_at": datetime.datetime.utcnow().isoformat()
        }
    )

    return {"doc_id": doc_id, "status": "updated", "chunks": len(new_chunks)}

Failure 6: Missing Context Constraint

This failure happens at the generation layer, not retrieval. The system prompt does not tell the model to stay within its retrieved context. When retrieval returns the wrong chunks — or when retrieval is silently empty — the model falls back to training data and answers confidently from memory.

The answer can be wrong in two ways. It can be factually incorrect because the model's training data is outdated. Or it can be factually correct according to training data but incorrect for this specific organization, which has its own policies, processes, and configurations that differ from the general case.

Hallucination is a policy failure, not a model failure. If the model is not instructed to stay within retrieved context, it will not.

The fix: The system prompt must explicitly instruct the model to use only the provided context and to say so when the context is insufficient. Add a retrieval validation check that catches empty or low-confidence retrieval before passing anything to the model.

python

SYSTEM_PROMPT = """You are a helpful assistant answering questions about our product.

CRITICAL INSTRUCTIONS:
1. Answer ONLY using the context documents provided below.
2. If the context does not contain enough information to answer the question,
   say: "I don't have that information in the provided documentation."
3. Do not use your general knowledge or training data to fill gaps.
4. When you use information from a source, reference it as [Source N].
5. Never speculate or infer beyond what the documents explicitly state.
"""

def validate_retrieval(
    chunks: list[dict],
    min_relevance_score: float = 0.4,
    min_chunks: int = 1
) -> tuple[bool, str]:
    """
    Check retrieval quality before passing to LLM.
    Returns (is_valid, reason).
    """
    if not chunks:
        return False, "No chunks retrieved from knowledge base."

    high_confidence = [c for c in chunks if c["relevance_score"] >= min_relevance_score]

    if len(high_confidence) < min_chunks:
        return False, (
            f"Retrieved {len(chunks)} chunks but none exceeded "
            f"minimum relevance threshold of {min_relevance_score}."
        )

    return True, "Retrieval passed validation."

def generate_with_guardrails(
    query: str,
    reranked_chunks: list[dict],
    llm_client
) -> str:
    is_valid, reason = validate_retrieval(reranked_chunks)

    if not is_valid:
        # Do not call LLM when retrieval fails
        return (
            "I don't have relevant information in the documentation to answer that. "
            f"(Reason: {reason})"
        )

    context = assemble_context(reranked_chunks)

    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": f"Context documents:\n\n{context}\n\nQuestion: {query}"
            }
        ]
    )
    return response.choices[0].message.content

Most RAG implementations do not fail at the model layer. They fail earlier, when systems proceed without validating whether retrieved information is sufficient. The validation step above is the structural fix.

Failure 7: No Evaluation Layer

This is the failure that enables all the others to persist. Teams ship RAG systems with no automated quality measurement. Failures accumulate silently. Nobody knows the system has a problem until a user reports it or screenshots a bad answer.

Most of the RAG systems we inherited had zero evaluation framework. This is the one that hurts most to admit. Without evaluation, there is no signal for which failure mode is active, how severe it is, or whether a fix actually helped.

The fix: Run RAGAS against a golden test set before every deployment. RAGAS measures four metrics that map directly to the failure modes above.

RAGAS Metric	What It Measures	Failure Mode It Catches
Faithfulness	Whether answer claims are supported by retrieved context	Failure 6: missing context constraint
Answer relevancy	Whether answer addresses the actual question	Failure 3: reranking miss
Context precision	Whether retrieved chunks are actually relevant	Failure 2: vocabulary mismatch
Context recall	Whether the right chunks were retrieved at all	Failure 1: chunking, Failure 5: stale index

Production targets from 2026 deployment data at MarsDevs: faithfulness above 0.9, answer relevancy above 0.85, context precision above 0.8.

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Build a golden test set of 50-100 representative questions
# before shipping anything to production
golden_set = {
    "question": [
        "What is the enterprise refund policy?",
        "How many API calls does the Pro plan include?",
        "Was the v1 API deprecated?",
    ],
    "contexts": [
        [["Enterprise customers receive full refunds within 60 days..."]],
        [["Pro plan includes 1 million API calls per month..."]],
        [["The v1 API was deprecated in March 2026. Migrate to v2..."]],
    ],
    "answer": [
        "Enterprise customers get full refunds within 60 days of purchase.",
        "The Pro plan includes 1 million API calls per month.",
        "Yes, the v1 API was deprecated in March 2026.",
    ],
    "ground_truth": [
        "Enterprise customers receive full refunds within 60 days.",
        "Pro plan allows 1 million API calls per month.",
        "The v1 API was deprecated in March 2026 and users must migrate to v2.",
    ]
}

dataset = Dataset.from_dict(golden_set)
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

# Gate deployment on minimum thresholds
assert results["faithfulness"] >= 0.9, "Faithfulness below threshold — check context constraint"
assert results["answer_relevancy"] >= 0.85, "Relevancy below threshold — check reranking"
assert results["context_precision"] >= 0.8, "Precision below threshold — check retrieval"

Pair RAGAS with distributed tracing through Arize Phoenix or Langfuse so you can trace every query's path through the pipeline and identify which step produced the wrong result.

Diagnosing Which Failure Mode You Have

When a RAG system produces a wrong answer, the debugging process is straightforward if you inspect each layer directly.

What to Check	What You Find	Failure Mode
Retrieved chunks for the failing query	Correct answer is absent from all chunks	Chunking (F1) or vocabulary mismatch (F2)
Rank position of correct chunk	Correct chunk is retrieved but ranked 4th or lower	Missing reranking (F3)
Position of correct chunk in assembled context	Correct chunk is in middle of long context	Lost in the middle (F4)
Document last-indexed timestamp	Answer was correct historically but wrong now	Stale index (F5)
Model output when correct context is provided manually	Model still uses training data, ignores context	Missing context constraint (F6)
RAGAS scores over time	Scores degrading without code changes	All of the above, silently

The debugging sequence is always: check retrieved chunks first, then chunk rank, then context assembly, then the generation layer. Most teams go straight to the generation layer and waste time there.

The Prevention Stack

The seven failure modes above are not independent. They compound. Bad chunking means vocabulary mismatch is more likely to miss the right documents. Missing reranking means the lost-in-the-middle effect hits harder because the wrong chunks rank high. No evaluation means none of this gets caught.

The prevention stack that addresses all seven:

Layer	What to Add	Failure Modes Prevented
Indexing	Semantic chunking + hierarchical parent-child	F1
Indexing	Document hash sync pipeline	F5
Retrieval	Hybrid search: dense + BM25 + RRF	F2
Retrieval	Cross-encoder reranker, top 5 only	F3
Context	Max 5 chunks, highest ranked first	F4
Generation	Context constraint in system prompt + retrieval validation	F6
Evaluation	RAGAS on golden set before every deployment	F7

None of these are exotic. Each one is well-understood and has mature tooling. The teams that ship reliable RAG systems are not doing anything mysterious — they are systematically applying this stack and measuring each layer independently.

Where to Go From Here

If you are still deciding whether to build RAG at all, the first question to answer is whether you have a knowledge problem or a behavior problem. That decision is covered in RAG vs Fine-Tuning.

If you are building RAG and want to understand the full architecture before writing code, RAG Architecture Explained covers every layer including chunking, hybrid search, and reranking in detail. The storage layer — which vector database to use and how HNSW indexing works — is in Vector Database in RAG.

For the foundational explanation of how the whole system fits together, start with What Is RAG in AI. For how embeddings produce the vectors that make semantic retrieval possible, and why the embedding model choice determines the ceiling for retrieval quality, the next article in this series is How Embeddings Work in RAG.

Most RAG systems struggle when they move from prototype to production. The interesting part is that the problem usually is not the language model. It is the retrieval architecture.

The Failure Map

Before going deep on each problem, here is the full map. Each failure mode lives at a specific layer of the pipeline.

Failure Mode	Where It Lives	Symptom
Fixed-size chunking artifacts	Indexing	Retrieved chunks miss the actual answer by a paragraph
Vocabulary mismatch	Retrieval	Semantically correct question, wrong documents returned
Missing reranking	Retrieval	Right chunk retrieved at position 5 — model ignores it
Lost in the middle	Context assembly	Correct chunk in context, model still gets it wrong
Stale index	Indexing	Model answers confidently from outdated documents
No context constraint	Generation	Model answers from training data when retrieval fails
No evaluation	All layers	Failures accumulate silently for weeks

A production RAG system is not an AI feature. It is a knowledge access system that happens to use an LLM. Most failures are predictable, repeatable, and preventable.

Failure 1: Fixed-Size Chunking Artifacts

This is the most common root cause and the hardest to notice until you look directly at retrieved chunks.

python

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Parent splitter: larger chunks for context
parent_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90   # fewer splits = larger chunks
)

# Child splitter: smaller chunks for retrieval precision
child_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=97   # more splits = smaller chunks
)

def build_parent_child_chunks(text: str) -> list[dict]:
    parent_docs = parent_splitter.create_documents([text])
    all_chunks = []

    for parent_idx, parent in enumerate(parent_docs):
        # Split parent into child chunks for retrieval
        child_docs = child_splitter.create_documents([parent.page_content])

        for child in child_docs:
            all_chunks.append({
                "child_text": child.page_content,      # indexed in vector DB
                "parent_text": parent.page_content,    # returned to LLM
                "parent_id": parent_idx
            })

    return all_chunks

Failure 2: Vocabulary Mismatch

The same applies to error codes, product model numbers, person names, version strings, and any domain-specific terminology that the embedding model tokenizes differently from how users phrase it.

Retrieval Type	Strengths	Weaknesses
Dense vector only	Semantic concepts, paraphrases, intent	Exact terms, product codes, proper nouns
BM25 only	Exact matches, rare terms, identifiers	Synonyms, conceptual similarity
Hybrid (dense + BM25)	Both	Slightly more complex to implement

Benchmarks show hybrid search delivers roughly 17% recall improvement over pure vector search in production pipelines. For most production RAG systems, this is not optional.

Failure 3: Missing Reranking

Retrieval returns the top-k chunks ranked by cosine similarity. Cosine similarity is a good proxy for semantic relatedness. It is not the same as relevance to this specific question.

Cohere Rerank v3.5 and Voyage AI rerank-2.5 are the leading managed options. BGE-Reranker from Hugging Face is the self-hosted choice.

python

import cohere

co = cohere.Client("your-cohere-api-key")

def retrieve_and_rerank(
    query: str,
    vector_db_results: list[str],
    top_n: int = 5
) -> list[dict]:

    # Rerank the candidates from vector retrieval
    reranked = co.rerank(
        query=query,
        documents=vector_db_results,
        model="rerank-v3.5",
        top_n=top_n
    )

    return [
        {
            "text": vector_db_results[r.index],
            "relevance_score": r.relevance_score
        }
        for r in reranked.results
    ]

Failure 4: Lost in the Middle

This one is counterintuitive because it feels like a retrieval success. The right chunk was retrieved. It is in the context window. The model still misses it.

python

def assemble_context(reranked_chunks: list[dict]) -> str:
    """
    Anti-lost-in-middle context assembly.
    Most relevant chunk goes first.
    Keep to top 5 maximum.
    """
    # Already sorted by relevance from reranker, highest first
    top_chunks = reranked_chunks[:5]

    context_parts = []
    for idx, chunk in enumerate(top_chunks):
        context_parts.append(
            f"[Source {idx + 1} | Relevance: {chunk['relevance_score']:.3f}]\n"
            f"{chunk['text']}"
        )

    return "\n\n---\n\n".join(context_parts)

Failure 5: Stale Index

python

import hashlib
import datetime

def compute_document_hash(content: str) -> str:
    return hashlib.sha256(content.encode()).hexdigest()

def sync_document_to_index(
    doc_id: str,
    new_content: str,
    vector_db_client,
    embedding_fn,
    chunker
) -> dict:
    """
    Hash-based sync: only reindex if the document actually changed.
    Returns sync status for monitoring.
    """
    new_hash = compute_document_hash(new_content)

    # Check current hash stored in metadata
    existing = vector_db_client.get_document_metadata(doc_id)

    if existing and existing.get("hash") == new_hash:
        return {"doc_id": doc_id, "status": "unchanged", "action": "skipped"}

    # Document changed: delete old chunks and reindex
    vector_db_client.delete_by_filter({"doc_id": doc_id})

    new_chunks = chunker.split(new_content)
    new_embeddings = embedding_fn(new_chunks)

    vector_db_client.upsert_chunks(
        chunks=new_chunks,
        embeddings=new_embeddings,
        metadata={
            "doc_id": doc_id,
            "hash": new_hash,
            "indexed_at": datetime.datetime.utcnow().isoformat()
        }
    )

    return {"doc_id": doc_id, "status": "updated", "chunks": len(new_chunks)}

Failure 6: Missing Context Constraint

Hallucination is a policy failure, not a model failure. If the model is not instructed to stay within retrieved context, it will not.

python

SYSTEM_PROMPT = """You are a helpful assistant answering questions about our product.

CRITICAL INSTRUCTIONS:
1. Answer ONLY using the context documents provided below.
2. If the context does not contain enough information to answer the question,
   say: "I don't have that information in the provided documentation."
3. Do not use your general knowledge or training data to fill gaps.
4. When you use information from a source, reference it as [Source N].
5. Never speculate or infer beyond what the documents explicitly state.
"""

def validate_retrieval(
    chunks: list[dict],
    min_relevance_score: float = 0.4,
    min_chunks: int = 1
) -> tuple[bool, str]:
    """
    Check retrieval quality before passing to LLM.
    Returns (is_valid, reason).
    """
    if not chunks:
        return False, "No chunks retrieved from knowledge base."

    high_confidence = [c for c in chunks if c["relevance_score"] >= min_relevance_score]

    if len(high_confidence) < min_chunks:
        return False, (
            f"Retrieved {len(chunks)} chunks but none exceeded "
            f"minimum relevance threshold of {min_relevance_score}."
        )

    return True, "Retrieval passed validation."

def generate_with_guardrails(
    query: str,
    reranked_chunks: list[dict],
    llm_client
) -> str:
    is_valid, reason = validate_retrieval(reranked_chunks)

    if not is_valid:
        # Do not call LLM when retrieval fails
        return (
            "I don't have relevant information in the documentation to answer that. "
            f"(Reason: {reason})"
        )

    context = assemble_context(reranked_chunks)

    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": f"Context documents:\n\n{context}\n\nQuestion: {query}"
            }
        ]
    )
    return response.choices[0].message.content

Failure 7: No Evaluation Layer

The fix: Run RAGAS against a golden test set before every deployment. RAGAS measures four metrics that map directly to the failure modes above.

RAGAS Metric	What It Measures	Failure Mode It Catches
Faithfulness	Whether answer claims are supported by retrieved context	Failure 6: missing context constraint
Answer relevancy	Whether answer addresses the actual question	Failure 3: reranking miss
Context precision	Whether retrieved chunks are actually relevant	Failure 2: vocabulary mismatch
Context recall	Whether the right chunks were retrieved at all	Failure 1: chunking, Failure 5: stale index

Production targets from 2026 deployment data at MarsDevs: faithfulness above 0.9, answer relevancy above 0.85, context precision above 0.8.

python

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Build a golden test set of 50-100 representative questions
# before shipping anything to production
golden_set = {
    "question": [
        "What is the enterprise refund policy?",
        "How many API calls does the Pro plan include?",
        "Was the v1 API deprecated?",
    ],
    "contexts": [
        [["Enterprise customers receive full refunds within 60 days..."]],
        [["Pro plan includes 1 million API calls per month..."]],
        [["The v1 API was deprecated in March 2026. Migrate to v2..."]],
    ],
    "answer": [
        "Enterprise customers get full refunds within 60 days of purchase.",
        "The Pro plan includes 1 million API calls per month.",
        "Yes, the v1 API was deprecated in March 2026.",
    ],
    "ground_truth": [
        "Enterprise customers receive full refunds within 60 days.",
        "Pro plan allows 1 million API calls per month.",
        "The v1 API was deprecated in March 2026 and users must migrate to v2.",
    ]
}

dataset = Dataset.from_dict(golden_set)
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

# Gate deployment on minimum thresholds
assert results["faithfulness"] >= 0.9, "Faithfulness below threshold — check context constraint"
assert results["answer_relevancy"] >= 0.85, "Relevancy below threshold — check reranking"
assert results["context_precision"] >= 0.8, "Precision below threshold — check retrieval"

Pair RAGAS with distributed tracing through Arize Phoenix or Langfuse so you can trace every query's path through the pipeline and identify which step produced the wrong result.

Diagnosing Which Failure Mode You Have

When a RAG system produces a wrong answer, the debugging process is straightforward if you inspect each layer directly.

What to Check	What You Find	Failure Mode
Retrieved chunks for the failing query	Correct answer is absent from all chunks	Chunking (F1) or vocabulary mismatch (F2)
Rank position of correct chunk	Correct chunk is retrieved but ranked 4th or lower	Missing reranking (F3)
Position of correct chunk in assembled context	Correct chunk is in middle of long context	Lost in the middle (F4)
Document last-indexed timestamp	Answer was correct historically but wrong now	Stale index (F5)
Model output when correct context is provided manually	Model still uses training data, ignores context	Missing context constraint (F6)
RAGAS scores over time	Scores degrading without code changes	All of the above, silently

The debugging sequence is always: check retrieved chunks first, then chunk rank, then context assembly, then the generation layer. Most teams go straight to the generation layer and waste time there.

The Prevention Stack

The prevention stack that addresses all seven:

Layer	What to Add	Failure Modes Prevented
Indexing	Semantic chunking + hierarchical parent-child	F1
Indexing	Document hash sync pipeline	F5
Retrieval	Hybrid search: dense + BM25 + RRF	F2
Retrieval	Cross-encoder reranker, top 5 only	F3
Context	Max 5 chunks, highest ranked first	F4
Generation	Context constraint in system prompt + retrieval validation	F6
Evaluation	RAGAS on golden set before every deployment	F7

Where to Go From Here

If you are still deciding whether to build RAG at all, the first question to answer is whether you have a knowledge problem or a behavior problem. That decision is covered in RAG vs Fine-Tuning.

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

The Failure Map

Failure 1: Fixed-Size Chunking Artifacts

Failure 2: Vocabulary Mismatch

Failure 3: Missing Reranking

Failure 4: Lost in the Middle

Failure 5: Stale Index

Failure 6: Missing Context Constraint

Failure 7: No Evaluation Layer

Diagnosing Which Failure Mode You Have

The Prevention Stack

Where to Go From Here

Krunal Kanojiya

Related Posts

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

The Failure Map

Failure 1: Fixed-Size Chunking Artifacts

Failure 2: Vocabulary Mismatch

Failure 3: Missing Reranking

Failure 4: Lost in the Middle

Failure 5: Stale Index

Failure 6: Missing Context Constraint

Failure 7: No Evaluation Layer

Diagnosing Which Failure Mode You Have

The Prevention Stack

Where to Go From Here

Krunal Kanojiya

Related Posts