K
Krunal Kanojiya
HomeAboutServicesBlog
Hire Me
K
Krunal Kanojiya

Technical Content Writer

BlogRSSSitemapEmail
© 2026 Krunal Kanojiya · Built with Next.js
Privacy PolicyTerms of Service
  1. Home
  2. /
  3. Blog
  4. /
  5. Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)
Tech17 min read3,355 words

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

RAG fails at retrieval 73% of the time, not generation. This guide covers every production failure mode — chunking artifacts, vocabulary mismatch, lost in the middle, missing reranking, stale indexes, and no evaluation layer — with specific fixes for each, backed by 2025 and 2026 production data.

Krunal Kanojiya

Krunal Kanojiya

May 07, 2026
Share:
#rag#rag-failure#retrieval-augmented-generation#chunking#hallucination#vector-search#reranking#ragas#llm#ai

Here is the thing nobody explains clearly: a RAG system can produce wrong answers through at least seven distinct mechanisms, and only two of them live in the generation layer. The other five happen before the LLM ever sees a single character of input.

Most RAG systems struggle when they move from prototype to production. The interesting part is that the problem usually is not the language model. It is the retrieval architecture.

The demo worked because the documents were clean, the questions were predictable, and the team was asking questions they already knew the answers to. Production is different. Real users ask vague questions. Documents are messy. The knowledge base grows and gets stale. And nobody is watching the retrieval step because everyone is focused on the model's output.

Industry analysis in 2026 consistently shows that when RAG fails, the failure point is retrieval 73% of the time, not generation. This article covers every failure mode with its specific diagnosis and fix.

The Failure Map

Before going deep on each problem, here is the full map. Each failure mode lives at a specific layer of the pipeline.

Failure ModeWhere It LivesSymptom
Fixed-size chunking artifactsIndexingRetrieved chunks miss the actual answer by a paragraph
Vocabulary mismatchRetrievalSemantically correct question, wrong documents returned
Missing rerankingRetrievalRight chunk retrieved at position 5 — model ignores it
Lost in the middleContext assemblyCorrect chunk in context, model still gets it wrong
Stale indexIndexingModel answers confidently from outdated documents
No context constraintGenerationModel answers from training data when retrieval fails
No evaluationAll layersFailures accumulate silently for weeks

A production RAG system is not an AI feature. It is a knowledge access system that happens to use an LLM. Most failures are predictable, repeatable, and preventable.

Failure 1: Fixed-Size Chunking Artifacts

This is the most common root cause and the hardest to notice until you look directly at retrieved chunks.

Fixed-size chunking splits documents at a character or token limit. It does not care what is at that boundary. A sentence gets cut in half. A table gets separated from its header row. A numbered list gets split between item 3 and item 4. The chunk about rate limiting contains the explanation but not the actual numbers, because those are in a table two paragraphs away from the text that was retrieved.

A query about an API rate limiting policy would surface a sentence from the rate limiting section, but the window was three sentences in the middle of a twelve-step configuration process. The model received context that was technically about rate limiting but was missing the actual numbers.

The retrieved chunk scores well on cosine similarity because it is topically related. The model reads it and cannot produce the specific answer. It either hedges or fabricates a plausible-sounding number.

The fix: Stop chunking by character count. Use semantic chunking that detects topic boundaries by measuring cosine similarity between consecutive sentences. When similarity drops below a threshold, that is a natural split. For the RAG architecture context, this is covered in detail in RAG Architecture Explained.

For tables specifically, repeat the header row in every table chunk. Without headers, retrieved rows become ambiguous facts with no column context. With headers, they become usable evidence. For code, chunk at function or class level using AST parsing rather than character counts — a function boundary is a semantic boundary.

Hierarchical chunking solves the precision-context tradeoff that forces a choice between small chunks (precise retrieval, missing context) and large chunks (complete context, imprecise retrieval). Store small chunks for retrieval precision. When a small chunk is retrieved, return its larger parent chunk to the LLM. Retrieve at paragraph granularity, generate with section-level context.

python
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Parent splitter: larger chunks for context
parent_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=90   # fewer splits = larger chunks
)

# Child splitter: smaller chunks for retrieval precision
child_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=97   # more splits = smaller chunks
)

def build_parent_child_chunks(text: str) -> list[dict]:
    parent_docs = parent_splitter.create_documents([text])
    all_chunks = []

    for parent_idx, parent in enumerate(parent_docs):
        # Split parent into child chunks for retrieval
        child_docs = child_splitter.create_documents([parent.page_content])

        for child in child_docs:
            all_chunks.append({
                "child_text": child.page_content,      # indexed in vector DB
                "parent_text": parent.page_content,    # returned to LLM
                "parent_id": parent_idx
            })

    return all_chunks

Failure 2: Vocabulary Mismatch

The user asks about a "money back guarantee." Your documentation says "refund policy." Dense vector search finds chunks that are semantically close to "money back guarantee" — but if the embedding model did not place those two phrases close together in vector space, the refund policy document ranks low. A document about payment methods might rank higher because "payment" is closer to "money" than "policy" is.

This is not a failure of the embedding model. It is a fundamental property of how dense retrieval works. Embedding models compress meaning into vectors. Exact term matching is a separate problem from semantic similarity.

A law firm notices their vector search returns semantically similar cases but misses cases containing a specific statute number. The fix is hybrid search — add a BM25 index so exact-match keyword searches run in parallel with the dense vector search, then fuse the results.

The same applies to error codes, product model numbers, person names, version strings, and any domain-specific terminology that the embedding model tokenizes differently from how users phrase it.

The fix: Implement hybrid search combining dense vector similarity with BM25 keyword matching, merged through Reciprocal Rank Fusion. For how to implement this on Qdrant and Weaviate, see RAG Architecture Explained. Do not treat vocabulary mismatch as an embedding model selection problem — no embedding model fixes this. It requires a structural change to the retrieval layer.

Retrieval TypeStrengthsWeaknesses
Dense vector onlySemantic concepts, paraphrases, intentExact terms, product codes, proper nouns
BM25 onlyExact matches, rare terms, identifiersSynonyms, conceptual similarity
Hybrid (dense + BM25)BothSlightly more complex to implement

Benchmarks show hybrid search delivers roughly 17% recall improvement over pure vector search in production pipelines. For most production RAG systems, this is not optional.

Failure 3: Missing Reranking

Retrieval returns the top-k chunks ranked by cosine similarity. Cosine similarity is a good proxy for semantic relatedness. It is not the same as relevance to this specific question.

Without a reranker, the top-k retrieved chunks are sorted by embedding similarity, which does not account for query-specific intent, negation, or specificity. We consistently saw the most relevant chunk sitting at position 4 or 5 in the retrieval output, behind noisier but closer matches.

A chunk about "refund timelines for enterprise contracts" and a chunk about "refund policy overview" both have high cosine similarity to a query about enterprise refunds. The overview ranks higher because it uses more of the query's words. The enterprise-specific chunk is more relevant but ranks lower. The model generates an answer from the overview, misses the enterprise-specific detail, and the answer is technically correct but wrong for this user.

The fix: Add a cross-encoder reranker after retrieval. Retrieve 20 to 50 candidates. Run a reranker that scores each candidate jointly against the query — cross-encoders read query and chunk together rather than embedding them independently, which gives far more accurate relevance scores. Return only the top 3 to 5 after reranking.

Cohere Rerank v3.5 and Voyage AI rerank-2.5 are the leading managed options. BGE-Reranker from Hugging Face is the self-hosted choice.

python
import cohere

co = cohere.Client("your-cohere-api-key")

def retrieve_and_rerank(
    query: str,
    vector_db_results: list[str],
    top_n: int = 5
) -> list[dict]:

    # Rerank the candidates from vector retrieval
    reranked = co.rerank(
        query=query,
        documents=vector_db_results,
        model="rerank-v3.5",
        top_n=top_n
    )

    return [
        {
            "text": vector_db_results[r.index],
            "relevance_score": r.relevance_score
        }
        for r in reranked.results
    ]

Reranking adds roughly 50ms latency per query and costs between $0.001 and $0.01 per query depending on the number of candidates. In every production RAG system where retrieval quality is the product rather than an internal tool, the quality improvement justifies this cost without exception.

Failure 4: Lost in the Middle

This one is counterintuitive because it feels like a retrieval success. The right chunk was retrieved. It is in the context window. The model still misses it.

LLMs have lower recall for information placed in the middle of long contexts. Research on transformer attention patterns consistently shows this effect. Information at the beginning and end of the context receives more attention than information in the middle. If you pass 8 retrieved chunks to the model and the most relevant one lands at position 4 or 5, the model is statistically likely to under-attend to it and give a partial or incorrect answer.

You can have perfect embeddings and still get poor answers if the correct chunk is placed in the middle of a long prompt. Embeddings decide what gets retrieved. They do not control what gets attended to.

The fix: Keep fewer chunks in the context. Three to five high-quality chunks produce better answers than eight chunks of mixed relevance. After reranking, place the highest-scoring chunk first in the assembled context — not in the middle. If you need to pass multiple chunks, use the "sandwich" strategy: most relevant first, supporting chunks in the middle, second-most-relevant last.

python
def assemble_context(reranked_chunks: list[dict]) -> str:
    """
    Anti-lost-in-middle context assembly.
    Most relevant chunk goes first.
    Keep to top 5 maximum.
    """
    # Already sorted by relevance from reranker, highest first
    top_chunks = reranked_chunks[:5]

    context_parts = []
    for idx, chunk in enumerate(top_chunks):
        context_parts.append(
            f"[Source {idx + 1} | Relevance: {chunk['relevance_score']:.3f}]\n"
            f"{chunk['text']}"
        )

    return "\n\n---\n\n".join(context_parts)

Failure 5: Stale Index

A knowledge base that does not update automatically will drift from reality. Documents change. Policies get updated. Features get deprecated. Prices change. The vector index does not know any of this unless you tell it.

The stale index failure produces a particularly bad outcome: the model answers confidently from a document that was accurate six months ago. The user has no indication the answer is outdated. They act on it.

RAG changes what the model can see right now. A document update that costs zero dollars in a RAG system costs between $500 and $5,000 with fine-tuning. But that zero-dollar advantage only holds if you actually update the document in the index.

The fix: Build an update pipeline before you go to production, not after you discover the problem. For most knowledge bases, a daily re-ingestion job that checks document hashes is sufficient. When a document hash changes, delete the old chunks for that document and reindex the new version. For real-time data sources, trigger reindexing via webhooks from the source system on every document change.

python
import hashlib
import datetime

def compute_document_hash(content: str) -> str:
    return hashlib.sha256(content.encode()).hexdigest()

def sync_document_to_index(
    doc_id: str,
    new_content: str,
    vector_db_client,
    embedding_fn,
    chunker
) -> dict:
    """
    Hash-based sync: only reindex if the document actually changed.
    Returns sync status for monitoring.
    """
    new_hash = compute_document_hash(new_content)

    # Check current hash stored in metadata
    existing = vector_db_client.get_document_metadata(doc_id)

    if existing and existing.get("hash") == new_hash:
        return {"doc_id": doc_id, "status": "unchanged", "action": "skipped"}

    # Document changed: delete old chunks and reindex
    vector_db_client.delete_by_filter({"doc_id": doc_id})

    new_chunks = chunker.split(new_content)
    new_embeddings = embedding_fn(new_chunks)

    vector_db_client.upsert_chunks(
        chunks=new_chunks,
        embeddings=new_embeddings,
        metadata={
            "doc_id": doc_id,
            "hash": new_hash,
            "indexed_at": datetime.datetime.utcnow().isoformat()
        }
    )

    return {"doc_id": doc_id, "status": "updated", "chunks": len(new_chunks)}

Failure 6: Missing Context Constraint

This failure happens at the generation layer, not retrieval. The system prompt does not tell the model to stay within its retrieved context. When retrieval returns the wrong chunks — or when retrieval is silently empty — the model falls back to training data and answers confidently from memory.

The answer can be wrong in two ways. It can be factually incorrect because the model's training data is outdated. Or it can be factually correct according to training data but incorrect for this specific organization, which has its own policies, processes, and configurations that differ from the general case.

Hallucination is a policy failure, not a model failure. If the model is not instructed to stay within retrieved context, it will not.

The fix: The system prompt must explicitly instruct the model to use only the provided context and to say so when the context is insufficient. Add a retrieval validation check that catches empty or low-confidence retrieval before passing anything to the model.

python
SYSTEM_PROMPT = """You are a helpful assistant answering questions about our product.

CRITICAL INSTRUCTIONS:
1. Answer ONLY using the context documents provided below.
2. If the context does not contain enough information to answer the question,
   say: "I don't have that information in the provided documentation."
3. Do not use your general knowledge or training data to fill gaps.
4. When you use information from a source, reference it as [Source N].
5. Never speculate or infer beyond what the documents explicitly state.
"""

def validate_retrieval(
    chunks: list[dict],
    min_relevance_score: float = 0.4,
    min_chunks: int = 1
) -> tuple[bool, str]:
    """
    Check retrieval quality before passing to LLM.
    Returns (is_valid, reason).
    """
    if not chunks:
        return False, "No chunks retrieved from knowledge base."

    high_confidence = [c for c in chunks if c["relevance_score"] >= min_relevance_score]

    if len(high_confidence) < min_chunks:
        return False, (
            f"Retrieved {len(chunks)} chunks but none exceeded "
            f"minimum relevance threshold of {min_relevance_score}."
        )

    return True, "Retrieval passed validation."

def generate_with_guardrails(
    query: str,
    reranked_chunks: list[dict],
    llm_client
) -> str:
    is_valid, reason = validate_retrieval(reranked_chunks)

    if not is_valid:
        # Do not call LLM when retrieval fails
        return (
            "I don't have relevant information in the documentation to answer that. "
            f"(Reason: {reason})"
        )

    context = assemble_context(reranked_chunks)

    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {
                "role": "user",
                "content": f"Context documents:\n\n{context}\n\nQuestion: {query}"
            }
        ]
    )
    return response.choices[0].message.content

Most RAG implementations do not fail at the model layer. They fail earlier, when systems proceed without validating whether retrieved information is sufficient. The validation step above is the structural fix.

Failure 7: No Evaluation Layer

This is the failure that enables all the others to persist. Teams ship RAG systems with no automated quality measurement. Failures accumulate silently. Nobody knows the system has a problem until a user reports it or screenshots a bad answer.

Most of the RAG systems we inherited had zero evaluation framework. This is the one that hurts most to admit. Without evaluation, there is no signal for which failure mode is active, how severe it is, or whether a fix actually helped.

The fix: Run RAGAS against a golden test set before every deployment. RAGAS measures four metrics that map directly to the failure modes above.

RAGAS MetricWhat It MeasuresFailure Mode It Catches
FaithfulnessWhether answer claims are supported by retrieved contextFailure 6: missing context constraint
Answer relevancyWhether answer addresses the actual questionFailure 3: reranking miss
Context precisionWhether retrieved chunks are actually relevantFailure 2: vocabulary mismatch
Context recallWhether the right chunks were retrieved at allFailure 1: chunking, Failure 5: stale index

Production targets from 2026 deployment data at MarsDevs: faithfulness above 0.9, answer relevancy above 0.85, context precision above 0.8.

python
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Build a golden test set of 50-100 representative questions
# before shipping anything to production
golden_set = {
    "question": [
        "What is the enterprise refund policy?",
        "How many API calls does the Pro plan include?",
        "Was the v1 API deprecated?",
    ],
    "contexts": [
        [["Enterprise customers receive full refunds within 60 days..."]],
        [["Pro plan includes 1 million API calls per month..."]],
        [["The v1 API was deprecated in March 2026. Migrate to v2..."]],
    ],
    "answer": [
        "Enterprise customers get full refunds within 60 days of purchase.",
        "The Pro plan includes 1 million API calls per month.",
        "Yes, the v1 API was deprecated in March 2026.",
    ],
    "ground_truth": [
        "Enterprise customers receive full refunds within 60 days.",
        "Pro plan allows 1 million API calls per month.",
        "The v1 API was deprecated in March 2026 and users must migrate to v2.",
    ]
}

dataset = Dataset.from_dict(golden_set)
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

# Gate deployment on minimum thresholds
assert results["faithfulness"] >= 0.9, "Faithfulness below threshold — check context constraint"
assert results["answer_relevancy"] >= 0.85, "Relevancy below threshold — check reranking"
assert results["context_precision"] >= 0.8, "Precision below threshold — check retrieval"

Pair RAGAS with distributed tracing through Arize Phoenix or Langfuse so you can trace every query's path through the pipeline and identify which step produced the wrong result.

Diagnosing Which Failure Mode You Have

When a RAG system produces a wrong answer, the debugging process is straightforward if you inspect each layer directly.

What to CheckWhat You FindFailure Mode
Retrieved chunks for the failing queryCorrect answer is absent from all chunksChunking (F1) or vocabulary mismatch (F2)
Rank position of correct chunkCorrect chunk is retrieved but ranked 4th or lowerMissing reranking (F3)
Position of correct chunk in assembled contextCorrect chunk is in middle of long contextLost in the middle (F4)
Document last-indexed timestampAnswer was correct historically but wrong nowStale index (F5)
Model output when correct context is provided manuallyModel still uses training data, ignores contextMissing context constraint (F6)
RAGAS scores over timeScores degrading without code changesAll of the above, silently

The debugging sequence is always: check retrieved chunks first, then chunk rank, then context assembly, then the generation layer. Most teams go straight to the generation layer and waste time there.

The Prevention Stack

The seven failure modes above are not independent. They compound. Bad chunking means vocabulary mismatch is more likely to miss the right documents. Missing reranking means the lost-in-the-middle effect hits harder because the wrong chunks rank high. No evaluation means none of this gets caught.

The prevention stack that addresses all seven:

LayerWhat to AddFailure Modes Prevented
IndexingSemantic chunking + hierarchical parent-childF1
IndexingDocument hash sync pipelineF5
RetrievalHybrid search: dense + BM25 + RRFF2
RetrievalCross-encoder reranker, top 5 onlyF3
ContextMax 5 chunks, highest ranked firstF4
GenerationContext constraint in system prompt + retrieval validationF6
EvaluationRAGAS on golden set before every deploymentF7

None of these are exotic. Each one is well-understood and has mature tooling. The teams that ship reliable RAG systems are not doing anything mysterious — they are systematically applying this stack and measuring each layer independently.

Where to Go From Here

If you are still deciding whether to build RAG at all, the first question to answer is whether you have a knowledge problem or a behavior problem. That decision is covered in RAG vs Fine-Tuning.

If you are building RAG and want to understand the full architecture before writing code, RAG Architecture Explained covers every layer including chunking, hybrid search, and reranking in detail. The storage layer — which vector database to use and how HNSW indexing works — is in Vector Database in RAG.

For the foundational explanation of how the whole system fits together, start with What Is RAG in AI. For how embeddings produce the vectors that make semantic retrieval possible, and why the embedding model choice determines the ceiling for retrieval quality, the next article in this series is How Embeddings Work in RAG.

On this page

The Failure MapFailure 1: Fixed-Size Chunking ArtifactsFailure 2: Vocabulary MismatchFailure 3: Missing RerankingFailure 4: Lost in the MiddleFailure 5: Stale IndexFailure 6: Missing Context ConstraintFailure 7: No Evaluation LayerDiagnosing Which Failure Mode You HaveThe Prevention StackWhere to Go From Here

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
All posts

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
Krunal Kanojiya

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

GitHubLinkedIn

Related Posts

RAG Architecture Explained: How Production Pipelines Actually Work (2026)

May 04, 2026 · 18 min read

RAG vs Traditional Search: What Changed, What Did Not, and Why BM25 Is Not Dead

May 08, 2026 · 15 min read

What Is RAG in AI? A Simple Explanation (With Examples)

May 05, 2026 · 13 min read