Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)
RAG fails at retrieval 73% of the time, not generation. This guide covers every production failure mode — chunking artifacts, vocabulary mismatch, lost in the middle, missing reranking, stale indexes, and no evaluation layer — with specific fixes for each, backed by 2025 and 2026 production data.
Here is the thing nobody explains clearly: a RAG system can produce wrong answers through at least seven distinct mechanisms, and only two of them live in the generation layer. The other five happen before the LLM ever sees a single character of input.
The demo worked because the documents were clean, the questions were predictable, and the team was asking questions they already knew the answers to. Production is different. Real users ask vague questions. Documents are messy. The knowledge base grows and gets stale. And nobody is watching the retrieval step because everyone is focused on the model's output.
Industry analysis in 2026 consistently shows that when RAG fails, the failure point is retrieval 73% of the time, not generation. This article covers every failure mode with its specific diagnosis and fix.
The Failure Map
Before going deep on each problem, here is the full map. Each failure mode lives at a specific layer of the pipeline.
| Failure Mode | Where It Lives | Symptom |
|---|---|---|
| Fixed-size chunking artifacts | Indexing | Retrieved chunks miss the actual answer by a paragraph |
| Vocabulary mismatch | Retrieval | Semantically correct question, wrong documents returned |
| Missing reranking | Retrieval | Right chunk retrieved at position 5 — model ignores it |
| Lost in the middle | Context assembly | Correct chunk in context, model still gets it wrong |
| Stale index | Indexing | Model answers confidently from outdated documents |
| No context constraint | Generation | Model answers from training data when retrieval fails |
| No evaluation | All layers | Failures accumulate silently for weeks |
Failure 1: Fixed-Size Chunking Artifacts
This is the most common root cause and the hardest to notice until you look directly at retrieved chunks.
Fixed-size chunking splits documents at a character or token limit. It does not care what is at that boundary. A sentence gets cut in half. A table gets separated from its header row. A numbered list gets split between item 3 and item 4. The chunk about rate limiting contains the explanation but not the actual numbers, because those are in a table two paragraphs away from the text that was retrieved.
The retrieved chunk scores well on cosine similarity because it is topically related. The model reads it and cannot produce the specific answer. It either hedges or fabricates a plausible-sounding number.
The fix: Stop chunking by character count. Use semantic chunking that detects topic boundaries by measuring cosine similarity between consecutive sentences. When similarity drops below a threshold, that is a natural split. For the RAG architecture context, this is covered in detail in RAG Architecture Explained.
For tables specifically, repeat the header row in every table chunk. Without headers, retrieved rows become ambiguous facts with no column context. With headers, they become usable evidence. For code, chunk at function or class level using AST parsing rather than character counts — a function boundary is a semantic boundary.
Hierarchical chunking solves the precision-context tradeoff that forces a choice between small chunks (precise retrieval, missing context) and large chunks (complete context, imprecise retrieval). Store small chunks for retrieval precision. When a small chunk is retrieved, return its larger parent chunk to the LLM. Retrieve at paragraph granularity, generate with section-level context.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Parent splitter: larger chunks for context
parent_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=90 # fewer splits = larger chunks
)
# Child splitter: smaller chunks for retrieval precision
child_splitter = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=97 # more splits = smaller chunks
)
def build_parent_child_chunks(text: str) -> list[dict]:
parent_docs = parent_splitter.create_documents([text])
all_chunks = []
for parent_idx, parent in enumerate(parent_docs):
# Split parent into child chunks for retrieval
child_docs = child_splitter.create_documents([parent.page_content])
for child in child_docs:
all_chunks.append({
"child_text": child.page_content, # indexed in vector DB
"parent_text": parent.page_content, # returned to LLM
"parent_id": parent_idx
})
return all_chunksFailure 2: Vocabulary Mismatch
The user asks about a "money back guarantee." Your documentation says "refund policy." Dense vector search finds chunks that are semantically close to "money back guarantee" — but if the embedding model did not place those two phrases close together in vector space, the refund policy document ranks low. A document about payment methods might rank higher because "payment" is closer to "money" than "policy" is.
This is not a failure of the embedding model. It is a fundamental property of how dense retrieval works. Embedding models compress meaning into vectors. Exact term matching is a separate problem from semantic similarity.
The same applies to error codes, product model numbers, person names, version strings, and any domain-specific terminology that the embedding model tokenizes differently from how users phrase it.
The fix: Implement hybrid search combining dense vector similarity with BM25 keyword matching, merged through Reciprocal Rank Fusion. For how to implement this on Qdrant and Weaviate, see RAG Architecture Explained. Do not treat vocabulary mismatch as an embedding model selection problem — no embedding model fixes this. It requires a structural change to the retrieval layer.
| Retrieval Type | Strengths | Weaknesses |
|---|---|---|
| Dense vector only | Semantic concepts, paraphrases, intent | Exact terms, product codes, proper nouns |
| BM25 only | Exact matches, rare terms, identifiers | Synonyms, conceptual similarity |
| Hybrid (dense + BM25) | Both | Slightly more complex to implement |
Benchmarks show hybrid search delivers roughly 17% recall improvement over pure vector search in production pipelines. For most production RAG systems, this is not optional.
Failure 3: Missing Reranking
Retrieval returns the top-k chunks ranked by cosine similarity. Cosine similarity is a good proxy for semantic relatedness. It is not the same as relevance to this specific question.
A chunk about "refund timelines for enterprise contracts" and a chunk about "refund policy overview" both have high cosine similarity to a query about enterprise refunds. The overview ranks higher because it uses more of the query's words. The enterprise-specific chunk is more relevant but ranks lower. The model generates an answer from the overview, misses the enterprise-specific detail, and the answer is technically correct but wrong for this user.
The fix: Add a cross-encoder reranker after retrieval. Retrieve 20 to 50 candidates. Run a reranker that scores each candidate jointly against the query — cross-encoders read query and chunk together rather than embedding them independently, which gives far more accurate relevance scores. Return only the top 3 to 5 after reranking.
Cohere Rerank v3.5 and Voyage AI rerank-2.5 are the leading managed options. BGE-Reranker from Hugging Face is the self-hosted choice.
import cohere
co = cohere.Client("your-cohere-api-key")
def retrieve_and_rerank(
query: str,
vector_db_results: list[str],
top_n: int = 5
) -> list[dict]:
# Rerank the candidates from vector retrieval
reranked = co.rerank(
query=query,
documents=vector_db_results,
model="rerank-v3.5",
top_n=top_n
)
return [
{
"text": vector_db_results[r.index],
"relevance_score": r.relevance_score
}
for r in reranked.results
]Reranking adds roughly 50ms latency per query and costs between $0.001 and $0.01 per query depending on the number of candidates. In every production RAG system where retrieval quality is the product rather than an internal tool, the quality improvement justifies this cost without exception.
Failure 4: Lost in the Middle
This one is counterintuitive because it feels like a retrieval success. The right chunk was retrieved. It is in the context window. The model still misses it.
LLMs have lower recall for information placed in the middle of long contexts. Research on transformer attention patterns consistently shows this effect. Information at the beginning and end of the context receives more attention than information in the middle. If you pass 8 retrieved chunks to the model and the most relevant one lands at position 4 or 5, the model is statistically likely to under-attend to it and give a partial or incorrect answer.
You can have perfect embeddings and still get poor answers if the correct chunk is placed in the middle of a long prompt. Embeddings decide what gets retrieved. They do not control what gets attended to.
The fix: Keep fewer chunks in the context. Three to five high-quality chunks produce better answers than eight chunks of mixed relevance. After reranking, place the highest-scoring chunk first in the assembled context — not in the middle. If you need to pass multiple chunks, use the "sandwich" strategy: most relevant first, supporting chunks in the middle, second-most-relevant last.
def assemble_context(reranked_chunks: list[dict]) -> str:
"""
Anti-lost-in-middle context assembly.
Most relevant chunk goes first.
Keep to top 5 maximum.
"""
# Already sorted by relevance from reranker, highest first
top_chunks = reranked_chunks[:5]
context_parts = []
for idx, chunk in enumerate(top_chunks):
context_parts.append(
f"[Source {idx + 1} | Relevance: {chunk['relevance_score']:.3f}]\n"
f"{chunk['text']}"
)
return "\n\n---\n\n".join(context_parts)Failure 5: Stale Index
A knowledge base that does not update automatically will drift from reality. Documents change. Policies get updated. Features get deprecated. Prices change. The vector index does not know any of this unless you tell it.
The stale index failure produces a particularly bad outcome: the model answers confidently from a document that was accurate six months ago. The user has no indication the answer is outdated. They act on it.
RAG changes what the model can see right now. A document update that costs zero dollars in a RAG system costs between $500 and $5,000 with fine-tuning. But that zero-dollar advantage only holds if you actually update the document in the index.
The fix: Build an update pipeline before you go to production, not after you discover the problem. For most knowledge bases, a daily re-ingestion job that checks document hashes is sufficient. When a document hash changes, delete the old chunks for that document and reindex the new version. For real-time data sources, trigger reindexing via webhooks from the source system on every document change.
import hashlib
import datetime
def compute_document_hash(content: str) -> str:
return hashlib.sha256(content.encode()).hexdigest()
def sync_document_to_index(
doc_id: str,
new_content: str,
vector_db_client,
embedding_fn,
chunker
) -> dict:
"""
Hash-based sync: only reindex if the document actually changed.
Returns sync status for monitoring.
"""
new_hash = compute_document_hash(new_content)
# Check current hash stored in metadata
existing = vector_db_client.get_document_metadata(doc_id)
if existing and existing.get("hash") == new_hash:
return {"doc_id": doc_id, "status": "unchanged", "action": "skipped"}
# Document changed: delete old chunks and reindex
vector_db_client.delete_by_filter({"doc_id": doc_id})
new_chunks = chunker.split(new_content)
new_embeddings = embedding_fn(new_chunks)
vector_db_client.upsert_chunks(
chunks=new_chunks,
embeddings=new_embeddings,
metadata={
"doc_id": doc_id,
"hash": new_hash,
"indexed_at": datetime.datetime.utcnow().isoformat()
}
)
return {"doc_id": doc_id, "status": "updated", "chunks": len(new_chunks)}Failure 6: Missing Context Constraint
This failure happens at the generation layer, not retrieval. The system prompt does not tell the model to stay within its retrieved context. When retrieval returns the wrong chunks — or when retrieval is silently empty — the model falls back to training data and answers confidently from memory.
The answer can be wrong in two ways. It can be factually incorrect because the model's training data is outdated. Or it can be factually correct according to training data but incorrect for this specific organization, which has its own policies, processes, and configurations that differ from the general case.
Hallucination is a policy failure, not a model failure. If the model is not instructed to stay within retrieved context, it will not.
The fix: The system prompt must explicitly instruct the model to use only the provided context and to say so when the context is insufficient. Add a retrieval validation check that catches empty or low-confidence retrieval before passing anything to the model.
SYSTEM_PROMPT = """You are a helpful assistant answering questions about our product.
CRITICAL INSTRUCTIONS:
1. Answer ONLY using the context documents provided below.
2. If the context does not contain enough information to answer the question,
say: "I don't have that information in the provided documentation."
3. Do not use your general knowledge or training data to fill gaps.
4. When you use information from a source, reference it as [Source N].
5. Never speculate or infer beyond what the documents explicitly state.
"""
def validate_retrieval(
chunks: list[dict],
min_relevance_score: float = 0.4,
min_chunks: int = 1
) -> tuple[bool, str]:
"""
Check retrieval quality before passing to LLM.
Returns (is_valid, reason).
"""
if not chunks:
return False, "No chunks retrieved from knowledge base."
high_confidence = [c for c in chunks if c["relevance_score"] >= min_relevance_score]
if len(high_confidence) < min_chunks:
return False, (
f"Retrieved {len(chunks)} chunks but none exceeded "
f"minimum relevance threshold of {min_relevance_score}."
)
return True, "Retrieval passed validation."
def generate_with_guardrails(
query: str,
reranked_chunks: list[dict],
llm_client
) -> str:
is_valid, reason = validate_retrieval(reranked_chunks)
if not is_valid:
# Do not call LLM when retrieval fails
return (
"I don't have relevant information in the documentation to answer that. "
f"(Reason: {reason})"
)
context = assemble_context(reranked_chunks)
response = llm_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": f"Context documents:\n\n{context}\n\nQuestion: {query}"
}
]
)
return response.choices[0].message.contentMost RAG implementations do not fail at the model layer. They fail earlier, when systems proceed without validating whether retrieved information is sufficient. The validation step above is the structural fix.
Failure 7: No Evaluation Layer
This is the failure that enables all the others to persist. Teams ship RAG systems with no automated quality measurement. Failures accumulate silently. Nobody knows the system has a problem until a user reports it or screenshots a bad answer.
Most of the RAG systems we inherited had zero evaluation framework. This is the one that hurts most to admit. Without evaluation, there is no signal for which failure mode is active, how severe it is, or whether a fix actually helped.
The fix: Run RAGAS against a golden test set before every deployment. RAGAS measures four metrics that map directly to the failure modes above.
| RAGAS Metric | What It Measures | Failure Mode It Catches |
|---|---|---|
| Faithfulness | Whether answer claims are supported by retrieved context | Failure 6: missing context constraint |
| Answer relevancy | Whether answer addresses the actual question | Failure 3: reranking miss |
| Context precision | Whether retrieved chunks are actually relevant | Failure 2: vocabulary mismatch |
| Context recall | Whether the right chunks were retrieved at all | Failure 1: chunking, Failure 5: stale index |
Production targets from 2026 deployment data at MarsDevs: faithfulness above 0.9, answer relevancy above 0.85, context precision above 0.8.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
# Build a golden test set of 50-100 representative questions
# before shipping anything to production
golden_set = {
"question": [
"What is the enterprise refund policy?",
"How many API calls does the Pro plan include?",
"Was the v1 API deprecated?",
],
"contexts": [
[["Enterprise customers receive full refunds within 60 days..."]],
[["Pro plan includes 1 million API calls per month..."]],
[["The v1 API was deprecated in March 2026. Migrate to v2..."]],
],
"answer": [
"Enterprise customers get full refunds within 60 days of purchase.",
"The Pro plan includes 1 million API calls per month.",
"Yes, the v1 API was deprecated in March 2026.",
],
"ground_truth": [
"Enterprise customers receive full refunds within 60 days.",
"Pro plan allows 1 million API calls per month.",
"The v1 API was deprecated in March 2026 and users must migrate to v2.",
]
}
dataset = Dataset.from_dict(golden_set)
results = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
# Gate deployment on minimum thresholds
assert results["faithfulness"] >= 0.9, "Faithfulness below threshold — check context constraint"
assert results["answer_relevancy"] >= 0.85, "Relevancy below threshold — check reranking"
assert results["context_precision"] >= 0.8, "Precision below threshold — check retrieval"Pair RAGAS with distributed tracing through Arize Phoenix or Langfuse so you can trace every query's path through the pipeline and identify which step produced the wrong result.
Diagnosing Which Failure Mode You Have
When a RAG system produces a wrong answer, the debugging process is straightforward if you inspect each layer directly.
| What to Check | What You Find | Failure Mode |
|---|---|---|
| Retrieved chunks for the failing query | Correct answer is absent from all chunks | Chunking (F1) or vocabulary mismatch (F2) |
| Rank position of correct chunk | Correct chunk is retrieved but ranked 4th or lower | Missing reranking (F3) |
| Position of correct chunk in assembled context | Correct chunk is in middle of long context | Lost in the middle (F4) |
| Document last-indexed timestamp | Answer was correct historically but wrong now | Stale index (F5) |
| Model output when correct context is provided manually | Model still uses training data, ignores context | Missing context constraint (F6) |
| RAGAS scores over time | Scores degrading without code changes | All of the above, silently |
The debugging sequence is always: check retrieved chunks first, then chunk rank, then context assembly, then the generation layer. Most teams go straight to the generation layer and waste time there.
The Prevention Stack
The seven failure modes above are not independent. They compound. Bad chunking means vocabulary mismatch is more likely to miss the right documents. Missing reranking means the lost-in-the-middle effect hits harder because the wrong chunks rank high. No evaluation means none of this gets caught.
The prevention stack that addresses all seven:
| Layer | What to Add | Failure Modes Prevented |
|---|---|---|
| Indexing | Semantic chunking + hierarchical parent-child | F1 |
| Indexing | Document hash sync pipeline | F5 |
| Retrieval | Hybrid search: dense + BM25 + RRF | F2 |
| Retrieval | Cross-encoder reranker, top 5 only | F3 |
| Context | Max 5 chunks, highest ranked first | F4 |
| Generation | Context constraint in system prompt + retrieval validation | F6 |
| Evaluation | RAGAS on golden set before every deployment | F7 |
None of these are exotic. Each one is well-understood and has mature tooling. The teams that ship reliable RAG systems are not doing anything mysterious — they are systematically applying this stack and measuring each layer independently.
Where to Go From Here
If you are still deciding whether to build RAG at all, the first question to answer is whether you have a knowledge problem or a behavior problem. That decision is covered in RAG vs Fine-Tuning.
If you are building RAG and want to understand the full architecture before writing code, RAG Architecture Explained covers every layer including chunking, hybrid search, and reranking in detail. The storage layer — which vector database to use and how HNSW indexing works — is in Vector Database in RAG.
For the foundational explanation of how the whole system fits together, start with What Is RAG in AI. For how embeddings produce the vectors that make semantic retrieval possible, and why the embedding model choice determines the ceiling for retrieval quality, the next article in this series is How Embeddings Work in RAG.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.