RAG Architecture Explained: How Production Pipelines Actually Work (2026)
Naive RAG fails 40% of the time at retrieval. This guide breaks down every layer of a production RAG architecture — chunking strategies, embedding selection, hybrid search, reranking, query transformation, agentic RAG, and evaluation with RAGAS — with working Python code for each component.
The tutorial version of RAG is three steps. Index documents. Retrieve chunks. Generate an answer. Thirty lines of Python. Works in a demo.
Then you put it in front of real users, with real documents, asking real questions. And retrieval starts returning the wrong chunks. The model answers confidently. The answers are wrong. You look at the retrieved context and you cannot even tell why those chunks ranked highest.
That is not bad luck. That is what happens when you build naive RAG and call it done.
Naive RAG pipelines fail at retrieval roughly 40% of the time. The generation layer is not the problem. The retrieval layer is. And fixing retrieval requires understanding every component that sits between your raw documents and the answer the model produces.
This article covers all of them.
The Full Production Architecture
Before going layer by layer, here is the complete picture of what a production RAG system looks like in 2026.
INGESTION PIPELINE (runs offline)
+------------------------------------------------------------------+
| |
| Raw Documents |
| (PDF, DOCX, HTML, DB records, Markdown) |
| | |
| v |
| Document Parser |
| (Unstructured.io / Apache Tika / PyMuPDF) |
| | |
| v |
| Chunking Layer |
| (semantic / hierarchical / domain-specific) |
| | |
| v |
| Embedding Model |
| (text-embedding-3-large / BGE / Voyage AI) |
| | |
| v |
| Vector Database |
| + Metadata Store |
| (Qdrant / Pinecone / Weaviate) |
| |
+------------------------------------------------------------------+
QUERY PIPELINE (runs at inference time)
+------------------------------------------------------------------+
| |
| User Question |
| | |
| v |
| Query Transformation (optional) |
| (HyDE / query rewriting / sub-question decomposition) |
| | |
| v |
| Hybrid Retrieval |
| Dense (vector similarity) + Sparse (BM25) |
| | |
| v |
| Reciprocal Rank Fusion |
| (merge dense + sparse results) |
| | |
| v |
| Reranker |
| (Cohere Rerank v3.5 / Voyage rerank-2.5 / BGE-Reranker) |
| | |
| v |
| Context Assembly |
| (top-k chunks + metadata + citation markers) |
| | |
| v |
| LLM Generation |
| (GPT-4o / Claude / Gemini / Llama) |
| | |
| v |
| Answer with Citations |
| |
+------------------------------------------------------------------+
EVALUATION LAYER (runs continuously)
+------------------------------------------------------------------+
| RAGAS: faithfulness / answer relevancy / context precision |
| Arize Phoenix or Langfuse: trajectory tracing |
| Weekly drift monitoring on frozen golden set |
+------------------------------------------------------------------+The reference architecture for 2026 production RAG uses Unstructured.io or Apache Tika for document parsing, semantic chunking via LangChain or LlamaIndex, self-hosted BGE or OpenAI text-embedding-3-small for embeddings, Qdrant or Pinecone with hybrid search enabled, Cohere Rerank for reranking, and RAGAS for automated evaluation. Each component is independently replaceable, which matters when your domain-specific benchmarks tell you something is underperforming.
Now, each layer in detail.
Layer 1: Document Ingestion and Parsing
Before you can chunk anything, you need clean text. This is the step most tutorials skip, and it creates problems that cascade through every layer downstream.
PDFs are the worst offenders. A PDF is a layout format, not a text format. Text extraction from a PDF can produce garbled character orders, broken tables, merged columns, and missing whitespace. Scanned PDFs have no extractable text at all and need OCR before anything else can happen.
Unstructured.io is the standard for production document parsing in 2026. It handles PDFs, DOCX, HTML, images, presentations, and emails. It applies layout detection to separate headers, body text, tables, and figures, and returns clean structured elements rather than raw character streams. For scanned documents, it runs Tesseract OCR automatically.
from unstructured.partition.auto import partition
from unstructured.staging.base import convert_to_dict
# Parse any document type automatically
elements = partition(filename="product_manual.pdf")
# Convert to structured dicts for downstream processing
structured = convert_to_dict(elements)
for item in structured:
if item["type"] in ["NarrativeText", "Title", "ListItem"]:
print(f"[{item['type']}] {item['text'][:100]}")For web content, Firecrawl handles URL-to-Markdown conversion cleanly, stripping navigation, ads, and boilerplate. For large-scale scheduled crawling across thousands of pages, Apify is the production choice.
The output of the parsing step should be clean, normalized Markdown or plain text. Normalize to Markdown before chunking regardless of source format. It makes chunk boundaries predictable and makes it easier to identify structural elements like headers and lists that inform the chunking strategy.
Layer 2: Chunking
Chunking is where most RAG pipelines silently fail. 80% of RAG failures trace back to the ingestion and chunking layer, not the LLM. You can swap embedding models and rerankers indefinitely. If the chunks are semantically broken, retrieval will keep returning the wrong context.
The goal of chunking is to create pieces that are semantically complete. Each chunk should be able to answer a question on its own, without requiring the reader to have seen adjacent chunks.
Fixed-size chunking is the simplest approach and the worst-performing one. You set a character or token limit and split there, with some overlap. It is fast to implement and terrible in practice. A fixed split will cut a sentence in the middle of an explanation, separate a question from its answer, and merge unrelated paragraphs that happen to fall within the limit.
Semantic chunking detects topic boundaries by measuring cosine similarity between consecutive sentences. When similarity drops below a threshold, a new chunk begins. A 2025 clinical decision support study found adaptive chunking reached 87% retrieval accuracy versus 13% for fixed-size baselines on the same corpus. That gap held across multiple embedding models and retrieval strategies. Chunking quality constrains retrieval accuracy more than embedding model choice does.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# SemanticChunker splits at cosine similarity breakpoints
chunker = SemanticChunker(
embeddings=embeddings,
breakpoint_threshold_type="percentile", # split at bottom 95th percentile similarity
breakpoint_threshold_amount=95
)
with open("product_manual.md", "r") as f:
text = f.read()
chunks = chunker.create_documents([text])
print(f"Created {len(chunks)} semantic chunks")
for i, chunk in enumerate(chunks[:3]):
print(f"\nChunk {i+1} ({len(chunk.page_content)} chars):")
print(chunk.page_content[:200])Hierarchical chunking goes one level further. It stores both small chunks for precise retrieval and larger parent chunks that provide broader context. When a small chunk is retrieved, the system returns the larger parent chunk to the LLM. This gives the model enough surrounding context to understand the retrieved passage fully.
Domain-specific chunk sizes matter. For general documentation and knowledge bases, 512 to 1024 tokens with 128-token overlap works well. Code should be chunked at function or class level using AST parsing, not character splits. Legal contracts need clause-level chunks with full paragraph context. Medical records chunk at the clinical note level. Conversational data chunks by turn with two-turn overlap.
Add metadata to every chunk at index time: the source document name, page number, section header, creation date, and document type. This metadata enables filtered retrieval later. A query about a product policy can be scoped to only the policy documents, not the entire knowledge base, which improves precision significantly.
Layer 3: Embedding Models
The embedding model converts each chunk into a high-dimensional vector. The quality of that vector determines how well similarity search can find the right chunks. Retrieval quality has a ceiling set by embedding quality. A better reranker cannot compensate for an embedding model that puts the wrong chunks near the top of the results.
The MTEB leaderboard is the standard benchmark for embedding model comparison, with evaluations updated regularly across retrieval, clustering, classification, and semantic similarity tasks.
Leading embedding models in 2026:
OpenAI text-embedding-3-large is the default for teams using OpenAI's API. It produces 3072-dimensional vectors, performs consistently well across general English domains, and requires no infrastructure to run. The smaller text-embedding-3-small at 1536 dimensions offers 90% of the quality at lower cost and latency.
Voyage AI models are specialized per domain and consistently outperform general-purpose models on their target domains. Voyage-3-finance for financial documents, voyage-3-code for code retrieval, voyage-3-law for legal text. If your domain is narrow and well-defined, Voyage AI is worth benchmarking.
BGE-large and E5-mistral from Hugging Face are the go-to open source choices for teams that need self-hosted embeddings for data privacy reasons. Self-hosted BGE or GTE-Qwen2 is the standard for sensitive data pipelines.
The cardinal rule: use the same embedding model for both indexing and querying. If you index with text-embedding-3-large and query with text-embedding-3-small, the vector spaces will not align and similarity scores will be meaningless.
For a complete treatment of how embeddings work internally and why model selection has this much impact on retrieval quality, see How Embeddings Work in RAG.
Layer 4: Hybrid Search and Reciprocal Rank Fusion
Pure vector similarity search has a known failure mode. It is good at finding semantically similar content but poor at finding exact matches. A user searching for error code ECONNREFUSED or product model AX-7200-PRO will get back chunks that are vaguely similar in meaning but do not contain the exact identifier. Pure keyword search has the opposite problem — it matches exact terms but misses semantically equivalent phrases.
Hybrid search combines both. Dense retrieval handles semantic similarity. BM25 handles keyword precision. Benchmarks show hybrid search delivers roughly 17% recall improvement over pure vector search in production pipelines.
Reciprocal Rank Fusion (RRF) merges the two result lists. It assigns each chunk a score based on its position in both the dense list and the sparse list, then sums those scores. A chunk that ranks 3rd in dense and 5th in sparse gets a higher combined score than a chunk that ranks 1st in dense but 50th in sparse.
from qdrant_client import QdrantClient
from qdrant_client.models import (
VectorParams, Distance, SparseVectorParams,
SparseIndexParams, SearchRequest, NamedVector,
NamedSparseVector, SparseVector
)
# Qdrant supports native hybrid search (dense + sparse in one query)
client = QdrantClient(url="http://localhost:6333")
# Dense query vector from embedding model
dense_vector = embed_query("What is the refund policy?") # your embedding function
# Sparse vector from BM25 (via FastEmbed or custom tokenizer)
sparse_vector = bm25_encode("What is the refund policy?") # your BM25 function
# Hybrid search with RRF fusion
results = client.query_points(
collection_name="company_docs",
prefetch=[
# Dense retrieval: semantic similarity
{"query": dense_vector, "using": "dense", "limit": 20},
# Sparse retrieval: keyword matching
{"query": SparseVector(indices=sparse_vector.indices,
values=sparse_vector.values),
"using": "sparse", "limit": 20},
],
# RRF fusion merges both result sets
query={"fusion": "rrf"},
limit=10,
with_payload=True
)Weaviate has hybrid search built in as a first-class feature. Qdrant supports it natively through prefetch and fusion queries. Pinecone requires implementing the sparse index separately alongside the dense index.
Layer 5: Reranking
Hybrid retrieval returns 20 to 50 candidate chunks. Most of them are relevant. Some are not. The order matters because the LLM pays more attention to context at the beginning and end of its prompt than to content buried in the middle.
A reranker is a cross-encoder model that takes the query and each candidate chunk together and scores their relevance jointly. This is more computationally expensive than bi-encoder retrieval but far more accurate. Cross-encoders jointly encode the user query and each document chunk and pass them through a transformer model, allowing fine-grained attention scores across all tokens. Bi-encoders embed query and chunk independently. The cross-encoder reads them side by side.
The leading rerankers in 2026 are Cohere Rerank v3.5 and Voyage AI rerank-2.5, with Voyage rerank-2.5 outperforming Cohere by 10 to 12% on instruction-following benchmarks. For self-hosted deployments, BGE-Reranker from Hugging Face is the standard choice.
import cohere
co = cohere.Client("your-cohere-api-key")
# Candidates from hybrid retrieval
candidates = [chunk.payload["text"] for chunk in results]
# Rerank with Cohere
reranked = co.rerank(
query="What is the refund policy?",
documents=candidates,
model="rerank-v3.5",
top_n=5 # keep only top 5 after reranking
)
# Build final context from reranked results
context_chunks = []
for result in reranked.results:
context_chunks.append({
"text": candidates[result.index],
"relevance_score": result.relevance_score,
"rank": result.index + 1
})
context = "\n\n".join([c["text"] for c in context_chunks])Reranking typically adds 50ms latency and costs between $0.001 and $0.01 per query depending on the number of candidates. For internal tools where approximate answers are acceptable, skip it. For customer-facing systems and any domain where answer accuracy is the product, reranking is the cheapest quality improvement available.
Layer 6: Query Transformation
The user's question is not always a good retrieval query. Short conversational questions, vague references, and underspecified queries all produce poor retrieval results when used directly as the search vector. Query transformation fixes this before retrieval runs.
Query rewriting uses an LLM to expand a vague or ambiguous question into a cleaner, more specific version before embedding it. A user who asks "how does that thing work?" in a customer support context gets their question rewritten to "how does the product's two-factor authentication flow work?" before the vector search runs.
HyDE — Hypothetical Document Embeddings takes a different approach. Instead of searching for the question, it asks the LLM to generate a hypothetical answer to the question first. That hypothetical answer gets embedded and used as the search query. The distance between a hypothetical answer and real document chunks is smaller in semantic space than the distance between a short question and those same chunks. Retrieval recall improves, particularly for sparse or underspecified queries.
import openai
client = openai.OpenAI(api_key="your-openai-api-key")
def hyde_query(user_question: str, embedding_model: callable) -> list[float]:
"""
HyDE: generate a hypothetical answer, embed it, use that for retrieval.
"""
# Step 1: generate a hypothetical answer
response = client.chat.completions.create(
model="gpt-4o-mini", # fast, cheap model is fine for HyDE
messages=[
{
"role": "system",
"content": (
"Generate a concise, factual answer to the question as if "
"you were writing it for a product documentation page. "
"Do not hedge or qualify. Write the answer directly."
)
},
{"role": "user", "content": user_question}
],
max_tokens=200
)
hypothetical_answer = response.choices[0].message.content
# Step 2: embed the hypothetical answer, not the question
vector = embedding_model(hypothetical_answer)
return vector
# Usage
user_question = "what happens if my payment fails?"
hyde_vector = hyde_query(user_question, your_embed_function)
# Use hyde_vector for retrieval instead of embedding user_question directly
results = vector_db.search(hyde_vector, top_k=20)Sub-question decomposition handles complex multi-hop questions that require information from multiple documents. A question like "how does our refund policy differ between enterprise and starter plan customers?" gets decomposed into "what is the refund policy for enterprise customers?" and "what is the refund policy for starter plan customers?" before retrieval. Both sub-questions run independently and the results get merged before generation.
Layer 7: Agentic RAG
Standard RAG is a single retrieval pass. The question goes in, chunks come back, the model generates an answer. That works well for straightforward factual questions with clear answers in the knowledge base.
Complex questions require multiple retrieval passes. A question that spans multiple documents, requires reasoning across retrieved facts, or cannot be answered by any single chunk does not map cleanly to one retrieve-then-generate loop.
Agentic RAG extends the static pipeline into a dynamic decision-making process. The LLM acts as an orchestrator. It decomposes the question into sub-questions, routes each to an appropriate tool, evaluates whether the retrieved context is sufficient to answer, and performs additional retrieval passes if it is not. It can check its own intermediate results before committing to a final answer.
Agentic RAG Loop
+------------------------------------------+
| User Question |
| | |
| v |
| Planner LLM |
| (decompose into sub-questions) |
| | |
| v |
| Route to retrieval tools: |
| - Vector search (semantic) |
| - Keyword search (BM25) |
| - SQL query (structured data) |
| - Web search (real-time data) |
| | |
| v |
| Retrieve + evaluate sufficiency |
| | |
| |-- Insufficient --> loop back |
| |-- Sufficient --> continue |
| | |
| v |
| Synthesize across retrieved context |
| | |
| v |
| Final Answer with Citations |
+------------------------------------------+Agentic RAG costs 3 to 10 times more tokens and adds 2 to 5 times the latency versus one-pass RAG. It earns that cost on multi-hop questions, ambiguous queries, and high-stakes domains where a wrong answer from a single retrieval pass is worse than the extra latency of a multi-pass loop. It does not earn it on FAQ bots and single-fact lookups.
LangGraph and LlamaIndex Workflows are the two standard frameworks for implementing agentic RAG loops in 2026. LangGraph gives you explicit graph-based control over agent state transitions. LlamaIndex Workflows is more ergonomic for retrieval-heavy pipelines.
Layer 8: Evaluation with RAGAS
Most teams skip systematic evaluation until something breaks in production. This is how you end up with retrieval problems that have been accumulating for weeks before anyone notices.
RAGAS is an open-source framework for automated RAG evaluation. It measures four metrics that together give a complete picture of pipeline health.
Faithfulness measures whether every claim in the generated answer can be inferred from the retrieved context. A faithfulness score below 0.9 means the model is making claims that go beyond what the retrieved documents actually say — which is hallucination.
Answer relevancy measures whether the generated answer addresses the actual question. A high-relevancy answer stays focused on what was asked rather than providing adjacent but technically accurate information.
Context precision measures whether the retrieved chunks are actually relevant to the question, and whether more relevant chunks appear earlier in the context. Poor context precision means retrieval is returning low-quality chunks that the model either ignores or gets confused by.
Context recall measures whether the retrieved context contains all the information needed to answer the question fully. Low recall means the right document exists in the knowledge base but retrieval is not finding it.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall
)
from datasets import Dataset
# Your evaluation dataset: questions, retrieved contexts, generated answers
eval_data = {
"question": [
"What is the refund policy?",
"How do I enable two-factor authentication?",
"What are the rate limits on the Pro plan?"
],
"contexts": [
[["Refunds are processed within 30 days of the original purchase date..."]],
[["Two-factor authentication can be enabled in Account Settings > Security..."]],
[["Pro plan includes 10,000 API calls per minute and 1M calls per month..."]]
],
"answer": [
"Refunds are processed within 30 days of purchase.",
"Go to Account Settings, then Security, and enable 2FA there.",
"The Pro plan allows 10,000 API calls per minute."
],
"ground_truth": [
"The refund policy allows returns within 30 days of purchase.",
"Two-factor authentication is enabled in Account Settings under Security.",
"Pro plan rate limits are 10,000 calls per minute and 1 million per month."
]
}
dataset = Dataset.from_dict(eval_data)
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall
]
)
print(results)
# Expected output for a healthy pipeline:
# {'faithfulness': 0.95, 'answer_relevancy': 0.91,
# 'context_precision': 0.88, 'context_recall': 0.87}Production targets from MarsDevs and other RAG production teams in 2026 are faithfulness above 0.9, answer relevancy above 0.85, and context precision above 0.8. Below those thresholds, the specific failing metric points to where in the pipeline the problem lives.
Low faithfulness means the LLM is generating beyond retrieved context. Tighten the system prompt to constrain the model to its context.
Low context precision means retrieval is surfacing irrelevant chunks. Improve chunking, add metadata filtering, or tune the hybrid search weights.
Low context recall means the right chunks exist but retrieval is not finding them. Examine query transformation, embedding model fit for your domain, and chunk size.
Pair RAGAS metrics with distributed tracing through Arize Phoenix or Langfuse to get full visibility into every step of each query's path through the pipeline. And run weekly evaluation on a frozen golden set of test questions so you can detect drift before users do.
How the Layers Compound
Each layer has its own failure mode, but the failures compound. Bad chunking produces semantically broken pieces that no embedding model can represent well. Poor embeddings mean hybrid retrieval starts with a bad candidate pool that no reranker can fully recover. A reranker that surfaces wrong chunks leads the LLM to produce confident answers with low faithfulness scores. And without RAGAS running continuously, you will not know any of this is happening.
The right sequence for building and debugging a RAG architecture is:
First, get chunking right. Evaluate retrieval precision on a small test set before building anything downstream. If the right chunks are not in your top-20 candidates, fix chunking before touching anything else.
Second, choose an embedding model that fits your domain. Benchmark on your actual data, not on MTEB general scores. For the reasoning behind embedding model selection, read How Embeddings Work in RAG.
Third, add hybrid search. Enable BM25 alongside dense retrieval. The recall improvement is consistent and the implementation cost is low with Weaviate or Qdrant.
Fourth, add reranking. It is cheap, it is fast to implement, and it consistently improves the quality of what goes into the LLM's context.
Fifth, evaluate with RAGAS from day one, not after you go to production.
Query transformation and agentic patterns come after those foundations are solid. Adding HyDE or a multi-hop agent loop to a pipeline with broken chunking makes the debugging problem harder, not easier.
For what happens when retrieval fails despite this architecture — and it will sometimes — read Why RAG Fails. For the full comparison of when this architecture is the right choice versus fine-tuning, see RAG vs Fine-Tuning. For the storage and indexing layer specifically, Vector Database in RAG covers HNSW indexing, vector database selection, and what the numbers look like under production load.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.