How Embeddings Work in RAG: The Complete Guide (2026)
Embeddings are the invisible foundation of every RAG retrieval pipeline. This guide explains what embeddings are, how transformers produce them, why cosine similarity works, the difference between bi-encoders and cross-encoders, how to choose an embedding model in 2026, and where retrieval quality silently breaks down.
Every time a user asks a question in a RAG system, the pipeline converts that question into a list of numbers before anything else happens. The retrieval step — the part that determines which document chunks get passed to the LLM — operates entirely on those numbers. Not on words. Not on meaning as humans understand it. On the geometry of vectors in a high-dimensional space.
If that geometry does not accurately reflect the semantic relationships in your domain, retrieval fails. And if retrieval fails, the LLM generates a confident answer from the wrong context.
This article explains how that geometry works, where it breaks, and which models to use to get it right in 2026.
What an Embedding Actually Is
An embedding is a fixed-length vector — a list of floating-point numbers — that represents the meaning of a piece of text. A short sentence might be represented as a vector with 768 numbers. A paragraph processed by OpenAI's text-embedding-3-large model becomes a vector with 3,072 numbers.
Those numbers are not arbitrary codes. They are coordinates in a high-dimensional space where meaning becomes geometry. Text that means similar things produces vectors that point in similar directions. Text that means different things produces vectors that point in different directions.
The classic illustration from Word2Vec is still useful: if you take the embedding for "king," subtract the embedding for "man," and add the embedding for "woman," you get a vector close to the embedding for "queen." The arithmetic works because the geometric relationships between words in the vector space reflect semantic relationships between concepts.
Semantic space (simplified to 2D for illustration)
^ dimension 2
|
"heart | "cardiac
failure" * arrest" *
|
|
| "bank loan" *
|
| "river bank" *
+-------------------------> dimension 1
Distance between "heart failure" and "cardiac arrest": small
Distance between "bank loan" and "river bank": large
Distance between "heart failure" and "bank loan": large
In a real embedding space, this plays out across 768 to 3,072 dimensions.
The semantic relationships are the same. Just harder to draw.Modern RAG systems use sentence or passage embeddings rather than word embeddings. The embedding model processes an entire chunk of text — a paragraph, a document section — and produces a single vector that represents the meaning of that whole piece. This is what gets stored in the vector database and compared against query vectors at retrieval time.
How Transformers Produce Embeddings
The embedding models used in production RAG systems today are based on the Transformer architecture, introduced in the 2017 paper "Attention is All You Need." Unlike earlier models that processed text sequentially, Transformers use self-attention mechanisms to weigh the importance of different words in relation to each other, regardless of their distance in the sequence.
A sentence goes into the model as a sequence of token IDs. Each token gets an initial vector representation. The self-attention mechanism then updates each token's vector based on its relationship to every other token in the sequence. After multiple attention layers, the model pools these token-level representations into a single fixed-length vector that represents the entire input.
That pooled vector is the embedding.
The quality of the embedding depends on what the model was trained to do. A model trained to predict masked words (like BERT) learns different vector relationships than a model trained to produce semantically similar vectors for paraphrases (like Sentence Transformers), or a model trained specifically for retrieval tasks where short queries should match long document passages.
For RAG, retrieval-optimized training matters. A general-purpose language model embedding is not the same as a retrieval-optimized embedding.
Cosine Similarity: Why Angle Beats Distance
Once you have two vectors — a query vector and a document chunk vector — you need a way to measure how similar they are. Two metrics dominate RAG systems.
Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1 (pointing in opposite directions) to +1 (pointing in the same direction). The cosine similarity is large or close to 1 for two vectors pointing in a similar direction, indicating semantic similarity.
import numpy as np
def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
"""
Measure semantic similarity between two embedding vectors.
Returns a value between -1 and 1. Higher = more similar.
"""
dot_product = np.dot(vec_a, vec_b)
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
return dot_product / (norm_a * norm_b)
# When vectors are normalized to unit length (standard practice),
# this simplifies to just the dot product:
def cosine_similarity_normalized(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
return np.dot(vec_a, vec_b)
# Example
query_vec = np.array([0.6, 0.8, 0.0]) # "refund policy"
doc_vec_1 = np.array([0.58, 0.81, 0.05]) # "money back guarantee" — similar direction
doc_vec_2 = np.array([0.9, 0.1, 0.4]) # "annual report filing" — different direction
print(cosine_similarity(query_vec, doc_vec_1)) # ~0.997 — high similarity
print(cosine_similarity(query_vec, doc_vec_2)) # ~0.617 — lower similarityWhy not Euclidean distance? Euclidean distance measures the straight-line distance between two points in the vector space. It is sensitive to the magnitude of vectors, not just their direction. Two documents that express the same idea with different levels of detail will have different-magnitude embeddings if the model is not normalized. Cosine similarity removes this sensitivity by focusing only on direction.
Most sentence embedding models normalize all vectors to unit length, which means cosine similarity reduces to the dot product — a faster computation. OpenAI normalizes all its embedding outputs. Voyage AI models normalize by default. For these models, cosine similarity and dot product are equivalent.
Bi-Encoders vs Cross-Encoders
This is the architectural distinction that explains why RAG pipelines have two retrieval stages.
Bi-encoders encode the query and each document independently. The query goes through the model and produces a vector. Each document chunk goes through the same model and produces a vector. Similarity is computed after encoding by comparing the resulting vectors.
The critical property: document vectors can be pre-computed offline and stored in the vector database. At query time, only the query needs to be embedded — a single inference call. Similarity search across millions of stored vectors takes milliseconds with HNSW indexing.
Bi-encoders are particularly effective in scenarios where large-scale, real-time retrieval is necessary, such as in search engines or large knowledge bases. Speed is their defining advantage.
Their limitation: they compress each input into a fixed-size vector independently. The model cannot directly compare the query and document during encoding. It encodes what the query means, encodes what the document means, and then compares those summaries. Nuanced relevance relationships — like detecting that "$500/night" contradicts the query term "cheap" — can be lost in the compression.
Cross-encoders take the query and a candidate document together as a single input. Every token in the query attends to every token in the document through the full transformer attention mechanism. The output is a single relevance score, not two separate vectors.
Cross-encoders 'read' the query and document together at one go, catching that $500/night contradicts "cheap" and ranking it lower. They are far more accurate at relevance scoring than bi-encoders.
Their limitation: there is no pre-computation. Every query-document pair must run through the full model at query time. Across a knowledge base of one million chunks, this is computationally impossible in real time.
The production solution is a two-stage pipeline:
| Stage | Model Type | Input | Output | Speed | Accuracy |
|---|---|---|---|---|---|
| Retrieval (Stage 1) | Bi-encoder | Query vector vs pre-stored doc vectors | Top 20 to 50 candidates | Milliseconds | Good |
| Reranking (Stage 2) | Cross-encoder | Query + each candidate together | Relevance scores | 50 to 200ms added | High |
| Generation (Stage 3) | LLM | Top 3 to 5 reranked chunks | Answer | 500ms to 2s | Depends on retrieval |
The optimal RAG setup is: bi-encoder retrieval for scalable candidate selection, cross-encoder reranking for accurate relevance scoring, and LLM generation from the reranked top-k. This is the architecture that enterprise-grade RAG systems run in 2026.
Asymmetric Retrieval: The Silent Quality Killer
One of the most consistent failure patterns in production RAG pipelines is using an embedding model that was not designed for asymmetric retrieval.
Queries are short. Three to fifteen words. Conversational phrasing. Often a question. Documents are long. Hundreds or thousands of words. Formal or technical prose. They are answering questions they were not necessarily written to answer.
Voyage AI implements this explicitly. When you call their API, you specify input_type="query" or input_type="document". The same text produces a different vector depending on which input type you specify, because the model applies different internal processing to optimize each for its role in retrieval.
import voyageai
client = voyageai.Client(api_key="your-voyage-api-key")
# At indexing time: use document input type for all chunks
doc_embeddings = client.embed(
texts=["Refunds are processed within 30 days of the original purchase date."],
model="voyage-3-large",
input_type="document" # optimized for longer passage representation
)
# At query time: use query input type for user questions
query_embedding = client.embed(
texts=["Can I get a refund after 30 days?"],
model="voyage-3-large",
input_type="query" # optimized for short question representation
)
# These two vectors now exist in an aligned asymmetric space
# The model trained on this distinction — retrieval quality is higherUsing the same model for both query and document is necessary but not sufficient. Retrieval quality also depends on whether the model was trained for asymmetric retrieval. A model trained on symmetric sentence pairs produces vectors where a short question and a long document passage are not optimally aligned, even if both are processed by the same weights.
BGE-M3 from BAAI handles this through its multi-granularity retrieval training — it was explicitly trained to match queries against passages of varying lengths and styles.
Matryoshka Representation Learning: Flexible Dimensions
One of the most practically useful developments in embedding models in the past two years is Matryoshka Representation Learning (MRL). Traditional embedding models produce a fixed-dimension vector. You get 1536 dimensions from OpenAI's text-embedding-3-small, or 3072 from text-embedding-3-large. There is no in-between.
MRL trains a model so that the first N dimensions of its output vector are already a good low-dimensional embedding of the text. You can truncate a 3072-dimension vector to 256 dimensions with only 2 to 3% quality loss. The full-dimension vector is the most accurate representation. The truncated vector is smaller, cheaper to store, and faster to search — at a very small precision cost.
This matters for production systems with large corpora. A knowledge base of 10 million chunks at 3072 dimensions requires roughly 115GB of raw vector storage. At 768 dimensions, that drops to about 29GB. At 256 dimensions, to under 10GB — with only a few percent accuracy loss.
import openai
import numpy as np
client = openai.OpenAI(api_key="your-openai-api-key")
text = "The refund policy allows returns within 30 days of purchase."
# Full precision embedding: 3072 dimensions
full_response = client.embeddings.create(
model="text-embedding-3-large",
input=text,
dimensions=3072
)
full_vec = np.array(full_response.data[0].embedding)
# Reduced to 256 dimensions via Matryoshka truncation
# Only 2-3% quality loss — OpenAI handles the truncation server-side
small_response = client.embeddings.create(
model="text-embedding-3-large",
input=text,
dimensions=256
)
small_vec = np.array(small_response.data[0].embedding)
print(f"Full vector shape: {full_vec.shape}") # (3072,)
print(f"Small vector shape: {small_vec.shape}") # (256,)Models supporting MRL in 2026: Gemini Embedding 2, Voyage 4, Cohere Embed v4, OpenAI text-embedding-3-*, Jina v5, Nomic v1.5. For any new production system indexing at scale, MRL-capable models are the default choice.
The 2026 Embedding Model Landscape
The MTEB leaderboard — the Massive Text Embedding Benchmark maintained by Hugging Face — is the standard for comparing embedding models across retrieval, clustering, classification, and semantic similarity tasks. Scores here reflect early 2026. Check the leaderboard directly before making a final architecture decision, as new models submit results monthly.
| Model | Provider | MTEB Score | Dimensions | Context Window | Cost | Best For |
|---|---|---|---|---|---|---|
| Gemini Embedding 001 | 68.32 (retrieval: 67.71) | 3072 | 2,048 tokens | API pricing | Highest retrieval accuracy, GCP ecosystem | |
| Qwen3-Embedding-8B | Alibaba | 70.58 (multilingual) | 32 to 4096 (flexible) | 32,768 tokens | Self-host (Apache 2.0) | Multilingual, long documents, self-hosted |
| Voyage-3-large | Voyage AI | 68.1+ retrieval | 1024 | 32,000 tokens | $0.06/M tokens | Strong quality, half OpenAI large cost |
| text-embedding-3-large | OpenAI | 64.6 | 256 to 3072 (MRL) | 8,192 tokens | $0.13/M tokens | OpenAI ecosystem teams |
| text-embedding-3-small | OpenAI | 62.26 | 512 to 1536 (MRL) | 8,192 tokens | $0.02/M tokens | Budget-sensitive, high-volume indexing |
| BGE-M3 | BAAI | 63.0 | 1024 | 8,192 tokens | Free (Apache 2.0) | Dense + sparse + multi-vector in one model |
| Cohere Embed v4 | Cohere | 65.2 | variable (MRL) | 128,000 tokens | $0.12/M tokens | Multimodal (text + images), long documents |
| NV-Embed-v2 | NVIDIA | ~68+ | 4096 | 32,768 tokens | Self-host (non-commercial) | Enterprise self-hosted, non-commercial |
Sources: MTEB leaderboard April 2026, Premai.io benchmark guide, Awesome Agents leaderboard.
MTEB scores on public datasets do not always translate to your corpus. A model that tops the leaderboard on Wikipedia and legal documents might perform differently on your internal ticketing system or product catalog. Run your own retrieval evaluation — measure precision@10 and recall@10 on 50 to 100 sample queries from your actual domain — before committing to an embedding model at scale.
A few selection notes grounded in 2026 production patterns.
Google's Gemini Embedding 001 holds the top spot on MTEB retrieval tasks in early 2026, scoring 67.71 on retrieval and 85.13 on pair classification. The case against it is Google ecosystem lock-in and a 2,048-token context window that constrains long document indexing.
Voyage AI was acquired by MongoDB for $220M in February 2025. Voyage-3-large at $0.06 per million tokens offers strong retrieval quality at roughly half the cost of OpenAI's text-embedding-3-large, with explicit query and document input type support for asymmetric retrieval.
BGE-M3 from BAAI is the standout open-source option because it handles dense retrieval, sparse retrieval, and multi-vector retrieval in a single model. Everything else requires a separate BM25 index alongside your dense vectors. BGE-M3 unifies both. For teams self-hosting with data privacy requirements, this eliminates significant infrastructure complexity.
Qwen3-Embedding-8B from Alibaba leads the multilingual MTEB leaderboard with a 70.58 score, supports flexible dimensions from 32 to 4096, and carries a 32K token context window. For multilingual corpora and long documents, it is the current benchmark. Released under Apache 2.0, it is commercially usable on self-hosted infrastructure.
OpenAI text-embedding-3-small at $0.02 per million tokens is hard to beat on price-to-performance for budget-sensitive applications. Its MTEB score of 62.26 is adequate for many production RAG systems even if it trails newer models at the leaderboard frontier.
Domain-Specific Embedding Models
General-purpose embedding models perform well across a broad range of domains. For narrow, specialized domains, domain-specific models consistently outperform general ones, and fine-tuning can add another 10 to 30% retrieval improvement.
| Domain | Recommended Approach |
|---|---|
| Code and technical docs | Voyage-3-code or fine-tuned BGE on code corpora |
| Legal text | Fine-tuned BGE or Qwen3 on legal corpus |
| Medical and clinical | BioGPT embeddings or PubMedBERT-based models |
| Financial documents | Voyage-3-finance or fine-tuned models on SEC filings |
| Multilingual enterprise | Qwen3-Embedding-8B (70.58 multilingual MTEB) |
| Mixed text and images | Cohere Embed v4 or Voyage Multimodal 3.5 |
Sources: Ailog.fr embedding guide 2026, Premai.io 2026 model benchmark
General-purpose models perform well across domains, but fine-tuning on your corpus produces 10 to 30% retrieval improvement. You do not need millions of training examples for embedding fine-tuning. A few thousand query-document pairs from real production traffic, annotated for relevance, are enough to meaningfully shift the model's vector space toward your domain's terminology and structure.
How the Embedding Layer Connects to the Rest of the Pipeline
The embedding model is the first technical decision in a RAG system and the one with the most downstream consequences.
Chunking and embedding interact. If your chunks are semantically broken — cut mid-sentence or mid-argument — no embedding model can produce a coherent vector for them. The embedding model can only represent what is in the text it receives. Good chunking produces semantically complete chunks. Good embeddings produce accurate vectors for those chunks. The two must be designed together, not independently.
Retrieval quality has a ceiling set by embeddings. A reranker can reorder the candidates returned by bi-encoder retrieval. It cannot surface a chunk that the bi-encoder missed entirely. If the embedding model does not place the right chunks near the query vector, reranking cannot help. Fix the embedding model before adding a reranker.
Index and query with the same model, same input type. This sounds obvious. Teams violate it regularly during infrastructure migrations, during A/B tests that swap the query model without re-indexing, and when the embedding API changes its model defaults. Retrieval works only when documents and queries are represented symmetrically — encoded with comparable semantic intent. A version mismatch between index and query model produces garbage similarity scores that are hard to diagnose because they look like normal retrieval behavior.
Cost scales with corpus size, not query volume. Embedding is a one-time cost at indexing. Every chunk in your knowledge base gets embedded once and stored. Queries get embedded at runtime, but queries are small — one vector per query, not one per document. The expensive moment is the initial index build and every re-index when documents change. Budget accordingly.
from openai import OpenAI
import numpy as np
client = OpenAI(api_key="your-openai-api-key")
def embed_for_indexing(texts: list[str], batch_size: int = 100) -> list[list[float]]:
"""
Embed document chunks for indexing. Process in batches to stay within rate limits.
Cost: $0.02 per million tokens (text-embedding-3-small)
"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
print(f"Embedded {min(i + batch_size, len(texts))} / {len(texts)} chunks")
return all_embeddings
def embed_query(query: str) -> list[float]:
"""
Embed a single user query at request time.
Same model as indexing — this is non-negotiable.
"""
response = client.embeddings.create(
model="text-embedding-3-small", # Must match indexing model
input=query
)
return response.data[0].embedding
def retrieval_quality_check(
query: str,
retrieved_chunks: list[dict],
min_similarity: float = 0.70
) -> dict:
"""
Post-retrieval sanity check on similarity scores.
If top results are below threshold, retrieval is likely failing.
"""
query_vec = np.array(embed_query(query))
issues = []
for i, chunk in enumerate(retrieved_chunks[:5]):
chunk_vec = np.array(chunk["embedding"])
sim = float(np.dot(query_vec, chunk_vec)) # assumes normalized vectors
if sim < min_similarity:
issues.append({
"rank": i + 1,
"similarity": sim,
"warning": f"Low similarity at rank {i+1} — retrieval may be degraded"
})
return {
"query": query,
"top_similarity": float(np.dot(query_vec, np.array(retrieved_chunks[0]["embedding"]))),
"issues": issues,
"retrieval_healthy": len(issues) == 0
}Where Embedding Quality Breaks Down
Knowing what embeddings are optimized for is as important as knowing what they do well.
Rare domain terms. A model trained on general web text has never seen your internal product codenames, proprietary terminology, or niche scientific nomenclature. These terms may embed near generic words that share surface-level similarity but different meaning. The fix is domain fine-tuning or hybrid search that adds BM25 to catch exact term matches.
Very short queries. A three-word query produces a vector with far less semantic information than a long document passage. The model has less to work with and the resulting vector is a weaker representation of the user's actual intent. Query expansion — using an LLM to rewrite a short query into a longer, more specific version before embedding — consistently improves retrieval on short queries.
Cross-lingual mismatch. If your documents are in English and your users ask questions in Hindi or Tamil, a monolingual English embedding model will produce poor cross-lingual similarity scores. For multilingual RAG, Qwen3-Embedding-8B and BGE-M3 both handle 100-plus languages with strong cross-lingual alignment. General-purpose English models are not a substitute.
Mixed modalities. Text embeddings cannot directly compare a user's text query against an image or a chart from a PDF. For knowledge bases that include figures, tables rendered as images, or other non-text content, multimodal embedding models like Cohere Embed v4 or Voyage Multimodal 3.5 are required. Otherwise, that content is invisible to retrieval entirely.
Where to Go From Here
This is the final article in the RAG series. At this point you have the complete picture.
What Is RAG in AI establishes the three-phase loop — index, retrieve, generate — that everything in this series builds on.
RAG vs Fine-Tuning answers the first architectural question: whether you need retrieval at all, or whether the problem is behavioral and calls for weight updates instead.
RAG Architecture Explained covers the full production pipeline from document parsing through agentic multi-hop retrieval and evaluation with RAGAS — with embedding model selection in the context of every other component it affects.
Vector Database in RAG goes deep on HNSW indexing, how vector databases store and retrieve the embedding vectors produced here, and the cost comparison across Pinecone, Qdrant, Weaviate, Milvus, and pgvector.
Why RAG Fails covers every failure mode in the production pipeline, including the embedding-related failures — domain mismatch, retrieval asymmetry, and vocabulary gaps — and how to fix them systematically.
RAG vs Traditional Search explains why BM25 is not dead, how it complements dense embedding-based retrieval in hybrid search, and where keyword matching handles what semantic vectors miss.
The embedding layer is where semantic understanding enters the pipeline. Every improvement downstream — better reranking, better generation, better evaluation — operates on the candidates that the embedding model makes retrievable. Get this layer right and the rest of the pipeline has something solid to work with. Get it wrong and no amount of reranking or prompt engineering compensates for the candidates that were never retrieved in the first place.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.