What is an embedding in the context of RAG?

An embedding is a list of numbers — a vector — that represents the meaning of a piece of text. A text embedding model converts a sentence, paragraph, or document chunk into this numerical form. Similar meanings produce vectors that point in similar directions in high-dimensional space. RAG uses embeddings to find which document chunks are most semantically similar to a user's query, without needing to match exact words.

How does cosine similarity work in vector search?

Cosine similarity measures the angle between two vectors rather than the straight-line distance between them. Two vectors pointing in nearly the same direction have a cosine similarity close to 1.0, meaning high semantic similarity. Two vectors pointing in opposite directions score near -1.0. The reason cosine similarity is preferred over Euclidean distance for text is that it focuses on semantic direction rather than the magnitude of the vectors, making it robust to differences in text length and word frequency.

What is the difference between a bi-encoder and a cross-encoder?

A bi-encoder encodes the query and each document independently into separate vectors, then compares those vectors with cosine similarity. It is fast because documents can be pre-embedded and stored, and similarity is a single mathematical operation. A cross-encoder reads the query and a candidate document together at the same time through a transformer, producing a single relevance score. Cross-encoders are far more accurate but too slow for large-scale retrieval. Production RAG systems use bi-encoders for retrieval and cross-encoders for reranking.

What is asymmetric retrieval and why does it matter?

Asymmetric retrieval is the mismatch between query structure and document structure. Queries are typically short, conversational questions. Documents are long, detailed passages. Using a model trained on symmetric text pairs — sentence against sentence — for asymmetric retrieval degrades quality. Voyage AI's query-document input type distinction and models like BGE-M3 that were explicitly trained for asymmetric retrieval address this. Using the wrong model for asymmetric queries is one of the most common silent quality failures in RAG pipelines.

Which embedding model should I use for RAG in 2026?

It depends on your constraints. For managed API use, Google Gemini Embedding 001 leads the MTEB retrieval leaderboard in early 2026, with Voyage-3-large offering strong quality at lower cost. For self-hosted deployments with data privacy requirements, Qwen3-Embedding-8B leads multilingual benchmarks and BGE-M3 handles dense plus sparse retrieval in a single model. For budget-sensitive applications, OpenAI text-embedding-3-small at $0.02 per million tokens delivers adequate production quality. Always benchmark on your own corpus before committing.

What is Matryoshka Representation Learning and why does it matter?

Matryoshka Representation Learning (MRL) is a technique that allows a single embedding model to produce vectors of variable dimension while maintaining quality. A model trained with MRL produces a 3072-dimension vector. You can truncate it to 256 dimensions with only 2 to 3% quality loss. This matters for cost and storage — smaller vectors require less memory and compute for retrieval, and you can choose the dimension trade-off at query time based on precision requirements. OpenAI, Voyage, Cohere, and Google all support MRL on their 2026 models.

Why does using the same embedding model for indexing and querying not guarantee good retrieval?

Using the same model for both is necessary but not sufficient. Retrieval quality also depends on whether the model was trained for asymmetric retrieval (short query vs long document), whether the domain vocabulary is well represented in the model's training data, and whether chunking produces semantically complete pieces that align with how the model learned to represent text. A model trained on general web text may systematically misrepresent specialized domain terms even when applied consistently to both query and document.

How Embeddings Work in RAG: The Complete Guide (2026)

Every time a user asks a question in a RAG system, the pipeline converts that question into a list of numbers before anything else happens. The retrieval step — the part that determines which document chunks get passed to the LLM — operates entirely on those numbers. Not on words. Not on meaning as humans understand it. On the geometry of vectors in a high-dimensional space.

If that geometry does not accurately reflect the semantic relationships in your domain, retrieval fails. And if retrieval fails, the LLM generates a confident answer from the wrong context.

As MTEB creator Dr. Niklas Muennighoff at Hugging Face puts it: "Embeddings are the invisible but crucial foundation of any performant RAG system. A good embedding choice can improve retrieval precision by 20 to 30%."

This article explains how that geometry works, where it breaks, and which models to use to get it right in 2026.

What an Embedding Actually Is

An embedding is a fixed-length vector — a list of floating-point numbers — that represents the meaning of a piece of text. A short sentence might be represented as a vector with 768 numbers. A paragraph processed by OpenAI's text-embedding-3-large model becomes a vector with 3,072 numbers.

Those numbers are not arbitrary codes. They are coordinates in a high-dimensional space where meaning becomes geometry. Text that means similar things produces vectors that point in similar directions. Text that means different things produces vectors that point in different directions.

The classic illustration from Word2Vec is still useful: if you take the embedding for "king," subtract the embedding for "man," and add the embedding for "woman," you get a vector close to the embedding for "queen." The arithmetic works because the geometric relationships between words in the vector space reflect semantic relationships between concepts.

plaintext

Semantic space (simplified to 2D for illustration)

            ^ dimension 2
            |
  "heart    |         "cardiac
  failure"  *         arrest" *
            |
            |
            |   "bank loan" *
            |
            |                "river bank" *
            +-------------------------> dimension 1

Distance between "heart failure" and "cardiac arrest": small
Distance between "bank loan" and "river bank": large
Distance between "heart failure" and "bank loan": large

In a real embedding space, this plays out across 768 to 3,072 dimensions.
The semantic relationships are the same. Just harder to draw.

Word embeddings are generated by neural networks trained on large text corpora. The network learns the contextual usage of words, capturing not just obvious synonyms but also nuanced relationships — so the embedding for "bank" in a financial context differs from "bank" as a riverbank even though the word is the same.

Modern RAG systems use sentence or passage embeddings rather than word embeddings. The embedding model processes an entire chunk of text — a paragraph, a document section — and produces a single vector that represents the meaning of that whole piece. This is what gets stored in the vector database and compared against query vectors at retrieval time.

How Transformers Produce Embeddings

The embedding models used in production RAG systems today are based on the Transformer architecture, introduced in the 2017 paper "Attention is All You Need." Unlike earlier models that processed text sequentially, Transformers use self-attention mechanisms to weigh the importance of different words in relation to each other, regardless of their distance in the sequence.

A sentence goes into the model as a sequence of token IDs. Each token gets an initial vector representation. The self-attention mechanism then updates each token's vector based on its relationship to every other token in the sequence. After multiple attention layers, the model pools these token-level representations into a single fixed-length vector that represents the entire input.

That pooled vector is the embedding.

The quality of the embedding depends on what the model was trained to do. A model trained to predict masked words (like BERT) learns different vector relationships than a model trained to produce semantically similar vectors for paraphrases (like Sentence Transformers), or a model trained specifically for retrieval tasks where short queries should match long document passages.

For RAG, retrieval-optimized training matters. A general-purpose language model embedding is not the same as a retrieval-optimized embedding.

Cosine Similarity: Why Angle Beats Distance

Once you have two vectors — a query vector and a document chunk vector — you need a way to measure how similar they are. Two metrics dominate RAG systems.

Cosine similarity measures the cosine of the angle between two vectors. It ranges from -1 (pointing in opposite directions) to +1 (pointing in the same direction). The cosine similarity is large or close to 1 for two vectors pointing in a similar direction, indicating semantic similarity.

python

import numpy as np

def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    """
    Measure semantic similarity between two embedding vectors.
    Returns a value between -1 and 1. Higher = more similar.
    """
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

# When vectors are normalized to unit length (standard practice),
# this simplifies to just the dot product:
def cosine_similarity_normalized(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    return np.dot(vec_a, vec_b)

# Example
query_vec   = np.array([0.6, 0.8, 0.0])    # "refund policy"
doc_vec_1   = np.array([0.58, 0.81, 0.05]) # "money back guarantee" — similar direction
doc_vec_2   = np.array([0.9, 0.1, 0.4])    # "annual report filing" — different direction

print(cosine_similarity(query_vec, doc_vec_1))  # ~0.997 — high similarity
print(cosine_similarity(query_vec, doc_vec_2))  # ~0.617 — lower similarity

Why not Euclidean distance? Euclidean distance measures the straight-line distance between two points in the vector space. It is sensitive to the magnitude of vectors, not just their direction. Two documents that express the same idea with different levels of detail will have different-magnitude embeddings if the model is not normalized. Cosine similarity removes this sensitivity by focusing only on direction.

Most sentence embedding models normalize all vectors to unit length, which means cosine similarity reduces to the dot product — a faster computation. OpenAI normalizes all its embedding outputs. Voyage AI models normalize by default. For these models, cosine similarity and dot product are equivalent.

Bi-Encoders vs Cross-Encoders

This is the architectural distinction that explains why RAG pipelines have two retrieval stages.

Bi-encoders encode the query and each document independently. The query goes through the model and produces a vector. Each document chunk goes through the same model and produces a vector. Similarity is computed after encoding by comparing the resulting vectors.

The critical property: document vectors can be pre-computed offline and stored in the vector database. At query time, only the query needs to be embedded — a single inference call. Similarity search across millions of stored vectors takes milliseconds with HNSW indexing.

Bi-encoders are particularly effective in scenarios where large-scale, real-time retrieval is necessary, such as in search engines or large knowledge bases. Speed is their defining advantage.

Their limitation: they compress each input into a fixed-size vector independently. The model cannot directly compare the query and document during encoding. It encodes what the query means, encodes what the document means, and then compares those summaries. Nuanced relevance relationships — like detecting that "$500/night" contradicts the query term "cheap" — can be lost in the compression.

Cross-encoders take the query and a candidate document together as a single input. Every token in the query attends to every token in the document through the full transformer attention mechanism. The output is a single relevance score, not two separate vectors.

Cross-encoders 'read' the query and document together at one go, catching that $500/night contradicts "cheap" and ranking it lower. They are far more accurate at relevance scoring than bi-encoders.

Their limitation: there is no pre-computation. Every query-document pair must run through the full model at query time. Across a knowledge base of one million chunks, this is computationally impossible in real time.

The production solution is a two-stage pipeline:

Stage	Model Type	Input	Output	Speed	Accuracy
Retrieval (Stage 1)	Bi-encoder	Query vector vs pre-stored doc vectors	Top 20 to 50 candidates	Milliseconds	Good
Reranking (Stage 2)	Cross-encoder	Query + each candidate together	Relevance scores	50 to 200ms added	High
Generation (Stage 3)	LLM	Top 3 to 5 reranked chunks	Answer	500ms to 2s	Depends on retrieval

The optimal RAG setup is: bi-encoder retrieval for scalable candidate selection, cross-encoder reranking for accurate relevance scoring, and LLM generation from the reranked top-k. This is the architecture that enterprise-grade RAG systems run in 2026.

Asymmetric Retrieval: The Silent Quality Killer

One of the most consistent failure patterns in production RAG pipelines is using an embedding model that was not designed for asymmetric retrieval.

Queries are short. Three to fifteen words. Conversational phrasing. Often a question. Documents are long. Hundreds or thousands of words. Formal or technical prose. They are answering questions they were not necessarily written to answer.

Many modern embedding models use asymmetric encoding — different processing for queries versus documents — because this accounts for the fact that queries are typically short while documents are longer and more detailed.

Voyage AI implements this explicitly. When you call their API, you specify input_type="query" or input_type="document". The same text produces a different vector depending on which input type you specify, because the model applies different internal processing to optimize each for its role in retrieval.

python

import voyageai

client = voyageai.Client(api_key="your-voyage-api-key")

# At indexing time: use document input type for all chunks
doc_embeddings = client.embed(
    texts=["Refunds are processed within 30 days of the original purchase date."],
    model="voyage-3-large",
    input_type="document"   # optimized for longer passage representation
)

# At query time: use query input type for user questions
query_embedding = client.embed(
    texts=["Can I get a refund after 30 days?"],
    model="voyage-3-large",
    input_type="query"      # optimized for short question representation
)

# These two vectors now exist in an aligned asymmetric space
# The model trained on this distinction — retrieval quality is higher

Using the same model for both query and document is necessary but not sufficient. Retrieval quality also depends on whether the model was trained for asymmetric retrieval. A model trained on symmetric sentence pairs produces vectors where a short question and a long document passage are not optimally aligned, even if both are processed by the same weights.

BGE-M3 from BAAI handles this through its multi-granularity retrieval training — it was explicitly trained to match queries against passages of varying lengths and styles.

Matryoshka Representation Learning: Flexible Dimensions

One of the most practically useful developments in embedding models in the past two years is Matryoshka Representation Learning (MRL). Traditional embedding models produce a fixed-dimension vector. You get 1536 dimensions from OpenAI's text-embedding-3-small, or 3072 from text-embedding-3-large. There is no in-between.

MRL trains a model so that the first N dimensions of its output vector are already a good low-dimensional embedding of the text. You can truncate a 3072-dimension vector to 256 dimensions with only 2 to 3% quality loss. The full-dimension vector is the most accurate representation. The truncated vector is smaller, cheaper to store, and faster to search — at a very small precision cost.

This matters for production systems with large corpora. A knowledge base of 10 million chunks at 3072 dimensions requires roughly 115GB of raw vector storage. At 768 dimensions, that drops to about 29GB. At 256 dimensions, to under 10GB — with only a few percent accuracy loss.

python

import openai
import numpy as np

client = openai.OpenAI(api_key="your-openai-api-key")

text = "The refund policy allows returns within 30 days of purchase."

# Full precision embedding: 3072 dimensions
full_response = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=3072
)
full_vec = np.array(full_response.data[0].embedding)

# Reduced to 256 dimensions via Matryoshka truncation
# Only 2-3% quality loss — OpenAI handles the truncation server-side
small_response = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=256
)
small_vec = np.array(small_response.data[0].embedding)

print(f"Full vector shape:  {full_vec.shape}")   # (3072,)
print(f"Small vector shape: {small_vec.shape}")  # (256,)

Models supporting MRL in 2026: Gemini Embedding 2, Voyage 4, Cohere Embed v4, OpenAI text-embedding-3-*, Jina v5, Nomic v1.5. For any new production system indexing at scale, MRL-capable models are the default choice.

The 2026 Embedding Model Landscape

The MTEB leaderboard — the Massive Text Embedding Benchmark maintained by Hugging Face — is the standard for comparing embedding models across retrieval, clustering, classification, and semantic similarity tasks. Scores here reflect early 2026. Check the leaderboard directly before making a final architecture decision, as new models submit results monthly.

Model	Provider	MTEB Score	Dimensions	Context Window	Cost	Best For
Gemini Embedding 001	Google	68.32 (retrieval: 67.71)	3072	2,048 tokens	API pricing	Highest retrieval accuracy, GCP ecosystem
Qwen3-Embedding-8B	Alibaba	70.58 (multilingual)	32 to 4096 (flexible)	32,768 tokens	Self-host (Apache 2.0)	Multilingual, long documents, self-hosted
Voyage-3-large	Voyage AI	68.1+ retrieval	1024	32,000 tokens	$0.06/M tokens	Strong quality, half OpenAI large cost
text-embedding-3-large	OpenAI	64.6	256 to 3072 (MRL)	8,192 tokens	$0.13/M tokens	OpenAI ecosystem teams
text-embedding-3-small	OpenAI	62.26	512 to 1536 (MRL)	8,192 tokens	$0.02/M tokens	Budget-sensitive, high-volume indexing
BGE-M3	BAAI	63.0	1024	8,192 tokens	Free (Apache 2.0)	Dense + sparse + multi-vector in one model
Cohere Embed v4	Cohere	65.2	variable (MRL)	128,000 tokens	$0.12/M tokens	Multimodal (text + images), long documents
NV-Embed-v2	NVIDIA	~68+	4096	32,768 tokens	Self-host (non-commercial)	Enterprise self-hosted, non-commercial

Sources: MTEB leaderboard April 2026, Premai.io benchmark guide, Awesome Agents leaderboard.

MTEB scores on public datasets do not always translate to your corpus. A model that tops the leaderboard on Wikipedia and legal documents might perform differently on your internal ticketing system or product catalog. Run your own retrieval evaluation — measure precision@10 and recall@10 on 50 to 100 sample queries from your actual domain — before committing to an embedding model at scale.

A few selection notes grounded in 2026 production patterns.

Google's Gemini Embedding 001 holds the top spot on MTEB retrieval tasks in early 2026, scoring 67.71 on retrieval and 85.13 on pair classification. The case against it is Google ecosystem lock-in and a 2,048-token context window that constrains long document indexing.

Voyage AI was acquired by MongoDB for $220M in February 2025. Voyage-3-large at $0.06 per million tokens offers strong retrieval quality at roughly half the cost of OpenAI's text-embedding-3-large, with explicit query and document input type support for asymmetric retrieval.

BGE-M3 from BAAI is the standout open-source option because it handles dense retrieval, sparse retrieval, and multi-vector retrieval in a single model. Everything else requires a separate BM25 index alongside your dense vectors. BGE-M3 unifies both. For teams self-hosting with data privacy requirements, this eliminates significant infrastructure complexity.

Qwen3-Embedding-8B from Alibaba leads the multilingual MTEB leaderboard with a 70.58 score, supports flexible dimensions from 32 to 4096, and carries a 32K token context window. For multilingual corpora and long documents, it is the current benchmark. Released under Apache 2.0, it is commercially usable on self-hosted infrastructure.

OpenAI text-embedding-3-small at $0.02 per million tokens is hard to beat on price-to-performance for budget-sensitive applications. Its MTEB score of 62.26 is adequate for many production RAG systems even if it trails newer models at the leaderboard frontier.

Domain-Specific Embedding Models

General-purpose embedding models perform well across a broad range of domains. For narrow, specialized domains, domain-specific models consistently outperform general ones, and fine-tuning can add another 10 to 30% retrieval improvement.

Domain	Recommended Approach
Code and technical docs	Voyage-3-code or fine-tuned BGE on code corpora
Legal text	Fine-tuned BGE or Qwen3 on legal corpus
Medical and clinical	BioGPT embeddings or PubMedBERT-based models
Financial documents	Voyage-3-finance or fine-tuned models on SEC filings
Multilingual enterprise	Qwen3-Embedding-8B (70.58 multilingual MTEB)
Mixed text and images	Cohere Embed v4 or Voyage Multimodal 3.5

Sources: Ailog.fr embedding guide 2026, Premai.io 2026 model benchmark

General-purpose models perform well across domains, but fine-tuning on your corpus produces 10 to 30% retrieval improvement. You do not need millions of training examples for embedding fine-tuning. A few thousand query-document pairs from real production traffic, annotated for relevance, are enough to meaningfully shift the model's vector space toward your domain's terminology and structure.

How the Embedding Layer Connects to the Rest of the Pipeline

The embedding model is the first technical decision in a RAG system and the one with the most downstream consequences.

Chunking and embedding interact. If your chunks are semantically broken — cut mid-sentence or mid-argument — no embedding model can produce a coherent vector for them. The embedding model can only represent what is in the text it receives. Good chunking produces semantically complete chunks. Good embeddings produce accurate vectors for those chunks. The two must be designed together, not independently.

Retrieval quality has a ceiling set by embeddings. A reranker can reorder the candidates returned by bi-encoder retrieval. It cannot surface a chunk that the bi-encoder missed entirely. If the embedding model does not place the right chunks near the query vector, reranking cannot help. Fix the embedding model before adding a reranker.

Index and query with the same model, same input type. This sounds obvious. Teams violate it regularly during infrastructure migrations, during A/B tests that swap the query model without re-indexing, and when the embedding API changes its model defaults. Retrieval works only when documents and queries are represented symmetrically — encoded with comparable semantic intent. A version mismatch between index and query model produces garbage similarity scores that are hard to diagnose because they look like normal retrieval behavior.

Cost scales with corpus size, not query volume. Embedding is a one-time cost at indexing. Every chunk in your knowledge base gets embedded once and stored. Queries get embedded at runtime, but queries are small — one vector per query, not one per document. The expensive moment is the initial index build and every re-index when documents change. Budget accordingly.

python

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="your-openai-api-key")

def embed_for_indexing(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """
    Embed document chunks for indexing. Process in batches to stay within rate limits.
    Cost: $0.02 per million tokens (text-embedding-3-small)
    """
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded {min(i + batch_size, len(texts))} / {len(texts)} chunks")

    return all_embeddings

def embed_query(query: str) -> list[float]:
    """
    Embed a single user query at request time.
    Same model as indexing — this is non-negotiable.
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",   # Must match indexing model
        input=query
    )
    return response.data[0].embedding

def retrieval_quality_check(
    query: str,
    retrieved_chunks: list[dict],
    min_similarity: float = 0.70
) -> dict:
    """
    Post-retrieval sanity check on similarity scores.
    If top results are below threshold, retrieval is likely failing.
    """
    query_vec = np.array(embed_query(query))
    issues = []

    for i, chunk in enumerate(retrieved_chunks[:5]):
        chunk_vec = np.array(chunk["embedding"])
        sim = float(np.dot(query_vec, chunk_vec))  # assumes normalized vectors

        if sim < min_similarity:
            issues.append({
                "rank": i + 1,
                "similarity": sim,
                "warning": f"Low similarity at rank {i+1} — retrieval may be degraded"
            })

    return {
        "query": query,
        "top_similarity": float(np.dot(query_vec, np.array(retrieved_chunks[0]["embedding"]))),
        "issues": issues,
        "retrieval_healthy": len(issues) == 0
    }

Where Embedding Quality Breaks Down

Knowing what embeddings are optimized for is as important as knowing what they do well.

Rare domain terms. A model trained on general web text has never seen your internal product codenames, proprietary terminology, or niche scientific nomenclature. These terms may embed near generic words that share surface-level similarity but different meaning. The fix is domain fine-tuning or hybrid search that adds BM25 to catch exact term matches.

Very short queries. A three-word query produces a vector with far less semantic information than a long document passage. The model has less to work with and the resulting vector is a weaker representation of the user's actual intent. Query expansion — using an LLM to rewrite a short query into a longer, more specific version before embedding — consistently improves retrieval on short queries.

Cross-lingual mismatch. If your documents are in English and your users ask questions in Hindi or Tamil, a monolingual English embedding model will produce poor cross-lingual similarity scores. For multilingual RAG, Qwen3-Embedding-8B and BGE-M3 both handle 100-plus languages with strong cross-lingual alignment. General-purpose English models are not a substitute.

Mixed modalities. Text embeddings cannot directly compare a user's text query against an image or a chart from a PDF. For knowledge bases that include figures, tables rendered as images, or other non-text content, multimodal embedding models like Cohere Embed v4 or Voyage Multimodal 3.5 are required. Otherwise, that content is invisible to retrieval entirely.

Where to Go From Here

This is the final article in the RAG series. At this point you have the complete picture.

What Is RAG in AI establishes the three-phase loop — index, retrieve, generate — that everything in this series builds on.

RAG vs Fine-Tuning answers the first architectural question: whether you need retrieval at all, or whether the problem is behavioral and calls for weight updates instead.

RAG Architecture Explained covers the full production pipeline from document parsing through agentic multi-hop retrieval and evaluation with RAGAS — with embedding model selection in the context of every other component it affects.

Vector Database in RAG goes deep on HNSW indexing, how vector databases store and retrieve the embedding vectors produced here, and the cost comparison across Pinecone, Qdrant, Weaviate, Milvus, and pgvector.

Why RAG Fails covers every failure mode in the production pipeline, including the embedding-related failures — domain mismatch, retrieval asymmetry, and vocabulary gaps — and how to fix them systematically.

RAG vs Traditional Search explains why BM25 is not dead, how it complements dense embedding-based retrieval in hybrid search, and where keyword matching handles what semantic vectors miss.

The embedding layer is where semantic understanding enters the pipeline. Every improvement downstream — better reranking, better generation, better evaluation — operates on the candidates that the embedding model makes retrievable. Get this layer right and the rest of the pipeline has something solid to work with. Get it wrong and no amount of reranking or prompt engineering compensates for the candidates that were never retrieved in the first place.

If that geometry does not accurately reflect the semantic relationships in your domain, retrieval fails. And if retrieval fails, the LLM generates a confident answer from the wrong context.

This article explains how that geometry works, where it breaks, and which models to use to get it right in 2026.

What an Embedding Actually Is

plaintext

Semantic space (simplified to 2D for illustration)

            ^ dimension 2
            |
  "heart    |         "cardiac
  failure"  *         arrest" *
            |
            |
            |   "bank loan" *
            |
            |                "river bank" *
            +-------------------------> dimension 1

Distance between "heart failure" and "cardiac arrest": small
Distance between "bank loan" and "river bank": large
Distance between "heart failure" and "bank loan": large

In a real embedding space, this plays out across 768 to 3,072 dimensions.
The semantic relationships are the same. Just harder to draw.

How Transformers Produce Embeddings

That pooled vector is the embedding.

For RAG, retrieval-optimized training matters. A general-purpose language model embedding is not the same as a retrieval-optimized embedding.

Cosine Similarity: Why Angle Beats Distance

Once you have two vectors — a query vector and a document chunk vector — you need a way to measure how similar they are. Two metrics dominate RAG systems.

python

import numpy as np

def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    """
    Measure semantic similarity between two embedding vectors.
    Returns a value between -1 and 1. Higher = more similar.
    """
    dot_product = np.dot(vec_a, vec_b)
    norm_a = np.linalg.norm(vec_a)
    norm_b = np.linalg.norm(vec_b)
    return dot_product / (norm_a * norm_b)

# When vectors are normalized to unit length (standard practice),
# this simplifies to just the dot product:
def cosine_similarity_normalized(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
    return np.dot(vec_a, vec_b)

# Example
query_vec   = np.array([0.6, 0.8, 0.0])    # "refund policy"
doc_vec_1   = np.array([0.58, 0.81, 0.05]) # "money back guarantee" — similar direction
doc_vec_2   = np.array([0.9, 0.1, 0.4])    # "annual report filing" — different direction

print(cosine_similarity(query_vec, doc_vec_1))  # ~0.997 — high similarity
print(cosine_similarity(query_vec, doc_vec_2))  # ~0.617 — lower similarity

Bi-Encoders vs Cross-Encoders

This is the architectural distinction that explains why RAG pipelines have two retrieval stages.

Bi-encoders are particularly effective in scenarios where large-scale, real-time retrieval is necessary, such as in search engines or large knowledge bases. Speed is their defining advantage.

Cross-encoders 'read' the query and document together at one go, catching that $500/night contradicts "cheap" and ranking it lower. They are far more accurate at relevance scoring than bi-encoders.

The production solution is a two-stage pipeline:

Stage	Model Type	Input	Output	Speed	Accuracy
Retrieval (Stage 1)	Bi-encoder	Query vector vs pre-stored doc vectors	Top 20 to 50 candidates	Milliseconds	Good
Reranking (Stage 2)	Cross-encoder	Query + each candidate together	Relevance scores	50 to 200ms added	High
Generation (Stage 3)	LLM	Top 3 to 5 reranked chunks	Answer	500ms to 2s	Depends on retrieval

Asymmetric Retrieval: The Silent Quality Killer

One of the most consistent failure patterns in production RAG pipelines is using an embedding model that was not designed for asymmetric retrieval.

python

import voyageai

client = voyageai.Client(api_key="your-voyage-api-key")

# At indexing time: use document input type for all chunks
doc_embeddings = client.embed(
    texts=["Refunds are processed within 30 days of the original purchase date."],
    model="voyage-3-large",
    input_type="document"   # optimized for longer passage representation
)

# At query time: use query input type for user questions
query_embedding = client.embed(
    texts=["Can I get a refund after 30 days?"],
    model="voyage-3-large",
    input_type="query"      # optimized for short question representation
)

# These two vectors now exist in an aligned asymmetric space
# The model trained on this distinction — retrieval quality is higher

BGE-M3 from BAAI handles this through its multi-granularity retrieval training — it was explicitly trained to match queries against passages of varying lengths and styles.

Matryoshka Representation Learning: Flexible Dimensions

python

import openai
import numpy as np

client = openai.OpenAI(api_key="your-openai-api-key")

text = "The refund policy allows returns within 30 days of purchase."

# Full precision embedding: 3072 dimensions
full_response = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=3072
)
full_vec = np.array(full_response.data[0].embedding)

# Reduced to 256 dimensions via Matryoshka truncation
# Only 2-3% quality loss — OpenAI handles the truncation server-side
small_response = client.embeddings.create(
    model="text-embedding-3-large",
    input=text,
    dimensions=256
)
small_vec = np.array(small_response.data[0].embedding)

print(f"Full vector shape:  {full_vec.shape}")   # (3072,)
print(f"Small vector shape: {small_vec.shape}")  # (256,)

The 2026 Embedding Model Landscape

Model	Provider	MTEB Score	Dimensions	Context Window	Cost	Best For
Gemini Embedding 001	Google	68.32 (retrieval: 67.71)	3072	2,048 tokens	API pricing	Highest retrieval accuracy, GCP ecosystem
Qwen3-Embedding-8B	Alibaba	70.58 (multilingual)	32 to 4096 (flexible)	32,768 tokens	Self-host (Apache 2.0)	Multilingual, long documents, self-hosted
Voyage-3-large	Voyage AI	68.1+ retrieval	1024	32,000 tokens	$0.06/M tokens	Strong quality, half OpenAI large cost
text-embedding-3-large	OpenAI	64.6	256 to 3072 (MRL)	8,192 tokens	$0.13/M tokens	OpenAI ecosystem teams
text-embedding-3-small	OpenAI	62.26	512 to 1536 (MRL)	8,192 tokens	$0.02/M tokens	Budget-sensitive, high-volume indexing
BGE-M3	BAAI	63.0	1024	8,192 tokens	Free (Apache 2.0)	Dense + sparse + multi-vector in one model
Cohere Embed v4	Cohere	65.2	variable (MRL)	128,000 tokens	$0.12/M tokens	Multimodal (text + images), long documents
NV-Embed-v2	NVIDIA	~68+	4096	32,768 tokens	Self-host (non-commercial)	Enterprise self-hosted, non-commercial

Sources: MTEB leaderboard April 2026, Premai.io benchmark guide, Awesome Agents leaderboard.

A few selection notes grounded in 2026 production patterns.

Domain-Specific Embedding Models

Domain	Recommended Approach
Code and technical docs	Voyage-3-code or fine-tuned BGE on code corpora
Legal text	Fine-tuned BGE or Qwen3 on legal corpus
Medical and clinical	BioGPT embeddings or PubMedBERT-based models
Financial documents	Voyage-3-finance or fine-tuned models on SEC filings
Multilingual enterprise	Qwen3-Embedding-8B (70.58 multilingual MTEB)
Mixed text and images	Cohere Embed v4 or Voyage Multimodal 3.5

Sources: Ailog.fr embedding guide 2026, Premai.io 2026 model benchmark

How the Embedding Layer Connects to the Rest of the Pipeline

The embedding model is the first technical decision in a RAG system and the one with the most downstream consequences.

python

from openai import OpenAI
import numpy as np

client = OpenAI(api_key="your-openai-api-key")

def embed_for_indexing(texts: list[str], batch_size: int = 100) -> list[list[float]]:
    """
    Embed document chunks for indexing. Process in batches to stay within rate limits.
    Cost: $0.02 per million tokens (text-embedding-3-small)
    """
    all_embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded {min(i + batch_size, len(texts))} / {len(texts)} chunks")

    return all_embeddings

def embed_query(query: str) -> list[float]:
    """
    Embed a single user query at request time.
    Same model as indexing — this is non-negotiable.
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",   # Must match indexing model
        input=query
    )
    return response.data[0].embedding

def retrieval_quality_check(
    query: str,
    retrieved_chunks: list[dict],
    min_similarity: float = 0.70
) -> dict:
    """
    Post-retrieval sanity check on similarity scores.
    If top results are below threshold, retrieval is likely failing.
    """
    query_vec = np.array(embed_query(query))
    issues = []

    for i, chunk in enumerate(retrieved_chunks[:5]):
        chunk_vec = np.array(chunk["embedding"])
        sim = float(np.dot(query_vec, chunk_vec))  # assumes normalized vectors

        if sim < min_similarity:
            issues.append({
                "rank": i + 1,
                "similarity": sim,
                "warning": f"Low similarity at rank {i+1} — retrieval may be degraded"
            })

    return {
        "query": query,
        "top_similarity": float(np.dot(query_vec, np.array(retrieved_chunks[0]["embedding"]))),
        "issues": issues,
        "retrieval_healthy": len(issues) == 0
    }

Where Embedding Quality Breaks Down

Knowing what embeddings are optimized for is as important as knowing what they do well.

Where to Go From Here

This is the final article in the RAG series. At this point you have the complete picture.

What Is RAG in AI establishes the three-phase loop — index, retrieve, generate — that everything in this series builds on.

RAG vs Fine-Tuning answers the first architectural question: whether you need retrieval at all, or whether the problem is behavioral and calls for weight updates instead.

RAG vs Traditional Search explains why BM25 is not dead, how it complements dense embedding-based retrieval in hybrid search, and where keyword matching handles what semantic vectors miss.

How Embeddings Work in RAG: The Complete Guide (2026)

What an Embedding Actually Is

How Transformers Produce Embeddings

Cosine Similarity: Why Angle Beats Distance

Bi-Encoders vs Cross-Encoders

Asymmetric Retrieval: The Silent Quality Killer

Matryoshka Representation Learning: Flexible Dimensions

The 2026 Embedding Model Landscape

Domain-Specific Embedding Models

How the Embedding Layer Connects to the Rest of the Pipeline

Where Embedding Quality Breaks Down

Where to Go From Here

Krunal Kanojiya

Related Posts

How Embeddings Work in RAG: The Complete Guide (2026)

What an Embedding Actually Is

How Transformers Produce Embeddings

Cosine Similarity: Why Angle Beats Distance

Bi-Encoders vs Cross-Encoders

Asymmetric Retrieval: The Silent Quality Killer

Matryoshka Representation Learning: Flexible Dimensions

The 2026 Embedding Model Landscape

Domain-Specific Embedding Models

How the Embedding Layer Connects to the Rest of the Pipeline

Where Embedding Quality Breaks Down

Where to Go From Here

Krunal Kanojiya

Related Posts