Why Your Vector Search Returns 'Close But Not Quite Right' Results
Most RAG pipelines stop at the dual encoder. That is the mistake. This guide explains why modern search systems chain dual encoders and cross-encoders together, how each one works, and how to implement the two-stage pipeline in Python.
I spent a week fine-tuning an embedding model for a client search project. The retrieval looked fast. The cosine scores looked decent. But when we tested it with real queries, the top results were always a little off. Not wrong enough to alarm anyone, but not useful enough to ship. The model kept surfacing documents that shared vocabulary with the query but missed what the user actually needed.
That experience sent me down a rabbit hole. The problem was not the embedding model. The problem was that we stopped too early in the pipeline. We were using one component when the architecture requires two.
This article is about that second component, why it exists, and why serious search systems never skip it.
The Gap Between Fast Search and Accurate Search
Every search system faces the same core tension. Speed scales. Accuracy does not, at least not in the same direction.
When a user types a query into a search bar, they expect a response in milliseconds. Milliseconds. Running a deep computation against every single document in a corpus of one million items is not an option. The physics do not allow it.
So the question becomes: how do you get both speed and accuracy?
The answer is not a single smarter model. The answer is two models that specialize in different jobs.
Most production RAG pipelines that feel "almost right" are single-stage pipelines. The dual encoder retrieves candidates. Nothing reranks them. The result is fast but imprecise.
What a Dual Encoder Actually Does
A dual encoder, often called a bi-encoder or a two-tower model, uses two separate transformer networks. One encodes the query. The other encodes the document. Both produce a fixed-size vector, typically 768 dimensions. Then the system computes the cosine similarity between those two vectors.
That is the entire comparison. A single number derived from two independent embeddings.
The reason this is fast comes down to precomputation. You can encode every document in your corpus ahead of time and store those vectors in a vector index like FAISS or Pinecone. When a query arrives, you encode only the query, which takes a few milliseconds, and then run an approximate nearest-neighbor search against the precomputed vectors. The query-time cost is effectively constant regardless of corpus size.
Here is what that looks like in Python using Sentence Transformers:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Precompute and store these at index time
documents = [
"Python is a high-level programming language.",
"Transformer models revolutionized NLP benchmarks.",
"Cosine similarity measures the angle between two vectors.",
"RAG systems combine retrieval with language model generation.",
]
doc_embeddings = model.encode(documents, convert_to_numpy=True)
# At query time, only this part runs
query = "How does vector similarity work in search?"
query_embedding = model.encode(query, convert_to_numpy=True)
# Compute cosine similarity against all stored vectors
scores = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
ranked = sorted(zip(scores, documents), reverse=True)
for score, doc in ranked[:3]:
print(f"{score:.4f} | {doc}")This runs in milliseconds on millions of documents with the right index. The tradeoff is that the query and document never actually interact. Each gets summarized into one vector independently, and then those summaries are compared. That independent compression is where meaning gets lost.
As the Towards Data Science deep dive on cross-encoders puts it: encoding a sentence is like pooling individual token embeddings into a single blurry vector. The word "quantum" and the word "speedup" both collapse into an average. The richness disappears.
A dual encoder cannot model the relationship between a query and a document. It models each one in isolation. This is why it misses nuanced relevance even when vocabulary overlaps.
The Problem That Dual Encoders Cannot Solve
Imagine a user searches for "side effects of stopping medication suddenly." A dual encoder might retrieve documents about medication dosage, drug interactions, or medication schedules because those share terms. What it cannot reliably do is distinguish which document actually answers the precise safety question the user is asking.
That distinction requires the model to read both the query and the document at the same time and understand how one speaks to the other.
That is exactly what a cross-encoder does.
A cross-encoder takes the query and a candidate document as a single concatenated input, structured as [CLS] query [SEP] document [SEP]. The full transformer runs on this joint sequence. Every query token attends to every document token across all layers. The model outputs a single relevance score between 0 and 1.
Because every token in the query can attend to every token in the document, the model captures nuance that a dual encoder simply cannot. It understands negation. It catches context. It scores documents based on whether they actually answer the question, not whether they share vocabulary with it.
Here is a practical cross-encoder example using Sentence Transformers:
from sentence_transformers.cross_encoder import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
query = "side effects of stopping medication suddenly"
candidates = [
"Medication dosage guidelines for common prescriptions.",
"Abrupt discontinuation of certain medications can cause withdrawal symptoms.",
"Drug interaction checkers and pharmacy tools.",
"How to schedule medication reminders on your phone.",
]
# Cross-encoder scores each query-candidate pair jointly
pairs = [[query, doc] for doc in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
for score, doc in ranked:
print(f"{score:.4f} | {doc}")The output reorders the candidate list so the most relevant document rises to the top, even if the dual encoder placed it lower based on vector distance alone.
The cost of this accuracy is real. Every cross-encoder call requires a full transformer forward pass on the combined query and document. You cannot precompute anything. At query time, the complexity scales with the number of candidates you pass through it.
The Sentence Transformers library includes several pretrained cross-encoder models. The cross-encoder/ms-marco-MiniLM-L-6-v2 model is fine-tuned on the MS MARCO passage ranking dataset and works well as a general-purpose reranker for English text.
Why the Two-Stage Pipeline Solves Both Problems
The solution is not to pick one model over the other. The solution is to use them in sequence, each handling the job it was built for.
Stage one uses the dual encoder to retrieve the top 50 to 100 candidates from the full corpus. This is the recall stage. You want high recall here because any relevant document that misses this cut is gone forever.
Stage two uses the cross-encoder to rerank only those 50 to 100 candidates. The corpus is now small enough that deep joint attention is computationally viable. This is the precision stage. The reranker reads each candidate against the query carefully and reorders the list.
Only the top results from stage two go to the user or to the LLM in a RAG pipeline.
Here is the full two-stage pipeline in one script:
from sentence_transformers import SentenceTransformer
from sentence_transformers.cross_encoder import CrossEncoder
import numpy as np
# Stage 1 model: dual encoder for fast retrieval
retriever = SentenceTransformer("all-MiniLM-L6-v2")
# Stage 2 model: cross-encoder for precise reranking
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# A small corpus for demonstration
corpus = [
"Stopping blood pressure medication without a doctor's guidance can be dangerous.",
"Common blood pressure drugs include ACE inhibitors and beta blockers.",
"Medication adherence improves outcomes in chronic disease management.",
"Withdrawal effects vary depending on the type and duration of medication use.",
"Pharmacists can review drug interactions and dosage schedules.",
"Abrupt cessation of antidepressants can cause discontinuation syndrome.",
"Always consult a physician before changing any medication regimen.",
"Over-the-counter pain relievers are generally safe for short-term use.",
]
query = "is it dangerous to stop taking my medication without a doctor?"
# Stage 1: Retrieve top-k with the dual encoder
doc_embeddings = retriever.encode(corpus, convert_to_numpy=True)
query_embedding = retriever.encode(query, convert_to_numpy=True)
cosine_scores = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
top_k = 5
top_indices = np.argsort(cosine_scores)[::-1][:top_k]
candidates = [corpus[i] for i in top_indices]
print("Stage 1 - Dual Encoder Retrieval:")
for i, doc in enumerate(candidates):
print(f" {i+1}. {doc}")
# Stage 2: Rerank candidates with the cross-encoder
pairs = [[query, doc] for doc in candidates]
rerank_scores = reranker.predict(pairs)
final_results = sorted(zip(rerank_scores, candidates), reverse=True)
print("\nStage 2 - Cross-Encoder Reranked:")
for score, doc in final_results:
print(f" {score:.4f} | {doc}")This pattern matches what Weaviate describes as the production-grade search architecture: the dual encoder casts a wide net, and the cross-encoder picks the best catch from that net.
How the Two Models Compare Side by Side
| Property | Dual Encoder | Cross-Encoder |
|---|---|---|
| Input | Query and document encoded separately | Query and document encoded jointly |
| Output | Two vectors compared via cosine | One relevance score per pair |
| Speed | Very fast, supports precomputation | Slow, no precomputation possible |
| Accuracy | Moderate, misses nuanced relevance | High, captures full query-document interaction |
| Scalability | Scales to millions of documents | Practical only on small candidate sets (50 to 200) |
| Use in pipeline | Stage 1 retrieval | Stage 2 reranking |
| Example models | all-MiniLM-L6-v2, text-embedding-ada-002 | ms-marco-MiniLM-L-6-v2, Cohere Rerank, BGE-Reranker |
| Precompute docs? | Yes | No |
| Token interaction | None | Full cross-attention across all layers |
The Models Worth Knowing for Reranking
Picking the right cross-encoder for your use case matters. Here are the most commonly used options in production:
- cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers is the default starting point for English. It is small, fast for a cross-encoder, and fine-tuned on MS MARCO passage ranking data.
- Cohere Rerank is a hosted API that handles the reranking step without managing your own model infrastructure. It works well in RAG pipelines where the dual encoder already handles retrieval.
- BGE-Reranker from BAAI is a strong open-source choice for multilingual workloads. It comes in base and large sizes.
- cross-encoder/nli-deberta-v3-small is useful when you need natural language inference capabilities alongside relevance scoring.
Where This Architecture Shows Up in the Real World
RAG Pipelines
LangChain and LlamaIndex both have built-in reranking steps. In a RAG system, the quality of what the LLM receives directly determines the quality of what it generates. Passing the top-k dual encoder results straight to the generator is a common reason RAG outputs feel slightly off. The cross-encoder filters out documents that look relevant on the surface but do not actually answer the question.
Semantic Search Products
Consumer search engines, legal document retrieval, medical knowledge bases, and enterprise internal search tools all benefit from this architecture. The dual encoder handles scale. The cross-encoder handles precision. Neither alone handles both.
E-commerce and Product Discovery
When a user searches "lightweight running shoes for flat feet," product descriptions that mention running shoes will surface via the dual encoder. The cross-encoder then evaluates which products actually address flat feet specifically, rather than just matching the general category.
ColBERT as a Middle Ground
If running a full cross-encoder on 100 candidates per query still feels expensive for your latency budget, there is a third option worth knowing.
ColBERT uses a late-interaction approach. It encodes query and document separately like a dual encoder, preserving the ability to precompute document representations. But instead of comparing two pooled vectors, it compares individual token embeddings using a MaxSim operation: for each query token, find the most similar document token across the full sequence.
This gives ColBERT much better accuracy than a dual encoder while keeping document precomputation intact. It is two orders of magnitude faster than a cross-encoder and uses four orders of magnitude fewer FLOPs per query, according to the original ColBERT paper from Stanford.
| Approach | Precompute Docs | Token Interaction | Relative Speed | Relative Accuracy |
|---|---|---|---|---|
| Dual Encoder | Yes | None | Fastest | Lowest |
| ColBERT | Yes (per token) | Per-token MaxSim | Fast | High |
| Cross-Encoder | No | Full cross-attention | Slowest | Highest |
ColBERT works especially well as a reranker when you need better precision than a dual encoder but cannot afford cross-encoder latency. The PLAID system built on ColBERTv2 achieves around 45 times lower latency than vanilla ColBERTv2 while maintaining retrieval quality.
If you are building a high-QPS service where tail latency matters and cross-encoder reranking is too slow, ColBERT-style late interaction is the architecture to explore. The RAGatouille library makes ColBERT accessible with minimal setup.
When to Use What
The pipeline you choose depends on your constraints. Here is a practical decision guide:
| Scenario | Recommended Approach |
|---|---|
| Small corpus, fewer than 500 docs | Cross-encoder directly, skip dual encoder |
| Large corpus, latency over accuracy | Dual encoder only |
| Large corpus, accuracy matters most | Dual encoder retrieval plus cross-encoder reranking |
| Very high QPS, tight latency budget | Dual encoder plus ColBERT reranking |
| Domain-specific content, general models underperform | Fine-tune the cross-encoder on your data |
| Multilingual content | BGE-Reranker or mGTE for multilingual reranking |
What the Pipeline Looks Like End to End
For a RAG pipeline, the sequence is straightforward:
- At index time, encode all documents with a dual encoder and store vectors in a vector database.
- At query time, encode the user query with the dual encoder.
- Run an approximate nearest-neighbor search to retrieve the top 50 to 100 candidates.
- Pass each candidate paired with the query to the cross-encoder.
- Collect the relevance scores and sort the candidates.
- Send only the top 5 to 10 reranked documents to the LLM as context.
- Generate the response.
Removing step 4 through step 6 is the single most common reason RAG pipelines return answers that are technically correct but miss the point.
The ceiling on a two-stage pipeline is always the recall of stage one. If the dual encoder misses a relevant document entirely, the cross-encoder never sees it. This is why tuning the top-k value in stage one matters. Retrieve generously, then rerank aggressively.
The Right Way to Think About This
A dual encoder and a cross-encoder are not competing solutions. They solve different parts of the same problem.
The dual encoder answers the question: "Which documents are in the neighborhood of this query?" It is a fast, scalable scan across a large space.
The cross-encoder answers the question: "Among these nearby documents, which ones actually answer this query?" It is a slow, careful read of a small set.
Building a search system that only does the first part is like hiring a librarian to narrow the search to one shelf, then handing the user every book on that shelf without checking whether any of them actually address the question. The retrieval succeeded. The precision failed.
The two-stage pipeline is how you get both.
Frequently Asked Questions
What is a dual encoder in search? A dual encoder encodes the query and each document separately into vectors and uses cosine similarity to find the nearest matches at scale.
What is a cross-encoder? A cross-encoder takes a query and a single document together as input and outputs a precise relevance score by reading both at the same time.
Why not use a cross-encoder for all retrieval? Cross-encoders cannot precompute anything, so running one against millions of documents at query time is computationally impossible in a real system.
What is the two-stage retrieval pipeline? The two-stage pipeline uses a dual encoder to quickly retrieve the top-k candidates, then a cross-encoder to rerank only that small candidate set for precision.
What is ColBERT and how does it relate to dual and cross-encoders? ColBERT is a late-interaction model that sits between the two, encoding query and document separately but comparing individual token vectors instead of a single pooled vector.
Which models are good for cross-encoder reranking?
Popular choices include cross-encoder/ms-marco-MiniLM-L-6-v2 from Sentence Transformers, Cohere Rerank, and the BGE Reranker for multilingual workloads.
Does LangChain support cross-encoder reranking in RAG pipelines? Yes, both LangChain and LlamaIndex have built-in reranking steps that support cross-encoder models as a second-stage reranker.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.