What is similarity search in a vector database?

Similarity search finds the stored vectors closest to a query vector according to a distance metric such as cosine similarity or Euclidean distance. Unlike exact search in a relational database which finds records matching a precise value, similarity search finds records whose vector representations are nearest to the query in a high-dimensional geometric space. The results are ranked by their similarity score rather than returned as a binary match or no-match.

How does a vector database find similar vectors quickly?

A vector database uses an approximate nearest neighbor (ANN) index to skip the vast majority of stored vectors. Instead of comparing the query against every vector (which would be O(n) and impractical at millions of records), the ANN index uses a graph structure (HNSW) or cluster structure (IVF) to navigate directly to the most likely nearest neighbors in O(log n) comparisons. The result is nearly identical to an exhaustive search but 100 to 1000 times faster.

What is the difference between similarity search and semantic search?

Similarity search is the technical mechanism: compare a query vector against stored vectors using a distance metric and return the nearest ones. Semantic search is the outcome: retrieve content that matches the meaning of a natural language query. Semantic search is implemented using similarity search. The text query is first converted to a vector by an embedding model, and then that vector is used to run similarity search. The terms are often used interchangeably in practice.

What does the similarity score mean in vector search results?

A similarity score is a numerical value indicating how close a result vector is to the query vector according to the distance metric used by the collection. For cosine similarity, scores range from 0 to 1 where 1 means identical direction and 0 means orthogonal (unrelated). For Euclidean distance, lower values mean higher similarity. Scores are relative within a result set: a score of 0.85 is meaningful only in comparison to other scores from the same query against the same embedding model.

Why do similarity search results sometimes seem wrong?

Three failure modes are common. First, the query and the target document were embedded with different models or different preprocessing, so they live in incompatible regions of the vector space. Second, the chunk containing the answer is too large and its embedding averages across multiple topics, placing it in a neighborhood far from the specific query. Third, the ANN index has low recall for your query type, meaning the approximate search missed the true nearest neighbors. Measuring ANN recall against exact search on your real data is the correct diagnostic.

What is the role of reranking after similarity search?

ANN similarity search uses a bi-encoder model that processes the query and each document independently. This is fast but shallow: the model cannot directly compare the query and document together. A reranker (cross-encoder) reads the query and each candidate document side by side and produces a more accurate relevance score. The common pattern is to retrieve 20 to 100 candidates from similarity search, rerank them with a cross-encoder, and return the top 3 to 10. This two-stage pipeline consistently improves precision without sacrificing recall.

How Similarity Search Works in Vector Databases

A user asks: "what is the penalty for late payment?" The system has 200,000 document chunks stored in a vector database. Somewhere in there is section 4.3 of the terms of service: "Overdue balances incur a 1.5 percent monthly fee." The query and the document share one word. They use completely different phrasing to describe the same concept.

Similarity search finds that document. Not because it matched keywords. Because the query vector and the document vector are geometrically close in the high-dimensional embedding space where both were placed when the embedding model processed them.

Understanding exactly how that happens, from the moment a query arrives to the moment results are returned, is what this article covers. Every step of the similarity search pipeline has engineering decisions that affect latency, recall, and result quality. Getting those decisions right is the difference between a search system that users trust and one they work around.

This article is part of the How Vector Databases Work Internally series. It covers the search side of the pipeline. The vector query lifecycle article covers the full end-to-end request flow including ingestion, storage, and response assembly. The distance metrics used in each search step are covered in depth in the cosine similarity vs Euclidean distance article.

What Similarity Search Is Doing Geometrically

Before covering the mechanics, the geometric intuition is worth establishing precisely.

Every embedding model maps text, images, or audio to a point in a high-dimensional vector space. The training process arranges those points so that items with similar meaning land close together and items with different meaning land far apart. When you search by similarity, you are asking: "find the stored points closest to this query point."

python

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Three stored documents
stored = {
    "doc_a": "Late payments incur a monthly fee of 1.5 percent.",
    "doc_b": "Shipping takes 3 to 5 business days.",
    "doc_c": "You may return items within 30 days of purchase.",
}

# One query
query = "what is the penalty for late payment"

# Embed everything
stored_vecs = {k: model.encode(v, normalize_embeddings=True) for k, v in stored.items()}
query_vec   = model.encode(query, normalize_embeddings=True)

# Compute cosine similarity (dot product after L2 normalization)
scores = {k: float(np.dot(query_vec, v)) for k, v in stored_vecs.items()}
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)

print("Similarity scores:")
for doc_id, score in ranked:
    print(f"  {doc_id}: {score:.4f}  '{stored[doc_id][:50]}'")

# Output:
# Similarity scores:
#   doc_a: 0.7841  'Late payments incur a monthly fee of 1.5 percent.'
#   doc_c: 0.3102  'You may return items within 30 days of purchase.'
#   doc_b: 0.1043  'Shipping takes 3 to 5 business days.'

"doc_a" scores 0.78 against the query even though "penalty" does not appear in the document. The embedding model learned during training that "late payment fee" and "penalty for late payment" occur in overlapping semantic contexts, so they land in similar regions of the embedding space. The math is just a dot product. The intelligence is in the learned geometry.

According to Couchbase's vector similarity guide, the index structure plays a crucial role in the similarity search process by guiding the search to relevant regions of the high-dimensional space, which helps narrow down the number of vector comparisons required.

The Two Phases of Similarity Search

Every production similarity search runs in two phases: retrieval and ranking.

Retrieval is the ANN phase. The index structure navigates to a candidate set of vectors that are approximately nearest to the query. This phase prioritizes speed. The goal is to return a superset of the true nearest neighbors without comparing against the entire collection.

Ranking is the scoring phase. Once candidates are retrieved, they are scored precisely by the distance metric, sorted, and filtered. This phase prioritizes accuracy.

plaintext

Phase 1: Retrieval (ANN search)
  Input:  query vector, k=10
  Output: candidate set of ~50 to 200 vector IDs (oversampled)
  Cost:   O(log n) comparisons using HNSW graph traversal
  Goal:   high recall — the true top-10 should be in this set

Phase 2: Ranking (exact scoring on candidates)
  Input:  candidate set of 50 to 200 vectors + query vector
  Output: sorted list of (id, score) pairs
  Cost:   O(candidates × dimensions) — fast, candidates << n
  Goal:   high precision — correct ordering of the candidate set

The separation between these phases is what makes similarity search at scale viable. Phase 1 discards 99.9 percent of the collection efficiently using the ANN index. Phase 2 applies exact scoring only to the small remaining candidate set.

Step 1: Query Vectorization

Before the search can begin, the query must be in the same vector space as the indexed documents. That means converting the raw text query to a float array using the same embedding model used during indexing.

python

import openai
import numpy as np

oai = openai.OpenAI(api_key="your-key")

def embed_query(text: str, model: str = "text-embedding-3-small") -> np.ndarray:
    """
    Convert a text query to a normalized embedding vector.
    Normalization to unit length converts cosine similarity to a dot product,
    which is faster and numerically more stable.
    """
    response = oai.embeddings.create(input=text, model=model)
    vec = np.array(response.data[0].embedding, dtype=np.float32)
    return vec / np.linalg.norm(vec)    # L2 normalize

query_vec = embed_query("what is the penalty for late payment")
print(f"Query vector shape: {query_vec.shape}")      # (1536,)
print(f"Query vector norm:  {np.linalg.norm(query_vec):.6f}")  # 1.000000

This step has two critical constraints. First, the embedding model must be identical to the model used during document indexing. Mixing models produces nonsensical similarity scores because different models learn different vector spaces with different geometries. Second, normalization matters: if documents were indexed as normalized vectors and the query is not normalized, cosine similarity scores will be incorrect.

Query vectorization typically takes 5 to 15ms for a sentence-length input on a modern CPU when calling an embedding API, or 1 to 3ms when running a local model like all-MiniLM-L6-v2.

Step 2: ANN Index Traversal

With the query vector ready, the database sends it to the ANN index. The specific traversal depends on the index type. HNSW traversal is the most common in production systems.

HNSW Traversal Step by Step

The HNSW graph has multiple layers. Each layer is a graph where nodes (vectors) are connected to their nearest neighbors. Higher layers have fewer nodes and longer-range connections. Layer 0 is the base layer containing all vectors.

plaintext

HNSW traversal for query vector q:

Layer 2 (coarse navigation):
  Enter at entry point node E
  Compute distance(E, q) = 0.42
  Check all neighbors of E: [A, F, G]
  distance(A, q) = 0.31 → move to A
  Check all neighbors of A: [B, E, H]
  No neighbor closer than A → descend to layer 1

Layer 1 (medium navigation):
  Enter at A
  Check neighbors of A: [B, C, E, D]
  distance(B, q) = 0.24 → move to B
  Check neighbors of B: [A, C, X, Y]
  distance(C, q) = 0.19 → move to C
  No neighbor closer than C → descend to layer 0

Layer 0 (precise local search):
  Enter at C
  Expand search to all neighbors and their neighbors
  using ef_search=64 candidate set
  Return top-K from explored candidates

The key parameter is ef_search (also written ef in some libraries), which controls how many candidates are tracked during the layer-0 search. A higher ef_search explores more of the local graph neighborhood, improving recall at the cost of more distance computations.

python

import faiss
import numpy as np

# Create HNSW index with ef_search tuning
d = 384   # dimension for all-MiniLM-L6-v2
index = faiss.IndexHNSWFlat(d, 16)     # M=16 connections per node
index.hnsw.efConstruction = 64         # quality during build
index.hnsw.efSearch = 64              # candidates during search (tune at query time)

# Add vectors
corpus = np.random.randn(100_000, d).astype(np.float32)
faiss.normalize_L2(corpus)
index.add(corpus)

query = np.random.randn(1, d).astype(np.float32)
faiss.normalize_L2(query)

# Search with default ef_search=64
distances, indices = index.search(query, k=10)

# Increase ef_search for higher recall (at cost of latency)
index.hnsw.efSearch = 128
distances_hq, indices_hq = index.search(query, k=10)

print(f"Result overlap: {len(set(indices[0]) & set(indices_hq[0]))} / 10")
# Result overlap: 9 / 10 — ef_search=128 found one extra true neighbor

The overlap between the two result sets tells you how much recall you gain from the higher ef_search. In practice, for most workloads, ef_search=64 gives 95 to 97 percent recall and ef_search=128 gives 97 to 99 percent recall. The right value depends on your latency budget and recall requirement.

According to Pinecone's similarity search guide, to reduce the computation complexity added by an exhaustive search, ANN search provides a massive performance boost on similarity search when dealing with large datasets by retrieving a close approximation of the nearest neighbor rather than the exact one.

Step 3: Distance Computation

During ANN traversal, distance computations happen constantly as the algorithm evaluates neighbors. Understanding the specific computation that runs at this step is important because it determines both correctness and performance.

Cosine Similarity (After Normalization)

For normalized vectors (L2 norm = 1), cosine similarity reduces to a dot product. This is the most common configuration for text embeddings.

python

import numpy as np

def cosine_similarity_normalized(a: np.ndarray, b: np.ndarray) -> float:
    """
    For L2-normalized vectors, cosine similarity = dot product.
    This is significantly faster than the full cosine formula because
    the norm computations (||a|| and ||b||) are both 1.0 and can be skipped.
    """
    return float(np.dot(a, b))

# Compare the two implementations on normalized vectors
a = np.random.randn(1536).astype(np.float32)
b = np.random.randn(1536).astype(np.float32)
a /= np.linalg.norm(a)
b /= np.linalg.norm(b)

full_cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
fast_dot    = np.dot(a, b)

print(f"Full cosine:  {full_cosine:.8f}")
print(f"Fast dot:     {fast_dot:.8f}")
print(f"Difference:   {abs(full_cosine - fast_dot):.2e}")
# Difference: 0.00e+00   — identical when vectors are normalized

Modern CPUs and GPUs implement dot products using SIMD (Single Instruction, Multiple Data) instructions that process multiple float values simultaneously. A single dot product on a 1536-dimensional float32 vector runs in a few microseconds on a modern CPU.

Euclidean Distance (L2)

For image embeddings and some audio models where vector magnitude carries signal, Euclidean distance is the correct metric.

python

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    diff = a - b
    return float(np.sqrt(np.dot(diff, diff)))

# Euclidean distance is sensitive to magnitude
a_short = np.array([1.0, 0.0, 0.0])
a_long  = np.array([3.0, 0.0, 0.0])   # same direction, 3x longer
b       = np.array([0.9, 0.1, 0.0])

print(f"euclid(a_short, b) = {euclidean_distance(a_short, b):.4f}")  # 0.1414
print(f"euclid(a_long, b)  = {euclidean_distance(a_long, b):.4f}")   # 2.1024
print(f"cosine(a_short, b) = {cosine_similarity_normalized(a_short / np.linalg.norm(a_short), b / np.linalg.norm(b)):.4f}")  # 0.9950
print(f"cosine(a_long, b)  = {cosine_similarity_normalized(a_long / np.linalg.norm(a_long), b / np.linalg.norm(b)):.4f}")   # 0.9950

The output shows the critical difference: cosine similarity treats a_short and a_long as identical (they point in the same direction). Euclidean distance sees them as very different (they are at different positions in space). Choose the metric that matches the geometry of your embedding model.

The full mathematical comparison with geometric intuition is in the cosine similarity vs Euclidean distance article. The high-level takeaway: use cosine for text, use Euclidean for image and audio models where the embedding magnitude is meaningful.

Step 4: Candidate Collection and Merging Across Segments

A production vector database stores vectors in multiple segments (sealed segments with HNSW indexes plus an active segment with brute-force search). Every segment must be searched in parallel and results merged.

python

import concurrent.futures
from dataclasses import dataclass

@dataclass
class SearchResult:
    id: int
    score: float
    payload: dict

def search_segment(segment_id: int, query_vec, k: int) -> list[SearchResult]:
    """Search one segment and return top-k local results."""
    # In a real database this runs the HNSW traversal for that segment
    # Returns local top-k sorted by score descending
    ...

def similarity_search(query_vec, k: int = 10) -> list[SearchResult]:
    """
    Search all segments in parallel and merge results globally.
    This is how production vector databases handle multi-segment collections.
    """
    segments = get_active_segments()    # list of segment IDs

    # Search all segments concurrently
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = {
            executor.submit(search_segment, seg_id, query_vec, k): seg_id
            for seg_id in segments
        }
        per_segment_results = []
        for future in concurrent.futures.as_completed(futures):
            per_segment_results.extend(future.result())

    # Global merge: sort all candidates by score, return top-k
    all_candidates = sorted(per_segment_results, key=lambda r: r.score, reverse=True)

    # Remove soft-deleted IDs
    active_candidates = [c for c in all_candidates if not is_deleted(c.id)]

    return active_candidates[:k]

The merge step is a straightforward sort of all per-segment top-K results followed by a global top-K selection. If there are 5 segments each returning 10 candidates, the merge sorts 50 candidates and returns 10. This is O(s × k × log(s × k)) where s is the number of segments, which is negligible compared to the ANN traversal cost.

The active segment, which is still growing and does not have an HNSW index, uses brute-force search over its (relatively small) vector set. This is acceptable because the active segment typically contains far fewer vectors than sealed segments.

Step 5: Metadata Filtering

Once candidates are retrieved from the ANN index, the query's metadata filter is applied to discard candidates that do not meet the structured criteria.

python

from qdrant_client import QdrantClient, models
import numpy as np

client = QdrantClient(host="localhost", port=6333)

query_vec = embed_query("what is the penalty for late payment")

# Post-filtering: ANN returns top-100, then filter by category
results = client.search(
    collection_name="documents",
    query_vector=query_vec.tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="category",
                match=models.MatchValue(value="terms-of-service")
            ),
            models.FieldCondition(
                key="active",
                match=models.MatchValue(value=True)
            )
        ]
    ),
    limit=10,
    with_payload=True,
    search_params=models.SearchParams(hnsw_ef=100),
)

for hit in results:
    print(f"Score {hit.score:.4f}: {hit.payload['text'][:60]}")

The filter category = "terms-of-service" AND active = True is evaluated against the metadata store for each candidate after the ANN search returns candidates. This is post-filtering: ANN search runs first, then filtering discards non-matching candidates.

For highly selective filters (where fewer than 5 percent of vectors match), post-filtering can produce fewer than K results. The solution is either to oversample (request more candidates than K from the ANN search) or switch to pre-filtering (apply the filter before ANN search to get the eligible ID set, then restrict ANN traversal to those IDs).

According to Instaclustr's vector similarity guide, similarity score computation is computationally intensive, especially for large datasets. Efficient algorithms and optimized libraries are essential to handle these computations.

Step 6: Score Normalization and Threshold Filtering

Raw similarity scores from different embedding models and different queries are not directly comparable. A score of 0.75 from one query may represent very high relevance. A score of 0.75 from a different query on a different collection may represent moderate relevance.

Score thresholds help filter out results that are statistically unlikely to be relevant. Setting a threshold requires understanding the score distribution for your specific model and collection.

python

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def profile_score_distribution(corpus_texts: list[str], n_random_queries: int = 100):
    """
    Sample random queries to understand the score distribution for this collection.
    This tells you what threshold separates relevant from random matches.
    """
    corpus_vecs = model.encode(corpus_texts, normalize_embeddings=True)

    random_queries = [
        "random query topic " + str(i) for i in range(n_random_queries)
    ]
    query_vecs = model.encode(random_queries, normalize_embeddings=True)

    all_scores = []
    for q_vec in query_vecs:
        scores = corpus_vecs @ q_vec   # dot product for all corpus vecs at once
        all_scores.extend(scores.tolist())

    scores_array = np.array(all_scores)
    print(f"Score distribution (random queries):")
    print(f"  Mean:   {scores_array.mean():.4f}")
    print(f"  Std:    {scores_array.std():.4f}")
    print(f"  P95:    {np.percentile(scores_array, 95):.4f}")
    print(f"  P99:    {np.percentile(scores_array, 99):.4f}")
    print()
    print(f"Recommended threshold: {np.percentile(scores_array, 99):.4f}")
    print("(scores above this are unlikely to be random coincidences)")

A practical rule: run 100 to 500 random or low-quality queries against your collection and record the maximum score each one achieves. The 99th percentile of those maximum scores is a reasonable starting threshold. Results below this score are probably coincidental matches rather than genuine semantic overlap.

For production RAG systems, the common pattern is to use a threshold of 0.3 to 0.5 for cosine similarity with all-MiniLM-L6-v2, and 0.5 to 0.75 for OpenAI's text-embedding-3-small. These ranges vary by collection and query type and must be calibrated on your actual data.

Step 7: Payload Fetch and Result Assembly

After scoring, filtering, and threshold application, the database fetches the full metadata payloads for the surviving result IDs. The payload fetch is deliberately deferred to this final step because most candidates are discarded before reaching it.

python

# Pseudocode: deferred payload fetch pattern used inside every vector database

def full_search(query_vec, k=10, filter_expr=None, score_threshold=0.3):

    # Step 1: ANN retrieval — returns (id, score) pairs only, no payloads
    candidates = ann_index.search(query_vec, k=k * 10)   # oversample

    # Step 2: Filter by metadata (using prebuilt metadata index, not full fetch)
    if filter_expr:
        candidates = apply_filter(candidates, filter_expr)

    # Step 3: Score threshold
    candidates = [c for c in candidates if c.score >= score_threshold]

    # Step 4: Take top-k survivors
    top_candidates = candidates[:k]

    # Step 5: NOW fetch full payloads — only for the final k results
    payloads = metadata_store.batch_get([c.id for c in top_candidates])

    return [
        {"id": c.id, "score": c.score, "payload": payloads[c.id]}
        for c in top_candidates
    ]

The deferred fetch pattern is important for performance. Fetching the full text payload for 1000 ANN candidates, each of which may be 500 to 2000 bytes, would add 0.5 to 2 MB of metadata reads per query. By deferring the fetch to only the final K results, the metadata I/O is bounded regardless of how many candidates the ANN phase produces.

Step 8: Reranking for Precision

The similarity search pipeline as described so far uses a bi-encoder: the embedding model processes the query and each document independently. The similarity score is computed by comparing their independently-produced vectors. This is fast but shallow.

A cross-encoder reranker processes the query and each candidate document together in a single forward pass through a transformer model. It can model the direct interaction between query and document words, which captures relevance signals that bi-encoder cosine similarity misses.

python

from sentence_transformers import CrossEncoder
import numpy as np

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def two_stage_search(
    query: str,
    vector_db_client,
    collection: str,
    first_stage_k: int = 50,
    final_k: int = 5,
) -> list[dict]:
    """
    Two-stage retrieval pipeline:
    Stage 1: Fast ANN similarity search (high recall, moderate precision)
    Stage 2: Cross-encoder reranking (high precision on the candidate set)
    """
    query_vec = embed_query(query)

    # Stage 1: retrieve generous candidate set
    raw_results = vector_db_client.search(
        collection_name=collection,
        query_vector=query_vec.tolist(),
        limit=first_stage_k,
        with_payload=True,
    )
    candidates = [
        {"text": hit.payload["text"], "score": hit.score, "id": hit.id}
        for hit in raw_results
    ]

    if not candidates:
        return []

    # Stage 2: cross-encoder reranking on candidates
    pairs  = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    for candidate, rerank_score in zip(candidates, scores):
        candidate["rerank_score"] = float(rerank_score)

    reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
    return reranked[:final_k]


query   = "what is the penalty for late payment"
results = two_stage_search(query, client, "documents", first_stage_k=50, final_k=5)

for r in results:
    print(f"Rerank score {r['rerank_score']:.3f} | ANN score {r['score']:.4f}")
    print(f"  {r['text'][:80]}")
    print()

The reranking step typically adds 50 to 200ms of latency depending on the model size and number of candidates. According to Qdrant's reranking documentation, unlike embedding models that compress everything into a single vector upfront, rerankers keep all the important details intact by using the full transformer output to calculate a similarity score. The tradeoff is precision gains at the cost of latency.

According to Shinrag's reranking analysis, a query like "Can managers approve their own expense reports?" may retrieve a high-scoring chunk about "expense reports must be approved by a direct manager" from bi-encoder search, while the chunk that actually answers the question ("self-approval of expense reports is prohibited") ranks lower. The cross-encoder, reading both together, correctly flips the ranking.

The pattern "retrieve 50, rerank to 5" consistently outperforms "retrieve 5" on precision benchmarks. The cost is the reranker inference on 50 candidates, which is bounded and predictable.

A Complete Similarity Search Implementation

This pulls all eight steps together into a single production-ready class.

python

import numpy as np
import openai
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient, models
from dataclasses import dataclass

@dataclass
class SearchHit:
    id: int
    ann_score: float
    rerank_score: float | None
    text: str
    metadata: dict

class SimilaritySearchPipeline:
    def __init__(
        self,
        embed_model_name: str = "all-MiniLM-L6-v2",
        rerank_model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
        qdrant_host: str = "localhost",
        qdrant_port: int = 6333,
        collection: str = "knowledge-base",
    ):
        self.embedder  = SentenceTransformer(embed_model_name)
        self.reranker  = CrossEncoder(rerank_model_name)
        self.db        = QdrantClient(host=qdrant_host, port=qdrant_port)
        self.collection = collection

    def embed(self, text: str) -> list[float]:
        vec = self.embedder.encode(text, normalize_embeddings=True)
        return vec.tolist()

    def search(
        self,
        query: str,
        k: int = 5,
        first_stage_k: int = 50,
        score_threshold: float = 0.30,
        filter_expr: models.Filter | None = None,
        use_reranker: bool = True,
    ) -> list[SearchHit]:

        # Step 1: embed query
        query_vec = self.embed(query)

        # Steps 2 to 6: ANN search with filtering
        raw = self.db.search(
            collection_name=self.collection,
            query_vector=query_vec,
            query_filter=filter_expr,
            limit=first_stage_k,
            with_payload=True,
            score_threshold=score_threshold,
            search_params=models.SearchParams(hnsw_ef=128),
        )

        if not raw:
            return []

        candidates = [
            SearchHit(
                id=hit.id,
                ann_score=hit.score,
                rerank_score=None,
                text=hit.payload.get("text", ""),
                metadata={k: v for k, v in hit.payload.items() if k != "text"},
            )
            for hit in raw
        ]

        # Step 7: reranking (optional but recommended)
        if use_reranker and candidates:
            pairs  = [(query, c.text) for c in candidates]
            scores = self.reranker.predict(pairs)
            for c, s in zip(candidates, scores):
                c.rerank_score = float(s)
            candidates.sort(key=lambda x: x.rerank_score, reverse=True)

        return candidates[:k]


pipeline = SimilaritySearchPipeline()

results = pipeline.search(
    query="what is the penalty for late payment",
    k=3,
    first_stage_k=30,
    score_threshold=0.35,
    filter_expr=models.Filter(
        must=[models.FieldCondition(
            key="category",
            match=models.MatchValue(value="terms-of-service")
        )]
    ),
    use_reranker=True,
)

for r in results:
    print(f"ANN: {r.ann_score:.4f}  Rerank: {r.rerank_score:.3f}")
    print(f"  {r.text[:100]}")

This pipeline covers all eight steps: query embedding, ANN traversal with HNSW, multi-segment candidate merge, metadata filter, score threshold, payload fetch, and cross-encoder reranking.

What the Similarity Score Tells You and What It Does Not

A similarity score of 0.82 from cosine search means the query vector and result vector are geometrically close in the embedding space. It does not necessarily mean the result is the best answer to the user's question.

The score is relative to a specific embedding model's learned geometry. A score of 0.82 from all-MiniLM-L6-v2 is not comparable to a score of 0.82 from text-embedding-3-large. The score is meaningful only within a single model's space and only in comparison to other scores from the same query.

This is why reranking adds value. The cross-encoder score is an absolute relevance estimate on a consistent scale (typically 0 to 1 for models trained on MS-MARCO), not a geometric distance in an embedding space. An ANN score of 0.72 can be more relevant than a score of 0.85 if the 0.72 result directly answers the query and the 0.85 result only incidentally mentions related words.

According to KX Systems' similarity search primer, it is common in similarity calculations for vector search to not use exactly 0, 90, or 180 degrees to determine similar, unrelated, or opposite vectors respectively, because looking for exact matches in a continuous geometric space is not practical. The search is always approximate, and the score is always a continuous-valued confidence estimate rather than a binary judgment.

Connecting Forward to the Technical Cluster Articles

Each step in the similarity search pipeline described here is covered in depth in the dedicated cluster articles.

The distance metrics at step 3, cosine similarity and Euclidean distance, are covered mathematically with geometric intuition in the cosine similarity vs Euclidean distance article. That article covers when each is appropriate and how the choice affects result quality.

The ANN index traversal at step 2, specifically why approximate search is necessary and what accuracy it sacrifices, is the subject of the exact vs approximate nearest neighbor article. It covers how recall is measured and how to set your ANN parameters for a target recall threshold.

The HNSW graph traversal at step 2 is covered in full detail, with diagrams of the layered graph structure and the greedy navigation algorithm, in the HNSW algorithm article.

The alternative IVF cluster-based retrieval, which runs instead of HNSW for memory-constrained and very large-scale deployments, is covered in the IVF index article.

The vector indexing discipline as a whole, including how different index types are chosen and how index quality is measured, is covered in the vector indexing article.

The full request lifecycle from API call through all internal components to response serialization is covered in the vector query lifecycle article.

Summary

Similarity search in a vector database is a pipeline of eight steps: query vectorization, ANN index traversal (HNSW or IVF), distance computation during traversal, candidate collection and merge across segments, metadata filtering, score threshold application, payload fetch, and optional cross-encoder reranking.

The performance of the pipeline is dominated by the ANN traversal step. Everything else, embedding, merging, filtering, payload fetch, is fast relative to the cost of navigating the index. Latency optimization therefore starts with ANN index tuning: the right ef_search for your recall requirement, the right segment size to minimize the number of active segments, and whether pre-filtering or post-filtering is appropriate for your filter selectivity.

Accuracy is dominated by the quality of the embedding model and the chunking strategy used during indexing. If the query and the relevant document land in the same geometric neighborhood, similarity search will find it. If they land far apart because the model does not understand your domain or because the chunk is too large and its embedding is diffuse, no amount of ANN tuning will recover the result. The How Vector Databases Work Internally pillar covers the full architecture that surrounds this search pipeline.

Sources and Further Reading

Pinecone. What Is Similarity Search? pinecone.io/learn/what-is-similarity-search
Couchbase. What Is Vector Similarity Search? Benefits and Applications. couchbase.com/blog/vector-similarity-search
Instaclustr. What Is Vector Similarity Search? Pros, Cons, and 5 Tips. instaclustr.com/education/vector-database/what-is-vector-similarity-search-pros-cons-and-5-tips-for-success
Oracle. Similarity Search: Why AI Speaking in Vectors Is a Win for Users. oracle.com/database/ai-vector-search/similarity-search
Redis. What Is Vector Similarity? Metrics and Algorithms Explained. redis.io/blog/vector-similarity
KX Systems. How Vector Databases Search by Similarity: A Comprehensive Primer. medium.com/kx-systems/how-vector-databases-search-by-similarity
Qdrant. Reranking for Better Search. qdrant.tech/documentation/search-precision/reranking-semantic-search
Shinrag. Reranking in RAG: Cross-Encoder Reranking for Better Retrieval. shinrag.com/blog/reranking-rag-retrieval-quality-cross-encoder
Superlinked. Optimizing RAG with Hybrid Search and Reranking. superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking
Elastic. Ranking and Reranking Documentation. elastic.co/docs/solutions/search/ranking
Bishal Bose. Re-Ranking Algorithms in Vector Databases: An In-Depth Analysis. bishalbose294.medium.com/re-ranking-algorithms-in-vector-databases-in-depth-analysis
Weaviate. Vector Search Documentation. weaviate.io/developers/weaviate/search/similarity
Milvus. Similarity Metrics Documentation. milvus.io/docs/metric.md
FAISS. Getting Started Documentation. faiss.ai/index

How Similarity Search Works in Vector Databases

What Similarity Search Is Doing Geometrically

The Two Phases of Similarity Search

Step 1: Query Vectorization

Step 2: ANN Index Traversal

HNSW Traversal Step by Step

Step 3: Distance Computation

Cosine Similarity (After Normalization)

Euclidean Distance (L2)

Step 4: Candidate Collection and Merging Across Segments

Step 5: Metadata Filtering

Step 6: Score Normalization and Threshold Filtering

Step 7: Payload Fetch and Result Assembly

Step 8: Reranking for Precision

A Complete Similarity Search Implementation

What the Similarity Score Tells You and What It Does Not

Connecting Forward to the Technical Cluster Articles

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts