What happens inside a vector database when you run a similarity search?

The lifecycle has seven distinct stages. First, the API layer receives and parses the request, extracting the query vector, the filter expression, and search parameters. Second, the filter strategy is selected (pre-filter or post-filter based on estimated selectivity). Third, the ANN index traversal runs on each segment in parallel. Fourth, per-segment results are merged into a global candidate list. Fifth, metadata filters are applied to discard candidates that do not match the structured criteria. Sixth, the surviving candidates are scored, threshold-filtered, and sorted. Seventh, full payloads are fetched from the metadata store and the response is serialized.

How does distributed vector search work across multiple shards?

A distributed vector database broadcasts the query to all shards in parallel. Each shard runs a local ANN search and returns its local top-K results. A query coordinator receives the per-shard result lists and performs a global merge: sorting all received candidates by score and selecting the global top-K. The key insight is that each shard must return more than K results to ensure that the global top-K can be assembled correctly. If shard returns only K=10 and the global top-10 includes 8 results from one shard, a shard that returned only 10 results may have missed candidates that belong in the global top-10.

What is the difference between pre-filtering and post-filtering in vector search?

Post-filtering runs ANN search first to get a large candidate set, then applies the metadata filter to discard non-matching candidates. It is fast because the ANN index runs at full efficiency, but it may return fewer than K results when the filter is highly selective. Pre-filtering applies the metadata filter first to get the eligible vector IDs, then restricts the ANN search to only those IDs. It always returns K results (if K matching vectors exist) but degrades ANN accuracy when the eligible set is small because the index was built over the full collection. Production systems choose adaptively based on estimated filter selectivity.

Why does a vector database search multiple segments per query?

A vector database stores vectors in multiple segments: sealed segments (with HNSW or IVF indexes) and an active segment (using brute-force search on new, uncommitted vectors). A query must search all segments because the relevant result could exist in any of them. Each segment is searched in parallel and the results are merged by score before the global top-K is returned. The active segment uses flat (exact) search because it is small enough that this is fast, while sealed segments use ANN indexes.

What is the latency breakdown of a typical vector query?

For a well-tuned system on a single node with 5 million vectors: API parsing takes under 1ms, filter selectivity estimation takes under 1ms, ANN index traversal on each segment takes 5 to 15ms (dominating total latency), multi-segment merge takes under 1ms, metadata payload fetch takes 1 to 3ms, and response serialization takes under 1ms. Total: 8 to 22ms. For distributed systems with 4 shards, network round trips to shards add 2 to 10ms, and the shard merge step adds under 1ms.

How does a reranker fit into the query lifecycle?

A reranker is applied after the ANN search and metadata filtering stages but before the final response is serialized. The query and each surviving candidate document are passed together through a cross-encoder model that produces a more accurate relevance score than the cosine similarity from the ANN stage. The candidates are then re-sorted by reranker score and the top-K are returned. Reranking typically adds 50 to 200ms of latency but consistently improves the precision of the top results, making it a worthwhile addition for RAG pipelines where LLM generation latency dominates and the search latency increase is proportionally small.

Vector Query Lifecycle Explained Step by Step

A user types a question into a customer support portal. The application converts it to a 1536-dimensional embedding and sends a search request to the vector database. 22 milliseconds later, the top 10 most relevant chunks are returned and passed to the LLM.

Those 22 milliseconds contain more engineering than most search systems perform in their entire request cycle. Seven distinct stages happen in sequence, some with internal parallelism, each with its own configuration surface and its own failure modes.

This article traces every stage of that 22-millisecond lifecycle. Every stage is a place where latency is added, recall is bounded, and configuration choices produce measurable consequences. Understanding the full sequence is what allows you to instrument the right place when latency spikes, diagnose missing results, and tune the right parameter when recall falls below target.

This is the final article in the How Vector Databases Work Internally series. Every stage of the lifecycle references concepts covered in the preceding cluster articles: similarity search mechanics, distance metrics, ANN vs exact search, HNSW traversal, IVF cluster search, PQ compression, and index selection. This article assembles all of those pieces into a single end-to-end walkthrough.

The Complete Lifecycle at a Glance

Before going step by step, the full sequence in one view:

plaintext

CLIENT REQUEST
    ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 1: API reception and request parsing         (~0.5ms)        │
│  Parse JSON/protobuf, extract vector + filter + params              │
│  Validate vector dimension, filter schema, parameter bounds         │
└──────────────────────────────┬──────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 2: Filter strategy selection                 (~0.5ms)        │
│  Estimate filter selectivity from metadata index statistics         │
│  Choose pre-filter, post-filter, or exact fallback                  │
└──────────────────────────────┬──────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 3: ANN index traversal (per segment, parallel)  (~5-15ms)   │
│  HNSW: greedy layer descent → beam search at layer 0               │
│  IVF: centroid comparison → exhaustive search in top-nprobe lists   │
│  Active segment: flat (exact) search                                │
│  Each segment returns local top-(K * oversampling_factor)           │
└──────────────────────────────┬──────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 4: Multi-segment result merge               (~0.3ms)         │
│  Collect per-segment candidate lists                                │
│  Global sort by score, deduplicate, discard soft-deleted IDs        │
└──────────────────────────────┬──────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 5: Metadata filtering                       (~0.5-2ms)       │
│  Fetch payload fields needed for filter evaluation                  │
│  Apply filter predicates to discard non-matching candidates         │
│  If too few remain: expand search or return partial result          │
└──────────────────────────────┬──────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 6: Scoring, threshold, and ranking          (~0.3ms)         │
│  Apply score threshold (discard low-confidence results)             │
│  Optional: cross-encoder reranking (+50 to 200ms)                  │
│  Select final top-K                                                 │
└──────────────────────────────┬──────────────────────────────────────┘
                               ↓
┌─────────────────────────────────────────────────────────────────────┐
│  Stage 7: Payload fetch and response assembly      (~1-3ms)         │
│  Fetch full metadata payloads for top-K IDs                         │
│  Serialize result to JSON (REST) or protobuf (gRPC)                 │
│  Return to client                                                   │
└─────────────────────────────────────────────────────────────────────┘
    ↓
CLIENT RESPONSE

The total time is dominated by Stage 3. All other stages combined typically take less than 5ms on a single-node system. The most impactful performance optimization in a vector database is almost always making ANN traversal faster, and the levers for that are the index algorithm, the parameter choices (ef_search or nprobe), the dimensionality of the vectors, and the number of segments being searched.

Stage 1: API Reception and Request Parsing

The lifecycle begins when the database server receives a query request. In most vector databases, queries arrive over one of two transport protocols: a REST API accepting JSON bodies, or a gRPC API accepting protobuf-encoded messages.

python

# The query a client sends (Qdrant REST API example)
import httpx
import numpy as np

query_vector = np.random.randn(1536).astype(np.float32)
query_vector /= np.linalg.norm(query_vector)

request_body = {
    "vector":  query_vector.tolist(),
    "filter": {
        "must": [
            {"key": "category", "match": {"value": "support"}},
            {"key": "language", "match": {"value": "en"}},
        ],
        "must_not": [
            {"key": "archived", "match": {"value": True}}
        ]
    },
    "limit":         10,
    "with_payload":  True,
    "score_threshold": 0.35,
    "params": {
        "hnsw_ef": 128,
        "exact":   False,
    }
}

response = httpx.post(
    "http://localhost:6333/collections/knowledge-base/points/search",
    json=request_body,
)
results = response.json()["result"]

Inside the server, the API layer performs:

Deserialization. The JSON body is parsed into an internal request struct. The vector field is parsed from a list of floats into a contiguous float32 array. The filter is parsed into an internal filter AST (abstract syntax tree) that will be evaluated against the metadata store.

Validation. The vector's length is checked against the collection's configured dimensionality. A 1538-element vector sent to a 1536-dimension collection is rejected with a clear error. The filter fields are checked for schema compatibility. The hnsw_ef parameter is validated against its allowed range.

Request normalization. If the collection uses cosine distance and the stored vectors are L2-normalized, the incoming query vector is also normalized. If the request specifies exact: true, the search is routed directly to the flat (brute-force) path, bypassing all ANN index logic.

This stage completes in under 1 millisecond for typical requests. Protobuf deserialization is faster than JSON parsing and is preferred for high-throughput production deployments.

Stage 2: Filter Strategy Selection

A filter expression like category = "support" AND language = "en" AND NOT archived = true must be evaluated against the vector search. The question the database must answer before the ANN search begins: should filters be applied before, during, or after the vector index traversal?

The answer depends on how selective the filter is. Selectivity is defined as the fraction of the collection that satisfies the filter. A filter that matches 50 percent of vectors is low-selectivity. A filter that matches 0.5 percent is high-selectivity.

python

# Pseudocode: filter strategy decision in Qdrant
def select_filter_strategy(
    filter_expr,
    collection_metadata_index,
    n_total_vectors: int,
    post_filter_threshold: float = 0.05,
) -> str:
    """
    Estimate filter selectivity and choose a retrieval strategy.
    post_filter_threshold: if more than this fraction matches, use post-filter.
    """
    estimated_matches = collection_metadata_index.estimate_count(filter_expr)
    selectivity = estimated_matches / n_total_vectors

    if selectivity > post_filter_threshold:
        # More than 5% match: ANN search first, filter results after
        return "post_filter"

    elif selectivity > 0.001:
        # 0.1% to 5% match: use ANN with allowed-ID list (pre-filter)
        return "pre_filter"

    else:
        # Under 0.1% match: so few matching vectors that exact search
        # over the eligible set is faster than ANN
        return "exact_over_filtered_set"

According to Microsoft Azure AI Search's filter documentation, prefiltering guarantees that k results are returned if they exist in the index but can cause a significant portion of the graph to be traversed for highly selective filters, increasing computation cost and latency. Post-filtering applies filters after query execution, which narrows the search results, but you might receive fewer than k documents that match the filter.

The three strategies are:

Post-filter (default for low-selectivity filters). Run ANN search over the entire collection to get top-K * oversampling_factor candidates, then discard those that fail the filter. Fast because the ANN index runs on the full collection with its full graph connectivity. Risk: if the filter is highly selective, many candidates are discarded and fewer than K results survive.

Pre-filter (default for high-selectivity filters). Fetch the IDs of all matching vectors from the metadata index first, then run ANN search constrained to that allowed-ID set. Always returns K results. Risk: constraining the HNSW graph traversal to a small ID subset degrades recall because the graph was built for the full collection.

Exact over filtered set. When fewer than a few hundred vectors match the filter, brute-force comparison against all matching vectors is faster and more accurate than ANN traversal over a constrained subset.

Adaptive strategies in modern databases like Qdrant use cardinality estimation from the metadata index (which maintains statistics about field value distributions) to choose automatically. According to research published on filtered ANN search, partition-based indexes (IVFFlat) outperform graph-based indexes (HNSW) for low-selectivity queries, while HNSW degrades more gracefully at moderate selectivity.

Stage 3: ANN Index Traversal

This is the computationally dominant stage. The ANN search happens per segment, with all segments processed in parallel across CPU cores.

HNSW Traversal (Sealed Segments)

For each sealed segment with an HNSW index:

plaintext

HNSW query for query vector q, ef_search=128:

1. Enter at global entry point (node in highest layer).

2. Upper layer descent (ef=1 at each upper layer):
   Layer L:  compare q against neighbors of current node.
             move to whichever neighbor is closer to q.
             stop when no neighbor is closer. descend to layer L-1.
   ...
   Layer 1:  arrive at P1, the closest node to q in layer 1.
   Descend to layer 0.

3. Layer 0 beam search (ef=ef_search=128):
   Maintain a priority queue of 128 candidates.
   Pop the closest unexplored candidate.
   Check all its neighbors.
   Add promising neighbors to the candidate queue.
   Terminate when the closest unexplored candidate is farther
   than the farthest node in the current top-128 result set.

4. Return top-K from the 128-candidate result set.
   (IDs and scores, no payloads yet)

The ef_search parameter (128 in this example) is the primary tuning lever for this stage. Higher ef_search explores more of the graph and finds more true nearest neighbors, but takes longer. Setting ef_search to 128 instead of 64 typically increases recall from 96 to 99 percent at the cost of roughly 80 percent more latency in Stage 3.

python

import faiss
import numpy as np

d = 384
n = 1_000_000

corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)

# Build HNSW index (simulating a sealed segment)
idx = faiss.IndexHNSWFlat(d, 16)
idx.hnsw.efConstruction = 200
idx.add(corpus)

query = np.random.randn(1, d).astype(np.float32)
faiss.normalize_L2(query)

import time

# The ef_search parameter is what Stage 3 tunes
for ef in [32, 64, 128, 200]:
    idx.hnsw.efSearch = ef
    t0 = time.perf_counter()
    for _ in range(100):   # average over 100 queries
        idx.search(query, k=10)
    avg_ms = (time.perf_counter() - t0) * 1000 / 100
    print(f"ef_search={ef:>4}:  {avg_ms:.2f}ms per query")

# ef_search=  32:  1.23ms
# ef_search=  64:  2.11ms
# ef_search= 128:  3.87ms
# ef_search= 200:  5.94ms

Active Segment (Flat Search)

The active segment has not yet accumulated enough vectors to justify building an HNSW index. It uses flat (brute-force) search:

python

# Active segment: flat search
flat_segment = faiss.IndexFlatIP(d)
flat_segment.add(new_vectors)   # vectors inserted since last seal

# Flat search: O(n_active * d) — fast because n_active is small
_, flat_ids = flat_segment.search(query, k=10)

If the active segment has 50,000 vectors, flat search takes approximately 3ms at 384 dimensions. This is acceptable because the active segment is always much smaller than sealed segments.

IVF Traversal (Alternative Algorithm)

For collections indexed with IVF, Stage 3 follows a two-phase structure instead of the graph traversal:

plaintext

IVF query for query vector q, nprobe=20, nlist=1024:

Phase 1: Coarse centroid comparison
  Compare q against all 1024 centroids: 1024 distance computations.
  Select the 20 closest centroids (nprobe=20).
  Cost: ~1024 * d multiply-adds = fast, sub-millisecond.

Phase 2: Fine search within selected clusters
  For each of the 20 selected clusters:
    Compare q against all vectors in the cluster's inverted list.
    ~976 vectors per cluster on average (1M / 1024).
    Cost per cluster: ~976 * d multiply-adds.
  Total fine search: 20 * 976 = ~19,520 comparisons
                     vs 1,000,000 for brute force.

Return top-K from the 19,520 comparisons.

Stage 3 is the only stage where the choice between HNSW and IVF produces meaningfully different execution paths. Every other stage is algorithm-agnostic.

Stage 4: Multi-Segment Result Merge

After all segments complete their individual ANN searches in parallel, the query coordinator collects per-segment result lists and performs a global merge.

python

import heapq
from dataclasses import dataclass

@dataclass(order=True)
class Candidate:
    score: float
    segment_id: int
    local_id: int
    global_id: int   # globally unique vector ID

def merge_segment_results(
    per_segment_results: list[list[Candidate]],
    k: int,
    deleted_ids: set[int],
) -> list[Candidate]:
    """
    Merge K-best candidates from multiple segments.

    Each segment returns its local top-(K * oversampling) results.
    We merge all of them into a global top-K, skipping deleted IDs.
    """
    all_candidates = []
    for segment_results in per_segment_results:
        all_candidates.extend(segment_results)

    # Sort all candidates by score descending
    all_candidates.sort(key=lambda c: c.score, reverse=True)

    # Deduplicate and skip deleted IDs
    seen_ids = set()
    global_top_k = []

    for candidate in all_candidates:
        if candidate.global_id in deleted_ids:
            continue   # soft-deleted vector
        if candidate.global_id in seen_ids:
            continue   # duplicate (should not happen but guards against edge cases)
        seen_ids.add(candidate.global_id)
        global_top_k.append(candidate)
        if len(global_top_k) >= k:
            break

    return global_top_k

The oversampling factor is the key design decision in this stage. If you need the final top-10 but have 5 segments, each segment must return more than 10 candidates. If the global top-10 includes 8 results from segment 1 and 2 from segment 4, and each segment only returned 10 local results, segments 2, 3, and 5 may have contributed nothing to the global top-10 even though they contain candidates ranked 11 through 30 globally.

The standard oversampling factor for N segments is roughly K * sqrt(N), though production systems tune this empirically based on observed score distributions across segments.

Stage 5: Metadata Filtering

With the global candidate list assembled, the database now applies the structured filter from Stage 2. If post-filtering was selected, this stage discards candidates that do not satisfy the filter expression.

python

def apply_filter(
    candidates: list[Candidate],
    filter_ast,
    metadata_index,
    k: int,
    min_results: int = None,
) -> list[Candidate]:
    """
    Apply a parsed filter expression to the candidate list.
    Fetches only the fields needed for filter evaluation, not full payloads.
    """
    passing = []

    for c in candidates:
        # Fetch only filter-relevant fields (not the full payload)
        filter_fields = metadata_index.get_fields(
            c.global_id,
            fields=filter_ast.required_fields()
        )

        if filter_ast.evaluate(filter_fields):
            passing.append(c)

        if len(passing) >= k:
            break

    # If too few candidates passed, the filter was more selective than expected
    if min_results and len(passing) < min_results:
        # Signal to the caller to expand the search or accept fewer results
        raise InsufficientResultsError(len(passing), min_results)

    return passing[:k]

A key optimization here: only the fields referenced in the filter expression are fetched during this stage. If the filter is category = "support", only the category field is retrieved from the metadata store per candidate. The full text payload (which may be hundreds of bytes) is not fetched until Stage 7. This deferred payload fetch pattern is what keeps Stage 5 fast even for large candidate sets.

If the filter is more selective than the selectivity estimate from Stage 2 (fewer candidates pass than expected), the database has two options: expand the ANN search to retrieve more candidates (adding latency) or return fewer than K results with a warning. Production systems typically expose this as a configuration choice, with a minimum_results parameter that triggers re-expansion when violated.

The engineering challenge of filter integration with vector search is significant. According to research on filtered ANN search, Milvus achieves superior recall stability through hybrid approximate/exact execution, while pgvector's cost-based query optimizer frequently selects suboptimal execution plans, favoring approximate index scans even when exact sequential scans would yield perfect recall at comparable latency.

Stage 6: Scoring, Threshold, and Optional Reranking

With the filtered candidate set assembled, the final scoring and ranking happens.

Score threshold. A score_threshold parameter discards candidates below a minimum similarity score. This prevents returning low-quality results when no strongly similar documents exist in the collection.

python

def apply_threshold_and_rank(
    candidates: list[Candidate],
    score_threshold: float,
    k: int,
) -> list[Candidate]:
    """
    Filter by score threshold, sort by score, return top-k.
    """
    above_threshold = [c for c in candidates if c.score >= score_threshold]
    above_threshold.sort(key=lambda c: c.score, reverse=True)
    return above_threshold[:k]

Cross-encoder reranking (optional but high-value). If a reranker is configured, the remaining candidates are scored by a cross-encoder model that processes the query and each candidate together. This produces a more accurate relevance score than the cosine similarity from Stage 3.

python

from sentence_transformers import CrossEncoder
import numpy as np

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_candidates(
    query_text: str,
    candidates: list[Candidate],
    metadata_store,
    top_n: int = 10,
) -> list[Candidate]:
    """
    Rerank candidates using a cross-encoder.
    Requires fetching candidate text for the cross-encoder input.
    """
    # Fetch text for each candidate (partial payload: text field only)
    texts = [metadata_store.get_text(c.global_id) for c in candidates]

    # Score all (query, text) pairs with the cross-encoder
    pairs  = [(query_text, t) for t in texts]
    scores = reranker.predict(pairs)

    for candidate, score in zip(candidates, scores):
        candidate.rerank_score = float(score)

    # Resort by reranker score
    candidates.sort(key=lambda c: c.rerank_score, reverse=True)
    return candidates[:top_n]

Reranking adds 50 to 200ms of latency depending on model size and candidate count. For RAG pipelines where the LLM generation step takes 1 to 5 seconds, this cost is proportionally small and the precision improvement is consistently worth it. The standard pattern is to retrieve 50 to 100 ANN candidates and rerank to the final 5 to 10.

Stage 7: Payload Fetch and Response Serialization

After the final top-K IDs are determined, the full metadata payloads are fetched. This is the deferred fetch pattern: all previous stages worked only with IDs and scores, never loading the full text content.

python

def fetch_payloads_and_serialize(
    top_candidates: list[Candidate],
    metadata_store,
    include_vector: bool = False,
) -> list[dict]:
    """
    Batch-fetch full payloads for the final top-K results.
    Serialize to the response format expected by the client.
    """
    # Batch fetch: one round trip to metadata store for all IDs
    ids       = [c.global_id for c in top_candidates]
    payloads  = metadata_store.batch_get(ids)

    results = []
    for candidate in top_candidates:
        result = {
            "id":    candidate.global_id,
            "score": candidate.score,
            "payload": payloads[candidate.global_id],
        }
        if include_vector:
            result["vector"] = vector_store.get(candidate.global_id).tolist()

        results.append(result)

    return results

Batch fetching all payloads in a single metadata store operation is far more efficient than fetching one payload per candidate in a loop. A batch fetch for 10 items from a key-value store takes roughly the same time as a single fetch (one round trip, larger response) versus 10 sequential fetches (10 round trips).

The response is then serialized and returned to the client. For REST endpoints, this is JSON encoding. For gRPC endpoints, this is protobuf encoding. Protobuf is typically 3 to 5 times smaller than JSON for the same data, reducing network transfer time for high-throughput deployments.

The Distributed Lifecycle: Queries Across Shards

For collections sharded across multiple nodes, Stages 3 through 7 involve inter-node coordination. The lifecycle expands:

plaintext

CLIENT REQUEST
    ↓
Query Coordinator (single node)
    ├─ Stage 1: Parse request
    ├─ Stage 2: Select filter strategy
    │
    ├─ Broadcast query to all N shards (parallel network calls)
    │     ↓                    ↓                    ↓
    │  Shard 0              Shard 1              Shard 2
    │  Stage 3: ANN        Stage 3: ANN         Stage 3: ANN
    │  Stage 4: Merge      Stage 4: Merge       Stage 4: Merge
    │  Stage 5: Filter     Stage 5: Filter      Stage 5: Filter
    │  Returns local        Returns local        Returns local
    │  top-(K * factor)     top-(K * factor)     top-(K * factor)
    │     ↓                    ↓                    ↓
    ├─ Receive shard results (parallel, wait for all)
    │
    ├─ Stage 4 (global): Merge all shard results → global top-K
    ├─ Stage 5 (global): Apply any remaining filter logic
    ├─ Stage 6: Score threshold, optional reranking
    └─ Stage 7: Payload fetch (coordinator fetches from owning shards)
    ↓
CLIENT RESPONSE

According to distributed vector database research, a distributed vector database must support search across all data shards. The query is broadcast to all workers, each worker performs an ANN search over its shards, and the partial results are then aggregated and the top results are returned.

The network round trip to shards adds 2 to 10ms depending on inter-node latency. The global merge in Stage 4 adds under 1ms regardless of shard count (sorting N * K results where N is the shard count and K is typically 10 to 100). Payload fetches in Stage 7 require a second round trip to the shard that owns each result's payload.

The key insight from distributed vector database architecture research is that distributed vector search is an optimization problem trying to balance three things: recall (which requires searching all shards broadly), latency (which conflicts with wide fan-out), and cost (fewer shards searched means lower resource use). Achieving the highest recall usually requires a wider search, which directly conflicts with the need for low latency and low cost.

Latency Attribution: Where Time Goes

Understanding which stage consumes most of the latency budget allows you to direct optimization effort correctly.

python

import time
import contextlib
from collections import defaultdict

class LatencyTracer:
    """
    Simple stage-level latency tracker for query profiling.
    """
    def __init__(self):
        self.stage_times_ms = defaultdict(list)

    @contextlib.contextmanager
    def trace(self, stage_name: str):
        t0 = time.perf_counter()
        yield
        elapsed_ms = (time.perf_counter() - t0) * 1000
        self.stage_times_ms[stage_name].append(elapsed_ms)

    def report(self) -> dict:
        return {
            stage: {
                "mean_ms": sum(times) / len(times),
                "p95_ms":  sorted(times)[int(0.95 * len(times))],
            }
            for stage, times in self.stage_times_ms.items()
        }


# Instrument a production query loop with this tracer
tracer = LatencyTracer()

for query in representative_queries:
    with tracer.trace("api_parse"):
        parsed = parse_request(query)

    with tracer.trace("filter_strategy"):
        strategy = select_filter_strategy(parsed.filter)

    with tracer.trace("ann_traversal"):
        candidates = run_ann_search(parsed.vector, strategy)

    with tracer.trace("segment_merge"):
        merged = merge_segment_results(candidates)

    with tracer.trace("metadata_filter"):
        filtered = apply_filter(merged, parsed.filter)

    with tracer.trace("scoring"):
        scored = apply_threshold_and_rank(filtered, parsed.score_threshold)

    with tracer.trace("payload_fetch"):
        results = fetch_payloads_and_serialize(scored)

report = tracer.report()
for stage, metrics in report.items():
    print(f"{stage:<20}: mean={metrics['mean_ms']:.2f}ms  p95={metrics['p95_ms']:.2f}ms")

# Typical output for a well-tuned single-node deployment:
# api_parse          : mean=0.31ms  p95=0.44ms
# filter_strategy    : mean=0.18ms  p95=0.27ms
# ann_traversal      : mean=8.42ms  p95=12.31ms   ← dominates
# segment_merge      : mean=0.22ms  p95=0.39ms
# metadata_filter    : mean=0.73ms  p95=1.12ms
# scoring            : mean=0.14ms  p95=0.21ms
# payload_fetch      : mean=1.31ms  p95=2.04ms
# ─────────────────────────────────────────────
# TOTAL              : mean=11.31ms p95=16.78ms

The ANN traversal stage dominates. Everything else combined takes less than 3ms. This measurement is the correct basis for optimization prioritization:

If total latency is too high, look at ef_search (HNSW) or nprobe (IVF) first. Reducing ef_search from 128 to 64 typically cuts Stage 3 latency by 40 to 50 percent at the cost of 2 to 4 percentage points of recall.

If payload_fetch is unexpectedly high, the metadata store may be under load from concurrent writes, the payload fields are very large, or the batch fetch is not being used (individual fetches are happening per candidate).

If metadata_filter is high and many candidates are being discarded, the filter selectivity is higher than estimated, the post-filter strategy is running when pre-filter would be more appropriate, or the filter cardinality statistics are stale.

Observability: What to Instrument in Production

A production vector database query pipeline should expose at minimum four metrics per query:

python

# Metrics to emit per query (in OpenTelemetry or Prometheus format)
def record_query_metrics(
    query_id: str,
    stage_latencies: dict[str, float],
    total_latency_ms: float,
    recall_estimate: float | None,   # if ground truth is sampled
    candidates_pre_filter: int,
    candidates_post_filter: int,
    final_results: int,
    shard_count: int,
):
    # Latency by stage (histogram)
    for stage, ms in stage_latencies.items():
        histogram.observe(f"vector_query_stage_latency_ms", ms, labels={"stage": stage})

    # Total query latency (histogram)
    histogram.observe("vector_query_total_latency_ms", total_latency_ms)

    # Filter effectiveness: what fraction of candidates passed the filter?
    filter_pass_rate = candidates_post_filter / max(candidates_pre_filter, 1)
    gauge.set("vector_query_filter_pass_rate", filter_pass_rate)

    # Result count: did the query return fewer than requested?
    gauge.set("vector_query_result_count", final_results)

    # Shard count (for distributed deployments)
    gauge.set("vector_query_shard_count", shard_count)

The filter_pass_rate metric is the most useful diagnostic. A filter_pass_rate below 0.2 (fewer than 20 percent of ANN candidates pass the filter) indicates that the filter is much more selective than the index knows about. This signals that either the metadata statistics are stale or the filter strategy selection threshold needs adjustment.

The result_count metric detects when queries return fewer results than requested. A system that consistently returns 7 results when K=10 is requested has a recall problem, not a latency problem, and the fix is at the filter strategy layer (switch to pre-filter or increase oversampling) rather than at the ANN layer.

Connecting the Full Series

This lifecycle walkthrough assembles every article in the P2 series into a single observable sequence.

Stage 1 connects to the how vector databases work internally pillar, which covers the API layer architecture and the storage layer that backs the metadata store used in Stages 5 and 7.

Stage 2 connects to what is vector indexing, which covers the filter strategy selection framework and the pre-filter vs post-filter tradeoffs in detail.

Stage 3 is where HNSW and IVF algorithms execute. The exact vs ANN tradeoff at the heart of Stage 3 is covered in exact vs approximate nearest neighbor. The distance metric used at each comparison in Stage 3 is covered in cosine similarity vs Euclidean distance.

Stage 3 also uses product quantization when IVF-PQ is the index type, replacing float32 distance computations with ADC table lookups.

Stage 6 is an expansion of what the similarity search article calls the two-stage retrieve-then-rerank pipeline, now placed precisely in the lifecycle context.

Stage 4 (multi-segment merge) and the distributed shard coordination are architectural elements covered in the how vector databases work internally pillar's storage and distributed sections.

Summary

A vector database query lifecycle has seven stages. API parsing validates and deserializes the request. Filter strategy selection estimates how many vectors satisfy the structured filter and chooses pre-filter, post-filter, or exact fallback accordingly. ANN index traversal (the dominant stage) runs HNSW graph search or IVF cluster search in parallel across all segments. Multi-segment merge collects per-segment results and assembles the global candidate list. Metadata filtering discards candidates that fail the structured filter. Scoring and optional reranking produce the final ranked list. Payload fetch loads the full metadata for the top-K and serializes the response.

Stage 3 (ANN traversal) dominates latency. Every other stage combined typically takes less than 5ms. Optimization effort should be directed at Stage 3 first: reduce ef_search or nprobe to gain latency at the cost of recall, or increase it to gain recall at the cost of latency.

Instrumentation should expose per-stage latencies, filter pass rates, and result counts. A consistently low filter pass rate indicates a filter strategy mismatch. Consistently fewer results than requested indicates insufficient oversampling or an unexpectedly selective filter.

This lifecycle repeats for every query that reaches the database. Understanding it fully is what makes you a better debugger of AI systems, not just a consumer of them.

Sources and Further Reading

Pinecone. What Is a Vector Database and How Does It Work? pinecone.io/learn/vector-database
Microsoft Azure. Vector Query Filters in Azure AI Search. learn.microsoft.com/en-us/azure/search/vector-search-filters
Qdrant. Vector Search Documentation. qdrant.tech/documentation/concepts/search
Milvus. Search Concepts and Query Pipeline. milvus.io/docs/single-vector-search.md
Weaviate. Vector Index Concepts. weaviate.io/developers/weaviate/concepts/vector-index
arXiv. Survey of Vector Database Management Systems. arxiv.org/abs/2310.14021
arXiv. Filtered Approximate Nearest Neighbor Search: System Design and Performance Analysis. arxiv.org/abs/2602.11443
arXiv. Exploring Distributed Vector Databases Performance on HPC Platforms. arxiv.org/abs/2509.12384
Aakash Sharan. Distributed Vector Database Architecture: Sharding, Routing, and Scale. aakashsharan.com/distributed-vector-database-architecture-sharding-routing
Bits and Backprops. The Achilles Heel of Vector Search: Filters. yudhiesh.github.io/2025/05/09/the-achilles-heel-of-vector-search-filters
Databricks Blog. Decoupled by Design: Billion-Scale Vector Search. databricks.com/blog/decoupled-design-billion-scale-vector-search
Redis. Common Challenges Working with Vector Databases. redis.io/blog/common-challenges-working-with-vector-databases
arXiv. HAKES: Scalable Vector Database for Embedding Search Service. arxiv.org/abs/2505.12524
Blockchain Council. Vector Database Performance Optimization. blockchain-council.org/ai/vector-database-performance-optimization-recall-latency-cost-indexing-quantization
Instaclustr. How a Vector Index Works and Best Practices. instaclustr.com/education/vector-database/how-a-vector-index-works-and-5-critical-best-practices

Vector Query Lifecycle Explained Step by Step

The Complete Lifecycle at a Glance

Stage 1: API Reception and Request Parsing

Stage 2: Filter Strategy Selection

Stage 3: ANN Index Traversal

HNSW Traversal (Sealed Segments)

Active Segment (Flat Search)

IVF Traversal (Alternative Algorithm)

Stage 4: Multi-Segment Result Merge

Stage 5: Metadata Filtering

Stage 6: Scoring, Threshold, and Optional Reranking

Stage 7: Payload Fetch and Response Serialization

The Distributed Lifecycle: Queries Across Shards

Latency Attribution: Where Time Goes

Observability: What to Instrument in Production

Connecting the Full Series

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts