Why does vector indexing matter for RAG applications?

A RAG pipeline's retrieval latency is dominated by the vector search step. Without an index, retrieving from 10 million document chunks takes seconds per query. With a well-tuned HNSW or IVF index, the same retrieval takes 5 to 20 milliseconds. That 100x difference determines whether the RAG system can serve real-time user queries at all. Index quality also affects recall: a poorly tuned index returns fewer of the truly relevant chunks, causing the LLM to generate answers from incomplete context.

What are the three properties that every vector index must balance?

Every vector index exists in a three-way tradeoff space: recall (what fraction of true nearest neighbors the index returns), latency (how long each query takes), and memory (how much RAM the index occupies). You cannot maximize all three simultaneously. Higher recall requires more thorough graph traversal or cluster search, increasing latency. Lower memory usage through compression (PQ) reduces recall. Tuning a vector index is the process of finding the operating point on this tradeoff surface that meets the requirements of your specific workload.

How do I choose between HNSW and IVF for my application?

Use HNSW as the default for datasets up to a few hundred million vectors where you need high recall, incremental inserts, and no training phase. Use IVF when memory is constrained and the dataset is too large for HNSW to fit in RAM (add PQ compression for IVF-PQ), when the dataset is largely static and the training phase is amortized, or when highly selective metadata filters make HNSW's graph traversal less efficient than IVF's cluster-based search.

What does it mean for an index to have good quality?

A high-quality index achieves high recall at low latency with predictable behavior across different query types. Concretely: Recall@10 above 0.95 on a representative query sample, p95 latency under your application's budget, and per-query recall variance that is low enough that no category of queries receives consistently poor results. A poorly built index (low ef_construction, insufficient IVF training data, or an insertion order that degrades the HNSW graph) can produce 70 to 80 percent recall on average but much lower for specific query types.

What is index build time and why does it matter?

Index build time is the duration required to construct the ANN index from scratch over a collection of vectors. For HNSW, build time scales with M (graph connectivity) and ef_construction (insertion beam width). For IVF, build time includes a k-means training phase plus vector assignment. Build time matters because it determines how quickly you can bring a new collection online, how long a re-index takes after a major corpus change, and whether you can afford to rebuild periodically to recover from insertion-order degradation.

What Is Vector Indexing and Why It Matters

Q: What is vector indexing?

Vector indexing is the process of building a data structure over a collection of high-dimensional vectors that allows similarity search to run in sub-linear time. Without an index, finding the nearest vectors to a query requires comparing the query against every stored vector, which is O(n) and too slow for production workloads above 100,000 vectors. A vector index organizes the vectors so that the search can skip the vast majority of the collection and find approximate nearest neighbors in milliseconds.

Every database problem eventually reduces to: how do you find relevant data without reading everything?

For text, inverted indexes solved this in the 1960s. Map each word to the list of documents containing it, and a keyword query touches only the relevant posting lists rather than every document. For structured data, B-trees solved this by sorting values into a tree where any lookup takes O(log n) comparisons rather than O(n).

For high-dimensional vectors, neither of those structures works. The search question is different: "find the vectors most similar to this query" is not an exact match problem, and no linear sort order preserves the geometric closeness that makes similarity meaningful. A new class of data structures had to be invented from scratch.

Those structures are what vector indexing covers: the algorithms, data structures, and engineering decisions that make similarity search over millions of vectors fast enough to use in production applications.

This article is the discipline-level view of the full How Vector Databases Work Internally series. The specific algorithms have their own articles: HNSW explained, IVF explained, and Product Quantization explained. The accuracy metric used to evaluate indexes is covered in the exact vs ANN article. The distance metrics that indexes use are in the cosine similarity vs Euclidean distance article. This article provides the unifying framework for choosing between them.

What a Vector Index Is and What It Does

A vector index is a data structure built over a collection of float arrays (vectors) that supports fast approximate nearest neighbor queries.

Before defining it more precisely, the baseline it replaces:

python

import numpy as np
import time

def no_index_search(corpus: np.ndarray, query: np.ndarray, k: int = 10) -> np.ndarray:
    """
    Search without any index: compare query to every stored vector.
    Time complexity: O(n * d)  — linear in corpus size.
    """
    scores  = corpus @ query          # shape (n,) — one score per vector
    top_k   = np.argsort(scores)[::-1][:k]
    return top_k

d = 1536   # OpenAI text-embedding-3-small
n_values = [10_000, 100_000, 1_000_000, 5_000_000]

for n in n_values:
    corpus = np.random.randn(n, d).astype(np.float32)
    query  = np.random.randn(d).astype(np.float32)
    faiss_normalize = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
    query_norm = query / np.linalg.norm(query)

    start = time.perf_counter()
    no_index_search(faiss_normalize, query_norm)
    elapsed_ms = (time.perf_counter() - start) * 1000

    print(f"n={n:>10,}: {elapsed_ms:>8.1f}ms (no index)")

# n=    10,000:      6.2ms
# n=   100,000:     62.4ms
# n= 1,000,000:    624.1ms
# n= 5,000,000:  3,120.6ms

The linear relationship is clear. Every ten-fold increase in corpus size produces a ten-fold increase in query time. At 5 million vectors, a single query takes over 3 seconds, which is incompatible with interactive search latency requirements.

An index breaks the linear relationship by pre-computing structural information about where vectors are relative to each other, so that at query time the search can navigate directly to the relevant neighborhood rather than examining every vector. According to Milvus's indexing reference, indexing in vector databases serves to organize high-dimensional data to enable efficient similarity searches. Without an index, searching for the closest vectors to a query would require comparing the query against every vector in the database.

The Three-Way Tradeoff

Every vector index exists inside a three-dimensional constraint space. No index can simultaneously maximize recall, minimize latency, and minimize memory. Understanding the structure of this tradeoff is the foundational skill for index selection and tuning.

plaintext

The three-way tradeoff triangle:

           Recall (accuracy)
               /\
              /  \
             /    \
            /      \
           /________\
    Latency          Memory

Moving toward any corner moves away from the other two.

Higher recall  → larger ef_search (HNSW) or nprobe (IVF)
               → more comparisons per query
               → higher latency per query
               → lower throughput (fewer queries per second)

Lower memory   → quantization (PQ or scalar quantization)
               → approximate distances computed from compressed codes
               → lower recall at the same ef_search or nprobe

Lower latency  → smaller ef_search or nprobe
               → fewer comparisons
               → lower recall

According to Redis's vector database challenges analysis, both HNSW and IVF make the same fundamental compromise: higher recall costs more time. In benchmarks on the SIFT10M dataset, going from 0.8 to 0.95 recall increased HNSW latency by roughly 31 percent, while IVF latency roughly tripled over the same recall range. The higher you push accuracy, the steeper the cost.

This is not a bug. It is a fundamental property of approximate nearest neighbor search in high-dimensional spaces. Every index selection and tuning decision is a choice about where to sit on this surface.

The practical consequence: you must define your requirements before choosing an index. A RAG pipeline serving interactive user queries has different requirements than a batch recommendation pipeline running overnight. An enterprise knowledge base requiring 99 percent recall has different requirements than a product recommendation engine where 90 percent recall is commercially acceptable.

What an Index Actually Stores

The internal representation of a vector index differs across algorithm families, but all of them require storing something beyond the raw vectors themselves.

HNSW: Graph Edges

HNSW stores the original float32 vectors plus a graph of edges: each vector has up to M neighbors per layer, stored as integer IDs. The graph overhead is approximately M * 8 bytes per vector (M neighbor IDs at 8 bytes each). For M=16 at 1 million vectors, that is 128 MB of graph data beyond the 6 GB of raw vector storage.

plaintext

HNSW index on disk for 1 million 1536-dim vectors, M=16:
  vectors.bin       6,144 MB    (float32 arrays)
  graph_layer0.bin    128 MB    (2*M = 32 neighbor IDs per vector)
  graph_upper.bin      ~8 MB    (upper layers, fewer nodes)
  metadata.db         variable  (payloads stored separately)
  ────────────────────────────
  Total index:       ~6,280 MB

IVF: Centroid Table and Inverted Lists

IVF stores the centroid vectors (nlist centroids of dimension d) and the assignment of each vector ID to its cluster. The centroid storage is negligible (nlist=1024 at 1536 dimensions is 6 MB). The inverted lists can either store the vector IDs only (with vectors accessed separately) or the full vectors inline with each list.

plaintext

IVF index on disk for 1 million 1536-dim vectors, nlist=1024:
  centroids.bin        6 MB     (1024 centroids of 1536 dimensions)
  inverted_lists/:
    list_0000.bin     ~6 MB     (vectors assigned to cluster 0)
    list_0001.bin     ~6 MB
    ...
    list_1023.bin     ~6 MB
  ────────────────────────────
  Total:             ~6,150 MB  (comparable to raw storage + 6 MB overhead)

IVF-PQ: Compressed Inverted Lists

IVF-PQ replaces each full float32 vector in the inverted lists with its PQ code (M bytes instead of d*4 bytes). At M=96, d=1536, the compression is 64x.

plaintext

IVF-PQ index for 1 million 1536-dim vectors, nlist=1024, M=96:
  centroids.bin        6 MB     (uncompressed centroid vectors)
  codebooks.bin       25 MB     (96 codebooks of 256 centroids × 16 dims)
  inverted_lists/:              (compressed PQ codes only)
    list_0000.bin    ~94 KB     (vectors encoded as 96-byte PQ codes)
    ...
    list_1023.bin    ~94 KB
  ────────────────────────────
  Total:              ~128 MB   (compared to 6,150 MB for IVFFlat — 48x reduction)

The Build Phase: What Happens Before Queries Can Run

Every vector index has a build phase that must complete before queries can be served. The build phase characteristics determine operational flexibility.

HNSW Build

HNSW inserts vectors one at a time into a growing graph. Each insertion:

Assigns a maximum layer using the exponential distribution.
Descends from the top layer to the assigned max layer via greedy traversal.
Runs beam search at each layer from max to layer 0, finding M nearest neighbors.
Creates bidirectional edges to those neighbors.
Prunes any existing node whose connection count exceeds M after the new edges are added.

The dominant cost is step 3: for each inserted vector, ef_construction distance computations happen at layer 0. Total build cost is approximately O(n * ef_construction * d) distance computations.

python

import faiss
import numpy as np
import time

d = 384
n = 1_000_000

corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)

print(f"{'M':>4} | {'ef_constr':>9} | {'Build time':>11} | {'Index MB':>8}")
print("-" * 42)
for M, ef_c in [(8, 64), (16, 64), (16, 128), (16, 200), (32, 128)]:
    idx = faiss.IndexHNSWFlat(d, M)
    idx.hnsw.efConstruction = ef_c

    t0 = time.perf_counter()
    idx.add(corpus)
    build_s = time.perf_counter() - t0

    # Rough memory: raw vectors + graph edges
    raw_mb   = n * d * 4 / 1e6
    graph_mb = n * (2 * M) * 8 / 1e6   # layer 0 uses 2*M
    total_mb = raw_mb + graph_mb

    print(f"{M:>4} | {ef_c:>9} | {build_s:>9.1f}s | {total_mb:>6.0f}MB")

# Approximate output (1M vectors, 384-dim, single CPU):
# M  | ef_constr | Build time | Index MB
# ------------------------------------------
#  8 |        64 |      31.2s |    1,797MB
# 16 |        64 |      48.7s |    1,921MB
# 16 |       128 |      87.4s |    1,921MB
# 16 |       200 |     134.8s |    1,921MB
# 32 |       128 |     149.3s |    2,169MB

HNSW requires no training phase. Vectors can be inserted incrementally at any time. The index degrades slightly when vectors are inserted in a distribution-shifted order (a phenomenon documented in research showing up to 12 percent recall degradation), but for most practical workloads, incremental insertion is acceptable.

IVF Build

IVF build has two distinct stages: training (k-means) and assignment (vector-to-centroid matching).

The training stage is O(n_train * nlist * d * n_iterations). For nlist=1024, training on 100,000 vectors with 20 k-means iterations at d=1536 takes roughly 30 to 90 seconds on CPU depending on hardware.

The assignment stage is O(n * nlist * d): compare each of the n indexed vectors against all nlist centroids. For 1 million vectors, nlist=1024, d=1536, this takes approximately 5 to 15 minutes on a single CPU core.

IVF cannot be built incrementally in a meaningful way. Vectors inserted after training are assigned to the nearest existing centroid. If those vectors come from a different distribution than the training data, some clusters receive far more vectors than expected, degrading search quality. A periodic full rebuild is best practice for workloads with significant data drift.

Index Quality: Measuring It Before You Ship

The definition of a high-quality vector index is precise: it consistently returns the true top-K nearest neighbors as determined by exact search, within the latency budget, across all query types in your workload distribution.

Measuring this requires three steps:

python

import numpy as np
import faiss
import time
from collections import defaultdict

def evaluate_index_quality(
    corpus: np.ndarray,
    hnsw_index,
    representative_queries: np.ndarray,
    query_labels: list[str],     # category label for each query
    k: int = 10,
    ef_search: int = 64,
) -> dict:
    """
    Evaluate ANN index quality against exact search ground truth.
    Groups results by query category to expose per-category recall.

    Returns dict with overall and per-category metrics.
    """
    d = corpus.shape[1]

    # Ground truth: exact search
    flat = faiss.IndexFlatIP(d)
    flat.add(corpus)
    faiss.normalize_L2(representative_queries)
    _, gt_ids = flat.search(representative_queries, k)

    # ANN search
    hnsw_index.hnsw.efSearch = ef_search
    t0 = time.perf_counter()
    _, ann_ids = hnsw_index.search(representative_queries, k)
    latency_ms = (time.perf_counter() - t0) * 1000 / len(representative_queries)

    # Per-query recall
    per_query_recall = []
    for i in range(len(representative_queries)):
        recall = len(set(gt_ids[i]) & set(ann_ids[i])) / k
        per_query_recall.append(recall)

    # Per-category aggregation
    category_recalls = defaultdict(list)
    for recall, label in zip(per_query_recall, query_labels):
        category_recalls[label].append(recall)

    recalls = np.array(per_query_recall)
    return {
        "mean_recall":        float(recalls.mean()),
        "median_recall":      float(np.median(recalls)),
        "p5_recall":          float(np.percentile(recalls, 5)),
        "p1_recall":          float(np.percentile(recalls, 1)),
        "mean_latency_ms":    latency_ms,
        "per_category":       {
            cat: {
                "mean": float(np.mean(v)),
                "p5":   float(np.percentile(v, 5)),
                "n":    len(v),
            }
            for cat, v in category_recalls.items()
        },
    }

The per-category breakdown is the critical part. Mean recall of 0.95 hides the possibility that one query category (say, queries about newly added topics not well-covered in training data) has recall of 0.70 while all other categories hit 0.98. The per-category view exposes exactly that.

According to Blockchain Council's performance optimization guide, in practice teams set a recall target first (for example, 99 percent Recall@10) and then tune ef_search until they reach that target at the lowest p95/p99 latency under expected concurrency.

Index Selection Framework

Choosing the right index type is a decision that follows from four questions asked in sequence.

Question 1: Does the full-precision index fit in RAM?

This is the first filter. If the answer is yes, prefer HNSW or IVFFlat without quantization. Both give higher recall than their compressed counterparts at the same ef/nprobe setting.

python

def estimate_index_memory_gb(
    n_vectors: int,
    d: int,
    index_type: str,
    M: int = 16,
    nlist: int = None,
    M_pq: int = None,
) -> float:
    """Estimate total RAM required for different index types."""
    raw_vector_gb = n_vectors * d * 4 / 1e9

    if index_type == "flat":
        return raw_vector_gb

    elif index_type == "hnsw":
        graph_gb = n_vectors * (2 * M) * 8 / 1e9
        return raw_vector_gb + graph_gb

    elif index_type == "ivfflat":
        centroid_gb = (nlist or 1024) * d * 4 / 1e9
        return raw_vector_gb + centroid_gb   # centroids are negligible

    elif index_type == "ivfpq":
        pq_bytes     = n_vectors * (M_pq or 96)   # M_pq bytes per vector
        codebook_gb  = (M_pq or 96) * 256 * (d // (M_pq or 96)) * 4 / 1e9
        return pq_bytes / 1e9 + codebook_gb

    return raw_vector_gb


# Decision table: which index fits in 256 GB RAM?
available_ram_gb = 256
d = 1536

print(f"{'n_vectors':>12} | {'Flat':>8} | {'HNSW':>8} | {'IVFFlat':>8} | {'IVF-PQ':>8}")
print("-" * 60)

for n in [1_000_000, 10_000_000, 50_000_000, 100_000_000, 500_000_000]:
    flat_gb    = estimate_index_memory_gb(n, d, "flat")
    hnsw_gb    = estimate_index_memory_gb(n, d, "hnsw", M=16)
    ivfflat_gb = estimate_index_memory_gb(n, d, "ivfflat", nlist=int(n**0.5))
    ivfpq_gb   = estimate_index_memory_gb(n, d, "ivfpq", M_pq=96)

    def fits(gb): return "✓" if gb <= available_ram_gb else f"{gb:.0f}GB"

    print(f"{n:>12,} | {fits(flat_gb):>8} | {fits(hnsw_gb):>8} | "
          f"{fits(ivfflat_gb):>8} | {fits(ivfpq_gb):>8}")

# n_vectors   |     Flat |     HNSW |  IVFFlat |   IVF-PQ
# ----------------------------------------------------------
#   1,000,000 |        ✓ |        ✓ |        ✓ |        ✓
#  10,000,000 |       ✓  |       ✓  |       ✓  |        ✓
#  50,000,000 |  307GB   |  333GB   |  307GB   |        ✓
# 100,000,000 |  614GB   |  666GB   |  614GB   |        ✓
# 500,000,000 | 3,072GB  | 3,328GB  | 3,072GB  |        ✓

For 50 million vectors at 1536 dimensions, neither flat, HNSW, nor IVFFlat fits in 256 GB. IVF-PQ with M=96 fits easily (about 4.8 GB). That makes the index selection for large corpora a constraint-driven decision, not a preference.

Question 2: Is the corpus static or frequently updated?

HNSW supports efficient incremental insertion. New vectors are added to the existing graph without rebuilding. The graph quality degrades slightly over time if insertions are not representative of the training distribution, but for most workloads this is acceptable.

IVF requires that inserted vectors be assigned to existing centroids. If the data distribution shifts significantly, some clusters become overloaded and others become empty. A periodic full rebuild (retrain k-means, reassign all vectors) is required to restore search quality. For corpora that change less than 10 to 20 percent per month, this is fine. For rapidly evolving corpora, HNSW is more operationally convenient.

Question 3: Are metadata filters a primary access pattern?

When a query filter is highly selective (only 1 to 5 percent of vectors match), HNSW graph traversal degrades because the traversal must skip many non-matching nodes, breaking the navigable structure. IVF handles this more gracefully by allowing the filter to be applied independently within each searched cluster, without corrupting the global index structure.

If your application requires heavy metadata filtering (multi-tenant isolation, per-user access control, date range filters), benchmark both index types under your actual filter distribution before choosing.

Question 4: Do you have GPU infrastructure?

GPU-accelerated IVF-PQ is available through FAISS GPU extensions and RAPIDS cuVS. On modern A100 GPUs, billion-scale IVF-PQ search runs at sub-millisecond latency per query with high throughput. If your workload justifies GPU infrastructure, IVF-PQ with GPU acceleration is the clear choice at extreme scale.

HNSW's graph traversal is less naturally parallelizable on GPU because the traversal path depends on sequential decisions (each step's direction depends on the current node's neighbors). GPU HNSW implementations exist but have not achieved the same speedup over CPU as GPU IVF-PQ.

The Index Selection Table

plaintext

Scenario                         | Recommended Index      | Reasoning
---------------------------------+------------------------+-------------------------------
Under 100K vectors               | Flat (exact)           | Brute force fast enough,
                                 |                        | no accuracy loss
100K to 100M vectors, fits RAM   | HNSW (M=16)            | Best recall-latency,
                                 |                        | incremental insert
100K to 100M vectors, low RAM    | IVFFlat + nprobe tune  | Same recall as HNSW at
                                 |                        | lower memory
Over 100M, does not fit flat     | IVF-PQ                 | Only option that fits
Already on PostgreSQL, under 10M | pgvector IVFFlat       | No new infra needed
Rapidly growing, high recall     | HNSW                   | Incremental insert,
                                 |                        | high recall
Mostly static, heavy filtering   | IVF-PQ or IVFFlat      | Filters degrade HNSW less
Billion-scale, GPU available     | IVF-PQ (GPU)           | Sub-ms at scale

Operational Index Health

An index that was well-configured at deployment can degrade over time. Three operational factors affect ongoing index quality.

Insertion Order and Distribution Shift

HNSW graph quality depends partly on insertion order. Vectors inserted early accumulate many long-range connections that become navigation highways. If the distribution of vectors changes significantly over time (new topics, new document types, different languages), the early-inserted hub nodes may not be well-connected to the new region of the vector space, degrading recall for queries in that region.

The practical fix: if your corpus grows by more than 30 to 50 percent with significantly different content, rebuild the HNSW index from scratch on the combined corpus. A from-scratch build with high ef_construction consistently produces better graph quality than incremental insertion at equivalent M.

Soft Deletes and Compaction Lag

When vectors are deleted, most vector databases apply soft deletes: the vector is marked as deleted but its graph edges remain. This inflates the effective graph size and means deleted nodes are visited (and discarded) during traversal. At high delete rates, soft-deleted nodes can consume a meaningful fraction of traversal time without contributing to results.

Compaction removes soft-deleted nodes and rebuilds the affected segment's graph. If your workload has a high update or delete rate, monitor the ratio of live-to-deleted vectors per segment and ensure compaction runs frequently enough to keep the ratio above 0.8.

Cold Index vs Warm Index

HNSW stores the graph structure in files on disk. When the process starts (or after a crash recovery), the index is loaded from disk into memory. The first queries after load access cold memory pages, producing higher latency than steady-state. Under concurrent load, a cold start can cause latency spikes while the working set pages are loaded into the CPU cache.

The standard mitigation is to warm the index before exposing it to production traffic: run a representative sample of queries against the loaded index before routing live traffic. This pre-loads the hot graph pages into RAM and CPU cache.

Vector Indexing as Infrastructure, Not Just Code

The view of vector indexing as a library call, index.add(vectors) followed by index.search(query, k), misses the operational reality.

A vector index in production is infrastructure with a lifecycle. It is created, populated, queried under concurrent load, updated as the corpus grows, and periodically maintained through compaction and rebuilding. Each phase has failure modes.

According to Weaviate's indexing documentation, indexing vector databases properly is crucial for performance, and different index types serve different purposes. The dynamic index can even start off as a flat index and then dynamically switch to the HNSW index as it scales past a threshold.

That progressive approach is exactly right for production systems. Start with flat (exact) search while the corpus is small. Transition to HNSW when the corpus grows past 50,000 to 100,000 vectors and latency becomes a constraint. Evaluate IVF-PQ when the corpus outgrows available RAM. Build measurement and monitoring into the system from the start so the transitions are data-driven rather than reactive.

The Connection to Every Other Article in This Series

The How Vector Databases Work Internally pillar covers the full system architecture: the ingestion pipeline, the segment model, the metadata filtering strategies, and horizontal distribution. Vector indexing is the component at the center of that architecture that determines query performance.

The similarity search article covers how ANN index traversal fits into the eight-step query pipeline from API request to ranked result. The distance metrics that the index uses at each comparison step are in the cosine similarity vs Euclidean distance article.

The exact vs ANN article establishes why approximate search is necessary and how recall is measured. The HNSW article and the IVF article cover the two dominant production algorithms in the detail that this article summarized. The Product Quantization article covers the memory compression layer that makes billion-scale IVF-PQ feasible.

The vector query lifecycle article traces a single search request through every internal component of the database, from API parsing through ANN traversal through metadata filtering to response serialization. That article is the final piece in the P2 cluster and assembles everything covered across C2.1 through C2.8 into one end-to-end walkthrough.

Summary

Vector indexing is the discipline of building data structures over collections of high-dimensional float arrays that support approximate nearest neighbor search faster than O(n). It exists because brute-force similarity search scales linearly with corpus size, making it too slow for production workloads above 100,000 vectors.

Every vector index makes tradeoffs across three dimensions: recall, latency, and memory. You cannot maximize all three. Index selection is the process of choosing a position on this tradeoff surface that satisfies the requirements of your specific application.

HNSW is the default for most workloads under a few hundred million vectors: high recall, no training phase, incremental insert support. IVF is better for highly selective metadata filters, memory-constrained environments, and batch-build workflows on static corpora. IVF-PQ with compression is the choice when the corpus exceeds available RAM, compressing storage by 32 to 64 times at a moderate recall cost.

Index quality degrades over time through soft delete accumulation, insertion-order skew, and data distribution shift. Measuring recall against exact search on a representative query sample before deployment and periodically in production is the only reliable way to know whether your index is meeting its accuracy targets.

Sources and Further Reading

Milvus AI Reference. What Is the Purpose of Indexing in a Vector Database? milvus.io/ai-quick-reference/what-is-the-purpose-of-indexing-in-a-vector-database
Weaviate. Vector Index Concepts. weaviate.io/developers/weaviate/concepts/vector-index
Teradata. What Is a Vector Index? teradata.com/insights/ai-and-machine-learning/what-is-vector-index
Instaclustr. How a Vector Index Works and 5 Critical Best Practices. instaclustr.com/education/vector-database/how-a-vector-index-works-and-5-critical-best-practices
Yugabyte. What Is Vector Indexing? yugabyte.com/key-concepts/what-is-vector-indexing
IBM. What Is a Vector Database? ibm.com/think/topics/vector-database
PromptWire. Vector Databases 101: How Indexes Drive Speed and Accuracy. promptwire.co/articles/vector-databases-101-how-indexes-drive-speed-and-accuracy
Redis. Common Challenges Working with Vector Databases. redis.io/blog/common-challenges-working-with-vector-databases
Blockchain Council. Vector Database Performance Optimization: Recall, Latency, Cost. blockchain-council.org/ai/vector-database-performance-optimization-recall-latency-cost-indexing-quantization
Medium / Beyond Localhost. Vector Search: The Latency Tax Nobody Warns You About. medium.com/beyond-localhost/vector-search-the-latency-tax-nobody-warns-you-about
Medium / Sarthak AI. A VectorDB Does Not Actually Work the Way You Think It Does. sarthakai.substack.com/p/a-vectordb-doesnt-actually-work-the
Nexumo. 8 Vector Indexes: Cost vs Recall Showdown. medium.com/@Nexumo_/8-vector-indexes-cost-vs-recall-showdown
Dobson et al. The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in HNSW. arxiv.org/abs/2405.17813
FAISS. Index Types and Guidelines. faiss.ai/guidelines.html
Microsoft Learn. Vector Search in SQL Server 2025. learn.microsoft.com/en-us/sql/sql-server/ai/vectors

What Is Vector Indexing and Why It Matters

What a Vector Index Is and What It Does

The Three-Way Tradeoff

What an Index Actually Stores

HNSW: Graph Edges

IVF: Centroid Table and Inverted Lists

IVF-PQ: Compressed Inverted Lists

The Build Phase: What Happens Before Queries Can Run

HNSW Build

IVF Build

Index Quality: Measuring It Before You Ship

Index Selection Framework

Question 1: Does the full-precision index fit in RAM?

Question 2: Is the corpus static or frequently updated?

Question 3: Are metadata filters a primary access pattern?

Question 4: Do you have GPU infrastructure?

The Index Selection Table

Operational Index Health

Insertion Order and Distribution Shift

Soft Deletes and Compaction Lag

Cold Index vs Warm Index

Vector Indexing as Infrastructure, Not Just Code

The Connection to Every Other Article in This Series

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts