What Is Vector Indexing and Why It Matters
A research-backed guide to vector indexing: what it is, why it exists, how the three-way tradeoff between recall, latency, and memory shapes every index decision, how to select the right index type for your workload, how to measure index quality, and the operational factors that affect index performance in production.
Every database problem eventually reduces to: how do you find relevant data without reading everything?
For text, inverted indexes solved this in the 1960s. Map each word to the list of documents containing it, and a keyword query touches only the relevant posting lists rather than every document. For structured data, B-trees solved this by sorting values into a tree where any lookup takes O(log n) comparisons rather than O(n).
For high-dimensional vectors, neither of those structures works. The search question is different: "find the vectors most similar to this query" is not an exact match problem, and no linear sort order preserves the geometric closeness that makes similarity meaningful. A new class of data structures had to be invented from scratch.
Those structures are what vector indexing covers: the algorithms, data structures, and engineering decisions that make similarity search over millions of vectors fast enough to use in production applications.
This article is the discipline-level view of the full How Vector Databases Work Internally series. The specific algorithms have their own articles: HNSW explained, IVF explained, and Product Quantization explained. The accuracy metric used to evaluate indexes is covered in the exact vs ANN article. The distance metrics that indexes use are in the cosine similarity vs Euclidean distance article. This article provides the unifying framework for choosing between them.
What a Vector Index Is and What It Does
A vector index is a data structure built over a collection of float arrays (vectors) that supports fast approximate nearest neighbor queries.
Before defining it more precisely, the baseline it replaces:
import numpy as np
import time
def no_index_search(corpus: np.ndarray, query: np.ndarray, k: int = 10) -> np.ndarray:
"""
Search without any index: compare query to every stored vector.
Time complexity: O(n * d) — linear in corpus size.
"""
scores = corpus @ query # shape (n,) — one score per vector
top_k = np.argsort(scores)[::-1][:k]
return top_k
d = 1536 # OpenAI text-embedding-3-small
n_values = [10_000, 100_000, 1_000_000, 5_000_000]
for n in n_values:
corpus = np.random.randn(n, d).astype(np.float32)
query = np.random.randn(d).astype(np.float32)
faiss_normalize = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
query_norm = query / np.linalg.norm(query)
start = time.perf_counter()
no_index_search(faiss_normalize, query_norm)
elapsed_ms = (time.perf_counter() - start) * 1000
print(f"n={n:>10,}: {elapsed_ms:>8.1f}ms (no index)")
# n= 10,000: 6.2ms
# n= 100,000: 62.4ms
# n= 1,000,000: 624.1ms
# n= 5,000,000: 3,120.6msThe linear relationship is clear. Every ten-fold increase in corpus size produces a ten-fold increase in query time. At 5 million vectors, a single query takes over 3 seconds, which is incompatible with interactive search latency requirements.
An index breaks the linear relationship by pre-computing structural information about where vectors are relative to each other, so that at query time the search can navigate directly to the relevant neighborhood rather than examining every vector. According to Milvus's indexing reference, indexing in vector databases serves to organize high-dimensional data to enable efficient similarity searches. Without an index, searching for the closest vectors to a query would require comparing the query against every vector in the database.
The Three-Way Tradeoff
Every vector index exists inside a three-dimensional constraint space. No index can simultaneously maximize recall, minimize latency, and minimize memory. Understanding the structure of this tradeoff is the foundational skill for index selection and tuning.
The three-way tradeoff triangle:
Recall (accuracy)
/\
/ \
/ \
/ \
/________\
Latency Memory
Moving toward any corner moves away from the other two.
Higher recall → larger ef_search (HNSW) or nprobe (IVF)
→ more comparisons per query
→ higher latency per query
→ lower throughput (fewer queries per second)
Lower memory → quantization (PQ or scalar quantization)
→ approximate distances computed from compressed codes
→ lower recall at the same ef_search or nprobe
Lower latency → smaller ef_search or nprobe
→ fewer comparisons
→ lower recallAccording to Redis's vector database challenges analysis, both HNSW and IVF make the same fundamental compromise: higher recall costs more time. In benchmarks on the SIFT10M dataset, going from 0.8 to 0.95 recall increased HNSW latency by roughly 31 percent, while IVF latency roughly tripled over the same recall range. The higher you push accuracy, the steeper the cost.
This is not a bug. It is a fundamental property of approximate nearest neighbor search in high-dimensional spaces. Every index selection and tuning decision is a choice about where to sit on this surface.
The practical consequence: you must define your requirements before choosing an index. A RAG pipeline serving interactive user queries has different requirements than a batch recommendation pipeline running overnight. An enterprise knowledge base requiring 99 percent recall has different requirements than a product recommendation engine where 90 percent recall is commercially acceptable.
What an Index Actually Stores
The internal representation of a vector index differs across algorithm families, but all of them require storing something beyond the raw vectors themselves.
HNSW: Graph Edges
HNSW stores the original float32 vectors plus a graph of edges: each vector has up to M neighbors per layer, stored as integer IDs. The graph overhead is approximately M * 8 bytes per vector (M neighbor IDs at 8 bytes each). For M=16 at 1 million vectors, that is 128 MB of graph data beyond the 6 GB of raw vector storage.
HNSW index on disk for 1 million 1536-dim vectors, M=16:
vectors.bin 6,144 MB (float32 arrays)
graph_layer0.bin 128 MB (2*M = 32 neighbor IDs per vector)
graph_upper.bin ~8 MB (upper layers, fewer nodes)
metadata.db variable (payloads stored separately)
────────────────────────────
Total index: ~6,280 MBIVF: Centroid Table and Inverted Lists
IVF stores the centroid vectors (nlist centroids of dimension d) and the assignment of each vector ID to its cluster. The centroid storage is negligible (nlist=1024 at 1536 dimensions is 6 MB). The inverted lists can either store the vector IDs only (with vectors accessed separately) or the full vectors inline with each list.
IVF index on disk for 1 million 1536-dim vectors, nlist=1024:
centroids.bin 6 MB (1024 centroids of 1536 dimensions)
inverted_lists/:
list_0000.bin ~6 MB (vectors assigned to cluster 0)
list_0001.bin ~6 MB
...
list_1023.bin ~6 MB
────────────────────────────
Total: ~6,150 MB (comparable to raw storage + 6 MB overhead)IVF-PQ: Compressed Inverted Lists
IVF-PQ replaces each full float32 vector in the inverted lists with its PQ code (M bytes instead of d*4 bytes). At M=96, d=1536, the compression is 64x.
IVF-PQ index for 1 million 1536-dim vectors, nlist=1024, M=96:
centroids.bin 6 MB (uncompressed centroid vectors)
codebooks.bin 25 MB (96 codebooks of 256 centroids × 16 dims)
inverted_lists/: (compressed PQ codes only)
list_0000.bin ~94 KB (vectors encoded as 96-byte PQ codes)
...
list_1023.bin ~94 KB
────────────────────────────
Total: ~128 MB (compared to 6,150 MB for IVFFlat — 48x reduction)The Build Phase: What Happens Before Queries Can Run
Every vector index has a build phase that must complete before queries can be served. The build phase characteristics determine operational flexibility.
HNSW Build
HNSW inserts vectors one at a time into a growing graph. Each insertion:
- Assigns a maximum layer using the exponential distribution.
- Descends from the top layer to the assigned max layer via greedy traversal.
- Runs beam search at each layer from max to layer 0, finding M nearest neighbors.
- Creates bidirectional edges to those neighbors.
- Prunes any existing node whose connection count exceeds M after the new edges are added.
The dominant cost is step 3: for each inserted vector, ef_construction distance computations happen at layer 0. Total build cost is approximately O(n * ef_construction * d) distance computations.
import faiss
import numpy as np
import time
d = 384
n = 1_000_000
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)
print(f"{'M':>4} | {'ef_constr':>9} | {'Build time':>11} | {'Index MB':>8}")
print("-" * 42)
for M, ef_c in [(8, 64), (16, 64), (16, 128), (16, 200), (32, 128)]:
idx = faiss.IndexHNSWFlat(d, M)
idx.hnsw.efConstruction = ef_c
t0 = time.perf_counter()
idx.add(corpus)
build_s = time.perf_counter() - t0
# Rough memory: raw vectors + graph edges
raw_mb = n * d * 4 / 1e6
graph_mb = n * (2 * M) * 8 / 1e6 # layer 0 uses 2*M
total_mb = raw_mb + graph_mb
print(f"{M:>4} | {ef_c:>9} | {build_s:>9.1f}s | {total_mb:>6.0f}MB")
# Approximate output (1M vectors, 384-dim, single CPU):
# M | ef_constr | Build time | Index MB
# ------------------------------------------
# 8 | 64 | 31.2s | 1,797MB
# 16 | 64 | 48.7s | 1,921MB
# 16 | 128 | 87.4s | 1,921MB
# 16 | 200 | 134.8s | 1,921MB
# 32 | 128 | 149.3s | 2,169MBHNSW requires no training phase. Vectors can be inserted incrementally at any time. The index degrades slightly when vectors are inserted in a distribution-shifted order (a phenomenon documented in research showing up to 12 percent recall degradation), but for most practical workloads, incremental insertion is acceptable.
IVF Build
IVF build has two distinct stages: training (k-means) and assignment (vector-to-centroid matching).
The training stage is O(n_train * nlist * d * n_iterations). For nlist=1024, training on 100,000 vectors with 20 k-means iterations at d=1536 takes roughly 30 to 90 seconds on CPU depending on hardware.
The assignment stage is O(n * nlist * d): compare each of the n indexed vectors against all nlist centroids. For 1 million vectors, nlist=1024, d=1536, this takes approximately 5 to 15 minutes on a single CPU core.
IVF cannot be built incrementally in a meaningful way. Vectors inserted after training are assigned to the nearest existing centroid. If those vectors come from a different distribution than the training data, some clusters receive far more vectors than expected, degrading search quality. A periodic full rebuild is best practice for workloads with significant data drift.
Index Quality: Measuring It Before You Ship
The definition of a high-quality vector index is precise: it consistently returns the true top-K nearest neighbors as determined by exact search, within the latency budget, across all query types in your workload distribution.
Measuring this requires three steps:
import numpy as np
import faiss
import time
from collections import defaultdict
def evaluate_index_quality(
corpus: np.ndarray,
hnsw_index,
representative_queries: np.ndarray,
query_labels: list[str], # category label for each query
k: int = 10,
ef_search: int = 64,
) -> dict:
"""
Evaluate ANN index quality against exact search ground truth.
Groups results by query category to expose per-category recall.
Returns dict with overall and per-category metrics.
"""
d = corpus.shape[1]
# Ground truth: exact search
flat = faiss.IndexFlatIP(d)
flat.add(corpus)
faiss.normalize_L2(representative_queries)
_, gt_ids = flat.search(representative_queries, k)
# ANN search
hnsw_index.hnsw.efSearch = ef_search
t0 = time.perf_counter()
_, ann_ids = hnsw_index.search(representative_queries, k)
latency_ms = (time.perf_counter() - t0) * 1000 / len(representative_queries)
# Per-query recall
per_query_recall = []
for i in range(len(representative_queries)):
recall = len(set(gt_ids[i]) & set(ann_ids[i])) / k
per_query_recall.append(recall)
# Per-category aggregation
category_recalls = defaultdict(list)
for recall, label in zip(per_query_recall, query_labels):
category_recalls[label].append(recall)
recalls = np.array(per_query_recall)
return {
"mean_recall": float(recalls.mean()),
"median_recall": float(np.median(recalls)),
"p5_recall": float(np.percentile(recalls, 5)),
"p1_recall": float(np.percentile(recalls, 1)),
"mean_latency_ms": latency_ms,
"per_category": {
cat: {
"mean": float(np.mean(v)),
"p5": float(np.percentile(v, 5)),
"n": len(v),
}
for cat, v in category_recalls.items()
},
}The per-category breakdown is the critical part. Mean recall of 0.95 hides the possibility that one query category (say, queries about newly added topics not well-covered in training data) has recall of 0.70 while all other categories hit 0.98. The per-category view exposes exactly that.
According to Blockchain Council's performance optimization guide, in practice teams set a recall target first (for example, 99 percent Recall@10) and then tune ef_search until they reach that target at the lowest p95/p99 latency under expected concurrency.
Index Selection Framework
Choosing the right index type is a decision that follows from four questions asked in sequence.
Question 1: Does the full-precision index fit in RAM?
This is the first filter. If the answer is yes, prefer HNSW or IVFFlat without quantization. Both give higher recall than their compressed counterparts at the same ef/nprobe setting.
def estimate_index_memory_gb(
n_vectors: int,
d: int,
index_type: str,
M: int = 16,
nlist: int = None,
M_pq: int = None,
) -> float:
"""Estimate total RAM required for different index types."""
raw_vector_gb = n_vectors * d * 4 / 1e9
if index_type == "flat":
return raw_vector_gb
elif index_type == "hnsw":
graph_gb = n_vectors * (2 * M) * 8 / 1e9
return raw_vector_gb + graph_gb
elif index_type == "ivfflat":
centroid_gb = (nlist or 1024) * d * 4 / 1e9
return raw_vector_gb + centroid_gb # centroids are negligible
elif index_type == "ivfpq":
pq_bytes = n_vectors * (M_pq or 96) # M_pq bytes per vector
codebook_gb = (M_pq or 96) * 256 * (d // (M_pq or 96)) * 4 / 1e9
return pq_bytes / 1e9 + codebook_gb
return raw_vector_gb
# Decision table: which index fits in 256 GB RAM?
available_ram_gb = 256
d = 1536
print(f"{'n_vectors':>12} | {'Flat':>8} | {'HNSW':>8} | {'IVFFlat':>8} | {'IVF-PQ':>8}")
print("-" * 60)
for n in [1_000_000, 10_000_000, 50_000_000, 100_000_000, 500_000_000]:
flat_gb = estimate_index_memory_gb(n, d, "flat")
hnsw_gb = estimate_index_memory_gb(n, d, "hnsw", M=16)
ivfflat_gb = estimate_index_memory_gb(n, d, "ivfflat", nlist=int(n**0.5))
ivfpq_gb = estimate_index_memory_gb(n, d, "ivfpq", M_pq=96)
def fits(gb): return "✓" if gb <= available_ram_gb else f"{gb:.0f}GB"
print(f"{n:>12,} | {fits(flat_gb):>8} | {fits(hnsw_gb):>8} | "
f"{fits(ivfflat_gb):>8} | {fits(ivfpq_gb):>8}")
# n_vectors | Flat | HNSW | IVFFlat | IVF-PQ
# ----------------------------------------------------------
# 1,000,000 | ✓ | ✓ | ✓ | ✓
# 10,000,000 | ✓ | ✓ | ✓ | ✓
# 50,000,000 | 307GB | 333GB | 307GB | ✓
# 100,000,000 | 614GB | 666GB | 614GB | ✓
# 500,000,000 | 3,072GB | 3,328GB | 3,072GB | ✓For 50 million vectors at 1536 dimensions, neither flat, HNSW, nor IVFFlat fits in 256 GB. IVF-PQ with M=96 fits easily (about 4.8 GB). That makes the index selection for large corpora a constraint-driven decision, not a preference.
Question 2: Is the corpus static or frequently updated?
HNSW supports efficient incremental insertion. New vectors are added to the existing graph without rebuilding. The graph quality degrades slightly over time if insertions are not representative of the training distribution, but for most workloads this is acceptable.
IVF requires that inserted vectors be assigned to existing centroids. If the data distribution shifts significantly, some clusters become overloaded and others become empty. A periodic full rebuild (retrain k-means, reassign all vectors) is required to restore search quality. For corpora that change less than 10 to 20 percent per month, this is fine. For rapidly evolving corpora, HNSW is more operationally convenient.
Question 3: Are metadata filters a primary access pattern?
When a query filter is highly selective (only 1 to 5 percent of vectors match), HNSW graph traversal degrades because the traversal must skip many non-matching nodes, breaking the navigable structure. IVF handles this more gracefully by allowing the filter to be applied independently within each searched cluster, without corrupting the global index structure.
If your application requires heavy metadata filtering (multi-tenant isolation, per-user access control, date range filters), benchmark both index types under your actual filter distribution before choosing.
Question 4: Do you have GPU infrastructure?
GPU-accelerated IVF-PQ is available through FAISS GPU extensions and RAPIDS cuVS. On modern A100 GPUs, billion-scale IVF-PQ search runs at sub-millisecond latency per query with high throughput. If your workload justifies GPU infrastructure, IVF-PQ with GPU acceleration is the clear choice at extreme scale.
HNSW's graph traversal is less naturally parallelizable on GPU because the traversal path depends on sequential decisions (each step's direction depends on the current node's neighbors). GPU HNSW implementations exist but have not achieved the same speedup over CPU as GPU IVF-PQ.
The Index Selection Table
Scenario | Recommended Index | Reasoning
---------------------------------+------------------------+-------------------------------
Under 100K vectors | Flat (exact) | Brute force fast enough,
| | no accuracy loss
100K to 100M vectors, fits RAM | HNSW (M=16) | Best recall-latency,
| | incremental insert
100K to 100M vectors, low RAM | IVFFlat + nprobe tune | Same recall as HNSW at
| | lower memory
Over 100M, does not fit flat | IVF-PQ | Only option that fits
Already on PostgreSQL, under 10M | pgvector IVFFlat | No new infra needed
Rapidly growing, high recall | HNSW | Incremental insert,
| | high recall
Mostly static, heavy filtering | IVF-PQ or IVFFlat | Filters degrade HNSW less
Billion-scale, GPU available | IVF-PQ (GPU) | Sub-ms at scaleOperational Index Health
An index that was well-configured at deployment can degrade over time. Three operational factors affect ongoing index quality.
Insertion Order and Distribution Shift
HNSW graph quality depends partly on insertion order. Vectors inserted early accumulate many long-range connections that become navigation highways. If the distribution of vectors changes significantly over time (new topics, new document types, different languages), the early-inserted hub nodes may not be well-connected to the new region of the vector space, degrading recall for queries in that region.
The practical fix: if your corpus grows by more than 30 to 50 percent with significantly different content, rebuild the HNSW index from scratch on the combined corpus. A from-scratch build with high ef_construction consistently produces better graph quality than incremental insertion at equivalent M.
Soft Deletes and Compaction Lag
When vectors are deleted, most vector databases apply soft deletes: the vector is marked as deleted but its graph edges remain. This inflates the effective graph size and means deleted nodes are visited (and discarded) during traversal. At high delete rates, soft-deleted nodes can consume a meaningful fraction of traversal time without contributing to results.
Compaction removes soft-deleted nodes and rebuilds the affected segment's graph. If your workload has a high update or delete rate, monitor the ratio of live-to-deleted vectors per segment and ensure compaction runs frequently enough to keep the ratio above 0.8.
Cold Index vs Warm Index
HNSW stores the graph structure in files on disk. When the process starts (or after a crash recovery), the index is loaded from disk into memory. The first queries after load access cold memory pages, producing higher latency than steady-state. Under concurrent load, a cold start can cause latency spikes while the working set pages are loaded into the CPU cache.
The standard mitigation is to warm the index before exposing it to production traffic: run a representative sample of queries against the loaded index before routing live traffic. This pre-loads the hot graph pages into RAM and CPU cache.
Vector Indexing as Infrastructure, Not Just Code
The view of vector indexing as a library call, index.add(vectors) followed by index.search(query, k), misses the operational reality.
A vector index in production is infrastructure with a lifecycle. It is created, populated, queried under concurrent load, updated as the corpus grows, and periodically maintained through compaction and rebuilding. Each phase has failure modes.
According to Weaviate's indexing documentation, indexing vector databases properly is crucial for performance, and different index types serve different purposes. The dynamic index can even start off as a flat index and then dynamically switch to the HNSW index as it scales past a threshold.
That progressive approach is exactly right for production systems. Start with flat (exact) search while the corpus is small. Transition to HNSW when the corpus grows past 50,000 to 100,000 vectors and latency becomes a constraint. Evaluate IVF-PQ when the corpus outgrows available RAM. Build measurement and monitoring into the system from the start so the transitions are data-driven rather than reactive.
The Connection to Every Other Article in This Series
The How Vector Databases Work Internally pillar covers the full system architecture: the ingestion pipeline, the segment model, the metadata filtering strategies, and horizontal distribution. Vector indexing is the component at the center of that architecture that determines query performance.
The similarity search article covers how ANN index traversal fits into the eight-step query pipeline from API request to ranked result. The distance metrics that the index uses at each comparison step are in the cosine similarity vs Euclidean distance article.
The exact vs ANN article establishes why approximate search is necessary and how recall is measured. The HNSW article and the IVF article cover the two dominant production algorithms in the detail that this article summarized. The Product Quantization article covers the memory compression layer that makes billion-scale IVF-PQ feasible.
The vector query lifecycle article traces a single search request through every internal component of the database, from API parsing through ANN traversal through metadata filtering to response serialization. That article is the final piece in the P2 cluster and assembles everything covered across C2.1 through C2.8 into one end-to-end walkthrough.
Summary
Vector indexing is the discipline of building data structures over collections of high-dimensional float arrays that support approximate nearest neighbor search faster than O(n). It exists because brute-force similarity search scales linearly with corpus size, making it too slow for production workloads above 100,000 vectors.
Every vector index makes tradeoffs across three dimensions: recall, latency, and memory. You cannot maximize all three. Index selection is the process of choosing a position on this tradeoff surface that satisfies the requirements of your specific application.
HNSW is the default for most workloads under a few hundred million vectors: high recall, no training phase, incremental insert support. IVF is better for highly selective metadata filters, memory-constrained environments, and batch-build workflows on static corpora. IVF-PQ with compression is the choice when the corpus exceeds available RAM, compressing storage by 32 to 64 times at a moderate recall cost.
Index quality degrades over time through soft delete accumulation, insertion-order skew, and data distribution shift. Measuring recall against exact search on a representative query sample before deployment and periodically in production is the only reliable way to know whether your index is meeting its accuracy targets.
Sources and Further Reading
- Milvus AI Reference. What Is the Purpose of Indexing in a Vector Database? milvus.io/ai-quick-reference/what-is-the-purpose-of-indexing-in-a-vector-database
- Weaviate. Vector Index Concepts. weaviate.io/developers/weaviate/concepts/vector-index
- Teradata. What Is a Vector Index? teradata.com/insights/ai-and-machine-learning/what-is-vector-index
- Instaclustr. How a Vector Index Works and 5 Critical Best Practices. instaclustr.com/education/vector-database/how-a-vector-index-works-and-5-critical-best-practices
- Yugabyte. What Is Vector Indexing? yugabyte.com/key-concepts/what-is-vector-indexing
- IBM. What Is a Vector Database? ibm.com/think/topics/vector-database
- PromptWire. Vector Databases 101: How Indexes Drive Speed and Accuracy. promptwire.co/articles/vector-databases-101-how-indexes-drive-speed-and-accuracy
- Redis. Common Challenges Working with Vector Databases. redis.io/blog/common-challenges-working-with-vector-databases
- Blockchain Council. Vector Database Performance Optimization: Recall, Latency, Cost. blockchain-council.org/ai/vector-database-performance-optimization-recall-latency-cost-indexing-quantization
- Medium / Beyond Localhost. Vector Search: The Latency Tax Nobody Warns You About. medium.com/beyond-localhost/vector-search-the-latency-tax-nobody-warns-you-about
- Medium / Sarthak AI. A VectorDB Does Not Actually Work the Way You Think It Does. sarthakai.substack.com/p/a-vectordb-doesnt-actually-work-the
- Nexumo. 8 Vector Indexes: Cost vs Recall Showdown. medium.com/@Nexumo_/8-vector-indexes-cost-vs-recall-showdown
- Dobson et al. The Impacts of Data, Ordering, and Intrinsic Dimensionality on Recall in HNSW. arxiv.org/abs/2405.17813
- FAISS. Index Types and Guidelines. faiss.ai/guidelines.html
- Microsoft Learn. Vector Search in SQL Server 2025. learn.microsoft.com/en-us/sql/sql-server/ai/vectors
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.