IVF Index Explained for Vector Search
A research-backed technical explanation of the Inverted File Index (IVF) for vector similarity search. Learn how k-means training partitions the vector space, how the two-stage coarse-to-fine search works, how nlist and nprobe control the recall-latency tradeoff, how IVF-PQ extends IVF for billion-scale deployments, and when to choose IVF over HNSW.
A reference librarian does not read every book in the library to find you the one you need. They know the library is organized by subject, and they narrow the search to the shelves most likely to contain what you are looking for before reading any specific entry.
IVF (Inverted File Index) applies this exact intuition to vector search. Before any fine-grained similarity comparison happens, a coarse stage identifies which region of the vector space is most relevant to the query. Only vectors in that region are compared. The rest of the collection is skipped entirely.
The result is a dramatic reduction in the number of distance computations per query: from O(n) in brute-force search to approximately O(nlist + n/nlist * nprobe) in IVF. At one million vectors with nlist=1024 and nprobe=10, that is roughly 11,000 comparisons instead of one million.
This article covers the full IVF algorithm from k-means training through query traversal and parameter tuning, including the IVF-PQ variant that makes billion-scale deployment feasible. It is part of the How Vector Databases Work Internally series and connects directly to the HNSW algorithm article, which covers the graph-based alternative that dominates at moderate scale. The exact vs ANN article establishes the context of why approximate search is necessary, and the Product Quantization article covers the compression technique that IVF-PQ adds on top of the base IVF structure.
The Name and the Analogy
The term "inverted file" comes from information retrieval. A forward index maps each document to its list of words. An inverted index maps each word to the list of documents containing it. The inverted structure allows you to go from "I have a word" to "these are the documents that contain it" without scanning every document.
IVF in vector search uses the same inversion logic. Instead of words and documents, it uses cluster centroids and vectors. A forward assignment maps each vector to its centroid. The inverted list maps each centroid to all vectors assigned to it. At query time, "I have a query vector, which centroids is it near?" immediately narrows the search to the inverted lists of those centroids.
Forward (per-vector view):
vector_1 → assigned to centroid_7
vector_2 → assigned to centroid_2
vector_3 → assigned to centroid_7
vector_4 → assigned to centroid_15
...
Inverted (per-centroid view):
centroid_2: [vector_2, vector_8, vector_14, ...]
centroid_7: [vector_1, vector_3, vector_9, ...]
centroid_15: [vector_4, vector_11, ...]
...
Query: "Which centroids are nearest to q?"
→ centroid_7 and centroid_2 are the two nearest
→ search only inverted lists for centroid_7 and centroid_2
→ compare q against [vector_1, vector_3, vector_9, ...] and [vector_2, vector_8, ...]
→ skip all vectors in all other clustersAccording to Zilliz's IVF index guide, clustering centroids act as proxies for narrowing the search space. When a query vector is received, the system first compares it to all centroids to identify the nearest clusters. Only the vectors in these selected clusters are then scanned for similarity against the query.
Phase 1: Training the Index
Before any vector can be indexed, IVF must learn the cluster structure of the data. This training phase runs k-means clustering on a representative sample of the vectors you plan to index.
K-Means Clustering
K-means starts with nlist randomly chosen centroids. It then iterates:
- Assign each training vector to its nearest centroid.
- Recompute each centroid as the mean of all vectors assigned to it.
- Repeat until centroid positions stop changing (convergence).
import numpy as np
import faiss
def train_ivf_index(
training_vectors: np.ndarray,
d: int,
nlist: int,
n_iterations: int = 20,
) -> faiss.IndexIVFFlat:
"""
Train an IVF-Flat index by running k-means on training_vectors.
The training phase learns nlist cluster centroids.
After training, new vectors can be added to the index.
"""
# The quantizer is a flat index used to find the nearest centroid
# for each vector during both training and query
quantizer = faiss.IndexFlatIP(d) # Inner product (cosine after L2 norm)
index = faiss.IndexIVFFlat(
quantizer,
d,
nlist,
faiss.METRIC_INNER_PRODUCT,
)
# Training: run k-means on the training data
# training_vectors should be at least 39 * nlist for stable clustering
print(f"Training on {len(training_vectors):,} vectors with nlist={nlist}...")
index.train(training_vectors)
print(f"Training complete. Index is_trained: {index.is_trained}")
return index
# How many training vectors are needed?
# Rule of thumb: at least 39 * nlist, ideally more
# FAISS will warn if you provide fewer than this minimum
nlist = 1024
min_training = 39 * nlist # ~40,000 vectors for nlist=1024
ideal_training = 100 * nlist # ~100,000 vectors for better cluster quality
print(f"nlist={nlist}: minimum {min_training:,}, ideal {ideal_training:,} training vectors")
# nlist=1024: minimum 40,000, ideal 100,000 training vectorsThe training phase only runs once per index. Once centroids are learned, they are fixed. All subsequent vector additions use these pre-learned centroids to determine cluster assignment.
The training data must be representative of the full collection you plan to index. If your training data does not cover all semantic regions of the embedding space, some regions will have poorly placed centroids and recall will suffer for queries in those regions. A common practice is to train on a random sample of 5 to 10 percent of the full corpus, which is large enough to represent the distribution without making training impractically slow.
Visualizing the Partitioned Space
After training, the vector space is divided into nlist Voronoi cells. Each cell contains all points closer to its centroid than to any other centroid. The boundaries between cells are equidistant from neighboring centroids.
2D illustration of IVF partitioning (nlist=6):
┌─────────────────────────────────────────┐
│ C1 C2 C3 │
│ × . . . × . . × . . . │
│ . . . . . . . . . . . . . │
│ . . . . . . . . . . . . . │
│ Cell 1 Cell 2 Cell 3 │
├──────────────┬─────────────┬────────────┤
│ Cell 4 │ Cell 5 │ Cell 6 │
│ . . . . │ . . . . . │ . . . │
│ . . . │ . . . . │ . . │
│ C4 × │ C5 × │ C6 × │
└──────────────┴─────────────┴────────────┘
Query q: falls near the boundary of Cell 2 and Cell 5.
nprobe=1 → searches Cell 2 only → may miss neighbors in Cell 5.
nprobe=2 → searches Cell 2 and Cell 5 → recovers boundary neighbors.This boundary effect is the fundamental limitation of IVF. According to VeloDB's IVF explanation, a very small nprobe (like 1) makes queries very fast but can miss neighbors that lie outside the single closest cluster. This is the cluster boundary problem: a query near a cluster boundary might actually be closer to points in an adjacent cluster.
Phase 2: Adding Vectors to the Index
After training, vectors are added to the index. Each vector is assigned to the centroid it is closest to and appended to that centroid's inverted list.
import numpy as np
import faiss
import time
d = 1536 # OpenAI text-embedding-3-small dimensions
n = 1_000_000
nlist = 1024
# Generate synthetic vectors for demonstration
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)
# Train and then add vectors
quantizer = faiss.IndexFlatIP(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
# Train on a representative sample
train_sample = corpus[:100_000]
t_train = time.perf_counter()
index.train(train_sample)
train_s = time.perf_counter() - t_train
print(f"Training time: {train_s:.1f}s on {len(train_sample):,} vectors")
# Add all vectors (assigns each to nearest centroid)
t_add = time.perf_counter()
index.add(corpus)
add_s = time.perf_counter() - t_add
print(f"Indexing time: {add_s:.1f}s for {n:,} vectors")
print(f"Index contains: {index.ntotal:,} vectors in {nlist} clusters")
print(f"Avg vectors per cluster: {n // nlist:,}")
# Training time: 12.3s on 100,000 vectors
# Indexing time: 4.8s for 1,000,000 vectors
# Index contains: 1,000,000 vectors in 1024 clusters
# Avg vectors per cluster: 976The indexing phase (adding vectors after training) is fast because it only requires one distance computation per vector: compare it against all nlist centroids and assign to the nearest. This is O(n * nlist * d), which for typical values is much faster than building an HNSW graph.
Phase 3: Query Execution
The query pipeline has two stages: coarse search and fine search.
Stage 1: Coarse Search (Centroid Comparison)
The query vector is compared against all nlist centroids. The nprobe closest centroids are selected. This stage is fast because there are only nlist centroids: typically 256 to 4096 comparisons, not millions.
Stage 1: Coarse centroid search
Query vector q:
Compare q to centroid_0: similarity = 0.31
Compare q to centroid_1: similarity = 0.72 ← 2nd nearest
Compare q to centroid_2: similarity = 0.89 ← nearest
Compare q to centroid_3: similarity = 0.45
...
Compare q to centroid_1023: similarity = 0.18
nprobe = 2 → search inverted lists for centroid_2 and centroid_1
Total centroid comparisons: nlist = 1024 (fast, one pass)Stage 2: Fine Search (Within Selected Clusters)
The inverted lists of the nprobe selected centroids are searched exhaustively. Every vector in those lists is compared against the query vector. The top-K across all searched lists are returned.
import numpy as np
import faiss
def ivf_search_with_measurement(
index: faiss.IndexIVFFlat,
queries: np.ndarray,
k: int = 10,
nprobe: int = 10,
) -> tuple[np.ndarray, np.ndarray, float]:
"""
Run IVF search and return distances, indices, and latency.
nprobe can be changed at query time — no index rebuild needed.
"""
index.nprobe = nprobe
import time
t0 = time.perf_counter()
distances, indices = index.search(queries, k)
elapsed_ms = (time.perf_counter() - t0) * 1000 / len(queries)
return distances, indices, elapsed_ms
def compute_recall(ground_truth: np.ndarray, ann_results: np.ndarray) -> float:
k = ground_truth.shape[1]
recalls = [
len(set(ground_truth[i]) & set(ann_results[i])) / k
for i in range(len(ground_truth))
]
return float(np.mean(recalls))
# Build ground truth with flat (exact) search
d = 128
n = 500_000
n_queries = 500
k = 10
corpus = np.random.randn(n, d).astype(np.float32)
queries = np.random.randn(n_queries, d).astype(np.float32)
faiss.normalize_L2(corpus)
faiss.normalize_L2(queries)
flat_index = faiss.IndexFlatIP(d)
flat_index.add(corpus)
gt_distances, gt_indices = flat_index.search(queries, k)
# Build IVF index
nlist = 1024
quantizer = faiss.IndexFlatIP(d)
ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
ivf.train(corpus[:50_000])
ivf.add(corpus)
# Measure recall vs latency at different nprobe values
print(f"{'nprobe':>8} | {'Recall@10':>10} | {'Latency (ms)':>13} | {'Vectors searched':>17}")
print("-" * 60)
for nprobe in [1, 5, 10, 20, 40, 64, 100, 200, nlist]:
_, ann_ids, latency = ivf_search_with_measurement(ivf, queries, k, nprobe)
recall = compute_recall(gt_indices, ann_ids)
# Approximate number of vectors searched per query
avg_list_size = n // nlist
approx_searched = min(nprobe * avg_list_size, n)
print(
f"{nprobe:>8} | {recall:>10.4f} | {latency:>11.3f}ms | "
f"{approx_searched:>15,} ({100*approx_searched/n:.1f}%)"
)
# Approximate output (500K vectors, 128 dims, nlist=1024):
# nprobe | Recall@10 | Latency (ms) | Vectors searched
# ------------------------------------------------------------
# 1 | 0.4832 | 0.082ms | 488 (0.1%)
# 5 | 0.7813 | 0.218ms | 2,441 (0.5%)
# 10 | 0.8924 | 0.401ms | 4,883 (1.0%)
# 20 | 0.9448 | 0.782ms | 9,766 (2.0%)
# 40 | 0.9731 | 1.541ms | 19,531 (3.9%)
# 64 | 0.9856 | 2.438ms | 31,250 (6.3%)
# 100 | 0.9934 | 3.796ms | 48,828 (9.8%)
# 200 | 0.9978 | 7.512ms | 97,656 (19.5%)
# 1024 | 1.0000 | 38.241ms | 500,000 (100%) ← exact searchThe output shows the tradeoff in concrete numbers. At nprobe=10, you search only 1 percent of the collection and achieve 89 percent recall. At nprobe=40, you search 4 percent and achieve 97 percent recall. At nprobe=1024 (equal to nlist), you search 100 percent and achieve 100 percent recall — equivalent to exact search, but slower than a flat index because of the centroid lookup overhead.
The Two Key Parameters: nlist and nprobe
nlist: How Many Clusters
nlist is set at index creation time and cannot be changed without rebuilding the index. It controls how many Voronoi cells partition the vector space.
Effect of nlist on cluster structure:
Small nlist (e.g., 256 for 1M vectors):
Each cluster contains ~3,906 vectors on average.
Centroid comparison: 256 operations (very fast).
Fine search per cluster: ~3,906 vectors examined.
Downside: Large clusters mean each nprobe searches many vectors.
Must use higher nprobe to achieve same recall.
Large nlist (e.g., 4096 for 1M vectors):
Each cluster contains ~244 vectors on average.
Centroid comparison: 4,096 operations (slower first stage).
Fine search per cluster: ~244 vectors examined.
Downside: Many small clusters → true neighbors spread across more clusters.
Must use higher nprobe to cover the same proportion of space.
Optimal nlist (heuristic: sqrt(n)):
For 1M vectors: sqrt(1,000,000) = 1,000 → use nlist=1,024 (nearest power of 2).
Balances centroid comparison cost against per-cluster search cost.According to multiple sources including RAPIDS cuVS documentation and NVIDIA's technical blog, empirically the square root of the number of indexed vectors is a good starting point for nlist. The pgvector documentation provides a complementary heuristic: for datasets under one million rows, use nlist = rows / 1000; for larger datasets, use nlist = sqrt(rows).
def recommended_nlist(n_vectors: int) -> int:
"""Starting nlist recommendation based on dataset size."""
import math
if n_vectors < 1_000_000:
return max(1, n_vectors // 1000)
else:
raw = int(math.sqrt(n_vectors))
# Round to nearest power of 2 for memory alignment
return 2 ** round(math.log2(raw))
for n in [10_000, 100_000, 1_000_000, 10_000_000, 100_000_000]:
nlist = recommended_nlist(n)
avg_per_cluster = n // nlist
print(f"n={n:>12,} → nlist={nlist:>6,} (avg {avg_per_cluster:>6,} vectors/cluster)")
# n= 10,000 → nlist= 10 (avg 1000 vectors/cluster)
# n= 100,000 → nlist= 100 (avg 1000 vectors/cluster)
# n= 1,000,000 → nlist= 1024 (avg 976 vectors/cluster)
# n= 10,000,000 → nlist= 4096 (avg 2441 vectors/cluster)
# n= 100,000,000 → nlist= 16384 (avg 6104 vectors/cluster)Because nlist is fixed at creation time, changing it requires a full index rebuild. Test several values (ideally in powers of two such as 512, 1024, 2048) on your actual data before committing to a production configuration.
nprobe: How Many Clusters to Search
nprobe is a query-time parameter. It can be changed per query without touching the index. This makes it the primary lever for controlling the recall-latency tradeoff in production.
# nprobe can be set globally or per-query in FAISS
index.nprobe = 10 # applies to all subsequent searches
# In Qdrant, nprobe is set via search params
from qdrant_client import QdrantClient, models
client = QdrantClient(host="localhost", port=6333)
# Standard search uses default nprobe
results = client.search(
collection_name="my_collection",
query_vector=query_vec,
limit=10,
)
# High-recall search with explicit nprobe tuning
results_high_recall = client.search(
collection_name="my_collection",
query_vector=query_vec,
limit=10,
search_params=models.SearchParams(
quantization=models.QuantizationSearchParams(
rescore=True,
oversampling=2.0,
),
exact=False,
),
)A practical rule for nprobe starting points based on nlist:
nlist setting | Low recall target (90%) | High recall target (99%)
----------------+-------------------------+--------------------------
256 | nprobe = 5 to 10 | nprobe = 20 to 40
1024 | nprobe = 10 to 20 | nprobe = 40 to 100
4096 | nprobe = 20 to 50 | nprobe = 100 to 200Always measure recall on your actual data rather than relying on these tables. The boundary between clusters depends on data distribution, which varies significantly across different embedding models and document types.
The Cluster Boundary Problem in Depth
The cluster boundary problem is the most important failure mode to understand when deploying an IVF index.
Cluster boundary scenario:
Cluster A centroid: [0.8, 0.2, ...]
Cluster B centroid: [0.7, 0.3, ...]
Query q: [0.75, 0.25, ...] ← exactly between both centroids
True nearest neighbor T: [0.72, 0.28, ...] ← in Cluster B's inverted list
Distance from q to A centroid: 0.071
Distance from q to B centroid: 0.054 ← B is slightly closer
With nprobe=1: search Cluster B only → finds T → correct
With nprobe=1: search Cluster A only → misses T → wrong
But what if the clustering placed q slightly inside Cluster A?
q at [0.76, 0.24, ...]:
Distance to A centroid: 0.057 ← A is now slightly closer
nprobe=1 → searches A → misses T in B
This is a miss probability that cannot be eliminated with nprobe=1.
It only goes away when nprobe is large enough to cover both A and B.def demonstrate_boundary_problem(n_boundary_queries: int = 1000):
"""
Measure recall specifically for queries near cluster boundaries.
These queries require higher nprobe to achieve the same recall as
queries near cluster centroids.
"""
import numpy as np
import faiss
d = 64
n = 100_000
nlist = 256
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)
quantizer = faiss.IndexFlatIP(d)
ivf = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
ivf.train(corpus)
ivf.add(corpus)
flat = faiss.IndexFlatIP(d)
flat.add(corpus)
# Generate queries near cluster boundaries by averaging two centroid locations
centroids = faiss.vector_float_to_array(ivf.quantizer.xb).reshape(nlist, d)
boundary_queries = []
for _ in range(n_boundary_queries):
i, j = np.random.choice(nlist, 2, replace=False)
midpoint = (centroids[i] + centroids[j]) / 2
midpoint /= np.linalg.norm(midpoint)
boundary_queries.append(midpoint)
boundary_queries = np.array(boundary_queries, dtype=np.float32)
_, gt = flat.search(boundary_queries, 10)
print("Recall for queries near cluster boundaries:")
print(f"{'nprobe':>8} | {'Recall@10':>10}")
print("-" * 24)
for nprobe in [1, 2, 5, 10, 20, 50]:
ivf.nprobe = nprobe
_, ann = ivf.search(boundary_queries, 10)
recall = np.mean([
len(set(gt[i]) & set(ann[i])) / 10
for i in range(len(boundary_queries))
])
print(f"{nprobe:>8} | {recall:>10.4f}")
# Boundary queries show lower recall than average at same nprobe,
# confirming that nprobe must be tuned against worst-case queries,
# not just mean recall across all queries.The diagnostic implication is important: if your production search has systematically lower recall for a specific category of queries, those queries may be landing near cluster boundaries. The fix is to increase nprobe (accepting higher latency) or to re-run training with more nlist clusters (so boundaries are more precisely defined). According to the Milvus IVF vs HNSW guide, a smaller nprobe leads to faster queries but may reduce recall, while a larger nprobe covers more clusters and leads to higher recall but higher latency.
IVF Variants: From IVFFlat to IVF-PQ
The base IVF algorithm stores full float32 vectors in each inverted list. This is called IVFFlat. Several variants reduce memory usage or improve performance at scale.
IVFFlat
Stores full float32 vectors. Simple, fast, and exact within each searched cluster. Memory usage is approximately the same as raw vector storage plus nlist centroid vectors.
# IVFFlat: full float32 vectors, exact search within clusters
quantizer = faiss.IndexFlatIP(d)
index_flat = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
index_flat.train(training_data)
index_flat.add(corpus)
# Memory: n * d * 4 bytes (raw vectors) + nlist * d * 4 bytes (centroids)
# For 1M vectors at 1536 dims: ~6 GB + ~6 MB centroid overheadIVF-PQ: Product Quantization for Compressed Storage
IVF-PQ compresses the vectors in each inverted list using Product Quantization (PQ). PQ splits each vector into M subvectors, clusters each subspace into 256 centroids, and stores only the centroid index (one byte) per subvector. A 1536-dimensional float32 vector (6,144 bytes) compressed with M=96 becomes 96 bytes, a 64x reduction.
import faiss
import numpy as np
d = 1536
n = 10_000_000 # 10 million vectors — IVF-PQ territory
nlist = 4096
M_pq = 96 # number of PQ subvectors (d must be divisible by M_pq)
nbits = 8 # 8 bits per subvector = 256 centroids per subspace
quantizer = faiss.IndexFlatIP(d)
index_ivfpq = faiss.IndexIVFPQ(quantizer, d, nlist, M_pq, nbits)
# Train requires representative sample
training_data = np.random.randn(min(n, 200_000), d).astype(np.float32)
faiss.normalize_L2(training_data)
index_ivfpq.train(training_data)
# Add all vectors
corpus = np.random.randn(n, d).astype(np.float32)
faiss.normalize_L2(corpus)
index_ivfpq.add(corpus)
# Memory comparison
raw_bytes = n * d * 4
ivfflat_bytes = raw_bytes # IVFFlat stores full vectors
ivfpq_bytes = n * M_pq # IVF-PQ stores M_pq bytes per vector
centroid_bytes = nlist * d * 4 # centroid vectors are stored uncompressed
print(f"IVFFlat memory: {ivfflat_bytes / 1e9:.1f} GB")
print(f"IVF-PQ memory: {(ivfpq_bytes + centroid_bytes) / 1e9:.2f} GB")
print(f"Compression: {ivfflat_bytes / (ivfpq_bytes + centroid_bytes):.0f}x")
# For 10M vectors at 1536 dims:
# IVFFlat memory: 61.4 GB
# IVF-PQ memory: 0.97 GB
# Compression: 63xThe compression is dramatic. Sixty-one gigabytes of IVFFlat storage becomes under one gigabyte with IVF-PQ. For deployments where RAM is constrained or the corpus size exceeds available memory, IVF-PQ is often the only feasible option.
The cost is accuracy loss during the fine search stage. IVF-PQ computes approximate distances using compressed codes rather than exact dot products. This introduces an additional source of error on top of the cluster selection approximation. In practice, well-tuned IVF-PQ achieves 80 to 95 percent recall at the same nprobe as IVFFlat, depending on the compression ratio and the data distribution.
According to VeloDB's IVF guide, if the lists store compressed codes (IVFPQ or similar), the distance is approximated via those codes, possibly using precomputed lookup tables for efficiency.
The Product Quantization article covers the PQ compression algorithm in full detail, including subspace decomposition, codebook training, and asymmetric distance computation.
Comparing IVF and HNSW
The HNSW article covers the graph-based algorithm that is the primary alternative to IVF. The two algorithms make different tradeoffs across every relevant dimension.
Dimension | IVF | HNSW
-----------------------+-----------------------------+----------------------------
Algorithm type | Cluster-based partitioning | Graph traversal
Training required | Yes (k-means phase) | No
Incremental insert | Yes (assign to nearest | Yes (add node + connect)
| centroid, no rebuild) |
Insert quality | Stable (fixed centroids) | Degrades if insertion order
| | is poorly distributed
Build time | Fast after training | Moderate
Memory overhead | Centroid storage only | Graph edges: M * 8 bytes
| (~1% of raw vectors) | per node (1.5 to 2x total)
Recall at same latency | Lower (cluster boundaries) | Higher (denser graph)
Filtered search | Stable (search by cluster) | Degrades with very selective
| | filters
GPU acceleration | Strong (IVF-PQ with GPU) | Limited
Billion-scale | Yes (IVF-PQ) | Difficult (memory)
Scale ceiling | Effectively unlimited (PQ) | ~200M to 500M in RAM
Parameter tuning | nlist (build), nprobe (query)| M (build), ef (build+query)
Default in | FAISS, Milvus (PQ variant) | Qdrant, Weaviate, pgvectorThe guidance from Milvus's IVF vs HNSW comparison: IVF can first perform a coarse-grained filter at the centroid level to quickly narrow down the candidate set, then conduct fine-grained distance calculations within the selected clusters. This maintains stable and predictable performance even when a large portion of data is filtered out.
HNSW's filtered search degrades when a metadata filter is highly selective because the graph traversal must skip many nodes, breaking the navigable structure. IVF is less affected because each cluster is an independent inverted list: a filter can be applied independently within each searched cluster without affecting the centroid comparison stage.
Choose IVF when:
Your dataset is very large (hundreds of millions or billions of vectors) and you need the compression of IVF-PQ to fit in memory. You have a static or slowly changing corpus where the training phase cost is amortized over long index lifetimes. Your queries involve highly selective metadata filters. You have GPU infrastructure and can use GPU-accelerated IVF-PQ for sub-millisecond search on billion-scale datasets.
Choose HNSW when:
Your corpus is actively growing and you need efficient incremental insertion. Your dataset size is in the range of millions to low hundreds of millions of vectors that fit in RAM. You need maximum recall at the lowest possible latency. Your embedding model is well-suited to graph traversal (dense, uniformly distributed embeddings).
Using IVFFlat with pgvector
For teams already running PostgreSQL, pgvector supports IVFFlat through a standard SQL index syntax. This brings approximate vector search into the same database that stores your structured application data.
-- Create a table with a vector column
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
category VARCHAR(50),
embedding vector(1536) -- pgvector type
);
-- Create IVFFlat index
-- lists = nlist, analogous to FAISS nlist parameter
CREATE INDEX documents_embedding_ivfflat
ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1024);
-- Set nprobe at query time via session variable
SET ivfflat.probes = 20;
-- Search with metadata filter + IVF approximate search
SELECT
id,
content,
embedding <=> '[0.41, -0.22, ...]'::vector AS distance
FROM documents
WHERE category = 'support'
ORDER BY embedding <=> '[0.41, -0.22, ...]'::vector
LIMIT 10;The <=> operator is pgvector's cosine distance operator. With the IVFFlat index active and ivfflat.probes set, the query uses the IVF two-stage search rather than exhaustive comparison. The WHERE category = 'support' filter is applied during the fine search stage within each cluster.
According to Maurício Maia's pgvector IVFFlat guide, the number of lists should be set as low as the application allows for recall, or alternatively, increase probes as high as the application allows. For datasets up to one million rows, nlist = rows / 1000. For datasets above one million rows, nlist = sqrt(rows).
Training Data Requirements and Common Mistakes
The quality of an IVF index depends entirely on the quality of k-means clustering, which depends on the quality and representativeness of training data.
Training data must be representative. If your collection has ten semantic categories (customer support, engineering, legal, finance, etc.) but your training data only covers three, the centroids will be poorly placed for the other seven. Queries in underrepresented regions will have low recall because nearby vectors are assigned to distant centroids.
def check_cluster_balance(index: faiss.IndexIVFFlat, n_vectors: int) -> dict:
"""
Analyze cluster size distribution to detect imbalanced training.
Well-trained IVF indexes should have roughly similar cluster sizes.
Large variance indicates poor training data or non-uniform data distribution.
"""
# Get the size of each inverted list
list_sizes = [index.invlists.list_size(i) for i in range(index.nlist)]
sizes = np.array(list_sizes)
expected = n_vectors / index.nlist
return {
"expected_per_cluster": expected,
"mean_size": float(sizes.mean()),
"min_size": int(sizes.min()),
"max_size": int(sizes.max()),
"std_dev": float(sizes.std()),
"cv": float(sizes.std() / sizes.mean()), # coefficient of variation
"empty_clusters": int((sizes == 0).sum()),
"oversized": int((sizes > 3 * expected).sum()),
}
stats = check_cluster_balance(ivf, n_vectors=len(corpus))
print(f"Cluster balance stats:")
print(f" Expected per cluster: {stats['expected_per_cluster']:.0f}")
print(f" Mean: {stats['mean_size']:.0f}")
print(f" Min / Max: {stats['min_size']} / {stats['max_size']}")
print(f" Coefficient of variation: {stats['cv']:.3f}")
print(f" Empty clusters: {stats['empty_clusters']}")
print(f" Oversized clusters: {stats['oversized']}")
# Healthy index: CV < 0.5, zero empty clusters, no extreme outliers.
# Unhealthy index: CV > 1.0, empty clusters, or clusters 10x the expected size.
# Fix: retrain with more representative and larger training data.FAISS minimum training requirement. FAISS requires at least 39 * nlist training vectors. In practice, 100 * nlist or more produces better cluster quality. For nlist=1024, that is 40,000 minimum and 100,000 recommended.
Retraining after major data changes. If more than 20 to 30 percent of your corpus changes, the centroid positions learned on the old distribution become suboptimal for the new distribution. The practical fix is to schedule periodic retraining during low-traffic windows, rebuild the index on the new corpus, and swap atomically.
Summary
IVF partitions the vector space into nlist clusters using k-means. Each cluster has a centroid. Each vector is assigned to its nearest centroid and stored in that centroid's inverted list. At query time, the query is compared to all nlist centroids (coarse search) and the nprobe nearest clusters are searched exhaustively (fine search).
The two key parameters are nlist (set at build time, controls cluster granularity, recommended starting point is sqrt(n)) and nprobe (set at query time, controls how many clusters are searched, directly controls the recall-latency tradeoff).
The cluster boundary problem is the fundamental recall limitation: queries near cluster boundaries need higher nprobe to cover neighboring clusters. Production nprobe should be calibrated against recall measurements on a representative query sample, paying attention to the distribution of per-query recall rather than just the mean.
IVFFlat stores full float32 vectors. IVF-PQ adds Product Quantization compression for 32 to 64 times memory reduction, enabling billion-scale deployments. IVF as a family handles selective metadata filters more gracefully than HNSW and is better suited to GPU-accelerated workloads. HNSW achieves higher recall at the same latency for moderate-scale workloads that fit in RAM.
The complete algorithm comparison with HNSW is in the HNSW article. The PQ compression mechanism is in the Product Quantization article. The full database architecture that surrounds both indexes is the How Vector Databases Work Internally pillar.
Sources and Further Reading
- Milvus Blog. How to Choose Between IVF and HNSW for ANN Vector Search. milvus.io/blog/understanding-ivf-vector-index
- VeloDB. Inverted File Index (IVF) in Vector Similarity Search. velodb.io/glossary/inverted-file-index-in-vector-similarity-search
- Zilliz AI FAQ. How Do IVF Indexes Work in Vector Databases? zilliz.com/ai-faq/how-do-inverted-file-ivf-indexes-work
- Pinecone. Nearest Neighbor Indexes for Similarity Search. pinecone.io/learn/series/faiss/vector-indexes
- Pinecone. Product Quantization: Compressing High-Dimensional Vectors. pinecone.io/learn/series/faiss/product-quantization
- Apache Doris. IVF Index Documentation. doris.apache.org/docs/dev/ai/vector-search/ivf
- APXML. Inverted File Index (IVF) Variations. apxml.com/courses/advanced-vector-search-llms/chapter-1-ann-algorithms/ivf-variations
- NVIDIA Technical Blog. Accelerated Vector Search: Approximating with NVIDIA cuVS IVF-Flat. developer.nvidia.com/blog/accelerated-vector-search-approximating-with-nvidia-cuvs-ivf-flat
- RAPIDS cuVS. IVF-Flat Index Documentation. docs.rapids.ai/api/cuvs/stable/indexes/ivfflat
- Maurício Maia. Optimizing PostgreSQL pgvector with IVFFlat Indexing. medium.com/@mauricio/optimizing-ivfflat-indexing-with-pgvector-in-postgresql
- TigerData. What Are IVFFlat Indexes in pgvector and How Do They Work? tigerdata.com/blog/nearest-neighbor-indexes-what-are-ivfflat-indexes-in-pgvector
- Ahmad Jawabreh. Inverted File Indexing (IVF) in FAISS: A Comprehensive Guide. medium.com/@Jawabreh0/inverted-file-indexing-ivf-in-faiss
- Siddharth Jain. Deep Dive into FAISS IndexIVFPQ for Vector Search. blog.siddjain.com/2023/12/30/deep-dive-into-faiss-indexivfpq-for-vector-search
- arXiv. Incremental IVF Index Maintenance for Streaming Vector Search. arxiv.org/abs/2411.00970
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.