K
Krunal Kanojiya
HomeAboutServicesBlog
Hire Me
K
Krunal Kanojiya

Technical Content Writer

BlogRSSSitemapEmail
© 2026 Krunal Kanojiya · Built with Next.js
Privacy PolicyTerms of Service
  1. Home
  2. /
  3. Blog
  4. /
  5. Vector Database in RAG: How It Works, Which One to Pick (2026 Guide)
Tech15 min read2,982 words

Vector Database in RAG: How It Works, Which One to Pick (2026 Guide)

The vector database is the retrieval engine behind every RAG system. This guide covers HNSW indexing, how cosine similarity works, pgvector vs purpose-built databases, real cost numbers at scale, and a decision framework for picking between Pinecone, Qdrant, Weaviate, Milvus, and Chroma in 2026.

Krunal Kanojiya

Krunal Kanojiya

May 06, 2026
Share:
#vector-database#rag#pinecone#qdrant#weaviate#milvus#hnsw#pgvector#embeddings#retrieval-augmented-generation

I once watched a team spend two weeks tuning their chunking strategy and embedding model on a RAG system that was returning slow, imprecise results at scale. They ran every benchmark they could find on chunk sizes and similarity thresholds. The system still underperformed.

The actual problem was the vector database. They had started with Chroma during development, which runs in-memory and is excellent for prototyping. They had never swapped it out before going to production with 15 million vectors. At that scale, Chroma was not built to operate. Every retrieval call was slow. The p99 latency was breaking their SLA.

One afternoon to migrate to Qdrant. Problem solved.

The vector database is not an interchangeable commodity. It is the retrieval engine. Its indexing structure, its support for hybrid search, its filtering performance, and its cost curve at scale all directly determine whether the retrieval step in your RAG pipeline works.

Why a Regular Database Cannot Do This

A relational database like Postgres looks for exact matches. You give it a query, it scans indexed columns for rows where the value equals what you asked for. It is phenomenally fast at this, because it can use B-tree indexes to jump directly to the matching rows.

Vector search is a different problem entirely. You have a query vector — a list of 1,536 or 3,072 floating-point numbers representing the semantic meaning of a user's question. You want to find the stored vectors that are most similar to it. There is no exact match. There is no equality check. There is a distance calculation across a high-dimensional space.

A brute-force approach scans every vector in the database and computes the distance to the query vector. This works at small scale. At one million vectors with 1,536 dimensions each, a brute-force scan computes 1.536 billion floating-point operations per query. At ten million vectors, it is 15.36 billion. That does not fit inside any user-facing latency budget.

Vector databases solve this with approximate nearest neighbor (ANN) indexes, and specifically with an algorithm called HNSW.

How HNSW Works

HNSW stands for Hierarchical Navigable Small World. It is the indexing algorithm used by Qdrant, Weaviate, Pinecone, and pgvector. Understanding it is not required to use a vector database, but it explains why retrieval is fast, what the tuning parameters mean, and where recall degrades.

The core idea is a layered proximity graph. Every vector in the database becomes a node. Nodes are connected to nearby vectors by edges. The graph has multiple layers.

plaintext
HNSW Layer Structure
+------------------------------------------------+
|  Layer 2 (sparse — fewest nodes, long edges)   |
|      A -------- E                              |
|                                                |
|  Layer 1 (medium density)                      |
|      A --- C --- E --- G                       |
|                                                |
|  Layer 0 (dense — all nodes, short edges)      |
|      A - B - C - D - E - F - G - H - I        |
+------------------------------------------------+

Search: start at Layer 2, navigate greedily toward query vector.
Drop to Layer 1 at local minimum. Repeat. Drop to Layer 0. Finish.

HNSW builds a multi-layered graph where each layer represents the dataset with varying degrees of abstraction. The top layer has the fewest nodes and long-range edges. A search starts at the top, navigates greedily toward the query vector by moving to whichever neighbor is closest, and then drops down to the next layer when it hits a local minimum. By the time it reaches the bottom layer, which contains all nodes and short-range edges, it has already narrowed its position in the graph significantly.

This delivers near-logarithmic search complexity — search time grows slowly as the number of vectors grows, not linearly. Exact brute-force search is O(N). HNSW is approximately O(log N). At 10 million vectors, that is the difference between milliseconds and seconds per query.

Two tuning parameters matter for production use. M controls how many edges each node has in the graph. Higher M improves recall but uses more memory and slows index builds. ef_construction controls how many candidates are evaluated when building the index. Higher values produce a better-quality graph at the cost of longer index build time. At query time, ef_search controls how many candidates are explored — a higher value trades query speed for recall accuracy.

The practical defaults work for most RAG systems. You only need to tune these if your evaluation shows recall below 0.95 or if memory costs are a constraint.

Cosine Similarity vs Dot Product

Every time a user asks a question, that question gets embedded into a vector, and the vector database computes a similarity score between that query vector and every candidate in the result set. Two distance metrics dominate RAG applications.

Cosine similarity measures the angle between two vectors. It ignores the magnitude of the vectors and looks only at their direction in the embedding space. A score of 1.0 means the vectors point in exactly the same direction. A score of 0 means they are perpendicular. A score of -1.0 means they point in opposite directions. For text embeddings, what matters is the direction — two chunks about refund policies should point in a similar semantic direction regardless of how long they are.

Dot product combines angle and magnitude. For normalized unit vectors (where the length is always 1.0), cosine similarity and dot product produce identical results. OpenAI normalizes all its embedding outputs to unit length, which makes this distinction irrelevant for those models. For embedding models that do not normalize, cosine similarity is the safer default for semantic search.

L2 (Euclidean) distance measures the straight-line distance between two points. It is rarely the right choice for text RAG because it is sensitive to vector magnitude, and text embedding magnitude does not correlate with semantic relevance.

When creating collections in any vector database, specify the distance metric explicitly. Mixing models that produce normalized vectors with L2 distance, or using unnormalized vectors with dot product, produces meaningless similarity scores.

The Vector Database Landscape in 2026

The field consolidated to eight production-grade options by 2026, organized across four tiers: managed leaders, open-source primaries, embedded and Postgres-integrated options, and large-scale deployments.

plaintext
Tier             Databases                  When to use
-----------------------------------------------------------------------
Managed          Pinecone, Vertex Vector    Zero ops overhead,
                                            strict SLAs, GCP-native

Open-source      Qdrant, Weaviate, Milvus   Performance + control,
primary                                     self-hosted or managed

Embedded /       Chroma, pgvector           Prototyping, Postgres
Postgres                                    integration, under 10M vectors

Large-scale      Milvus, Vespa              100M to billions of vectors,
hybrid                                      GPU acceleration needed
-----------------------------------------------------------------------

Here is a detailed look at each primary option.

Pinecone

Pinecone is the fully managed vector database. You do not run infrastructure. You do not tune HNSW parameters. You call an API and vectors go in, queries come back.

Pinecone is the right choice when managed infrastructure is the priority and the team values not operating their own systems. It is particularly strong for teams building customer-facing AI with strict SLAs, and for organizations without platform engineering capacity.

The cost curve is the main limitation. At 10 million vectors, Pinecone Serverless costs roughly $70 per month, which is competitive. At 100 million vectors, the cost can reach $700 or more per month, while self-hosted alternatives stay under $100 per month. Teams that start on Pinecone for convenience frequently revisit the decision when they hit 50 million vectors.

Qdrant

Qdrant is written in Rust and leads open-source vector databases on latency. At 10 million vectors, p99 latency typically lands around 12ms. Weaviate at the same scale runs around 16ms. Milvus runs around 18ms.

Beyond raw speed, Qdrant has strong payload filtering — you can filter by metadata at query time without a post-retrieval pass. It supports native hybrid search through named vector collections combining dense and sparse vectors. Quantization support (int8 and binary) reduces memory footprint significantly without much recall degradation.

Qdrant Cloud starts at $25 per month with a free tier of 1GB, no credit card required. Self-hosting is free under Apache 2.0. For most teams that need performance and data privacy without Pinecone's cost curve, Qdrant is the default choice.

python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid

client = QdrantClient(url="http://localhost:6333")

# Create a collection with cosine similarity
client.create_collection(
    collection_name="rag_knowledge_base",
    vectors_config=VectorParams(
        size=1536,           # dimensions from text-embedding-3-small
        distance=Distance.COSINE
    )
)

# Index a batch of document chunks
def index_chunks(chunks: list[dict], embeddings: list[list[float]]):
    points = [
        PointStruct(
            id=str(uuid.uuid4()),
            vector=embedding,
            payload={
                "text": chunk["text"],
                "source": chunk["source"],
                "page": chunk["page"],
                "section": chunk["section"],
                "created_at": chunk["created_at"]
            }
        )
        for chunk, embedding in zip(chunks, embeddings)
    ]
    client.upsert(
        collection_name="rag_knowledge_base",
        points=points
    )

# Query with metadata filtering
def retrieve(query_vector: list[float], doc_type: str = None, top_k: int = 10):
    from qdrant_client.models import Filter, FieldCondition, MatchValue

    # Optional: filter by document type at query time
    query_filter = None
    if doc_type:
        query_filter = Filter(
            must=[FieldCondition(
                key="source",
                match=MatchValue(value=doc_type)
            )]
        )

    results = client.search(
        collection_name="rag_knowledge_base",
        query_vector=query_vector,
        query_filter=query_filter,
        limit=top_k,
        with_payload=True
    )
    return results

Weaviate

Weaviate has the most mature native hybrid search implementation in 2026. It combines BM25 with dense vector search through BlockMax WAND and Relative Score Fusion natively, without requiring a separate sparse index. The built-in vectorization modules let you insert raw text and have Weaviate call an embedding API automatically, though this trades control over the embedding pipeline for convenience.

For queries like "find cases semantically similar to this clause AND mentioning this specific statute," Weaviate's hybrid implementation is the strongest native option available in 2026. It is the right choice when hybrid search quality is the primary concern.

Weaviate consumes more memory and CPU than Qdrant at equivalent scale. Below 50 million vectors it runs efficiently. Above 100 million, capacity planning needs care. Weaviate Cloud starts at $25 per month after a 14-day trial — the shortest trial period among major options.

python
import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_local()

# Create collection with hybrid search enabled
client.collections.create(
    name="CompanyDocs",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    ),
    properties=[
        wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="source", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="section", data_type=wvc.config.DataType.TEXT),
    ]
)

collection = client.collections.get("CompanyDocs")

# Hybrid search: dense + BM25 combined natively
results = collection.query.hybrid(
    query="refund policy for enterprise customers",
    alpha=0.5,      # 0 = pure BM25, 1 = pure vector, 0.5 = balanced
    limit=10,
    return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True)
)

for obj in results.objects:
    print(f"Score: {obj.metadata.score:.4f}")
    print(f"Text: {obj.properties['text'][:150]}")
    print()

client.close()

The alpha parameter in Weaviate's hybrid search controls the balance between dense and sparse retrieval. Alpha of 0.5 gives equal weight to both. For queries involving product names, error codes, or specific identifiers, lower alpha toward 0.3 to weight BM25 more heavily. For broad conceptual questions, raise it toward 0.7 to weight semantic similarity more.

Milvus

Milvus is built for scale that makes other options struggle. Where Qdrant and Weaviate handle millions to tens of millions of vectors comfortably, Milvus is routinely deployed at hundreds of millions to billions of vectors by search companies, e-commerce platforms, and genomics research organizations.

Milvus 2.5, released in December 2024, introduced native full-text search with Sparse-BM25 technology. In internal benchmarks against Elasticsearch on 1 million vectors, Milvus 2.5 returned results in 6ms versus Elasticsearch's 200ms — a 30 times improvement, with unified storage eliminating the need for separate keyword and vector infrastructure. Milvus 2.6, generally available on Zilliz Cloud in early 2026, added hot and cold tiering for cost-efficient archival.

The trade-off is operational complexity. Running Milvus in production means running Kafka, MinIO, etcd, and the Milvus coordination plane together. The Kubernetes operator simplifies this, but it is still more moving parts than Qdrant or Pinecone. For most RAG systems under 100 million vectors, Milvus is over-engineered. For billion-scale deployments, it is the right answer.

pgvector

pgvector is the right default for roughly 70% of AI-agent workloads under 10 million vectors. If the team already runs Postgres and the corpus is modest, adding a separate vector database system creates operational overhead that the scale does not justify.

pgvector stores embeddings as a column type in Postgres and adds HNSW indexing through an extension. Vectors and relational metadata live in the same transaction boundary. Queries can join vector similarity results with structured filters in a single SQL statement. No separate infrastructure to run.

Recent benchmarks show pgvectorscale achieving 471 QPS at 99% recall on 50 million vectors, which challenges the assumption that extensions always lose to purpose-built systems at mid-scale. Beyond 50 to 100 million vectors, purpose-built databases do pull ahead on throughput and latency ceilings.

sql
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table with a vector column
CREATE TABLE document_chunks (
    id          SERIAL PRIMARY KEY,
    content     TEXT NOT NULL,
    source      TEXT,
    page_num    INTEGER,
    section     TEXT,
    created_at  TIMESTAMP DEFAULT NOW(),
    embedding   vector(1536)   -- dimensions must match your embedding model
);

-- Create an HNSW index for approximate nearest neighbor search
-- m=16 is the number of edges per node (higher = better recall, more memory)
-- ef_construction=64 is candidates evaluated during index build
CREATE INDEX ON document_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Semantic search query
SELECT
    content,
    source,
    page_num,
    1 - (embedding <=> '[your_query_vector_here]') AS similarity
FROM document_chunks
WHERE source = 'product_manual'       -- metadata filter
ORDER BY embedding <=> '[your_query_vector_here]'
LIMIT 10;

Chroma

Chroma is the fastest path from zero to a working vector search for prototyping. It runs in-process as an embedded database — no server, no Docker, no configuration. It has a clean Python API and integrates directly with LangChain and LlamaIndex. For proof-of-concept work and development, nothing is faster to get running.

Chroma does not support native hybrid search. It has not proven itself at extreme scale. For any production system above a few hundred thousand vectors with real query volume, plan to migrate to Qdrant, Weaviate, or Pinecone before launch, not after.

Real Cost Numbers at Scale

The gap between vendor pricing pages and actual production bills averages 2.5 to 4 times. Teams budget for storage and queries. They forget re-indexing during embedding model migrations, metadata storage costs, and the compute overhead of running hybrid search at throughput.

plaintext
Monthly cost at 10M vectors
---------------------------------------------------------------------------
pgvector on AWS RDS          ~$45/month
Qdrant Cloud                 ~$65/month
Pinecone Serverless          ~$70/month
Weaviate Cloud               ~$135/month
Self-hosted Qdrant (EC2)     ~$30-50/month (just compute)
---------------------------------------------------------------------------

Monthly cost at 100M vectors
---------------------------------------------------------------------------
Self-hosted Milvus / pgvec  under $100/month
Self-hosted Qdrant           ~$100-250/month
Qdrant Cloud                 ~$250/month
Weaviate Cloud               ~$300/month
Pinecone Serverless          $700+/month
---------------------------------------------------------------------------

At 100 million vectors, Pinecone can cost 3 to 5 times what self-hosted Qdrant or Milvus costs. The break-even point on investing in a platform engineering team to run self-hosted infrastructure at that scale is usually within a few months.

One cost that most comparisons miss: when you update your embedding model, every vector in the database needs to be re-embedded and re-indexed. On a managed service, this often means running two indexes in parallel during migration, doubling storage costs for the migration period. Self-hosted databases allow rolling rebuilds, but the compute cost is still real. Budget for at least one full re-index per year.

The Decision Framework

Work through these in order.

plaintext
START
  |
  v
Does your team already run Postgres and have under 10M vectors?
  |-- Yes --> Use pgvector. Add a dedicated vector DB only if
  |           you hit scale or recall quality limits.
  |-- No  --> Continue.
  |
  v
Is managed infrastructure the priority over cost optimization?
  |-- Yes --> Use Pinecone. Watch the cost at 50M+ vectors.
  |-- No  --> Continue.
  |
  v
Is hybrid search quality the primary requirement?
  |-- Yes --> Use Weaviate (strongest native hybrid in 2026).
  |-- No  --> Continue.
  |
  v
Is scale above 100M vectors or GPU acceleration needed?
  |-- Yes --> Use Milvus / Zilliz Cloud.
  |-- No  --> Use Qdrant (best open-source performance-per-dollar).

For typical RAG use cases — company documentation, support knowledge bases, internal knowledge assistants — the answer is almost always pgvector under 10 million vectors, and Qdrant or Weaviate above that. Pinecone is right when the team cannot afford to operate infrastructure. Milvus is right when scale is genuinely at hundreds of millions of vectors.

Metadata Filtering Is Not Optional

Every chunk in your vector database should carry metadata alongside its embedding vector. Source document name, page number, section header, document type, creation date, and any domain-specific tags. This metadata enables filtered retrieval.

A query about "refund policy changes" scoped to document_type: policy retrieves a dramatically smaller and more relevant candidate set than searching the full knowledge base. A query about a specific product version scoped to product_version: 2.4 avoids returning chunks from earlier documentation that is no longer accurate.

Adding metadata filtering is the cheapest recall improvement available after hybrid search. Qdrant and Weaviate apply metadata filters at query time before the HNSW traversal, not after — which means filtering reduces the candidate pool before similarity is even computed, improving both speed and relevance.

Where the Vector Database Sits in the Bigger Picture

The vector database is one layer of the RAG architecture. It sits between the embedding model and the reranker. Its job is to take a query vector and return the most relevant candidate chunks from your indexed corpus, fast enough to fit inside a user-facing latency budget.

Retrieval quality has a ceiling set by chunking and embedding quality. Even a perfectly tuned HNSW index cannot rescue poorly chunked documents or a mismatched embedding model. For how chunking and embedding selection interact with retrieval, see RAG Architecture Explained and How Embeddings Work in RAG.

For what happens when retrieval returns the wrong chunks despite a well-configured vector database — because it does happen — read Why RAG Fails. And for how the retrieval layer connects to the generation decision of whether to build RAG at all, see RAG vs Fine-Tuning.

If you are starting from scratch on a new RAG project, start with the overview in What Is RAG in AI and come back here once you are ready to choose the storage layer.

On this page

Why a Regular Database Cannot Do ThisHow HNSW WorksCosine Similarity vs Dot ProductThe Vector Database Landscape in 2026PineconeQdrantWeaviateMilvuspgvectorChromaReal Cost Numbers at ScaleThe Decision FrameworkMetadata Filtering Is Not OptionalWhere the Vector Database Sits in the Bigger Picture

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
All posts

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
Krunal Kanojiya

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

GitHubLinkedIn

Related Posts

What Is a Vector Database? The Complete Beginner Guide (With Examples)

Apr 23, 2026 · 16 min read

What Is RAG in AI? A Simple Explanation (With Examples)

May 05, 2026 · 13 min read

How Embeddings Work in RAG: The Complete Guide (2026)

May 08, 2026 · 19 min read