How is semantic search different from keyword search?

Keyword search looks for exact or stemmed word matches using an inverted index. It returns documents that contain the same words as the query. Semantic search converts both query and documents into vectors and returns documents whose vector representations are closest to the query vector. Semantic search handles synonyms, paraphrases, and intent naturally. Keyword search handles exact identifiers and technical strings precisely. Most production systems combine both.

What is chunking in semantic search?

Chunking is splitting a large document into smaller text segments before embedding. A 50-page PDF is too large to embed meaningfully in a single vector. Chunking breaks it into paragraphs or 300 to 500 token windows, each of which gets its own embedding. At query time, the system retrieves the most relevant chunks rather than entire documents, which gives the LLM precise, focused context.

What is ANN search and why does semantic search need it?

ANN stands for Approximate Nearest Neighbor. A semantic search system stores millions of embedding vectors. Finding the closest vector to a query by comparing it against every stored vector would take seconds at scale. ANN algorithms like HNSW organize vectors into a graph structure that allows the search to skip most comparisons and return results in milliseconds with negligible accuracy loss.

What is a cross-encoder reranker and when should I use one?

A cross-encoder is a model that takes a query-document pair and outputs a single relevance score. Unlike a bi-encoder (standard embedding model) that processes query and document separately, a cross-encoder reads them together and captures their interaction. Rerankers are computationally expensive so they are applied only to the top 20 to 100 candidates from the first retrieval stage to reorder them with higher precision. Add a reranker when your retrieval quality is good but your top-1 or top-3 results are still not precise enough.

Where does semantic search fail?

Semantic search fails on exact string matching: product codes, error identifiers, proper names, and technical jargon that appear rarely in training data. It also degrades when chunks are too large and contain multiple topics, because the embedding averages across them and loses specificity. Very short queries like single words or abbreviations give the embedding model too little context, often producing poor results. Hybrid search combining semantic and keyword retrieval addresses most of these failure modes.

What is the difference between semantic search and vector search?

They describe the same operation at different levels of abstraction. Vector search is the technical mechanism: compare a query vector against stored vectors using a distance metric and return the nearest ones. Semantic search is the outcome: retrieve content that matches the meaning of the query. Vector search is how semantic search is implemented. The terms are often used interchangeably in practice.

What Is Semantic Search? How It Works

A user types "why does my app crash when the network is slow" into a support portal. The document that answers their question is titled "Handling connection timeouts in mobile applications." It contains the words "timeout," "mobile," and "connection" but not "crash," "app," or "slow."

A keyword search returns nothing useful. A semantic search returns the exact document as the top result.

That gap is what semantic search solves. It is not a minor improvement over keyword search. It is a different model of what retrieval should do: find content that satisfies the intent of the query, not content that happens to share words with it.

This article covers what semantic search is, how the entire pipeline works from document ingestion to ranked results, where it fails and why, and how to build a working semantic search system in Python. It sits at the center of the Vector Database Fundamentals series, connecting the foundational concepts in vectors, embeddings, and dense vs sparse representations to the practical architecture of vector databases and retrieval augmented generation.

What Semantic Search Actually Is

Semantic search is a retrieval method that finds content based on meaning and intent rather than lexical overlap. It achieves this by representing both the query and the documents as vectors in a shared embedding space, then returning documents whose vectors are closest to the query vector.

According to Google Cloud's semantic search documentation, semantic search is a data searching technique that focuses on understanding the contextual meaning and intent behind a user's search query, rather than only matching keywords.

The key word is contextual. Two sentences can share zero words and still describe the same thing. Semantic search captures that relationship because both sentences, when passed through the same embedding model, land at nearby coordinates in the vector space.

plaintext

Query:    "why does my app crash when the network is slow"
Document: "Handling connection timeouts in mobile applications"

Keyword overlap: 0 shared terms
Cosine similarity: 0.78 (highly similar)

That similarity score comes from the geometry of the learned embedding space, not from any explicit synonym mapping. The model learned from training data that "crash," "app," and "slow network" appear in the same context as "timeout," "mobile," and "connection."

The Full Pipeline: From Raw Documents to Search Results

Semantic search is not a single operation. It is a pipeline with distinct offline and online phases. Understanding each step is what separates teams who build reliable systems from teams who struggle with poor retrieval quality.

Phase 1: Offline Indexing

This phase runs once (and reruns when new documents arrive). Its output is a populated vector database ready to receive queries.

Step 1.1: Document Ingestion and Cleaning

Raw documents arrive in various formats: PDFs, HTML pages, Markdown files, database rows, API responses. Before embedding, each document needs to be extracted into clean text.

python

import PyPDF2
import re

def extract_text_from_pdf(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        text = " ".join(page.extract_text() or "" for page in reader.pages)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

raw_text = extract_text_from_pdf("product-manual.pdf")

According to Parallel.ai's semantic search implementation guide, your semantic search system will only be as accurate as the data you index. Removing duplicates, fixing formatting issues, and filtering out low-quality content is not optional housekeeping. It directly determines retrieval quality.

Step 1.2: Chunking

A 50-page document cannot be embedded as a single vector. The embedding model has a token limit (typically 512 to 8192 tokens depending on the model), and even if the document fits, a single vector for 50 pages would average together so many topics that the resulting embedding would be too diffuse to retrieve precisely.

Chunking splits documents into smaller segments, each of which gets its own embedding. The goal is segments that are small enough to embed precisely but large enough to contain a complete idea.

According to Elastic's chunking strategies documentation, LLMs have a limited number of tokens they can take as context, and it is much more efficient to send only relevant chunks rather than a whole document.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_document(text: str, chunk_size: int = 400, overlap: int = 80) -> list[str]:
    """
    Split text into overlapping chunks.

    chunk_size: target number of characters per chunk
    overlap:    characters shared between adjacent chunks
                (prevents answers from being cut at a boundary)
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

chunks = chunk_document(raw_text)
print(f"Document split into {len(chunks)} chunks")
# Document split into 47 chunks

The overlap is important. If a key sentence straddles a chunk boundary, having it appear in both the preceding and following chunk ensures at least one embedding captures it intact.

Chunking strategies in practice:

Fixed-size chunking splits on character or token count with overlap. It is fast and predictable but may cut sentences mid-thought.

Sentence chunking splits at sentence boundaries. It preserves semantic units but produces variable-length chunks.

Semantic chunking uses an embedding model to detect where the topic shifts and splits at those boundaries. It produces the most coherent chunks but requires embedding computation during indexing.

For most teams starting out, recursive character splitting with 300 to 500 token chunks and 15 percent overlap is the right default. Tuning chunk size is one of the highest-leverage optimization steps once you have baseline retrieval working.

Step 1.3: Embedding Each Chunk

Each chunk is passed through an embedding model to produce a dense vector. The choice of embedding model determines what "similarity" means for your system.

python

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed_chunks(chunks: list[str], batch_size: int = 64) -> list[list[float]]:
    embeddings = []
    for i in tqdm(range(0, len(chunks), batch_size)):
        batch = chunks[i : i + batch_size]
        batch_embeddings = model.encode(batch, normalize_embeddings=True)
        embeddings.extend(batch_embeddings.tolist())
    return embeddings

chunk_embeddings = embed_chunks(chunks)
print(f"Produced {len(chunk_embeddings)} embeddings of dimension {len(chunk_embeddings[0])}")
# Produced 47 embeddings of dimension 384

Batching is essential for performance. Embedding one chunk at a time is roughly 10 to 30 times slower than batching due to GPU underutilization.

The normalize_embeddings=True flag normalizes each vector to unit length (L2 norm = 1). This makes cosine similarity equivalent to a dot product, which is faster to compute and is the expected input format for most ANN indexes.

Step 1.4: Storing in a Vector Database

Each embedding is stored in a vector database alongside its original text and metadata. The database builds an ANN index over the stored vectors to enable fast similarity search at query time.

python

from qdrant_client import QdrantClient, models
import uuid

client = QdrantClient(":memory:")   # Use host/port for persistent deployments

COLLECTION = "product-docs"
DIMENSIONS = 384   # matches all-MiniLM-L6-v2 output

client.create_collection(
    collection_name=COLLECTION,
    vectors_config=models.VectorParams(
        size=DIMENSIONS,
        distance=models.Distance.COSINE,
    ),
)

points = [
    models.PointStruct(
        id=str(uuid.uuid4()),
        vector=embedding,
        payload={
            "text": chunk,
            "source": "product-manual.pdf",
            "chunk_index": i,
        },
    )
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings))
]

client.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks into Qdrant")

The metadata payload (source file, chunk index, timestamps, document category) is what enables filtered search later. Without metadata, every query searches the entire index. With it, you can restrict searches to a specific document, date range, or category.

Phase 2: Online Query

This phase runs on every user query. Speed matters here. Users expect sub-second responses.

Step 2.1: Query Embedding

The user's query is passed through the same embedding model used during indexing. This is non-negotiable. Embeddings from different models live in incompatible spaces and cannot be compared.

python

def embed_query(query: str) -> list[float]:
    vector = model.encode([query], normalize_embeddings=True)
    return vector[0].tolist()

query = "why does my app crash when the network is slow"
query_vector = embed_query(query)

The query embedding is typically computed in under 10ms for a sentence-length input on a modern CPU.

Step 2.2: ANN Search

The query vector is sent to the vector database, which uses its ANN index to find the top-K most similar stored vectors without comparing against every document.

python

def semantic_search(query: str, top_k: int = 5) -> list[dict]:
    query_vector = embed_query(query)

    results = client.search(
        collection_name=COLLECTION,
        query_vector=query_vector,
        limit=top_k,
        with_payload=True,
        score_threshold=0.4,    # filter out low-confidence results
    )

    return [
        {
            "score": hit.score,
            "text": hit.payload["text"],
            "source": hit.payload["source"],
            "chunk_index": hit.payload["chunk_index"],
        }
        for hit in results
    ]

results = semantic_search("why does my app crash when the network is slow")
for r in results:
    print(f"Score {r['score']:.4f}: {r['text'][:80]}...")

The ANN index (HNSW in Qdrant's case) is what makes this fast at scale. Without it, finding the top-K vectors in a collection of one million documents would require one million cosine similarity computations. HNSW reduces that to a few hundred comparisons by traversing a layered graph structure. The mechanics are covered in the why traditional indexes fail for vector search article.

Step 2.3: Reranking (Optional but High-Impact)

ANN search optimizes for speed. The initial results are highly relevant on average but not perfectly ordered. A reranker takes the top 20 to 100 candidates from the first stage and rescores them using a more powerful model that reads the query and each candidate together.

Bi-encoders (used in ANN search) process query and document independently and compare their outputs. Cross-encoders read the query and document together and output a single relevance score that captures their interaction directly.

According to ZeroEntropy's reranking guide, Databricks research shows reranking can improve retrieval quality by up to 48 percent. The three-stage pipeline of BM25 plus dense retrieval plus reranking maximizes both recall and precision.

python

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_results(query: str, candidates: list[dict], top_n: int = 3) -> list[dict]:
    """
    Rerank retrieval candidates using a cross-encoder.
    Cross-encoders read query and document together for higher accuracy.
    """
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    for candidate, score in zip(candidates, scores):
        candidate["rerank_score"] = float(score)

    reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
    return reranked[:top_n]

# Get 20 candidates from ANN search, rerank to top 3
candidates = semantic_search(query, top_k=20)
final_results = rerank_results(query, candidates, top_n=3)

for r in final_results:
    print(f"Rerank score {r['rerank_score']:.4f}: {r['text'][:80]}...")

Reranking adds 50 to 200ms of latency depending on the model size and number of candidates. For most RAG applications where generation latency already exceeds one second, this cost is negligible relative to the precision gain.

Semantic Search vs Keyword Search: The Honest Comparison

The two approaches fail in opposite directions. Understanding those failure modes is more useful than declaring one superior.

plaintext

Property              | Keyword Search         | Semantic Search
----------------------+------------------------+---------------------------
Match basis           | Exact word overlap     | Vector similarity
Handles synonyms      | No (without synonyms)  | Yes
Handles paraphrases   | No                     | Yes
Handles exact strings | Yes (precisely)        | Poorly
Handles rare terms    | Yes (exact match)      | Poorly (OOV terms)
Index type            | Inverted index         | ANN vector index
Interpretability      | High (shows matches)   | Low (opaque scores)
Setup complexity      | Low                    | Higher
Latency               | Sub-millisecond        | 10 to 100ms per query

According to Redis's semantic vs keyword search comparison, semantic search excels when users express intent in natural language, while keyword search delivers precision for exact identifiers and compliance scenarios.

According to Unstructured's retrieval analysis, keyword search fails when the document uses different terminology than the query, when spelling varies, or when key information lives in images or tables that were not extracted into indexable text.

For most production knowledge bases and RAG systems, neither method alone is adequate. The dense vs sparse vectors article covers how hybrid search combines both into a single retrieval pipeline that consistently outperforms either method.

A Complete Semantic Search System

This is a full working example that ties every phase together: ingestion, chunking, embedding, indexing, search, and reranking.

python

import re
import uuid
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient, models

###############################################################################
# Configuration
###############################################################################
EMBED_MODEL   = "all-MiniLM-L6-v2"
RERANK_MODEL  = "cross-encoder/ms-marco-MiniLM-L-6-v2"
COLLECTION    = "knowledge-base"
DIMENSIONS    = 384
CHUNK_SIZE    = 400
CHUNK_OVERLAP = 80

###############################################################################
# Models
###############################################################################
embedder = SentenceTransformer(EMBED_MODEL)
reranker = CrossEncoder(RERANK_MODEL)
db       = QdrantClient(":memory:")

db.create_collection(
    collection_name=COLLECTION,
    vectors_config=models.VectorParams(size=DIMENSIONS, distance=models.Distance.COSINE),
)

###############################################################################
# Utility functions
###############################################################################
def chunk_text(text: str) -> list[str]:
    """Simple fixed-size chunking with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + CHUNK_SIZE, len(text))
        chunks.append(text[start:end].strip())
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return [c for c in chunks if len(c) > 50]   # drop tiny trailing chunks

def embed(texts: list[str]) -> list[list[float]]:
    return embedder.encode(texts, normalize_embeddings=True).tolist()

###############################################################################
# Indexing
###############################################################################
def index_document(text: str, doc_id: str, metadata: dict | None = None) -> int:
    chunks = chunk_text(text)
    vectors = embed(chunks)

    points = [
        models.PointStruct(
            id=str(uuid.uuid4()),
            vector=vec,
            payload={"text": chunk, "doc_id": doc_id, **(metadata or {})},
        )
        for chunk, vec in zip(chunks, vectors)
    ]
    db.upsert(collection_name=COLLECTION, points=points)
    return len(points)

###############################################################################
# Search
###############################################################################
def search(query: str, top_k: int = 10, rerank_n: int = 3) -> list[dict]:
    query_vec = embed([query])[0]

    hits = db.search(
        collection_name=COLLECTION,
        query_vector=query_vec,
        limit=top_k,
        with_payload=True,
        score_threshold=0.30,
    )

    candidates = [
        {"text": h.payload["text"], "doc_id": h.payload["doc_id"], "score": h.score}
        for h in hits
    ]

    if not candidates:
        return []

    # Rerank candidates
    pairs  = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)

    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
    return candidates[:rerank_n]


###############################################################################
# Demo
###############################################################################
documents = {
    "refund-policy": """
        Our refund policy allows customers to return any product within 30 days
        of purchase for a full refund. Digital downloads are non-refundable.
        Refund requests must include the original order number.
        Processing takes 5 to 7 business days after approval.
    """,
    "shipping-guide": """
        Standard shipping takes 3 to 5 business days.
        Express shipping delivers within 1 to 2 business days at additional cost.
        Free shipping is available on orders above 50 dollars.
        International orders may be subject to customs delays.
    """,
    "timeout-errors": """
        Connection timeout errors occur when the server does not respond
        within the expected time window. This commonly happens on slow or
        unreliable network connections. Set a reasonable timeout value in your
        HTTP client and implement retry logic with exponential backoff.
    """,
}

for doc_id, text in documents.items():
    n = index_document(text, doc_id)
    print(f"Indexed '{doc_id}': {n} chunks")

print()
queries = [
    "how do I get my money back from a purchase",
    "why does my app crash when the network is slow",
    "how long does delivery take",
]

for query in queries:
    results = search(query, top_k=10, rerank_n=1)
    print(f"Query: {query}")
    if results:
        print(f"  Best match ({results[0]['doc_id']}): {results[0]['text'][:80].strip()}...")
    print()

# Output:
# Query: how do I get my money back from a purchase
#   Best match (refund-policy): Our refund policy allows customers to return any product...
#
# Query: why does my app crash when the network is slow
#   Best match (timeout-errors): Connection timeout errors occur when the server does not...
#
# Query: how long does delivery take
#   Best match (shipping-guide): Standard shipping takes 3 to 5 business days...

Every query succeeds without a single shared keyword between query and the relevant document. "Money back" matches "refund policy." "App crash" and "network slow" matches "timeout errors." "How long does delivery take" matches "shipping." The embedding geometry is doing all the work.

What Makes Semantic Search Fail

Semantic search is not universally better than keyword search. It has specific, predictable failure modes.

Chunk size too large. If a chunk contains three unrelated topics, its embedding averages the semantics of all three. The resulting vector is pulled in multiple directions and may not be close to any specific query. Smaller, focused chunks produce sharper embeddings.

Exact identifiers. The string "PROD-SKU-7842X" is not in any embedding model's training data. The model assigns it an embedding near similar-looking strings, which may be completely wrong. BM25 handles this correctly because it matches exact tokens. This is the core argument for hybrid search.

Very short queries. A single word gives the embedding model almost no context. "Python" could refer to the programming language, the snake, or Monty Python. The embedding lands somewhere in the middle of all three, which produces poor results. Query expansion or hybrid search helps here.

Out-of-vocabulary technical terms. New product names, internal acronyms, and recently coined terminology may appear rarely or never in training data. The embedding is unreliable. A combination of keyword search for exact matching and semantic search for meaning-based queries handles this correctly.

Model mismatch. Using a general-purpose embedding model for a highly specialized domain (medical literature, legal documents, financial instruments) produces lower-quality embeddings than a domain-fine-tuned model. According to Sparkco's sentence transformer guide, for domain-specific data, fine-tuning on your own corpus can significantly enhance embedding quality.

According to Unstructured's vector embeddings analysis, chunking determines the granularity of retrieval. Overly large chunks dilute meaning, and overly small chunks drop the context the model needs to disambiguate.

Semantic Search in Production: Real Systems

Semantic search is no longer experimental. It powers retrieval in systems that operate at massive scale.

LinkedIn's job and people search runs a full semantic search stack combining GPU-accelerated exhaustive retrieval, LLM-supervised embedding models, and a small language model reranker under strict latency and throughput constraints. The system processes billions of queries against billion-scale indexes.

Azure AI Search benchmarked hybrid search plus semantic reranking against pure vector search across four customer datasets. Hybrid with reranking consistently outperformed pure vector search across all document types and industry verticals.

Elasticsearch's ELSER adds a learned sparse retriever on top of traditional BM25 infrastructure, giving teams the option to use semantic retrieval without replacing their existing Elasticsearch deployment.

Semantic Search as the Memory Layer for LLMs

The most important application of semantic search in 2025 and 2026 is as the retrieval component of RAG systems. Large language models have a knowledge cutoff and cannot access private data. Semantic search bridges the gap by finding relevant documents from a private knowledge base before the LLM generates a response.

According to RapidSearch's semantic search analysis, semantic search is the engine behind Retrieval-Augmented Generation (RAG), which solves one of the biggest problems with LLMs: hallucinations. In a RAG workflow, the semantic search system first retrieves the most relevant factual data from a private knowledge base. This retrieved context is fed into the LLM alongside the original question, grounding the response in accurate information.

plaintext

User: "What is our policy on remote work in different time zones?"

Semantic Search:
  Embed query → [0.41, -0.22, ..., 0.88]
  ANN search in HR policy database
  Retrieve: "Employees in different time zones are expected to maintain
             a minimum 4-hour overlap with their team's core hours..."
  Rerank top 5 candidates → top 2 selected

LLM receives:
  Context: [retrieved policy text]
  Question: "What is our policy on remote work in different time zones?"
  → Grounded, accurate, specific answer

The vector database article covers how this full RAG architecture is assembled, including indexing pipelines, metadata filtering, and the complete retrieval-generation loop.

The Query Pipeline in One Diagram

plaintext

OFFLINE (runs once during indexing)
────────────────────────────────────────────────────────────────
Raw documents (PDF, HTML, Markdown, database)
    ↓
Text extraction and cleaning
    ↓
Chunking (300 to 500 tokens with 15% overlap)
    ↓
Embedding model (bi-encoder)
    ↓
Dense vector per chunk  +  Sparse vector per chunk (optional BM25)
    ↓
Vector database (HNSW index for dense, inverted index for sparse)

ONLINE (runs on every query, target < 200ms)
────────────────────────────────────────────────────────────────
User query
    ↓
Query embedding (same model as indexing, < 10ms)
    ↓
ANN search in vector database (HNSW traversal, < 50ms)
    ↓
Optional: BM25 keyword search in parallel
    ↓
Optional: RRF fusion of dense + sparse results
    ↓
Optional: Cross-encoder reranker on top 20 to 100 candidates (< 200ms)
    ↓
Top K most relevant chunks
    ↓
LLM context  OR  Search results UI

Each stage in this pipeline can be tuned independently. Swap the embedding model, change the chunk size, adjust the reranker model, or add metadata filters without redesigning the entire system.

Choosing an Embedding Model for Semantic Search

The embedding model is the single most important configuration decision in a semantic search system. It defines what similarity means. Two sentences close in the embedding space are retrieved together regardless of what they actually say.

For English text at moderate scale, all-MiniLM-L6-v2 from Sentence Transformers is a strong starting point: 384 dimensions, fast on CPU, good quality. For higher quality at the cost of more compute, all-mpnet-base-v2 (768 dimensions) is preferred. For production RAG at scale with an API budget, OpenAI's text-embedding-3-small (1536 dimensions) consistently ranks among the top performers on semantic benchmarks.

The embeddings article covers the full model comparison table including multilingual options, multimodal models, and domain-specific fine-tuned models.

Summary

Semantic search converts both queries and documents into embedding vectors and returns documents whose vectors are closest to the query vector. It handles synonyms, paraphrases, and natural language intent that keyword search cannot.

The pipeline has two phases. Offline indexing extracts text, splits it into chunks, embeds each chunk, and stores the vectors in a vector database. Online querying embeds the user's query, runs ANN search to find the nearest vectors, and optionally reranks the top candidates using a cross-encoder.

Semantic search fails on exact identifiers, very short queries, out-of-vocabulary terms, and oversized chunks. Hybrid search combining semantic retrieval with BM25 addresses the most critical of those failure modes, which is covered in detail in dense vs sparse vectors.

The downstream destination for the retrieved chunks is the LLM in a RAG pipeline. The infrastructure that makes this work at scale is the vector database, which stores millions of embeddings and returns the nearest ones in milliseconds using ANN indexing.

Sources and Further Reading

Google Cloud. What Is Semantic Search and How Does It Work? cloud.google.com/discover/what-is-semantic-search
Redis. Semantic Search vs. Keyword Search: When to Use Each. redis.io/blog/semantic-search-vs-keyword-search
Elastic. Chunking Strategies for Semantic Search in Elasticsearch. elastic.co/search-labs/blog/chunking-strategies-elasticsearch
Unstructured. Semantic Search vs. Keyword Search: Key Differences. unstructured.io/insights/semantic-vs-keyword-search-key-differences-for-ai-data
Unstructured. How Vector Embeddings Improve Search Relevance. unstructured.io/insights/vector-embeddings-the-key-to-better-search-relevance
Parallel.ai. What Is Semantic Search and How Does It Work? parallel.ai/articles/what-is-semantic-search
Meilisearch. What Is Semantic Search and How Does It Work? meilisearch.com/blog/semantic-search
RapidSearch. Semantic Search: How It Works, Why It Matters. rapidsearch.app/blog/semantic-search
TechTarget. What Is Semantic Search and How Does It Work? techtarget.com/searchenterpriseai/definition/semantic-search
Elastic. Semantic Reranking Documentation. elastic.co/docs/solutions/search/ranking/semantic-reranking
ZeroEntropy. Ultimate Guide to Choosing the Best Reranking Model in 2026. zeroentropy.dev/articles/ultimate-guide-to-choosing-the-best-reranking-model-in-2025
Microsoft Azure. Azure AI Search: Outperforming Vector Search with Hybrid Retrieval and Reranking. techcommunity.microsoft.com/blog/azure-ai-foundry-blog/azure-ai-search-outperforming-vector-search
LinkedIn. Semantic Search at LinkedIn. arxiv.org/pdf/2602.07309
Couchbase. Semantic Search vs. Keyword Search: Key Differences. couchbase.com/blog/semantic-search-vs-keyword-search
Hugging Face. Sentence Transformers Library. huggingface.co/sentence-transformers

A keyword search returns nothing useful. A semantic search returns the exact document as the top result.

What Semantic Search Actually Is

plaintext

Query:    "why does my app crash when the network is slow"
Document: "Handling connection timeouts in mobile applications"

Keyword overlap: 0 shared terms
Cosine similarity: 0.78 (highly similar)

The Full Pipeline: From Raw Documents to Search Results

Phase 1: Offline Indexing

This phase runs once (and reruns when new documents arrive). Its output is a populated vector database ready to receive queries.

Step 1.1: Document Ingestion and Cleaning

Raw documents arrive in various formats: PDFs, HTML pages, Markdown files, database rows, API responses. Before embedding, each document needs to be extracted into clean text.

python

import PyPDF2
import re

def extract_text_from_pdf(pdf_path: str) -> str:
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        text = " ".join(page.extract_text() or "" for page in reader.pages)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

raw_text = extract_text_from_pdf("product-manual.pdf")

Step 1.2: Chunking

Chunking splits documents into smaller segments, each of which gets its own embedding. The goal is segments that are small enough to embed precisely but large enough to contain a complete idea.

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

def chunk_document(text: str, chunk_size: int = 400, overlap: int = 80) -> list[str]:
    """
    Split text into overlapping chunks.

    chunk_size: target number of characters per chunk
    overlap:    characters shared between adjacent chunks
                (prevents answers from being cut at a boundary)
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    return splitter.split_text(text)

chunks = chunk_document(raw_text)
print(f"Document split into {len(chunks)} chunks")
# Document split into 47 chunks

The overlap is important. If a key sentence straddles a chunk boundary, having it appear in both the preceding and following chunk ensures at least one embedding captures it intact.

Chunking strategies in practice:

Fixed-size chunking splits on character or token count with overlap. It is fast and predictable but may cut sentences mid-thought.

Sentence chunking splits at sentence boundaries. It preserves semantic units but produces variable-length chunks.

Semantic chunking uses an embedding model to detect where the topic shifts and splits at those boundaries. It produces the most coherent chunks but requires embedding computation during indexing.

Step 1.3: Embedding Each Chunk

Each chunk is passed through an embedding model to produce a dense vector. The choice of embedding model determines what "similarity" means for your system.

python

from sentence_transformers import SentenceTransformer
from tqdm import tqdm

model = SentenceTransformer("all-MiniLM-L6-v2")

def embed_chunks(chunks: list[str], batch_size: int = 64) -> list[list[float]]:
    embeddings = []
    for i in tqdm(range(0, len(chunks), batch_size)):
        batch = chunks[i : i + batch_size]
        batch_embeddings = model.encode(batch, normalize_embeddings=True)
        embeddings.extend(batch_embeddings.tolist())
    return embeddings

chunk_embeddings = embed_chunks(chunks)
print(f"Produced {len(chunk_embeddings)} embeddings of dimension {len(chunk_embeddings[0])}")
# Produced 47 embeddings of dimension 384

Batching is essential for performance. Embedding one chunk at a time is roughly 10 to 30 times slower than batching due to GPU underutilization.

Step 1.4: Storing in a Vector Database

Each embedding is stored in a vector database alongside its original text and metadata. The database builds an ANN index over the stored vectors to enable fast similarity search at query time.

python

from qdrant_client import QdrantClient, models
import uuid

client = QdrantClient(":memory:")   # Use host/port for persistent deployments

COLLECTION = "product-docs"
DIMENSIONS = 384   # matches all-MiniLM-L6-v2 output

client.create_collection(
    collection_name=COLLECTION,
    vectors_config=models.VectorParams(
        size=DIMENSIONS,
        distance=models.Distance.COSINE,
    ),
)

points = [
    models.PointStruct(
        id=str(uuid.uuid4()),
        vector=embedding,
        payload={
            "text": chunk,
            "source": "product-manual.pdf",
            "chunk_index": i,
        },
    )
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings))
]

client.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(points)} chunks into Qdrant")

Phase 2: Online Query

This phase runs on every user query. Speed matters here. Users expect sub-second responses.

Step 2.1: Query Embedding

The user's query is passed through the same embedding model used during indexing. This is non-negotiable. Embeddings from different models live in incompatible spaces and cannot be compared.

python

def embed_query(query: str) -> list[float]:
    vector = model.encode([query], normalize_embeddings=True)
    return vector[0].tolist()

query = "why does my app crash when the network is slow"
query_vector = embed_query(query)

The query embedding is typically computed in under 10ms for a sentence-length input on a modern CPU.

Step 2.2: ANN Search

The query vector is sent to the vector database, which uses its ANN index to find the top-K most similar stored vectors without comparing against every document.

python

def semantic_search(query: str, top_k: int = 5) -> list[dict]:
    query_vector = embed_query(query)

    results = client.search(
        collection_name=COLLECTION,
        query_vector=query_vector,
        limit=top_k,
        with_payload=True,
        score_threshold=0.4,    # filter out low-confidence results
    )

    return [
        {
            "score": hit.score,
            "text": hit.payload["text"],
            "source": hit.payload["source"],
            "chunk_index": hit.payload["chunk_index"],
        }
        for hit in results
    ]

results = semantic_search("why does my app crash when the network is slow")
for r in results:
    print(f"Score {r['score']:.4f}: {r['text'][:80]}...")

Step 2.3: Reranking (Optional but High-Impact)

python

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_results(query: str, candidates: list[dict], top_n: int = 3) -> list[dict]:
    """
    Rerank retrieval candidates using a cross-encoder.
    Cross-encoders read query and document together for higher accuracy.
    """
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    for candidate, score in zip(candidates, scores):
        candidate["rerank_score"] = float(score)

    reranked = sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)
    return reranked[:top_n]

# Get 20 candidates from ANN search, rerank to top 3
candidates = semantic_search(query, top_k=20)
final_results = rerank_results(query, candidates, top_n=3)

for r in final_results:
    print(f"Rerank score {r['rerank_score']:.4f}: {r['text'][:80]}...")

Semantic Search vs Keyword Search: The Honest Comparison

The two approaches fail in opposite directions. Understanding those failure modes is more useful than declaring one superior.

plaintext

Property              | Keyword Search         | Semantic Search
----------------------+------------------------+---------------------------
Match basis           | Exact word overlap     | Vector similarity
Handles synonyms      | No (without synonyms)  | Yes
Handles paraphrases   | No                     | Yes
Handles exact strings | Yes (precisely)        | Poorly
Handles rare terms    | Yes (exact match)      | Poorly (OOV terms)
Index type            | Inverted index         | ANN vector index
Interpretability      | High (shows matches)   | Low (opaque scores)
Setup complexity      | Low                    | Higher
Latency               | Sub-millisecond        | 10 to 100ms per query

A Complete Semantic Search System

This is a full working example that ties every phase together: ingestion, chunking, embedding, indexing, search, and reranking.

python

import re
import uuid
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from qdrant_client import QdrantClient, models

###############################################################################
# Configuration
###############################################################################
EMBED_MODEL   = "all-MiniLM-L6-v2"
RERANK_MODEL  = "cross-encoder/ms-marco-MiniLM-L-6-v2"
COLLECTION    = "knowledge-base"
DIMENSIONS    = 384
CHUNK_SIZE    = 400
CHUNK_OVERLAP = 80

###############################################################################
# Models
###############################################################################
embedder = SentenceTransformer(EMBED_MODEL)
reranker = CrossEncoder(RERANK_MODEL)
db       = QdrantClient(":memory:")

db.create_collection(
    collection_name=COLLECTION,
    vectors_config=models.VectorParams(size=DIMENSIONS, distance=models.Distance.COSINE),
)

###############################################################################
# Utility functions
###############################################################################
def chunk_text(text: str) -> list[str]:
    """Simple fixed-size chunking with overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + CHUNK_SIZE, len(text))
        chunks.append(text[start:end].strip())
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return [c for c in chunks if len(c) > 50]   # drop tiny trailing chunks

def embed(texts: list[str]) -> list[list[float]]:
    return embedder.encode(texts, normalize_embeddings=True).tolist()

###############################################################################
# Indexing
###############################################################################
def index_document(text: str, doc_id: str, metadata: dict | None = None) -> int:
    chunks = chunk_text(text)
    vectors = embed(chunks)

    points = [
        models.PointStruct(
            id=str(uuid.uuid4()),
            vector=vec,
            payload={"text": chunk, "doc_id": doc_id, **(metadata or {})},
        )
        for chunk, vec in zip(chunks, vectors)
    ]
    db.upsert(collection_name=COLLECTION, points=points)
    return len(points)

###############################################################################
# Search
###############################################################################
def search(query: str, top_k: int = 10, rerank_n: int = 3) -> list[dict]:
    query_vec = embed([query])[0]

    hits = db.search(
        collection_name=COLLECTION,
        query_vector=query_vec,
        limit=top_k,
        with_payload=True,
        score_threshold=0.30,
    )

    candidates = [
        {"text": h.payload["text"], "doc_id": h.payload["doc_id"], "score": h.score}
        for h in hits
    ]

    if not candidates:
        return []

    # Rerank candidates
    pairs  = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)

    candidates.sort(key=lambda x: x["rerank_score"], reverse=True)
    return candidates[:rerank_n]


###############################################################################
# Demo
###############################################################################
documents = {
    "refund-policy": """
        Our refund policy allows customers to return any product within 30 days
        of purchase for a full refund. Digital downloads are non-refundable.
        Refund requests must include the original order number.
        Processing takes 5 to 7 business days after approval.
    """,
    "shipping-guide": """
        Standard shipping takes 3 to 5 business days.
        Express shipping delivers within 1 to 2 business days at additional cost.
        Free shipping is available on orders above 50 dollars.
        International orders may be subject to customs delays.
    """,
    "timeout-errors": """
        Connection timeout errors occur when the server does not respond
        within the expected time window. This commonly happens on slow or
        unreliable network connections. Set a reasonable timeout value in your
        HTTP client and implement retry logic with exponential backoff.
    """,
}

for doc_id, text in documents.items():
    n = index_document(text, doc_id)
    print(f"Indexed '{doc_id}': {n} chunks")

print()
queries = [
    "how do I get my money back from a purchase",
    "why does my app crash when the network is slow",
    "how long does delivery take",
]

for query in queries:
    results = search(query, top_k=10, rerank_n=1)
    print(f"Query: {query}")
    if results:
        print(f"  Best match ({results[0]['doc_id']}): {results[0]['text'][:80].strip()}...")
    print()

# Output:
# Query: how do I get my money back from a purchase
#   Best match (refund-policy): Our refund policy allows customers to return any product...
#
# Query: why does my app crash when the network is slow
#   Best match (timeout-errors): Connection timeout errors occur when the server does not...
#
# Query: how long does delivery take
#   Best match (shipping-guide): Standard shipping takes 3 to 5 business days...

What Makes Semantic Search Fail

Semantic search is not universally better than keyword search. It has specific, predictable failure modes.

Semantic Search in Production: Real Systems

Semantic search is no longer experimental. It powers retrieval in systems that operate at massive scale.

Semantic Search as the Memory Layer for LLMs

plaintext

User: "What is our policy on remote work in different time zones?"

Semantic Search:
  Embed query → [0.41, -0.22, ..., 0.88]
  ANN search in HR policy database
  Retrieve: "Employees in different time zones are expected to maintain
             a minimum 4-hour overlap with their team's core hours..."
  Rerank top 5 candidates → top 2 selected

LLM receives:
  Context: [retrieved policy text]
  Question: "What is our policy on remote work in different time zones?"
  → Grounded, accurate, specific answer

The vector database article covers how this full RAG architecture is assembled, including indexing pipelines, metadata filtering, and the complete retrieval-generation loop.

The Query Pipeline in One Diagram

plaintext

OFFLINE (runs once during indexing)
────────────────────────────────────────────────────────────────
Raw documents (PDF, HTML, Markdown, database)
    ↓
Text extraction and cleaning
    ↓
Chunking (300 to 500 tokens with 15% overlap)
    ↓
Embedding model (bi-encoder)
    ↓
Dense vector per chunk  +  Sparse vector per chunk (optional BM25)
    ↓
Vector database (HNSW index for dense, inverted index for sparse)

ONLINE (runs on every query, target < 200ms)
────────────────────────────────────────────────────────────────
User query
    ↓
Query embedding (same model as indexing, < 10ms)
    ↓
ANN search in vector database (HNSW traversal, < 50ms)
    ↓
Optional: BM25 keyword search in parallel
    ↓
Optional: RRF fusion of dense + sparse results
    ↓
Optional: Cross-encoder reranker on top 20 to 100 candidates (< 200ms)
    ↓
Top K most relevant chunks
    ↓
LLM context  OR  Search results UI

Each stage in this pipeline can be tuned independently. Swap the embedding model, change the chunk size, adjust the reranker model, or add metadata filters without redesigning the entire system.

Choosing an Embedding Model for Semantic Search

The embeddings article covers the full model comparison table including multilingual options, multimodal models, and domain-specific fine-tuned models.

Summary

Sources and Further Reading

Google Cloud. What Is Semantic Search and How Does It Work? cloud.google.com/discover/what-is-semantic-search
Redis. Semantic Search vs. Keyword Search: When to Use Each. redis.io/blog/semantic-search-vs-keyword-search
Elastic. Chunking Strategies for Semantic Search in Elasticsearch. elastic.co/search-labs/blog/chunking-strategies-elasticsearch
Unstructured. Semantic Search vs. Keyword Search: Key Differences. unstructured.io/insights/semantic-vs-keyword-search-key-differences-for-ai-data
Unstructured. How Vector Embeddings Improve Search Relevance. unstructured.io/insights/vector-embeddings-the-key-to-better-search-relevance
Parallel.ai. What Is Semantic Search and How Does It Work? parallel.ai/articles/what-is-semantic-search
Meilisearch. What Is Semantic Search and How Does It Work? meilisearch.com/blog/semantic-search
RapidSearch. Semantic Search: How It Works, Why It Matters. rapidsearch.app/blog/semantic-search
TechTarget. What Is Semantic Search and How Does It Work? techtarget.com/searchenterpriseai/definition/semantic-search
Elastic. Semantic Reranking Documentation. elastic.co/docs/solutions/search/ranking/semantic-reranking
ZeroEntropy. Ultimate Guide to Choosing the Best Reranking Model in 2026. zeroentropy.dev/articles/ultimate-guide-to-choosing-the-best-reranking-model-in-2025
Microsoft Azure. Azure AI Search: Outperforming Vector Search with Hybrid Retrieval and Reranking. techcommunity.microsoft.com/blog/azure-ai-foundry-blog/azure-ai-search-outperforming-vector-search
LinkedIn. Semantic Search at LinkedIn. arxiv.org/pdf/2602.07309
Couchbase. Semantic Search vs. Keyword Search: Key Differences. couchbase.com/blog/semantic-search-vs-keyword-search
Hugging Face. Sentence Transformers Library. huggingface.co/sentence-transformers

What Semantic Search Actually Is

The Full Pipeline: From Raw Documents to Search Results

Phase 1: Offline Indexing

Step 1.1: Document Ingestion and Cleaning

Step 1.2: Chunking

Step 1.3: Embedding Each Chunk

Step 1.4: Storing in a Vector Database

Phase 2: Online Query

Step 2.1: Query Embedding

Step 2.2: ANN Search

Step 2.3: Reranking (Optional but High-Impact)

Semantic Search vs Keyword Search: The Honest Comparison

A Complete Semantic Search System

What Makes Semantic Search Fail

Semantic Search in Production: Real Systems

Semantic Search as the Memory Layer for LLMs

The Query Pipeline in One Diagram

Choosing an Embedding Model for Semantic Search

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts

What Semantic Search Actually Is

The Full Pipeline: From Raw Documents to Search Results

Phase 1: Offline Indexing

Step 1.1: Document Ingestion and Cleaning

Step 1.2: Chunking

Step 1.3: Embedding Each Chunk

Step 1.4: Storing in a Vector Database

Phase 2: Online Query

Step 2.1: Query Embedding

Step 2.2: ANN Search

Step 2.3: Reranking (Optional but High-Impact)

Semantic Search vs Keyword Search: The Honest Comparison

A Complete Semantic Search System

What Makes Semantic Search Fail

Semantic Search in Production: Real Systems

Semantic Search as the Memory Layer for LLMs

The Query Pipeline in One Diagram

Choosing an Embedding Model for Semantic Search

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts