What is an embedding in machine learning?

An embedding is a numerical vector that represents the meaning of a piece of data. A word, sentence, image, or document gets converted into a fixed-length list of floating-point numbers by a neural network. The key property is that items with similar meanings produce numerically similar vectors, so you can find related content by measuring the mathematical distance between vectors.

How are embeddings different from one-hot encoding?

One-hot encoding assigns a unique vector to each word where one position is 1 and all others are 0. A vocabulary of 50,000 words produces 50,000-dimensional sparse vectors with no relationship between them. Embeddings are dense, low-dimensional vectors (typically 100 to 1536 dimensions) where the geometry encodes meaning. Words used in similar contexts land close together in the vector space.

What is the difference between a vector and an embedding?

A vector is any ordered list of numbers. An embedding is a specific kind of vector produced by a model trained to encode semantic meaning. Every embedding is a vector, but not every vector is an embedding. A vector of raw pixel values is a vector. The same image passed through a CNN and reduced to a 512-dimensional representation of its visual features is an embedding.

What model should I use to generate text embeddings?

For most production RAG and search applications, OpenAI's text-embedding-3-small at 1536 dimensions is a strong default. If you need to run models locally without API cost, all-MiniLM-L6-v2 from the Sentence Transformers library produces 384-dimensional embeddings and runs well on CPU. For multilingual applications, paraphrase-multilingual-mpnet-base-v2 covers 50+ languages.

Can embeddings from different models be compared?

No. Embeddings from OpenAI's models cannot be compared to embeddings from Cohere, Google, or Sentence Transformers. Each model trains its own vector space with its own geometry. Mixing embeddings from different models produces meaningless similarity scores. Once you choose a model for a project, all vectors in your database must come from that same model.

How are embeddings used in RAG pipelines?

In a retrieval augmented generation pipeline, documents are split into chunks and each chunk is converted to an embedding using an embedding model. Those embeddings are stored in a vector database. At query time, the user's question is also embedded using the same model, and the database finds the stored chunks with the highest cosine similarity. Those chunks are passed to the LLM as context before generating a response.

What is the curse of dimensionality in embeddings?

As the number of dimensions grows, vectors tend to become equidistant from each other. This makes naive nearest-neighbor search increasingly unreliable at very high dimensions. Modern embedding models are designed to use dimensions efficiently, and vector databases solve the search problem using approximate nearest neighbor algorithms like HNSW that bypass exhaustive distance calculations.

What Are Embeddings? How AI Turns Text Into Numbers

When you type a question into a search engine and it returns a relevant result even though your exact words never appear on that page, something has to bridge the gap between your words and the document's words. That something is an embedding.

Embeddings are not a recent invention. The core idea of representing words as vectors goes back to distributional semantics research in the 1950s. What changed is the quality of the representations and the scale at which they can be produced. Modern transformer models produce embeddings where the full meaning of a 500-word paragraph fits into a list of 1536 numbers, and the geometry of those numbers encodes relationships that feel almost intuitive when you examine them.

This article explains what embeddings are, how they are generated, how the mathematics works, and where they show up in the AI applications you build or use every day. It connects directly to the foundational concepts in vectors in machine learning and is the bridge to understanding vector databases, semantic search, and the difference between dense and sparse representations.

What Is an Embedding?

According to Wikipedia's machine learning embedding entry, embedding is a representation learning technique that maps complex, high-dimensional data into a lower-dimensional vector space of numerical vectors. It also denotes the resulting representation, where meaningful patterns or relationships are preserved.

In practice: an embedding is a list of floating-point numbers produced by a neural network model that captures the meaning or context of an input. Two inputs that mean similar things produce numerically similar embeddings. Two inputs that are unrelated produce numerically distant embeddings.

plaintext

Input:  "How do I reset my password?"
Output: [0.0231, -0.1420, 0.8832, 0.0045, -0.3310, ..., 0.1192]
         ↑ 1536 floating-point numbers representing the meaning of that question

The individual numbers in that list do not have a human-readable interpretation. Dimension 47 does not mean "this is a question." The meaning lives in the geometry, in the distances and angles between this vector and others in the same space.

The Problem That Embeddings Solve

Before embeddings became standard, the main way to represent text in machine learning models was one-hot encoding. The concept is simple. You define a vocabulary of every unique word you expect to encounter. Each word gets a unique index. Its representation is a vector where that index is 1 and every other position is 0.

python

# Vocabulary: ["cat", "dog", "run", "sleep"]
# Indices:       0       1      2       3

one_hot_cat   = [1, 0, 0, 0]
one_hot_dog   = [0, 1, 0, 0]
one_hot_run   = [0, 0, 1, 0]
one_hot_sleep = [0, 0, 0, 1]

According to Google's machine learning course on embeddings, this approach has two fundamental problems. First, it creates enormous sparse vectors. A vocabulary of 50,000 words produces 50,000-dimensional vectors that are almost entirely zeros. Second, there is no meaningful relationship between any two vectors. The distance between "cat" and "dog" is mathematically identical to the distance between "cat" and "spaceship." The encoding contains zero information about meaning.

Embeddings solve both problems simultaneously. They produce dense, lower-dimensional vectors where the geometry encodes the semantic relationships that one-hot encoding loses entirely.

python

# Dense embeddings for the same words (simplified to 4D for illustration)
embed_cat   = [ 0.82,  0.51, -0.14,  0.33]
embed_dog   = [ 0.79,  0.48, -0.11,  0.31]   # close to cat
embed_run   = [-0.22,  0.91,  0.44, -0.67]   # far from cat
embed_sleep = [-0.18,  0.88,  0.41, -0.72]   # close to run, far from cat

"Cat" and "dog" are now numerically close. "Run" and "sleep" are now close to each other and far from "cat" and "dog." That is semantic information encoded into geometry.

How a Neural Network Learns Embeddings

Embeddings are not hand-coded. They are learned from data. The network does not start with human-defined relationships between words. It starts with random numbers and adjusts them until the geometry reflects actual semantic relationships, discovered purely from co-occurrence patterns in text.

The classic example is Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013. The core idea is the distributional hypothesis: words that appear in similar contexts carry similar meanings.

Word2Vec trains a shallow neural network on one of two tasks. In the Skip-gram architecture, given a word, predict the words surrounding it. In the CBOW (Continuous Bag of Words) architecture, given the surrounding words, predict the center word. The prediction task itself is not the goal. The goal is the learned internal representation that the network develops to make those predictions accurately.

plaintext

Training sentence: "The cat sat on the mat"

Skip-gram task:
  Input: "sat"
  Predict: ["The", "cat", "on", "the"]

After training on millions of sentences, "cat" and "dog" both appear
next to words like "pet", "fur", "vet", "feed" — so their vectors
get pushed close together in the learned vector space.

According to Serokell's Word2Vec explainer, the basic idea behind Word2Vec is to represent each word as a multi-dimensional vector where the position of the vector in that high-dimensional space captures the meaning of the word. Word2Vec takes a large corpus of text as input and generates a vector space with hundreds of dimensions.

A well-trained Word2Vec model produces a famous result: vector arithmetic that captures semantic analogy.

python

vector("king") - vector("man") + vector("woman") ≈ vector("queen")
vector("Paris") - vector("France") + vector("Italy") ≈ vector("Rome")

This is not a trick or a cherry-picked result. It demonstrates that the learned geometry encodes relational meaning. The direction from "man" to "woman" in the vector space is the same direction as from "king" to "queen." Relationships become directions.

From Word Embeddings to Sentence Embeddings

Word2Vec produces one vector per word. That creates a problem for sentence-level tasks. The sentence "I went to the bank to deposit money" and "I sat on the bank of the river" use the same word "bank" but in completely different meanings. Word2Vec assigns the same vector to "bank" regardless of context.

Transformer models, starting with the BERT architecture published by Google AI in 2018, solved this by producing context-aware embeddings. Every word's representation is influenced by every other word in the sentence through the self-attention mechanism.

The next step was sentence-level embeddings. According to Pinecone's sentence transformers guide, transformers work using word or token-level embeddings, not sentence-level embeddings. Before sentence transformers, the approach to calculating accurate sentence similarity with BERT was to use a cross-encoder structure, which required passing every pair of sentences through the model together, which was computationally impractical at scale.

Sentence Transformers, introduced in the paper "Sentence-BERT" (2019), solved this by fine-tuning BERT to produce a single fixed-length vector for an entire sentence. The fine-tuning uses contrastive learning: pairs of semantically similar sentences are pushed together, and pairs of unrelated sentences are pushed apart.

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 384) — 3 sentences, each represented as 384 floats

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

The first two sentences about weather have similarity 0.6660. The third sentence about driving is nearly unrelated to both, scoring around 0.10. The model has encoded meaning into numbers, and the numbers reflect human semantic judgment. The code above is taken directly from the Sentence Transformers documentation at Hugging Face.

How Transformer Models Produce Embeddings: Step by Step

Modern embedding models follow a pipeline from raw text to a fixed-length vector. Understanding each step helps you reason about failure modes and model selection. For the deeper view of how a transformer represents tokens inside the model, including positional embeddings and RoPE, see token and positional embeddings.

Step 1: Tokenization

The input text is broken into tokens. A token is not always a full word. Models use subword tokenization (typically BPE or WordPiece) so that unknown words can be represented as combinations of known subwords.

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("embeddings are fascinating")
print(tokens)
# ['em', '##bed', '##dings', 'are', 'fascinating']

"Embeddings" is split into three subword tokens. This allows the model to handle words it has never seen during training by recognizing familiar subword patterns.

Step 2: Token Embeddings

Each token gets an initial embedding vector from a lookup table. This is the learned embedding layer, a matrix where each row corresponds to a token and contains its dense vector representation. A vocabulary of 30,000 tokens with 768-dimensional embeddings requires a matrix of shape (30000, 768).

Step 3: Self-Attention

The transformer's self-attention mechanism allows each token to incorporate information from every other token in the sequence. The word "bank" produces a different vector depending on whether it appears next to "money" or next to "river," because the surrounding context updates its representation at every attention layer.

According to Airbyte's OpenAI embeddings guide, OpenAI embeddings use transformer-based attention mechanisms to capture context-dependent meaning, so the same word is embedded differently based on surrounding context.

Step 4: Pooling

After all attention layers, the model produces one vector per token. To get a single vector for the whole input, those token vectors are pooled. Common approaches are mean pooling (average all token vectors) and CLS token pooling (use the special [CLS] token's output, which is trained to summarize the input).

plaintext

Input: "How do I reset my password?"

Tokens:        [CLS]  how   do   i   reset   my   password   ?   [SEP]
After layers:  v_cls  v1    v2   v3   v4      v5   v6         v7  v8

Mean pooling:  average(v1, v2, v3, v4, v5, v6, v7) → single 768-dim vector
CLS pooling:   v_cls                               → single 768-dim vector

The resulting single vector is the embedding for the entire sentence.

Calling an Embedding Model via API

For most teams building production applications, calling an embedding API is more practical than hosting a model. OpenAI's Embeddings API produces state-of-the-art results with no infrastructure management.

python

import openai
import numpy as np

client = openai.OpenAI(api_key="your-key-here")

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

# Embed two sentences
sentence_a = "How do I get a refund?"
sentence_b = "What is the process for returning a product?"
sentence_c = "What is the capital of France?"

emb_a = np.array(get_embedding(sentence_a))
emb_b = np.array(get_embedding(sentence_b))
emb_c = np.array(get_embedding(sentence_c))

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print(f"Refund vs Return: {cosine_similarity(emb_a, emb_b):.4f}")   # high ~0.92
print(f"Refund vs France: {cosine_similarity(emb_a, emb_c):.4f}")   # low  ~0.21

The embeddings for "refund" and "return a product" are semantically close even though they share no keywords. A traditional keyword search would fail to connect them. This is the foundation of semantic search.

Choosing an Embedding Model

The model you choose determines the quality of your similarity search. Key dimensions to consider: dimensionality, context window, whether the model is open-source or API-based, and performance on the type of content you are indexing.

plaintext

Model                           | Dimensions | Type          | Best for
--------------------------------+------------+---------------+---------------------------
text-embedding-3-small (OpenAI) | 1536       | API           | General RAG, English text
text-embedding-3-large (OpenAI) | 3072       | API           | High precision tasks
all-MiniLM-L6-v2 (SBERT)        | 384        | Open source   | Low-latency, CPU-friendly
all-mpnet-base-v2 (SBERT)       | 768        | Open source   | Highest quality local model
paraphrase-multilingual-mpnet   | 768        | Open source   | 50+ language support
text-embedding-004 (Google)     | 768        | API           | Vertex AI integration
embed-english-v3.0 (Cohere)     | 1024       | API           | Enterprise search

According to Sparkco's sentence transformer guide, choosing an appropriate model is critical for accuracy and relevance. Models like all-MiniLM-L6-v2 are versatile but may not suffice for niche applications. For domain-specific data, fine-tuning on your own corpus can significantly enhance embedding quality.

One critical rule: embeddings from different models cannot be mixed. An embedding from OpenAI's model and an embedding from Cohere live in completely different vector spaces. Comparing them produces meaningless numbers. Every document in your vector database must be embedded with the same model as your query.

Types of Embeddings

Text is not the only data type that gets embedded. The same principle, encoding meaning into a dense vector, applies to images, audio, graphs, and combinations of data types.

Word Embeddings

Word embeddings assign one vector per word. Word2Vec and GloVe are the classic models. They are fast and lightweight but context-unaware. The word "bank" has one vector regardless of whether it means a financial institution or a riverbank.

Sentence and Document Embeddings

Sentence embeddings assign one vector per sentence or paragraph, capturing the meaning of the full sequence rather than individual words. Sentence Transformers and OpenAI's embedding API both produce sentence-level embeddings. These are what most RAG pipelines use.

Image Embeddings

Convolutional neural networks like ResNet and ViT (Vision Transformer) convert images into dense vectors. According to Labelbox's AI foundations guide, models like AlexNet, VGG, and ResNet revolutionized image processing by creating image embeddings that preserve spatial hierarchies and semantic information.

Multimodal Embeddings

Models like CLIP by OpenAI produce embeddings where text and images share the same vector space. A photo of a dog and the sentence "a golden retriever playing outside" land at similar coordinates. This enables cross-modal search: query with text, retrieve images, or query with an image, retrieve related text.

Graph Embeddings

Graph embeddings represent nodes in a knowledge graph as vectors that encode both the node's attributes and its relationships to neighboring nodes. These are common in recommendation systems and fraud detection, where the network structure itself carries meaning.

Embedding Dimensionality: How Many Numbers Do You Need?

More dimensions allow the model to capture more nuance, but come at the cost of storage, computation, and the curse of dimensionality in nearest-neighbor search. The right dimensionality depends on the task.

According to Wikipedia's embedding article, for high-dimensional vector spaces, vectors tend to converge in distance, so Euclidean distance becomes less reliable for large embedding vectors. This is why cosine similarity, which measures angle rather than absolute distance, is preferred for high-dimensional text embeddings.

OpenAI's text-embedding-3-small and text-embedding-3-large models also support dimension reduction through the Matryoshka Representation Learning technique. You can request 256-dimensional or 512-dimensional versions of a 1536-dimensional embedding with minimal quality loss, which is useful when storage and latency matter more than maximum recall.

python

# Request a smaller dimension from OpenAI's API
response = client.embeddings.create(
    input="What is a vector database?",
    model="text-embedding-3-small",
    dimensions=512      # reduced from 1536 to 512
)
embedding = response.data[0].embedding
print(len(embedding))   # 512

Contextual vs Static Embeddings

The distinction between static and contextual embeddings is important for understanding why modern models outperform older ones.

A static embedding model assigns the same vector to a word regardless of context. Word2Vec, GloVe, and FastText are static. "Bank" always maps to the same vector.

A contextual embedding model produces a different vector for the same word depending on what surrounds it. BERT, GPT, and the OpenAI embedding API are contextual. "Bank" near "money" and "bank" near "river" produce different vectors.

Contextual embeddings are strictly more powerful for tasks involving polysemous words (words with multiple meanings) and nuanced sentence-level comparison. They are also more expensive to compute because the entire input sequence must be processed through multiple attention layers.

How Embeddings Connect to Vector Databases

Once you produce embeddings for a large collection of documents, you need somewhere to store and search them efficiently. A vector database is purpose-built for this: it stores the embedding vectors alongside metadata and uses approximate nearest neighbor algorithms to find the closest vectors to any query in milliseconds.

The full pipeline works as follows:

plaintext

Offline Indexing Phase
───────────────────────────────────────────────────────────
Document corpus
    ↓
Chunk into segments (500 tokens with overlap)
    ↓
Embedding model (text-embedding-3-small)
    ↓
1536-dimensional float vector per chunk
    ↓
Vector database (Pinecone / Weaviate / Milvus)
    ↓
Stored with metadata (source, chunk index, original text)

Online Query Phase
───────────────────────────────────────────────────────────
User query
    ↓
Same embedding model
    ↓
Query vector
    ↓
ANN search in vector database
    ↓
Top K most similar chunks
    ↓
Passed to LLM as context
    ↓
Grounded, accurate response

This is the RAG (Retrieval Augmented Generation) architecture. The vector database is what makes the retrieval step fast at scale. Without it, you would need to compute the similarity between the query and every stored document on every request, which is not viable for millions of documents.

The dense vectors used in this pipeline are discussed in detail in the dense vs sparse vectors article. The retrieval step relies on the similarity concepts covered in the semantic search article.

Why Embeddings From Different Models Cannot Be Mixed

This deserves its own section because it is a common source of bugs in production systems.

Every embedding model trains its own vector space from scratch. OpenAI's text-embedding-3-small learns a 1536-dimensional space. Cohere's embed-english-v3.0 learns a 1024-dimensional space. The orientation, scale, and geometry of those spaces are completely independent. There is no transformation that reliably maps one into the other.

If you store documents embedded with one model and then query with a different model, the similarity scores are meaningless. The vectors point in incompatible directions in incompatible spaces.

The practical rule: pick one model and use it for both indexing and querying. If you change the model, re-embed your entire document corpus.

Practical Example: Semantic Deduplication

One underused application of embeddings is finding near-duplicate content in large datasets. Two support tickets that say the same thing in different words will have high cosine similarity even though they share no keywords.

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

tickets = [
    "My account is locked and I can't log in.",
    "I am unable to access my account. It seems locked.",
    "How do I enable two-factor authentication?",
    "How can I turn on 2FA for my account?",
    "I want to delete my account permanently.",
]

embeddings = model.encode(tickets)

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("Near-duplicate detection:")
for i in range(len(tickets)):
    for j in range(i + 1, len(tickets)):
        sim = cosine_sim(embeddings[i], embeddings[j])
        if sim > 0.70:
            print(f"Similarity {sim:.3f}:")
            print(f"  [{i}] {tickets[i]}")
            print(f"  [{j}] {tickets[j]}")
            print()

# Output:
# Similarity 0.891:
#   [0] My account is locked and I can't log in.
#   [1] I am unable to access my account. It seems locked.
#
# Similarity 0.823:
#   [2] How do I enable two-factor authentication?
#   [3] How can I turn on 2FA for my account?

Tickets 0 and 1 are duplicates, as are 2 and 3. The embedding model found both pairs without any keyword overlap between "locked"/"unable to access" and without knowing that "2FA" is an abbreviation of "two-factor authentication."

Embeddings and the Latent Space

When a model produces an embedding, it places the input somewhere in what researchers call the latent space. This is a high-dimensional mathematical space where the coordinates are not pixel values or word counts, but learned abstract features.

According to AWS's embedding explainer, embeddings convert real-world objects into complex mathematical representations that capture inherent properties and relationships between real-world data. The entire process is automated, with AI systems self-creating embeddings during training.

The geometry of this latent space is what makes embeddings powerful. Analogies become vector arithmetic. Categories become clusters. The direction from "positive sentiment" to "negative sentiment" is a direction you can apply to any review embedding to predict its tone. This is covered in depth in the latent space article.

Summary

An embedding is a dense numerical vector produced by a neural network that encodes the meaning of its input. Words, sentences, images, and audio can all be embedded. Similar inputs produce numerically similar embeddings, so finding related content becomes a geometry problem rather than a keyword matching problem.

The generation process goes through tokenization, learned token embeddings, self-attention across the full input, and pooling to a single fixed-length vector. Modern transformer-based embedding models like OpenAI's text-embedding-3-small and the Sentence Transformers library produce contextual embeddings that handle polysemy and sentence-level meaning far better than older static approaches like Word2Vec.

Embeddings are the input to vector databases, the engine behind semantic search, and the bridge between raw unstructured data and AI applications that understand what the data means. The vector database article covers what happens after embeddings are generated and how ANN search retrieves the right ones at scale.

Sources and Further Reading

AWS. What Is Embedding in Machine Learning? aws.amazon.com/what-is/embeddings-in-machine-learning
Cloudflare. What Are Embeddings? cloudflare.com/learning/ai/what-are-embeddings
Google for Developers. Embeddings: Machine Learning Crash Course. developers.google.com/machine-learning/crash-course/embeddings
IBM. What Is Embedding? ibm.com/think/topics/embedding
Wikipedia. Embedding (Machine Learning). en.wikipedia.org/wiki/Embedding_(machine_learning)
Wikipedia. Word2Vec. en.wikipedia.org/wiki/Word2vec
Pinecone. Sentence Transformers: Meanings in Disguise. pinecone.io/learn/series/nlp/sentence-embeddings
Hugging Face. Sentence Transformers Documentation. huggingface.co/sentence-transformers
OpenAI. Embeddings API Guide. platform.openai.com/docs/guides/embeddings
Airbyte. OpenAI Embeddings 101. airbyte.com/data-engineering-resources/openai-embeddings
Serokell. Word2Vec: Explanation and Examples. serokell.io/blog/word2vec
Labelbox. AI Foundations: Understanding Embeddings. labelbox.com/guides/ai-foundations-understanding-embeddings
Lightly.ai. Embeddings in Machine Learning: An Overview. lightly.ai/blog/embeddings
GeeksforGeeks. Text Embeddings Using OpenAI. geeksforgeeks.org/nlp/text-embeddings-using-openai
Mikolov et al. Distributed Representations of Words and Phrases. arxiv.org/abs/1310.4546

What Is an Embedding?

plaintext

Input:  "How do I reset my password?"
Output: [0.0231, -0.1420, 0.8832, 0.0045, -0.3310, ..., 0.1192]
         ↑ 1536 floating-point numbers representing the meaning of that question

The Problem That Embeddings Solve

python

# Vocabulary: ["cat", "dog", "run", "sleep"]
# Indices:       0       1      2       3

one_hot_cat   = [1, 0, 0, 0]
one_hot_dog   = [0, 1, 0, 0]
one_hot_run   = [0, 0, 1, 0]
one_hot_sleep = [0, 0, 0, 1]

Embeddings solve both problems simultaneously. They produce dense, lower-dimensional vectors where the geometry encodes the semantic relationships that one-hot encoding loses entirely.

python

# Dense embeddings for the same words (simplified to 4D for illustration)
embed_cat   = [ 0.82,  0.51, -0.14,  0.33]
embed_dog   = [ 0.79,  0.48, -0.11,  0.31]   # close to cat
embed_run   = [-0.22,  0.91,  0.44, -0.67]   # far from cat
embed_sleep = [-0.18,  0.88,  0.41, -0.72]   # close to run, far from cat

"Cat" and "dog" are now numerically close. "Run" and "sleep" are now close to each other and far from "cat" and "dog." That is semantic information encoded into geometry.

How a Neural Network Learns Embeddings

plaintext

Training sentence: "The cat sat on the mat"

Skip-gram task:
  Input: "sat"
  Predict: ["The", "cat", "on", "the"]

After training on millions of sentences, "cat" and "dog" both appear
next to words like "pet", "fur", "vet", "feed" — so their vectors
get pushed close together in the learned vector space.

A well-trained Word2Vec model produces a famous result: vector arithmetic that captures semantic analogy.

python

vector("king") - vector("man") + vector("woman") ≈ vector("queen")
vector("Paris") - vector("France") + vector("Italy") ≈ vector("Rome")

From Word Embeddings to Sentence Embeddings

python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# (3, 384) — 3 sentences, each represented as 384 floats

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

How Transformer Models Produce Embeddings: Step by Step

Step 1: Tokenization

python

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("embeddings are fascinating")
print(tokens)
# ['em', '##bed', '##dings', 'are', 'fascinating']

"Embeddings" is split into three subword tokens. This allows the model to handle words it has never seen during training by recognizing familiar subword patterns.

Step 2: Token Embeddings

Step 3: Self-Attention

Step 4: Pooling

plaintext

Input: "How do I reset my password?"

Tokens:        [CLS]  how   do   i   reset   my   password   ?   [SEP]
After layers:  v_cls  v1    v2   v3   v4      v5   v6         v7  v8

Mean pooling:  average(v1, v2, v3, v4, v5, v6, v7) → single 768-dim vector
CLS pooling:   v_cls                               → single 768-dim vector

The resulting single vector is the embedding for the entire sentence.

Calling an Embedding Model via API

python

import openai
import numpy as np

client = openai.OpenAI(api_key="your-key-here")

def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]:
    text = text.replace("\n", " ")
    response = client.embeddings.create(input=[text], model=model)
    return response.data[0].embedding

# Embed two sentences
sentence_a = "How do I get a refund?"
sentence_b = "What is the process for returning a product?"
sentence_c = "What is the capital of France?"

emb_a = np.array(get_embedding(sentence_a))
emb_b = np.array(get_embedding(sentence_b))
emb_c = np.array(get_embedding(sentence_c))

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

print(f"Refund vs Return: {cosine_similarity(emb_a, emb_b):.4f}")   # high ~0.92
print(f"Refund vs France: {cosine_similarity(emb_a, emb_c):.4f}")   # low  ~0.21

Choosing an Embedding Model

plaintext

Model                           | Dimensions | Type          | Best for
--------------------------------+------------+---------------+---------------------------
text-embedding-3-small (OpenAI) | 1536       | API           | General RAG, English text
text-embedding-3-large (OpenAI) | 3072       | API           | High precision tasks
all-MiniLM-L6-v2 (SBERT)        | 384        | Open source   | Low-latency, CPU-friendly
all-mpnet-base-v2 (SBERT)       | 768        | Open source   | Highest quality local model
paraphrase-multilingual-mpnet   | 768        | Open source   | 50+ language support
text-embedding-004 (Google)     | 768        | API           | Vertex AI integration
embed-english-v3.0 (Cohere)     | 1024       | API           | Enterprise search

Types of Embeddings

Text is not the only data type that gets embedded. The same principle, encoding meaning into a dense vector, applies to images, audio, graphs, and combinations of data types.

Word Embeddings

Sentence and Document Embeddings

Image Embeddings

Multimodal Embeddings

Graph Embeddings

Embedding Dimensionality: How Many Numbers Do You Need?

python

# Request a smaller dimension from OpenAI's API
response = client.embeddings.create(
    input="What is a vector database?",
    model="text-embedding-3-small",
    dimensions=512      # reduced from 1536 to 512
)
embedding = response.data[0].embedding
print(len(embedding))   # 512

Contextual vs Static Embeddings

The distinction between static and contextual embeddings is important for understanding why modern models outperform older ones.

A static embedding model assigns the same vector to a word regardless of context. Word2Vec, GloVe, and FastText are static. "Bank" always maps to the same vector.

How Embeddings Connect to Vector Databases

The full pipeline works as follows:

plaintext

Offline Indexing Phase
───────────────────────────────────────────────────────────
Document corpus
    ↓
Chunk into segments (500 tokens with overlap)
    ↓
Embedding model (text-embedding-3-small)
    ↓
1536-dimensional float vector per chunk
    ↓
Vector database (Pinecone / Weaviate / Milvus)
    ↓
Stored with metadata (source, chunk index, original text)

Online Query Phase
───────────────────────────────────────────────────────────
User query
    ↓
Same embedding model
    ↓
Query vector
    ↓
ANN search in vector database
    ↓
Top K most similar chunks
    ↓
Passed to LLM as context
    ↓
Grounded, accurate response

The dense vectors used in this pipeline are discussed in detail in the dense vs sparse vectors article. The retrieval step relies on the similarity concepts covered in the semantic search article.

Why Embeddings From Different Models Cannot Be Mixed

This deserves its own section because it is a common source of bugs in production systems.

If you store documents embedded with one model and then query with a different model, the similarity scores are meaningless. The vectors point in incompatible directions in incompatible spaces.

The practical rule: pick one model and use it for both indexing and querying. If you change the model, re-embed your entire document corpus.

Practical Example: Semantic Deduplication

python

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

tickets = [
    "My account is locked and I can't log in.",
    "I am unable to access my account. It seems locked.",
    "How do I enable two-factor authentication?",
    "How can I turn on 2FA for my account?",
    "I want to delete my account permanently.",
]

embeddings = model.encode(tickets)

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("Near-duplicate detection:")
for i in range(len(tickets)):
    for j in range(i + 1, len(tickets)):
        sim = cosine_sim(embeddings[i], embeddings[j])
        if sim > 0.70:
            print(f"Similarity {sim:.3f}:")
            print(f"  [{i}] {tickets[i]}")
            print(f"  [{j}] {tickets[j]}")
            print()

# Output:
# Similarity 0.891:
#   [0] My account is locked and I can't log in.
#   [1] I am unable to access my account. It seems locked.
#
# Similarity 0.823:
#   [2] How do I enable two-factor authentication?
#   [3] How can I turn on 2FA for my account?

Embeddings and the Latent Space

Summary

Sources and Further Reading

AWS. What Is Embedding in Machine Learning? aws.amazon.com/what-is/embeddings-in-machine-learning
Cloudflare. What Are Embeddings? cloudflare.com/learning/ai/what-are-embeddings
Google for Developers. Embeddings: Machine Learning Crash Course. developers.google.com/machine-learning/crash-course/embeddings
IBM. What Is Embedding? ibm.com/think/topics/embedding
Wikipedia. Embedding (Machine Learning). en.wikipedia.org/wiki/Embedding_(machine_learning)
Wikipedia. Word2Vec. en.wikipedia.org/wiki/Word2vec
Pinecone. Sentence Transformers: Meanings in Disguise. pinecone.io/learn/series/nlp/sentence-embeddings
Hugging Face. Sentence Transformers Documentation. huggingface.co/sentence-transformers
OpenAI. Embeddings API Guide. platform.openai.com/docs/guides/embeddings
Airbyte. OpenAI Embeddings 101. airbyte.com/data-engineering-resources/openai-embeddings
Serokell. Word2Vec: Explanation and Examples. serokell.io/blog/word2vec
Labelbox. AI Foundations: Understanding Embeddings. labelbox.com/guides/ai-foundations-understanding-embeddings
Lightly.ai. Embeddings in Machine Learning: An Overview. lightly.ai/blog/embeddings
GeeksforGeeks. Text Embeddings Using OpenAI. geeksforgeeks.org/nlp/text-embeddings-using-openai
Mikolov et al. Distributed Representations of Words and Phrases. arxiv.org/abs/1310.4546

What Is an Embedding?

The Problem That Embeddings Solve

How a Neural Network Learns Embeddings

From Word Embeddings to Sentence Embeddings

How Transformer Models Produce Embeddings: Step by Step

Step 1: Tokenization

Step 2: Token Embeddings

Step 3: Self-Attention

Step 4: Pooling

Calling an Embedding Model via API

Choosing an Embedding Model

Types of Embeddings

Word Embeddings

Sentence and Document Embeddings

Image Embeddings

Multimodal Embeddings

Graph Embeddings

Embedding Dimensionality: How Many Numbers Do You Need?

Contextual vs Static Embeddings

How Embeddings Connect to Vector Databases

Why Embeddings From Different Models Cannot Be Mixed

Practical Example: Semantic Deduplication

Embeddings and the Latent Space

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts

What Is an Embedding?

The Problem That Embeddings Solve

How a Neural Network Learns Embeddings

From Word Embeddings to Sentence Embeddings

How Transformer Models Produce Embeddings: Step by Step

Step 1: Tokenization

Step 2: Token Embeddings

Step 3: Self-Attention

Step 4: Pooling

Calling an Embedding Model via API

Choosing an Embedding Model

Types of Embeddings

Word Embeddings

Sentence and Document Embeddings

Image Embeddings

Multimodal Embeddings

Graph Embeddings

Embedding Dimensionality: How Many Numbers Do You Need?

Contextual vs Static Embeddings

How Embeddings Connect to Vector Databases

Why Embeddings From Different Models Cannot Be Mixed

Practical Example: Semantic Deduplication

Embeddings and the Latent Space

Summary

Sources and Further Reading

Krunal Kanojiya

Related Posts