Embeddings and Representation Learning: How Models Turn Words Into Math
Embeddings are how neural networks turn raw tokens into something they can actually reason about. This article covers token embeddings, positional embeddings, the evolution from Word2Vec to RoPE, and why the geometry of the vector space matters for everything downstream.
When I first looked at how token embeddings work in a transformer, I expected something complicated. It is not.
nn.Embedding in PyTorch is a weight matrix. Shape: (vocab_size, embed_dim). A forward pass is a row lookup. Token ID 42 gives you row 42 of the matrix. That row is a vector of 128 or 768 or 4096 floating-point numbers, and it gets updated by backpropagation like any other weight in the network.
The reason embeddings feel magical is not the mechanism. The mechanism is trivial. The reason is what backpropagation does to that matrix over millions of training steps. Tokens that appear in similar contexts get pulled toward similar regions of the vector space. By the end of training, the geometry of that space encodes real semantic relationships — not because anyone programmed it to, but because the loss function pushed it there.
This is Article 4 in the series on AI and ML fundamentals. Article 3 covered how neural networks and backpropagation work. Everything there applies here directly — an embedding layer is trained the same way. Article 5 will cover sequence modeling and RNNs, where these embeddings become the input to a model that processes them one step at a time. Understanding what an embedding is and what makes a good one is what you need before that.
The lookup table that learns
Start with the mechanics before the intuition.
import torch
import torch.nn as nn
vocab_size = 50_257 # GPT-2 vocabulary
embed_dim = 128 # NanoGPT scale
embedding = nn.Embedding(vocab_size, embed_dim)
# token IDs — what the tokenizer produces
token_ids = torch.tensor([15496, 995, 11]) # "Hello world !"
# lookup: each ID becomes its corresponding row vector
vectors = embedding(token_ids)
print(vectors.shape) # torch.Size([3, 128])
print(vectors[0]) # the 128-dimensional vector for token 15496Three token IDs go in. Three 128-dimensional vectors come out. That is the entire forward pass of an embedding layer.
Before training, those vectors are random noise — initialized from a normal distribution, typically N(0, 0.02). After training on enough text, they are not random anymore. The model has shaped them into a space where nearby vectors mean related things.
# weight matrix lives in embedding.weight
print(embedding.weight.shape) # torch.Size([50257, 128])
# the embedding is just a row lookup
manual_lookup = embedding.weight[15496]
assert torch.allclose(vectors[0], manual_lookup)
# True — nn.Embedding is literally indexing a matrixBecause it is just matrix rows, backpropagation only updates the rows that were actually used in the current batch. A token that appears once in the training data gets one gradient update. A common word like "the" gets millions. This is part of why rare token embeddings are worse than common ones — they have seen fewer gradient signals.
What the geometry looks like after training
The classic demonstration uses Word2Vec analogies. Train on enough text and the vector arithmetic works out:
# using a pre-trained embedding model to show the geometry
# (requires: pip install gensim)
import gensim.downloader as api
model = api.load("word2vec-google-news-300")
# vector("king") - vector("man") + vector("woman") ≈ vector("queen")
result = model.most_similar(
positive=["king", "woman"],
negative=["man"],
topn=3
)
print(result)
# [('queen', 0.7118), ('monarch', 0.6189), ('princess', 0.5902)]"King minus man plus woman equals queen." That relationship was not programmed. The model learned it from patterns in billions of words of text: kings and queens appear in similar contexts, men and women appear in similar contexts, and the offset between them is consistent.
This is what representation learning means. The model does not store facts. It learns a geometric space where relationships between concepts are encoded as directions and distances.
For LLMs, the same thing happens at a much larger scale. Research comparing classical Word2Vec embeddings with LLM-induced embeddings found that LLMs cluster semantically related words more tightly and perform better on analogy tasks, though the underlying principle is the same — meaning emerges from the geometry of the space.
The three eras of text embeddings
Understanding where we are requires knowing where embeddings came from.
Static embeddings (2013 to 2018): Word2Vec, GloVe, FastText. One vector per word, fixed regardless of context. "Bank" always maps to the same vector whether it means a riverbank or a financial institution. These were groundbreaking when they appeared. They are inadequate for modern LLMs.
Contextual embeddings (2018 onward): BERT, GPT, and everything after. The embedding for "bank" depends on the surrounding sentence because attention layers mix token representations together. The same token in different contexts produces different vectors. This is what transformers do that earlier models could not.
LLM-infused embeddings (2023 onward): We are now entering a third era, that of LLM-infused text embedding models. Rather than being trained independently, the best embedding models today are seeded from and trained on data synthesized by LLMs. Models like E5, BGE, and LLM2Vec fine-tune LLaMA or Mistral specifically to produce high-quality sentence-level embeddings for retrieval tasks.
For building products, the third era is where you are operating. When you call an embedding API to power semantic search or RAG, you are using a model from this era.
Cosine similarity: measuring meaning in vector space
Once you have embedding vectors, you need a way to compare them. Cosine similarity is the standard.
import torch
import torch.nn.functional as F
def cosine_sim(a: torch.Tensor, b: torch.Tensor) -> float:
return F.cosine_similarity(a.unsqueeze(0), b.unsqueeze(0)).item()
# using a small embedding table to demonstrate
embedding = nn.Embedding(10_000, 64)
nn.init.normal_(embedding.weight, mean=0, std=0.02)
# before training: random vectors, similarity is near zero
vec_cat = embedding(torch.tensor(4149)) # token for "cat"
vec_dog = embedding(torch.tensor(3360)) # token for "dog"
vec_car = embedding(torch.tensor(3116)) # token for "car"
print(f"cat vs dog (before training): {cosine_sim(vec_cat, vec_dog):.4f}")
print(f"cat vs car (before training): {cosine_sim(vec_cat, vec_car):.4f}")
# both close to 0 — random vectors are nearly orthogonalAfter training on language data, the cosine similarity between "cat" and "dog" would be much higher than between "cat" and "car" — they appear in more similar contexts. That signal is what RAG systems use when matching a user query to a relevant document chunk.
# what RAG is doing under the hood
def find_most_relevant(query_vec, doc_vecs, doc_texts):
similarities = [
(cosine_sim(query_vec, dv), text)
for dv, text in zip(doc_vecs, doc_texts)
]
return sorted(similarities, reverse=True)[0]
# real systems use approximate nearest neighbor search
# (FAISS, Hnswlib) for million-scale document collections
# but the core operation is this cosine similarityArticle 5 will cover how sequences of these vectors are processed over time. The quality of the embeddings feeding into that sequence model directly determines what patterns the model can learn.
Positional embeddings: fixing a fundamental problem
Here is something that tripped me up early. Attention is permutation-invariant.
If you give a transformer the tokens ["The", "cat", "sat"] in that order, or shuffle them to ["sat", "The", "cat"], the attention scores are identical. The mechanism computes dot products between all pairs of tokens. It does not know which one came first.
import torch
import torch.nn.functional as F
# simulate what attention sees without positional information
torch.manual_seed(42)
# two sequences: same tokens, different order
seq1 = torch.randn(3, 8) # ["The", "cat", "sat"]
seq2 = seq1[[2, 0, 1]] # ["sat", "The", "cat"] — shuffled
Q = torch.randn(8, 8)
K = torch.randn(8, 8)
att1 = F.softmax((seq1 @ Q) @ (seq1 @ K).T / 8**0.5, dim=-1)
att2 = F.softmax((seq2 @ Q) @ (seq2 @ K).T / 8**0.5, dim=-1)
# the attention patterns differ only because the token vectors differ
# if you sort both by the same permutation, they'd match
# there's no notion of "first" vs "second" built in
print("attention scores are based on content, not order")Positional embeddings fix this by adding position information to each token's vector before the attention layers see it. The original transformer paper used sinusoidal functions — fixed mathematical waves of different frequencies.
import torch
import math
def sinusoidal_encoding(seq_len: int, embed_dim: int) -> torch.Tensor:
pe = torch.zeros(seq_len, embed_dim)
position = torch.arange(seq_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float()
* (-math.log(10000.0) / embed_dim)
)
pe[:, 0::2] = torch.sin(position * div_term) # even dims: sine
pe[:, 1::2] = torch.cos(position * div_term) # odd dims: cosine
return pe
pe = sinusoidal_encoding(seq_len=10, embed_dim=16)
print(pe.shape) # torch.Size([10, 16])
# each row is a unique positional fingerprint
# position 0 and position 9 have very different patterns
print("pos 0:", pe[0, :4].tolist())
print("pos 9:", pe[9, :4].tolist())This worked well enough for the original transformer. But it has a hard ceiling: the model cannot handle sequences longer than the maximum position it was trained on. Token at position 1025 in a model trained on sequences of 1024 is genuinely out of distribution.
RoPE: what modern LLMs actually use
LLaMA, Mistral, Qwen, DeepSeek — almost every major open LLM uses RoPE (Rotary Position Embedding). Understanding why it replaced learned and sinusoidal encodings matters.
The core insight: instead of adding position information to the token vectors, RoPE rotates the query and key vectors by an angle that depends on their position. When you then compute the dot product q · k, the result depends on the relative distance between the two tokens — not their absolute positions.
import torch
import math
def apply_rope(x: torch.Tensor, position: int) -> torch.Tensor:
"""
Apply RoPE to a single vector at a given position.
x: (embed_dim,) — must have even dim
"""
d = x.shape[-1]
device = x.device
# rotation angles — different frequency for each pair of dimensions
theta = torch.tensor([
1.0 / (10000 ** (2 * i / d))
for i in range(d // 2)
], device=device)
angles = position * theta # scale by position
# rotate each (x[2i], x[2i+1]) pair by angle theta_i
cos = torch.cos(angles)
sin = torch.sin(angles)
x_pairs = x.view(-1, 2) # group into pairs
x1, x2 = x_pairs[:, 0], x_pairs[:, 1]
rotated = torch.stack([
x1 * cos - x2 * sin,
x1 * sin + x2 * cos
], dim=1).view(-1)
return rotated
# token at position 0 vs position 5 get different rotations
embed_dim = 8
vec = torch.randn(embed_dim)
at_pos_0 = apply_rope(vec, position=0)
at_pos_5 = apply_rope(vec, position=5)
# the angle between them encodes relative distance
similarity = torch.nn.functional.cosine_similarity(
at_pos_0.unsqueeze(0), at_pos_5.unsqueeze(0)
)
print(f"cosine similarity between pos 0 and pos 5: {similarity.item():.4f}")
# will be less than 1 — they have been rotated away from each otherWhy does this matter? RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. Because the attention dot product captures relative distance rather than absolute position, RoPE generalizes better to longer sequences than learned positional embeddings, which memorize specific position indices and fail completely beyond their training length.
Both RoPE and ALiBi work on the philosophy that positional information and semantic information represent different things and should not be mixed — they modify the attention weights rather than the token vectors directly.
Here is how RoPE looks in practice inside a transformer layer using PyTorch's built-in support:
import torch
import torch.nn as nn
class RoPEAttention(nn.Module):
"""Simplified single-head attention with RoPE positional encoding."""
def __init__(self, embed_dim: int, max_seq_len: int = 2048):
super().__init__()
self.embed_dim = embed_dim
self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
self.v_proj = nn.Linear(embed_dim, embed_dim, bias=False)
self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)
# precompute RoPE frequencies
theta = 1.0 / (10000 ** (
torch.arange(0, embed_dim, 2).float() / embed_dim
))
positions = torch.arange(max_seq_len).float()
freqs = torch.outer(positions, theta) # (max_seq_len, dim/2)
self.register_buffer("cos_cache", freqs.cos())
self.register_buffer("sin_cache", freqs.sin())
def rotate_half(self, x: torch.Tensor) -> torch.Tensor:
x1, x2 = x[..., : x.shape[-1] // 2], x[..., x.shape[-1] // 2 :]
return torch.cat([-x2, x1], dim=-1)
def apply_rope(self, x: torch.Tensor, seq_len: int) -> torch.Tensor:
cos = self.cos_cache[:seq_len].unsqueeze(0) # (1, T, dim/2)
sin = self.sin_cache[:seq_len].unsqueeze(0)
cos = torch.cat([cos, cos], dim=-1) # (1, T, dim)
sin = torch.cat([sin, sin], dim=-1)
return x * cos + self.rotate_half(x) * sin
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, T, C = x.shape
q = self.q_proj(x)
k = self.k_proj(x)
v = self.v_proj(x)
# apply RoPE to queries and keys only — not values
q = self.apply_rope(q, T)
k = self.apply_rope(k, T)
att = (q @ k.transpose(-2, -1)) * (C ** -0.5)
att = torch.nn.functional.softmax(att, dim=-1)
return self.out_proj(att @ v)
# test
attn = RoPEAttention(embed_dim=64)
x = torch.randn(2, 16, 64) # batch=2, seq_len=16, dim=64
out = attn(x)
print(out.shape) # torch.Size([2, 16, 64])The values do not get RoPE applied — only queries and keys. This is intentional. RoPE is about how tokens relate to each other (captured in attention scores), not about what information each token carries (captured in values).
Learned vs RoPE vs ALiBi: which to use
The three main approaches you will encounter when reading transformer code:
| Method | How it works | Used in | Context extrapolation |
|---|---|---|---|
| Learned (absolute) | Trainable vector per position, added to token embeddings | Original BERT, GPT-2 | Poor — fails beyond training length |
| Sinusoidal | Fixed sin/cos waves, added to token embeddings | Original transformer (2017) | Limited |
| RoPE | Rotates Q and K vectors by position-dependent angle | LLaMA, Mistral, Qwen, DeepSeek | Good, especially with YaRN scaling |
| ALiBi | Subtracts linear penalty from attention scores based on token distance | MPT, older Bloom-style models | Excellent extrapolation, fewer parameters |
Most modern LLMs today such as Qwen, DeepSeek, LLaMA are fine-tuned using YaRN to enable context length expansion — an extension of RoPE that allows models trained on 4K tokens to handle 128K or more without retraining from scratch.
If you are reading the LLaMA or Mistral source code and see apply_rotary_emb, that is RoPE. If you see attention scores being modified with a distance-based penalty matrix, that is ALiBi.
The full embedding pipeline in a transformer
Putting it together: what actually happens between "token ID" and "the model's first attention layer."
import torch
import torch.nn as nn
class TransformerEmbeddings(nn.Module):
"""
The embedding block that sits at the front of every transformer.
Handles both token embeddings and positional information.
"""
def __init__(
self,
vocab_size: int,
embed_dim: int,
max_seq_len: int,
dropout: float = 0.1
):
super().__init__()
# token embedding: the big lookup table
self.token_emb = nn.Embedding(vocab_size, embed_dim)
# positional embedding: learned, one vector per position
# (in real LLMs this would be RoPE applied in attention instead)
self.pos_emb = nn.Embedding(max_seq_len, embed_dim)
self.dropout = nn.Dropout(dropout)
self.ln = nn.LayerNorm(embed_dim)
def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
B, T = token_ids.shape
# token embeddings: row lookup
tok = self.token_emb(token_ids) # (B, T, embed_dim)
# positional embeddings: position 0, 1, 2, ... T-1
pos = torch.arange(T, device=token_ids.device)
pos = self.pos_emb(pos) # (T, embed_dim)
# add them together — this is where "the token is at position N" gets encoded
x = tok + pos # broadcasting handles batch dim
return self.dropout(self.ln(x)) # normalize + dropout
# numbers from NanoGPT config
emb = TransformerEmbeddings(
vocab_size=50_257,
embed_dim=128,
max_seq_len=256
)
token_ids = torch.randint(0, 50_257, (4, 32)) # batch=4, seq_len=32
out = emb(token_ids)
print(out.shape) # torch.Size([4, 32, 128])
# parameter count
total = sum(p.numel() for p in emb.parameters())
print(f"embedding parameters: {total:,}")
# ≈ 50,257 * 128 + 256 * 128 = 6,464,896About 6.4 million parameters just for the embeddings in a NanoGPT-scale model. In GPT-2 Small with embed_dim=768 and vocab_size=50,257, the token embedding alone has 38.6 million parameters — about a third of the entire model.
Weight tying — sharing weights between the token embedding and the final output projection — halves this cost and consistently improves performance. It is in every serious LLM codebase.
Why embedding quality directly affects Article 5
Article 5 covers sequence modeling and RNNs. The specific problem it addresses is: how do you process a sequence of these embedding vectors over time, maintaining memory of what came earlier?
The quality of the embeddings feeding into that sequence model determines what the model can learn. If "cat" and "dog" are close in embedding space and "cat" and "skyscraper" are far, the sequence model has useful geometric structure to work with. If all embeddings are random noise, the sequence model has to learn meaning from scratch — it cannot.
This is why modern LLMs with good pre-trained embeddings transfer so well. The embedding space encodes general knowledge about how language works. Fine-tuning on a specific task then shapes the rest of the network around that already-structured space.
Autoregressive LLMs are not inherently optimized for producing high-quality representations because they are trained with a next-token prediction objective, not a sequence-level semantic objective. This is why specialized embedding models exist alongside generation models, and why you generally should not use a GPT-style model's hidden states directly for retrieval without some fine-tuning.
For text generation and reasoning tasks, use the LLM's own token embeddings — they are trained end-to-end with the rest of the model. For retrieval, semantic search, and RAG, use a dedicated embedding model (BGE, E5, Nomic, or an API like OpenAI's text-embedding-3-small). The objectives are different and the models are optimized differently.
Next in the series
Article 5 covers sequence modeling and RNNs. You will see how a sequence of embedding vectors gets processed one step at a time, why maintaining memory across long sequences is hard, and why the vanishing gradient problem that Article 3 introduced becomes a serious engineering problem at scale. The reason transformers with attention replaced RNNs — which Article 6 covers — only makes sense after you understand what RNNs were trying to do and where they broke down.
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.