Embeddings Demystified: Math, Meaning & Machines

Apr 4, 2025 · 8 min

“Embeddings are like whispers in a language machines can understand — quiet, dense, and surprisingly smart.”

What’s the Deal with Embeddings?

When you say “I love ice cream,” your friend gets the vibe. But a machine? Not so much.

That’s where embeddings come in. They transform human text into fixed-length numeric vectors that capture the meaning behind the words. It’s not just about words anymore — it’s about context, relationships, and even intent.

Think of embeddings as a way to place words, sentences, or documents on a giant 3D map — except this map has hundreds (or thousands) of dimensions.

"ice cream" → [0.21, -0.55, 0.88, 0.12, ...]

Every sentence gets its own unique “location.” And sentences that mean similar things? They land close together.

The Mathy Intuition

An embedding is just a list of numbers. But those numbers come from layers of transformation:

Embedding table: Converts tokens to fixed-length vectors
Transformer layers: Inject context using self-attention — each token is influenced by the others
Pooling/Aggregation: Squeeze it down into one vector that represents everything

Each final embedding vector lives in a high-dimensional space (often 768–4096 dimensions). And in this space, closeness = semantic similarity.

⚙️ How It Works — Behind the Scenes

Let’s walk through how a sentence becomes an embedding:

Step 1: Tokenization

The sentence is broken into subword tokens:

"Tokyo is beautiful" → ["Tokyo", " is", " beautiful"]

Step 2: Mapping to IDs

Each token is mapped to an integer ID via a vocabulary:

["Tokyo", " is", " beautiful"] → [2031, 58, 1109]

Step 3: Embedding Lookup

Each ID is used to fetch a vector from an embedding matrix:

2031 → [0.2, -0.1, 0.5, ...]

Step 4: Contextualization via Transformer

These vectors pass through multiple self-attention layers. Tokens update themselves based on their neighbors. For instance, “beautiful” can learn to associate more strongly with “Tokyo.”

Of course, this isn’t always interpretable. These updates depend heavily on how the model was pre-trained. Think of this part as a black box that magically learns relationships — not with hard rules, but with statistical patterns over massive amounts of text.

Step 5: Aggregation

To get a single embedding for the whole sentence, we need to combine the contextualized token vectors into one fixed-length representation. This step matters because most downstream tasks (like search or classification) require just one vector.

Here are common aggregation strategies:

Averaging: Take the mean of all token vectors. This works well when all tokens contribute equally to the sentence’s meaning.
Max pooling: Take the maximum value across all token vectors per dimension. This tends to highlight the strongest signal per feature.
[CLS] token (in BERT-style models): Use the final vector of the special [CLS] token, which is trained to summarize the entire input. This method is fast and widely adopted.

How Do We Compare Embeddings?

Once you’ve got two embeddings, the most common similarity measure is cosine similarity:

Cosine of small angle ≈ 1 → very similar
Cosine of large angle ≈ 0 → very different

"physician" vs. "doctor" → 0.98 (almost identical)
"banana" vs. "physician" → 0.02 (totally unrelated)

This works because embeddings “live” in a space where direction means meaning.

Let’s Talk Math (Just a Little)

Imagine two vectors:

A = [1, 2, 3], B = [2, 4, 6]

The cosine similarity is:

cos(θ) = (A · B) / (||A|| * ||B||)

Which comes out to:

(1*2 + 2*4 + 3*6) / (sqrt(14) * sqrt(56)) = 1

Meaning? They point in exactly the same direction → identical meaning.

Why Do Embeddings Matter?

Embeddings are the foundation for a lot of smart behavior in AI systems:

Semantic Search: Find info that’s meaningfully related
RAG (Retrieval-Augmented Generation): Feed relevant data to LLMs
Chat Memory: Embed chat history for recall
Content Filtering: Cluster similar docs, tag content
Ranking/Recommendations: Embed users and products

And the best part? Embeddings make these tasks efficient and scalable.

Are Embeddings Learned?

Yes. During model training, the neural network tweaks its weights so that:

Similar meanings → closer vectors
Different meanings → distant vectors

It’s not perfect. But over millions of examples, the model gets very good at encoding meaning.

Bonus: Dimensionality

Why are embeddings so long? (e.g. 1536 dimensions)

Because language is complex. You need space to capture tone, topic, syntax, semantics — all at once.

Each dimension might loosely track something abstract — like past/future tense, politeness, or even emotional intensity.

Final Thought

Embeddings are how machines “understand” language — not perfectly, but close enough to be useful. They enable smarter search, better chatbots, and semantic AI. And as LLMs evolve, so will the quality and utility of embeddings.