Embeddings Demystified: Math, Meaning & Machines

· 8 min

“Embeddings are like whispers in a language machines can understand — quiet, dense, and surprisingly smart.”

What’s the Deal with Embeddings?

When you say “I love ice cream,” your friend gets the vibe. But a machine? Not so much.

That’s where embeddings come in. They transform human text into fixed-length numeric vectors that capture the meaning behind the words. It’s not just about words anymore — it’s about context, relationships, and even intent.

Think of embeddings as a way to place words, sentences, or documents on a giant 3D map — except this map has hundreds (or thousands) of dimensions.

"ice cream" → [0.21, -0.55, 0.88, 0.12, ...]

Every sentence gets its own unique “location.” And sentences that mean similar things? They land close together.

The Mathy Intuition

An embedding is just a list of numbers. But those numbers come from layers of transformation:

Each final embedding vector lives in a high-dimensional space (often 768–4096 dimensions). And in this space, closeness = semantic similarity.

⚙️ How It Works — Behind the Scenes

Let’s walk through how a sentence becomes an embedding:

Step 1: Tokenization

The sentence is broken into subword tokens:

"Tokyo is beautiful" → ["Tokyo", " is", " beautiful"]

Step 2: Mapping to IDs

Each token is mapped to an integer ID via a vocabulary:

["Tokyo", " is", " beautiful"] → [2031, 58, 1109]

Step 3: Embedding Lookup

Each ID is used to fetch a vector from an embedding matrix:

2031 → [0.2, -0.1, 0.5, ...]

Step 4: Contextualization via Transformer

These vectors pass through multiple self-attention layers. Tokens update themselves based on their neighbors. For instance, “beautiful” can learn to associate more strongly with “Tokyo.”

Of course, this isn’t always interpretable. These updates depend heavily on how the model was pre-trained. Think of this part as a black box that magically learns relationships — not with hard rules, but with statistical patterns over massive amounts of text.

Step 5: Aggregation

To get a single embedding for the whole sentence, we need to combine the contextualized token vectors into one fixed-length representation. This step matters because most downstream tasks (like search or classification) require just one vector.

Here are common aggregation strategies:

How Do We Compare Embeddings?

Once you’ve got two embeddings, the most common similarity measure is cosine similarity:

"physician" vs. "doctor" → 0.98 (almost identical)
"banana" vs. "physician" → 0.02 (totally unrelated)

This works because embeddings “live” in a space where direction means meaning.

Let’s Talk Math (Just a Little)

Imagine two vectors:

A = [1, 2, 3], B = [2, 4, 6]

The cosine similarity is:

cos(θ) = (A · B) / (||A|| * ||B||)

Which comes out to:

(1*2 + 2*4 + 3*6) / (sqrt(14) * sqrt(56)) = 1

Meaning? They point in exactly the same direction → identical meaning.

Why Do Embeddings Matter?

Embeddings are the foundation for a lot of smart behavior in AI systems:

And the best part? Embeddings make these tasks efficient and scalable.

Are Embeddings Learned?

Yes. During model training, the neural network tweaks its weights so that:

It’s not perfect. But over millions of examples, the model gets very good at encoding meaning.

Bonus: Dimensionality

Why are embeddings so long? (e.g. 1536 dimensions)

Because language is complex. You need space to capture tone, topic, syntax, semantics — all at once.

Each dimension might loosely track something abstract — like past/future tense, politeness, or even emotional intensity.

Final Thought

Embeddings are how machines “understand” language — not perfectly, but close enough to be useful. They enable smarter search, better chatbots, and semantic AI. And as LLMs evolve, so will the quality and utility of embeddings.