Embeddings explained: how AI represents meaning as numbers
The mechanism behind semantic search, RAG, classification, and recommendations. What embeddings are, what they capture, what they miss (negation, exact match, fine-grained logic), and how to choose a model.
Embeddings explained: how AI represents meaning as numbers
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings get embeddings that are close together in vector space; pieces with different meanings get embeddings that are far apart. This is the underlying mechanism behind semantic search, recommendation systems, RAG, and almost every "find similar things" feature in modern AI products.
You don't need to understand embeddings to use AI. But if you want to know why semantic search works, why RAG retrieves what it retrieves, or why your classifier returns weird matches, the answer is in how embeddings work.
What an embedding looks like
A modern embedding model outputs a vector of 512 to 4,096 floating-point numbers per piece of input text.
For the sentence "The cat sat on the mat," the model might output something like:
[0.0231, -0.4521, 0.7821, ..., 0.0103] (4096 numbers total)
The individual numbers don't mean anything to a human. They're coordinates in a high-dimensional space. What matters is the relative position of one embedding to another.
How "meaning" gets encoded
The embedding model is trained on huge amounts of text with an objective like: "given a sentence and a paraphrase of that sentence, produce vectors that are close together; given two unrelated sentences, produce vectors that are far apart."
After enough training, the model learns that:
- "I love dogs" and "I'm a big fan of dogs" should land near each other
- "I love dogs" and "I hate dogs" should land somewhat near (they're about the same topic)
- "I love dogs" and "the stock market closed up 2%" should land far apart
The model isn't told what "meaning" is. It learns the patterns of which texts go together in similar contexts and represents that as geometric proximity.
This is why semantic search works. You embed the query, embed all the documents, and find the documents whose embedding is closest to the query's. No keyword matching required.
Measuring "close" in vector space
The most common similarity metric is cosine similarity: the angle between two vectors, normalized to a number between -1 and 1.
- 1.0: identical direction (semantically very similar)
- 0.0: perpendicular (unrelated)
- -1.0: opposite direction (rarely seen with modern models; usually means very different topics)
In practice, you'll work with values like 0.85 (highly relevant), 0.65 (somewhat related), 0.3 (probably irrelevant). The exact thresholds depend on the model.
What embeddings are used for
1. Semantic search
Type "best Italian restaurants" — get results that mention "trattorias" and "pasta places." Keyword search would miss these. Embedding search finds them because the embeddings are nearby.
2. Retrieval-Augmented Generation (RAG)
The most common production use today. You have a knowledge base. A user asks a question. You embed the question, find the K most similar document chunks, paste them into the prompt as context, and ask the LLM to answer using them. The retrieval step is pure embedding similarity.
See How RAG works, and when to use it for the deeper RAG dive.
3. Classification
Given examples of each category as labeled training data, classify a new input by embedding it and finding the most similar labeled examples. Cheaper and faster than fine-tuning a classifier for many use cases.
4. Deduplication
Find near-duplicate content (customer support tickets, news articles, code snippets) by clustering embeddings. Useful for "we keep getting the same question" detection.
5. Recommendation systems
"Users who liked this article also liked..." can be solved with embedding similarity over content descriptions. Doesn't require user behavior data to bootstrap.
6. Anomaly detection
Embed everything. Find points far from any cluster. Those are anomalies. Useful for fraud detection, content moderation triage, and quality control.
What embeddings are not good at
- Exact matching: looking for the literal string "ORDER-12345" — use a database, not embeddings
- Distinguishing fine-grained differences: "the customer wanted a refund" and "the customer received a refund" can land very close. Embeddings capture topic, less so logic.
- Capturing negation reliably: "I love this product" and "I don't love this product" are often closer than you'd hope
- Numerical reasoning: $50 and $50,000 in financial documents might embed very similarly
For these failure modes, combine embeddings with other techniques (filters, structured data lookups, re-ranking with an LLM).
Choosing an embedding model
Major providers offer embedding APIs:
- OpenAI:
text-embedding-3-small(cheap, 1536 dims),text-embedding-3-large(better, 3072 dims) - Anthropic: via Voyage AI partnership
- Google: Vertex AI text embeddings
- Cohere: their embed-v3 family
- Open-source: BAAI's BGE family, the Mxbai-large series, or Sentence-Transformers — usable locally
Things that matter when choosing:
- Quality on your data: benchmarks lie. Test on a sample of your actual content + queries
- Cost: pricing per million tokens, just like LLMs
- Dimensions: bigger isn't always better. Bigger = more storage cost, slower search, marginally better recall
- Languages: some models are English-only; multilingual options exist
- Domain: medical or legal text benefits from a model fine-tuned on that domain
For most production use, start with one of the major commercial APIs. Move to self-hosted only when cost or latency justifies the engineering.
Storing embeddings: vector databases
Once you have a million embeddings, "find the most similar to this query" needs to be fast. That's what vector databases solve.
Common options:
- Pinecone: managed, easy onboarding, higher cost at scale
- Weaviate, Qdrant: open-source, self-host or managed
- Postgres with pgvector: if you already have Postgres, this is often enough
- Chroma: simple, in-process, good for prototypes
For small datasets (under ~100K embeddings), you can skip the vector database entirely and do brute-force cosine similarity in memory. It's faster than you'd think.
How embeddings break in production
Things that look fine in dev but cause problems at scale:
- Chunking strategy: how you split documents before embedding matters a lot. Too short (single sentences) = lose context. Too long (whole pages) = embeddings become muddy averages.
- Embedding drift over time: as the underlying model updates, embeddings from old vs. new versions aren't comparable. Plan for re-embedding when you upgrade.
- Out-of-distribution queries: queries about topics not covered in your indexed content will still return something (the model finds nearest neighbors). You need a relevance threshold or LLM re-ranking to catch this.
- Multilingual mismatches: if your docs are English and your queries are sometimes Japanese, an English-only model produces poor matches across languages.
What to read next
- How RAG works, and when to use it — the most common production use of embeddings
- Fine-tuning vs RAG vs prompting: which one fits your problem — where embeddings sit in the larger decision
- Transformers and attention: the architecture under every modern AI model — the architecture that makes good embedding models possible
Next in this pillar
Transformers and attention: the architecture under every modern AI modelGet the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.