How sentences become vectors — models, dimensions, similarity metrics, and fine-tuning strategies
Embeddings map text to dense vectors in a continuous space where semantic similarity equals geometric proximity. A sentence becomes a point in 768 or 1024 dimensional space. Sentences that mean the same thing end up close together; unrelated sentences are far apart.
Use cases: semantic search (find similar documents), clustering (group documents by meaning), classification (label documents based on vector nearness), RAG (retrieval-augmented generation), cross-lingual matching (find equivalent meanings across languages).
Not a language model: Embeddings come from encoder-only architecture (BERT-style) or late-interaction models. They don't generate text — they compress it into a fixed vector. No autoregressive generation happens here.
The core insight: word relationships are algebraic. "king" - "man" + "woman" ≈ "queen". The vector space preserves semantic relationships, making arithmetic on embeddings meaningful.
The embedding model landscape has matured rapidly. Here's the state of major models as of early 2025:
| Model | Dims | Context | MTEB avg | Notes |
|---|---|---|---|---|
| text-embedding-3-small (OpenAI) | 1536 | 8191 | 62.3 | Cheap, fast, good for most tasks |
| text-embedding-3-large (OpenAI) | 3072 | 8191 | 64.6 | Best OpenAI option |
| BAAI/bge-large-en-v1.5 | 1024 | 512 | 64.2 | Best open-source for English |
| BAAI/bge-m3 | 1024 | 8192 | 64.3 | Multilingual, long context |
| Cohere embed-v3 | 1024 | 512 | 64.5 | Strong retrieval with input_type |
| E5-large-v2 | 1024 | 512 | 62.3 | Solid open-source option |
| GTE-Qwen2-7B | 3584 | 32768 | 72.1 | LLM-based, expensive |
| NV-Embed-v2 (NVIDIA) | 4096 | 32768 | 72.3 | SOTA but large |
MTEB (Massive Text Embedding Benchmark): A standard leaderboard with 56 datasets across 8 tasks. Check the MTEB leaderboard before choosing a model — it's the gold standard for ranking embeddings.
Multiple ways to measure how close two embeddings are. Each has tradeoffs:
| Metric | Formula | Range | When to use |
|---|---|---|---|
| Cosine | dot(a,b) / (|a| |b|) | -1 to 1 | Default for text; measures direction only |
| Dot product | sum(a_i * b_i) | Unbounded | When vectors are L2-normalized (= cosine) |
| L2 distance | sqrt(sum((a-b)^2)) | 0 to ∞ | Image similarity, low-dimensional spaces |
| Manhattan | sum(|a_i - b_i|) | 0 to ∞ | Rarely used in practice |
Cosine similarity is the most common. It measures the angle between vectors, not their magnitude. Two vectors pointing in the same direction have cosine similarity of 1, opposite directions give -1, perpendicular gives 0.
Dot product is equivalent to cosine when vectors are L2-normalized (unit length). Most production embedding models normalize automatically. Dot product is faster to compute and FAISS/pgvector HNSW support it natively for fast ANN search.
Higher dimensions capture more nuance but cost more to store and search. Matryoshka Representation Learning (MRL) is a training technique that lets you keep only the useful dimensions.
The idea: Train embeddings so the first N dimensions are already a useful representation. You can truncate to 256, 512, or 1024 dimensions and retain most quality. OpenAI's text-embedding-3 supports this natively.
Storage and latency drop dramatically with smaller dimensions. The tradeoff is a small quality loss:
| Dims | Storage (1M docs) | ANN latency | MTEB retrieval (bge-large) |
|---|---|---|---|
| 256 | 1 GB | ~1ms | 56.2 |
| 512 | 2 GB | ~2ms | 60.1 |
| 1024 | 4 GB | ~3ms | 64.2 |
| 3072 | 12 GB | ~8ms | 66.3 |
| 4096 | 16 GB | ~12ms | 67.1 |
For most RAG systems, 512–1024 dimensions is the sweet spot. You get 95–98% of the quality with a 4–16× storage savings.
Sometimes the best open-source or commercial embeddings aren't domain-specific enough. Fine-tuning customizes them to your data.
When to fine-tune: Your domain has specialized terminology (medical, legal, code), queries are out-of-distribution, or you have labeled query-document pairs to train on.
Training signals: (1) Human relevance labels — annotators mark (query, document) pairs as relevant or not. (2) Synthetic pairs via LLM — GPT-4 generates questions from documents. (3) Hard negatives from BM25 — take the top 10 BM25 results and use them as negative examples to distinguish exact match from semantic match.
Loss functions: InfoNCE / contrastive loss (push positive, pull negative), triplet loss (anchor, positive, negative), multiple negatives ranking loss (most common with Sentence Transformers). The last one is simple and effective: for each query, you have one positive document and N-1 random negatives; the loss pulls the positive closer and pushes negatives away.
Frameworks: Sentence Transformers (Python), Unstructured.io, LlamaIndex fine-tuning, FlagEmbedding (BAAI).
A bi-encoder retrieves fast by embedding query and documents separately. But it can't capture fine-grained matching — no cross-attention between query and document. Cross-encoders fix this.
Bi-encoder: Embed query and doc separately, compute similarity. Fast (dot product on pre-computed vectors), but no query-document interaction. Typical recall: 95–99% of top documents.
Cross-encoder: Concatenate query + document, score them jointly in a single forward pass. Much higher quality because it sees the full context, but O(N) forward passes for N candidates. Typical improvement: +5–15% MRR (Mean Reciprocal Rank).
The two-stage pipeline: Retrieve top-100 with bi-encoder (fast ANN search) → rerank top-100 with cross-encoder (slower but higher quality) → return top-5 to user.
| Property | Bi-encoder | Cross-encoder |
|---|---|---|
| Speed | Fast (dot product) | Slow (full forward pass) |
| Quality | Good | Better (+5–15% MRR) |
| Scalable to millions | Yes (ANN index) | No (linear scan) |
| Use case | Retrieval (stage 1) | Reranking (stage 2) |
Popular cross-encoder models: BAAI/bge-reranker-v2-m3, Cohere rerank-v3, Jina reranker. Most score on a 0–1 scale or raw logits.
Moving embeddings from notebook to production introduces practical challenges:
Embedding index updates: Add-only is easy (append new vectors to the index). Deletion requires a soft-delete flag + periodic reindex. Many vector databases support incremental updates, but graph-based ANN indices (HNSW) are expensive to rebuild.
Chunking affects quality: Very long chunks (2000+ tokens) dilute the signal — the vector blurs across many concepts. Very short chunks (50 tokens) lose context. The sweet spot is 256–512 tokens per chunk with 50-token overlap. This is a strong baseline; test on your data.
Multilingual: bge-m3, multilingual-e5-large, Cohere embed-multilingual-v3. These handle 100+ languages, but retrieval quality varies. Test on your target languages using MTEB multilingual tasks before committing.
Modality mismatch: Don't use the same embedding model for code, multilingual text, and general English without testing. Code embeddings, legal documents, and user queries have very different optimal models. Cross-lingual or code-to-natural-language retrieval is especially risky.
Tools & infrastructure:
Embeddings are foundational — they underpin RAG, semantic search, recommendations, and clustering. Start here before building any retrieval system.
Call openai.embeddings.create() or sentence_transformers.encode() on "The cat sat on the mat." Print the vector. It's just 1536 (or 384) floats. Demystify the abstraction.
Embed "cat on mat" and "feline on rug". Compute np.dot(a, b) / (norm(a)*norm(b)). See that it's ~0.93. Now embed "stock market". See ~0.45. Intuition locked in.
Take any dataset (Wikipedia articles, your own docs), embed them, store in ChromaDB, then query with a question. This is the seed of every RAG system.
Run MTEB benchmarks for retrieval (pip install mteb). BGE-M3 and E5-Mistral top the charts; text-embedding-3-large is the best hosted option. Pick based on your language and domain.