REPRESENTATIONS

Text Embeddings

How sentences become vectors — models, dimensions, similarity metrics, and fine-tuning strategies

768–4096 dims typical range
cosine similarity standard metric
MTEB leaderboard where to check rankings
Contents
  1. What are embeddings
  2. Model landscape
  3. Similarity metrics
  4. Dimensions & Matryoshka
  5. Fine-tuning
  6. Cross-encoder reranking
  7. Production considerations
01 — Definition

What Are Embeddings?

Embeddings map text to dense vectors in a continuous space where semantic similarity equals geometric proximity. A sentence becomes a point in 768 or 1024 dimensional space. Sentences that mean the same thing end up close together; unrelated sentences are far apart.

Use cases: semantic search (find similar documents), clustering (group documents by meaning), classification (label documents based on vector nearness), RAG (retrieval-augmented generation), cross-lingual matching (find equivalent meanings across languages).

Not a language model: Embeddings come from encoder-only architecture (BERT-style) or late-interaction models. They don't generate text — they compress it into a fixed vector. No autoregressive generation happens here.

The core insight: word relationships are algebraic. "king" - "man" + "woman" ≈ "queen". The vector space preserves semantic relationships, making arithmetic on embeddings meaningful.

Example: Computing Cosine Similarity

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('BAAI/bge-large-en-v1.5') sentences = [ "The transformer architecture revolutionized NLP", "Attention mechanisms changed how we process sequences", "I like pizza" ] embeddings = model.encode(sentences, normalize_embeddings=True) # Cosine similarity (dot product of normalized vectors) sim_01 = np.dot(embeddings[0], embeddings[1]) # 0.87 — very similar sim_02 = np.dot(embeddings[0], embeddings[2]) # 0.12 — unrelated print(f"Sentences 0-1: {sim_01:.3f}") print(f"Sentences 0-2: {sim_02:.3f}")
02 — Comparison

Model Landscape: Which Embedding Model?

The embedding model landscape has matured rapidly. Here's the state of major models as of early 2025:

Model Dims Context MTEB avg Notes
text-embedding-3-small (OpenAI) 1536 8191 62.3 Cheap, fast, good for most tasks
text-embedding-3-large (OpenAI) 3072 8191 64.6 Best OpenAI option
BAAI/bge-large-en-v1.5 1024 512 64.2 Best open-source for English
BAAI/bge-m3 1024 8192 64.3 Multilingual, long context
Cohere embed-v3 1024 512 64.5 Strong retrieval with input_type
E5-large-v2 1024 512 62.3 Solid open-source option
GTE-Qwen2-7B 3584 32768 72.1 LLM-based, expensive
NV-Embed-v2 (NVIDIA) 4096 32768 72.3 SOTA but large

MTEB (Massive Text Embedding Benchmark): A standard leaderboard with 56 datasets across 8 tasks. Check the MTEB leaderboard before choosing a model — it's the gold standard for ranking embeddings.

⚠️ Important for RAG: Retrieval performance (MTEB Retrieval subset) often matters more than overall MTEB score. A model can rank 10th overall but be #1 for retrieval. Test on your specific task.
03 — Metrics

Similarity Metrics

Multiple ways to measure how close two embeddings are. Each has tradeoffs:

Metric Formula Range When to use
Cosine dot(a,b) / (|a| |b|) -1 to 1 Default for text; measures direction only
Dot product sum(a_i * b_i) Unbounded When vectors are L2-normalized (= cosine)
L2 distance sqrt(sum((a-b)^2)) 0 to ∞ Image similarity, low-dimensional spaces
Manhattan sum(|a_i - b_i|) 0 to ∞ Rarely used in practice

Cosine similarity is the most common. It measures the angle between vectors, not their magnitude. Two vectors pointing in the same direction have cosine similarity of 1, opposite directions give -1, perpendicular gives 0.

Dot product is equivalent to cosine when vectors are L2-normalized (unit length). Most production embedding models normalize automatically. Dot product is faster to compute and FAISS/pgvector HNSW support it natively for fast ANN search.

Best practice: Always normalize embeddings at index time and query time. Dot product on normalized vectors equals cosine similarity and enables the fastest ANN algorithms.
04 — Optimization

Dimensions and Matryoshka Embeddings

Higher dimensions capture more nuance but cost more to store and search. Matryoshka Representation Learning (MRL) is a training technique that lets you keep only the useful dimensions.

The idea: Train embeddings so the first N dimensions are already a useful representation. You can truncate to 256, 512, or 1024 dimensions and retain most quality. OpenAI's text-embedding-3 supports this natively.

Example: Truncating OpenAI Embeddings

# Truncate OpenAI embeddings from openai import OpenAI client = OpenAI() # Full 3072 dims full = client.embeddings.create( model="text-embedding-3-large", input="hello world") # Truncated to 256 dims (Matryoshka) small = client.embeddings.create( model="text-embedding-3-large", input="hello world", dimensions=256) # 12x smaller, ~5% quality drop

Storage and latency drop dramatically with smaller dimensions. The tradeoff is a small quality loss:

Dims Storage (1M docs) ANN latency MTEB retrieval (bge-large)
256 1 GB ~1ms 56.2
512 2 GB ~2ms 60.1
1024 4 GB ~3ms 64.2
3072 12 GB ~8ms 66.3
4096 16 GB ~12ms 67.1

For most RAG systems, 512–1024 dimensions is the sweet spot. You get 95–98% of the quality with a 4–16× storage savings.

05 — Customization

Fine-Tuning Embeddings

Sometimes the best open-source or commercial embeddings aren't domain-specific enough. Fine-tuning customizes them to your data.

When to fine-tune: Your domain has specialized terminology (medical, legal, code), queries are out-of-distribution, or you have labeled query-document pairs to train on.

Training signals: (1) Human relevance labels — annotators mark (query, document) pairs as relevant or not. (2) Synthetic pairs via LLM — GPT-4 generates questions from documents. (3) Hard negatives from BM25 — take the top 10 BM25 results and use them as negative examples to distinguish exact match from semantic match.

Loss functions: InfoNCE / contrastive loss (push positive, pull negative), triplet loss (anchor, positive, negative), multiple negatives ranking loss (most common with Sentence Transformers). The last one is simple and effective: for each query, you have one positive document and N-1 random negatives; the loss pulls the positive closer and pushes negatives away.

Frameworks: Sentence Transformers (Python), Unstructured.io, LlamaIndex fine-tuning, FlagEmbedding (BAAI).

Example: Fine-tuning with Sentence Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader model = SentenceTransformer('BAAI/bge-base-en-v1.5') train_examples = [ InputExample(texts=["What is RAG?", "Retrieval-Augmented Generation combines..."], label=1.0), InputExample(texts=["What is RAG?", "The capital of France is Paris"], label=0.0), ] train_dl = DataLoader(train_examples, shuffle=True, batch_size=16) train_loss = losses.CosineSimilarityLoss(model) model.fit(train_objectives=[(train_dl, train_loss)], epochs=3) model.save('my-domain-embeddings')
⚠️ Key lever: Synthetic negative mining is the highest-leverage step. Easy negatives (random docs) are too easy; the model learns nothing. Use BM25 top-10 as hard negatives — they're semantically related but not the answer.
06 — Two-Stage Retrieval

Cross-Encoder Reranking

A bi-encoder retrieves fast by embedding query and documents separately. But it can't capture fine-grained matching — no cross-attention between query and document. Cross-encoders fix this.

Bi-encoder: Embed query and doc separately, compute similarity. Fast (dot product on pre-computed vectors), but no query-document interaction. Typical recall: 95–99% of top documents.

Cross-encoder: Concatenate query + document, score them jointly in a single forward pass. Much higher quality because it sees the full context, but O(N) forward passes for N candidates. Typical improvement: +5–15% MRR (Mean Reciprocal Rank).

The two-stage pipeline: Retrieve top-100 with bi-encoder (fast ANN search) → rerank top-100 with cross-encoder (slower but higher quality) → return top-5 to user.

Property Bi-encoder Cross-encoder
Speed Fast (dot product) Slow (full forward pass)
Quality Good Better (+5–15% MRR)
Scalable to millions Yes (ANN index) No (linear scan)
Use case Retrieval (stage 1) Reranking (stage 2)

Popular cross-encoder models: BAAI/bge-reranker-v2-m3, Cohere rerank-v3, Jina reranker. Most score on a 0–1 scale or raw logits.

Example: Using Cohere Rerank API

# Pseudo-code for two-stage retrieval dense_results = faiss_index.search(query_embedding, k=100) documents = [corpus[i] for i, _ in dense_results] # Rerank with cross-encoder reranked = cohere_client.rerank( query=query, documents=documents, model="rerank-v3-compact", top_n=5 ) final_results = [documents[i] for i, _ in reranked]
07 — Deployment

Production Considerations

Moving embeddings from notebook to production introduces practical challenges:

Embedding index updates: Add-only is easy (append new vectors to the index). Deletion requires a soft-delete flag + periodic reindex. Many vector databases support incremental updates, but graph-based ANN indices (HNSW) are expensive to rebuild.

Chunking affects quality: Very long chunks (2000+ tokens) dilute the signal — the vector blurs across many concepts. Very short chunks (50 tokens) lose context. The sweet spot is 256–512 tokens per chunk with 50-token overlap. This is a strong baseline; test on your data.

Multilingual: bge-m3, multilingual-e5-large, Cohere embed-multilingual-v3. These handle 100+ languages, but retrieval quality varies. Test on your target languages using MTEB multilingual tasks before committing.

Modality mismatch: Don't use the same embedding model for code, multilingual text, and general English without testing. Code embeddings, legal documents, and user queries have very different optimal models. Cross-lingual or code-to-natural-language retrieval is especially risky.

Tools & infrastructure:

Libraries
sentence-transformers
Python library for embedding inference, fine-tuning, and utilities
APIs
OpenAI Embeddings
text-embedding-3-small/large, managed, best-in-class
APIs
Cohere embed
Flexible, multilingual, supports input_type for task-specific tuning
APIs
Voyage AI
High-performance, supports dimensions parameter
Vector Search
FAISS
Facebook's library; IndexHNSW, IVF, PQ — the standard
Vector DBs
pgvector
Postgres extension; HNSW + IVF, full SQL, integrated
Vector DBs
Qdrant
HNSW native, rich filtering, cloud managed
Vector DBs
Weaviate
HNSW, GraphQL API, good ecosystem
⚠️ Multimodal risk: Don't mix embedding models for different modalities. Code-to-code retrieval needs CodeBERT or Unixcoder. Cross-lingual needs XM-ROBERTA or mBERT. General English needs bge or text-embedding-3. Test separately.
References

Further Reading

Leaderboards & Benchmarks
Official Resources
Papers & Deep Dives
LEARNING PATH

Learning Path

Embeddings are foundational — they underpin RAG, semantic search, recommendations, and clustering. Start here before building any retrieval system.

What is a Vector?cosine similarity
Embedding ModelsBGE, text-embedding-3
Vector StoresChroma, Weaviate
Semantic Searchyour first use case
RAG / Clusteringadvanced applications
1

Embed one sentence, look at the numbers

Call openai.embeddings.create() or sentence_transformers.encode() on "The cat sat on the mat." Print the vector. It's just 1536 (or 384) floats. Demystify the abstraction.

2

Compute cosine similarity by hand

Embed "cat on mat" and "feline on rug". Compute np.dot(a, b) / (norm(a)*norm(b)). See that it's ~0.93. Now embed "stock market". See ~0.45. Intuition locked in.

3

Build a 100-doc semantic search system

Take any dataset (Wikipedia articles, your own docs), embed them, store in ChromaDB, then query with a question. This is the seed of every RAG system.

4

Evaluate your embedding model

Run MTEB benchmarks for retrieval (pip install mteb). BGE-M3 and E5-Mistral top the charts; text-embedding-3-large is the best hosted option. Pick based on your language and domain.