Text Embeddings

Contents

What are embeddings
Model landscape
Similarity metrics
Dimensions & Matryoshka
Fine-tuning
Cross-encoder reranking
Production considerations

01 — Definition

What Are Embeddings?

Embeddings map text to dense vectors in a continuous space where semantic similarity equals geometric proximity. A sentence becomes a point in 768 or 1024 dimensional space. Sentences that mean the same thing end up close together; unrelated sentences are far apart.

Use cases: semantic search (find similar documents), clustering (group documents by meaning), classification (label documents based on vector nearness), RAG (retrieval-augmented generation), cross-lingual matching (find equivalent meanings across languages).

Not a language model: Embeddings come from encoder-only architecture (BERT-style) or late-interaction models. They don't generate text — they compress it into a fixed vector. No autoregressive generation happens here.

The core insight: word relationships are algebraic. "king" - "man" + "woman" ≈ "queen". The vector space preserves semantic relationships, making arithmetic on embeddings meaningful.

Example: Computing Cosine Similarity

from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer('BAAI/bge-large-en-v1.5') sentences = [ "The transformer architecture revolutionized NLP", "Attention mechanisms changed how we process sequences", "I like pizza" ] embeddings = model.encode(sentences, normalize_embeddings=True) # Cosine similarity (dot product of normalized vectors) sim_01 = np.dot(embeddings[0], embeddings[1]) # 0.87 — very similar sim_02 = np.dot(embeddings[0], embeddings[2]) # 0.12 — unrelated print(f"Sentences 0-1: {sim_01:.3f}") print(f"Sentences 0-2: {sim_02:.3f}")

02 — Comparison

Model Landscape: Which Embedding Model?

The embedding model landscape has matured rapidly. Here's the state of major models as of early 2025:

Model	Dims	Context	MTEB avg	Notes
text-embedding-3-small (OpenAI)	1536	8191	62.3	Cheap, fast, good for most tasks
text-embedding-3-large (OpenAI)	3072	8191	64.6	Best OpenAI option
BAAI/bge-large-en-v1.5	1024	512	64.2	Best open-source for English
BAAI/bge-m3	1024	8192	64.3	Multilingual, long context
Cohere embed-v3	1024	512	64.5	Strong retrieval with input_type
E5-large-v2	1024	512	62.3	Solid open-source option
GTE-Qwen2-7B	3584	32768	72.1	LLM-based, expensive
NV-Embed-v2 (NVIDIA)	4096	32768	72.3	SOTA but large

MTEB (Massive Text Embedding Benchmark): A standard leaderboard with 56 datasets across 8 tasks. Check the MTEB leaderboard before choosing a model — it's the gold standard for ranking embeddings.

⚠️ Important for RAG: Retrieval performance (MTEB Retrieval subset) often matters more than overall MTEB score. A model can rank 10th overall but be #1 for retrieval. Test on your specific task.

03 — Metrics

Similarity Metrics

Multiple ways to measure how close two embeddings are. Each has tradeoffs:

Metric	Formula	Range	When to use
Cosine	dot(a,b) / (\|a\| \|b\|)	-1 to 1	Default for text; measures direction only
Dot product	sum(a_i * b_i)	Unbounded	When vectors are L2-normalized (= cosine)
L2 distance	sqrt(sum((a-b)^2))	0 to ∞	Image similarity, low-dimensional spaces
Manhattan	sum(\|a_i - b_i\|)	0 to ∞	Rarely used in practice

Cosine similarity is the most common. It measures the angle between vectors, not their magnitude. Two vectors pointing in the same direction have cosine similarity of 1, opposite directions give -1, perpendicular gives 0.

Dot product is equivalent to cosine when vectors are L2-normalized (unit length). Most production embedding models normalize automatically. Dot product is faster to compute and FAISS/pgvector HNSW support it natively for fast ANN search.

✓ Best practice: Always normalize embeddings at index time and query time. Dot product on normalized vectors equals cosine similarity and enables the fastest ANN algorithms.

04 — Optimization

Dimensions and Matryoshka Embeddings

Higher dimensions capture more nuance but cost more to store and search. Matryoshka Representation Learning (MRL) is a training technique that lets you keep only the useful dimensions.

The idea: Train embeddings so the first N dimensions are already a useful representation. You can truncate to 256, 512, or 1024 dimensions and retain most quality. OpenAI's text-embedding-3 supports this natively.

Example: Truncating OpenAI Embeddings

# Truncate OpenAI embeddings from openai import OpenAI client = OpenAI() # Full 3072 dims full = client.embeddings.create( model="text-embedding-3-large", input="hello world") # Truncated to 256 dims (Matryoshka) small = client.embeddings.create( model="text-embedding-3-large", input="hello world", dimensions=256) # 12x smaller, ~5% quality drop

Storage and latency drop dramatically with smaller dimensions. The tradeoff is a small quality loss:

Dims	Storage (1M docs)	ANN latency	MTEB retrieval (bge-large)
256	1 GB	~1ms	56.2
512	2 GB	~2ms	60.1
1024	4 GB	~3ms	64.2
3072	12 GB	~8ms	66.3
4096	16 GB	~12ms	67.1

For most RAG systems, 512–1024 dimensions is the sweet spot. You get 95–98% of the quality with a 4–16× storage savings.

05 — Customization

Fine-Tuning Embeddings

Sometimes the best open-source or commercial embeddings aren't domain-specific enough. Fine-tuning customizes them to your data.

When to fine-tune: Your domain has specialized terminology (medical, legal, code), queries are out-of-distribution, or you have labeled query-document pairs to train on.

Training signals: (1) Human relevance labels — annotators mark (query, document) pairs as relevant or not. (2) Synthetic pairs via LLM — GPT-4 generates questions from documents. (3) Hard negatives from BM25 — take the top 10 BM25 results and use them as negative examples to distinguish exact match from semantic match.

Loss functions: InfoNCE / contrastive loss (push positive, pull negative), triplet loss (anchor, positive, negative), multiple negatives ranking loss (most common with Sentence Transformers). The last one is simple and effective: for each query, you have one positive document and N-1 random negatives; the loss pulls the positive closer and pushes negatives away.

Frameworks: Sentence Transformers (Python), Unstructured.io, LlamaIndex fine-tuning, FlagEmbedding (BAAI).

Example: Fine-tuning with Sentence Transformers

from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader model = SentenceTransformer('BAAI/bge-base-en-v1.5') train_examples = [ InputExample(texts=["What is RAG?", "Retrieval-Augmented Generation combines..."], label=1.0), InputExample(texts=["What is RAG?", "The capital of France is Paris"], label=0.0), ] train_dl = DataLoader(train_examples, shuffle=True, batch_size=16) train_loss = losses.CosineSimilarityLoss(model) model.fit(train_objectives=[(train_dl, train_loss)], epochs=3) model.save('my-domain-embeddings')

⚠️ Key lever: Synthetic negative mining is the highest-leverage step. Easy negatives (random docs) are too easy; the model learns nothing. Use BM25 top-10 as hard negatives — they're semantically related but not the answer.

06 — Two-Stage Retrieval

Cross-Encoder Reranking

A bi-encoder retrieves fast by embedding query and documents separately. But it can't capture fine-grained matching — no cross-attention between query and document. Cross-encoders fix this.

Bi-encoder: Embed query and doc separately, compute similarity. Fast (dot product on pre-computed vectors), but no query-document interaction. Typical recall: 95–99% of top documents.

Cross-encoder: Concatenate query + document, score them jointly in a single forward pass. Much higher quality because it sees the full context, but O(N) forward passes for N candidates. Typical improvement: +5–15% MRR (Mean Reciprocal Rank).

The two-stage pipeline: Retrieve top-100 with bi-encoder (fast ANN search) → rerank top-100 with cross-encoder (slower but higher quality) → return top-5 to user.

Property	Bi-encoder	Cross-encoder
Speed	Fast (dot product)	Slow (full forward pass)
Quality	Good	Better (+5–15% MRR)
Scalable to millions	Yes (ANN index)	No (linear scan)
Use case	Retrieval (stage 1)	Reranking (stage 2)

Popular cross-encoder models: BAAI/bge-reranker-v2-m3, Cohere rerank-v3, Jina reranker. Most score on a 0–1 scale or raw logits.

Example: Using Cohere Rerank API

# Pseudo-code for two-stage retrieval dense_results = faiss_index.search(query_embedding, k=100) documents = [corpus[i] for i, _ in dense_results] # Rerank with cross-encoder reranked = cohere_client.rerank( query=query, documents=documents, model="rerank-v3-compact", top_n=5 ) final_results = [documents[i] for i, _ in reranked]

07 — Deployment

Production Considerations

Moving embeddings from notebook to production introduces practical challenges:

Embedding index updates: Add-only is easy (append new vectors to the index). Deletion requires a soft-delete flag + periodic reindex. Many vector databases support incremental updates, but graph-based ANN indices (HNSW) are expensive to rebuild.

Chunking affects quality: Very long chunks (2000+ tokens) dilute the signal — the vector blurs across many concepts. Very short chunks (50 tokens) lose context. The sweet spot is 256–512 tokens per chunk with 50-token overlap. This is a strong baseline; test on your data.

Multilingual: bge-m3, multilingual-e5-large, Cohere embed-multilingual-v3. These handle 100+ languages, but retrieval quality varies. Test on your target languages using MTEB multilingual tasks before committing.

Modality mismatch: Don't use the same embedding model for code, multilingual text, and general English without testing. Code embeddings, legal documents, and user queries have very different optimal models. Cross-lingual or code-to-natural-language retrieval is especially risky.

Tools & infrastructure:

Libraries

sentence-transformers

Python library for embedding inference, fine-tuning, and utilities

APIs

OpenAI Embeddings

text-embedding-3-small/large, managed, best-in-class

APIs

Cohere embed

Flexible, multilingual, supports input_type for task-specific tuning

APIs

Voyage AI

High-performance, supports dimensions parameter

Vector Search

FAISS

Facebook's library; IndexHNSW, IVF, PQ — the standard

Vector DBs

pgvector

Postgres extension; HNSW + IVF, full SQL, integrated

Vector DBs

Qdrant

HNSW native, rich filtering, cloud managed

Vector DBs

Weaviate

HNSW, GraphQL API, good ecosystem

⚠️ Multimodal risk: Don't mix embedding models for different modalities. Code-to-code retrieval needs CodeBERT or Unixcoder. Cross-lingual needs XM-ROBERTA or mBERT. General English needs bge or text-embedding-3. Test separately.

References

Learning Path

Embeddings are foundational — they underpin RAG, semantic search, recommendations, and clustering. Start here before building any retrieval system.

What is a Vector?cosine similarity

→

Embedding ModelsBGE, text-embedding-3

→

Vector StoresChroma, Weaviate

→

Semantic Searchyour first use case

→

RAG / Clusteringadvanced applications

Embed one sentence, look at the numbers

Call openai.embeddings.create() or sentence_transformers.encode() on "The cat sat on the mat." Print the vector. It's just 1536 (or 384) floats. Demystify the abstraction.

Compute cosine similarity by hand

Embed "cat on mat" and "feline on rug". Compute np.dot(a, b) / (norm(a)*norm(b)). See that it's ~0.93. Now embed "stock market". See ~0.45. Intuition locked in.

Build a 100-doc semantic search system

Take any dataset (Wikipedia articles, your own docs), embed them, store in ChromaDB, then query with a question. This is the seed of every RAG system.

Evaluate your embedding model

Run MTEB benchmarks for retrieval (pip install mteb). BGE-M3 and E5-Mistral top the charts; text-embedding-3-large is the best hosted option. Pick based on your language and domain.

Text Embeddings

What Are Embeddings?

Example: Computing Cosine Similarity

Model Landscape: Which Embedding Model?

Similarity Metrics

Dimensions and Matryoshka Embeddings

Example: Truncating OpenAI Embeddings

Fine-Tuning Embeddings

Example: Fine-tuning with Sentence Transformers

Cross-Encoder Reranking

Example: Using Cohere Rerank API

Production Considerations

Further Reading

Learning Path

Embed one sentence, look at the numbers

Compute cosine similarity by hand

Build a 100-doc semantic search system

Evaluate your embedding model

Related concepts