OpenAI's text-embedding models (text-embedding-3-small, text-embedding-3-large) for high-quality semantic representations via API.
Running your own embedding model means managing GPU infrastructure, model weights, batching logic, and version upgrades. Hosted embeddings let you offload all of that to an API call — you send text, you get a vector back, you pay per token.
The tradeoff: you depend on network latency and API pricing. For high-volume, latency-sensitive applications, local models (like Sentence Transformers or BGE) often win on cost. For most applications, the simplicity of an API is worth the trade.
| Model | Max dims | Cost | Quality (MTEB) | Best for |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02/1M tokens | 62.3 | Cost-conscious production RAG |
| text-embedding-3-large | 3072 | $0.13/1M tokens | 64.6 | Highest accuracy requirements |
| text-embedding-ada-002 (legacy) | 1536 | $0.10/1M tokens | 61.0 | Avoid for new projects |
text-embedding-3-small is 5× cheaper than ada-002 and 1.5 points better on MTEB. Use it by default.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed(text: str, model: str = "text-embedding-3-small") -> list[float]:
text = text.replace("\n", " ") # newlines can degrade quality
return client.embeddings.create(input=[text], model=model).data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
q = embed("How do I cancel my subscription?")
d1 = embed("To cancel, go to Account Settings > Subscription > Cancel Plan.")
d2 = embed("Our headquarters is located in San Francisco.")
print(cosine_similarity(q, d1)) # high ~0.85
print(cosine_similarity(q, d2)) # low ~0.15
text-embedding-3 models support Matryoshka Representation Learning — you can request fewer dimensions and still get high-quality embeddings. Lower dimensions = smaller storage and faster ANN search:
response = client.embeddings.create(
input=["OpenAI embeddings support dimension reduction"],
model="text-embedding-3-large",
dimensions=256 # truncate from 3072 to 256
)
embedding_256 = response.data[0].embedding
print(len(embedding_256)) # 256
# Quality is still excellent — MTEB drops only ~1 point from 3072→256
Use 256-dim for cost-sensitive, high-scale applications (>100M docs). Use full 1536/3072 when retrieval precision is paramount.
from openai import OpenAI
import numpy as np
client = OpenAI()
def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
'''Embed a list of texts, respecting the 2048-item batch limit.'''
texts = [t.replace("\n", " ") for t in texts]
all_embeddings = []
BATCH_SIZE = 512 # safe batch size
for i in range(0, len(texts), BATCH_SIZE):
batch = texts[i:i + BATCH_SIZE]
response = client.embeddings.create(input=batch, model=model)
# Sort by index to preserve order (API may reorder)
sorted_data = sorted(response.data, key=lambda x: x.index)
all_embeddings.extend([d.embedding for d in sorted_data])
return np.array(all_embeddings)
# Embed 10,000 docs
docs = ["document text " + str(i) for i in range(10_000)]
embeddings = embed_batch(docs)
print(embeddings.shape) # (10000, 1536)
Count tokens before embedding. Use tiktoken to estimate cost before a large batch:
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # used by all embedding models
total_tokens = sum(len(enc.encode(t)) for t in texts)
cost_usd = total_tokens / 1_000_000 * 0.02 # $0.02/1M for small
print(f"Estimated cost: ${cost_usd:.4f} for {total_tokens:,} tokens")
Cache aggressively. Store embeddings in your vector DB or a file — recomputing is pure waste:
import hashlib, json, os
def embed_cached(text: str, cache_dir: str = ".embed_cache") -> list[float]:
os.makedirs(cache_dir, exist_ok=True)
key = hashlib.md5(text.encode()).hexdigest()
path = os.path.join(cache_dir, key + ".json")
if os.path.exists(path):
return json.load(open(path))
emb = embed(text)
json.dump(emb, open(path, "w"))
return emb
Newlines degrade quality. Always replace \n with a space before embedding. The models were trained on contiguous text, and extra whitespace can shift the embedding.
Token limit is 8,192 tokens. Text longer than that is silently truncated. Chunk before embedding — don't pass full documents.
API rate limits. The default tier has 3,000 RPM and 1,000,000 TPM. For bulk ingestion, use exponential backoff or the async client:
from openai import AsyncOpenAI
import asyncio
async_client = AsyncOpenAI()
async def embed_async(texts: list[str]) -> list[list[float]]:
tasks = [async_client.embeddings.create(input=[t], model="text-embedding-3-small") for t in texts]
responses = await asyncio.gather(*tasks)
return [r.data[0].embedding for r in responses]
Model versions can change. Pin to the full model name (text-embedding-3-small) — OpenAI may update the default alias. Re-embed your corpus whenever you change the model.
Choosing between text-embedding-3-small and text-embedding-3-large involves balancing retrieval quality against cost and latency. The large model consistently outperforms small by 3–8 percentage points on MTEB benchmarks but costs 5x more per token. For applications with high query volume, the cost difference compounds significantly — 100M tokens per month costs $10 with small versus $130 with large. Evaluating both models on a domain-representative retrieval benchmark before committing to a model choice identifies whether the quality difference is meaningful for the specific use case.
OpenAI embedding API calls should be batched efficiently to minimize latency and API overhead. Sending individual embedding requests for each text in a loop is 10–50x slower than batching up to 2,048 texts in a single API call, which processes all texts in one server round-trip. The tenacity library with exponential backoff handles rate limit errors that occur when sending large batch volumes in parallel. For corpus embedding jobs requiring millions of embeddings, parallel batching with 5–10 concurrent requests typically saturates the available rate limit while handling errors gracefully.
OpenAI's text-embedding-3 models return unit-normalized vectors (L2 norm = 1), meaning cosine similarity equals dot product. This property simplifies computation: fast FAISS indexes can use dot-product-only operations, and Postgres pgvector can use <=> operator for efficient similarity search without explicit cosine computation. Normalized embeddings also enable mixing similarity metrics: cosine similarity, dot product, and Euclidean distance on normalized vectors are mathematically related, so swapping between them (useful for debugging or combining with other signals) maintains semantic consistency. However, normalization has subtle effects on similarity distributions: the maximum similarity between random embeddings is lower (on average ~0.2 instead of unbounded), which affects threshold-based filtering and ranking. For production systems, understanding this property prevents common mistakes like applying embedding norms twice or comparing embedding-3 normalized vectors against non-normalized vectors from other models without re-normalizing.
OpenAI's embedding API accepts up to 2,048 texts per request in batch mode, reducing per-request overhead significantly. For large-scale embedding tasks (corpus indexing, historical re-embedding), batching is essential: embedding 1 million documents one-at-a-time incurs 1 million API calls and network roundtrips; batching reduces this to 500 calls with identical quality and 100-1000x faster wall-clock time. The batching strategy requires careful orchestration: maintaining ordering across batches, handling partial failures (single text within a batch fails), and respecting rate limits (default 3,000 requests/minute). Many production systems implement a batching queue that: buffers incoming texts, periodically flushes batches when thresholds are met (batch size >= 100 or timeout >= 10 seconds), and retries failed batches with exponential backoff. For teams embedding datasets continuously (document monitoring, user-generated content indexing), understanding batch economics—throughput per dollar, latency SLA trade-offs—becomes central to total cost of ownership.
text-embedding-3-large produces 3,072-dimensional vectors; text-embedding-3-small produces 1,536-dimensional vectors. Storing embeddings at full precision (float32) requires 12 KB per document for large or 6 KB per document for small. For billion-scale corpora, this becomes prohibitive storage cost. Techniques like quantization (converting float32 to int8 reduces size 4x) and dimensionality reduction (PCA, random projections) preserve retrieval quality while reducing storage and memory. OpenAI research shows that truncating embedding-3-large to 256 dimensions via PCA preserves ~95% of retrieval performance while reducing storage 12x. Production systems trading retrieval quality for efficiency can evaluate this trade-off empirically: index the full corpus with reduced embeddings, run offline evaluations on sample queries, and measure NDCG/Recall@k degradation. For applications with strict latency SLAs (search must return in <50ms), reduced embeddings enable faster nearest-neighbor search via approximate algorithms like HNSW, recovering latency gains that compensate for reduced recall.
OpenAI's text-embedding-3 models return unit-normalized vectors (L2 norm = 1), meaning cosine similarity equals dot product. This property simplifies computation: fast FAISS indexes can use dot-product-only operations, and Postgres pgvector can use <=> operator for efficient similarity search without explicit cosine computation. Normalized embeddings also enable mixing similarity metrics: cosine similarity, dot product, and Euclidean distance on normalized vectors are mathematically related, so swapping between them (useful for debugging or combining with other signals) maintains semantic consistency. However, normalization has subtle effects on similarity distributions: the maximum similarity between random embeddings is lower (on average ~0.2 instead of unbounded), which affects threshold-based filtering and ranking. For production systems, understanding this property prevents common mistakes like applying embedding norms twice or comparing embedding-3 normalized vectors against non-normalized vectors from other models without re-normalizing.