OpenAI Embeddings

Why use hosted embeddings
The text-embedding-3 family
Basic usage
Dimension reduction
Batch encoding at scale
Cost optimisation
Gotchas

SECTION 01

Why use hosted embeddings

Running your own embedding model means managing GPU infrastructure, model weights, batching logic, and version upgrades. Hosted embeddings let you offload all of that to an API call — you send text, you get a vector back, you pay per token.

The tradeoff: you depend on network latency and API pricing. For high-volume, latency-sensitive applications, local models (like Sentence Transformers or BGE) often win on cost. For most applications, the simplicity of an API is worth the trade.

SECTION 02

The text-embedding-3 family

Model	Max dims	Cost	Quality (MTEB)	Best for
text-embedding-3-small	1536	$0.02/1M tokens	62.3	Cost-conscious production RAG
text-embedding-3-large	3072	$0.13/1M tokens	64.6	Highest accuracy requirements
text-embedding-ada-002 (legacy)	1536	$0.10/1M tokens	61.0	Avoid for new projects

text-embedding-3-small is 5× cheaper than ada-002 and 1.5 points better on MTEB. Use it by default.

SECTION 03

Basic usage

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text: str, model: str = "text-embedding-3-small") -> list[float]:
    text = text.replace("\n", " ")  # newlines can degrade quality
    return client.embeddings.create(input=[text], model=model).data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

q = embed("How do I cancel my subscription?")
d1 = embed("To cancel, go to Account Settings > Subscription > Cancel Plan.")
d2 = embed("Our headquarters is located in San Francisco.")

print(cosine_similarity(q, d1))   # high ~0.85
print(cosine_similarity(q, d2))   # low  ~0.15

SECTION 04

Dimension reduction

text-embedding-3 models support Matryoshka Representation Learning — you can request fewer dimensions and still get high-quality embeddings. Lower dimensions = smaller storage and faster ANN search:

response = client.embeddings.create(
    input=["OpenAI embeddings support dimension reduction"],
    model="text-embedding-3-large",
    dimensions=256     # truncate from 3072 to 256
)
embedding_256 = response.data[0].embedding
print(len(embedding_256))   # 256

# Quality is still excellent — MTEB drops only ~1 point from 3072→256

Use 256-dim for cost-sensitive, high-scale applications (>100M docs). Use full 1536/3072 when retrieval precision is paramount.

SECTION 05

Batch encoding at scale

from openai import OpenAI
import numpy as np

client = OpenAI()

def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> np.ndarray:
    '''Embed a list of texts, respecting the 2048-item batch limit.'''
    texts = [t.replace("\n", " ") for t in texts]
    all_embeddings = []

    BATCH_SIZE = 512   # safe batch size
    for i in range(0, len(texts), BATCH_SIZE):
        batch = texts[i:i + BATCH_SIZE]
        response = client.embeddings.create(input=batch, model=model)
        # Sort by index to preserve order (API may reorder)
        sorted_data = sorted(response.data, key=lambda x: x.index)
        all_embeddings.extend([d.embedding for d in sorted_data])

    return np.array(all_embeddings)

# Embed 10,000 docs
docs = ["document text " + str(i) for i in range(10_000)]
embeddings = embed_batch(docs)
print(embeddings.shape)   # (10000, 1536)

SECTION 06

Cost optimisation

Count tokens before embedding. Use tiktoken to estimate cost before a large batch:

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # used by all embedding models
total_tokens = sum(len(enc.encode(t)) for t in texts)
cost_usd = total_tokens / 1_000_000 * 0.02   # $0.02/1M for small
print(f"Estimated cost: ${cost_usd:.4f} for {total_tokens:,} tokens")

Cache aggressively. Store embeddings in your vector DB or a file — recomputing is pure waste:

import hashlib, json, os

def embed_cached(text: str, cache_dir: str = ".embed_cache") -> list[float]:
    os.makedirs(cache_dir, exist_ok=True)
    key = hashlib.md5(text.encode()).hexdigest()
    path = os.path.join(cache_dir, key + ".json")
    if os.path.exists(path):
        return json.load(open(path))
    emb = embed(text)
    json.dump(emb, open(path, "w"))
    return emb

SECTION 07

Gotchas

Newlines degrade quality. Always replace \n with a space before embedding. The models were trained on contiguous text, and extra whitespace can shift the embedding.

Token limit is 8,192 tokens. Text longer than that is silently truncated. Chunk before embedding — don't pass full documents.

API rate limits. The default tier has 3,000 RPM and 1,000,000 TPM. For bulk ingestion, use exponential backoff or the async client:

from openai import AsyncOpenAI
import asyncio

async_client = AsyncOpenAI()

async def embed_async(texts: list[str]) -> list[list[float]]:
    tasks = [async_client.embeddings.create(input=[t], model="text-embedding-3-small") for t in texts]
    responses = await asyncio.gather(*tasks)
    return [r.data[0].embedding for r in responses]

Model versions can change. Pin to the full model name (text-embedding-3-small) — OpenAI may update the default alias. Re-embed your corpus whenever you change the model.

Model comparison and cost optimization

Choosing between text-embedding-3-small and text-embedding-3-large involves balancing retrieval quality against cost and latency. The large model consistently outperforms small by 3–8 percentage points on MTEB benchmarks but costs 5x more per token. For applications with high query volume, the cost difference compounds significantly — 100M tokens per month costs $10 with small versus $130 with large. Evaluating both models on a domain-representative retrieval benchmark before committing to a model choice identifies whether the quality difference is meaningful for the specific use case.

OpenAI embedding API calls should be batched efficiently to minimize latency and API overhead. Sending individual embedding requests for each text in a loop is 10–50x slower than batching up to 2,048 texts in a single API call, which processes all texts in one server round-trip. The tenacity library with exponential backoff handles rate limit errors that occur when sending large batch volumes in parallel. For corpus embedding jobs requiring millions of embeddings, parallel batching with 5–10 concurrent requests typically saturates the available rate limit while handling errors gracefully.

Embedding normalization and similarity metrics

OpenAI's text-embedding-3 models return unit-normalized vectors (L2 norm = 1), meaning cosine similarity equals dot product. This property simplifies computation: fast FAISS indexes can use dot-product-only operations, and Postgres pgvector can use <=> operator for efficient similarity search without explicit cosine computation. Normalized embeddings also enable mixing similarity metrics: cosine similarity, dot product, and Euclidean distance on normalized vectors are mathematically related, so swapping between them (useful for debugging or combining with other signals) maintains semantic consistency. However, normalization has subtle effects on similarity distributions: the maximum similarity between random embeddings is lower (on average ~0.2 instead of unbounded), which affects threshold-based filtering and ranking. For production systems, understanding this property prevents common mistakes like applying embedding norms twice or comparing embedding-3 normalized vectors against non-normalized vectors from other models without re-normalizing.

Batch processing and rate limits

OpenAI's embedding API accepts up to 2,048 texts per request in batch mode, reducing per-request overhead significantly. For large-scale embedding tasks (corpus indexing, historical re-embedding), batching is essential: embedding 1 million documents one-at-a-time incurs 1 million API calls and network roundtrips; batching reduces this to 500 calls with identical quality and 100-1000x faster wall-clock time. The batching strategy requires careful orchestration: maintaining ordering across batches, handling partial failures (single text within a batch fails), and respecting rate limits (default 3,000 requests/minute). Many production systems implement a batching queue that: buffers incoming texts, periodically flushes batches when thresholds are met (batch size >= 100 or timeout >= 10 seconds), and retries failed batches with exponential backoff. For teams embedding datasets continuously (document monitoring, user-generated content indexing), understanding batch economics—throughput per dollar, latency SLA trade-offs—becomes central to total cost of ownership.

Dimensionality reduction and storage efficiency

text-embedding-3-large produces 3,072-dimensional vectors; text-embedding-3-small produces 1,536-dimensional vectors. Storing embeddings at full precision (float32) requires 12 KB per document for large or 6 KB per document for small. For billion-scale corpora, this becomes prohibitive storage cost. Techniques like quantization (converting float32 to int8 reduces size 4x) and dimensionality reduction (PCA, random projections) preserve retrieval quality while reducing storage and memory. OpenAI research shows that truncating embedding-3-large to 256 dimensions via PCA preserves ~95% of retrieval performance while reducing storage 12x. Production systems trading retrieval quality for efficiency can evaluate this trade-off empirically: index the full corpus with reduced embeddings, run offline evaluations on sample queries, and measure NDCG/Recall@k degradation. For applications with strict latency SLAs (search must return in <50ms), reduced embeddings enable faster nearest-neighbor search via approximate algorithms like HNSW, recovering latency gains that compensate for reduced recall.

OpenAI Embeddings

Table of Contents

Why use hosted embeddings

The text-embedding-3 family

Basic usage

Dimension reduction

Batch encoding at scale

Cost optimisation

Gotchas

Model comparison and cost optimization

Embedding normalization and similarity metrics

Batch processing and rate limits

Dimensionality reduction and storage efficiency

Embedding normalization and similarity metrics

Related concepts