Cohere's Embed API providing multilingual embeddings with input_type separation for queries vs documents, optimised for production RAG.
The single most important Cohere API feature for RAG: the input_type parameter explicitly tells the model whether you're embedding a search query or a document to be retrieved. Under the hood, this uses different learned prefixes — similar to BGE's instruction trick, but API-enforced so you can't accidentally forget it.
import cohere
co = cohere.Client("your-api-key")
# Query embedding
query_emb = co.embed(
texts=["How do transformers handle long sequences?"],
model="embed-english-v3.0",
input_type="search_query"
).embeddings[0]
# Document embedding
doc_emb = co.embed(
texts=["Transformers use position encodings and attention windows for long-context tasks."],
model="embed-english-v3.0",
input_type="search_document"
).embeddings[0]
Always use search_query for user inputs and search_document for corpus passages. Mixing them up silently degrades retrieval accuracy.
| Model | Dims | Languages | Best for |
|---|---|---|---|
| embed-english-v3.0 | 1024 | English | English RAG, highest English quality |
| embed-multilingual-v3.0 | 1024 | 100+ | Multilingual semantic search |
| embed-english-light-v3.0 | 384 | English | Lower latency, lower cost |
| embed-multilingual-light-v3.0 | 384 | 100+ | Multilingual, cost-conscious |
import cohere
import numpy as np
co = cohere.Client("your-api-key")
def cosine_similarity(a, b):
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Embed documents at index time
docs = [
"Annual leave must be approved two weeks in advance.",
"Remote work is permitted up to three days per week.",
"Expense claims must be submitted within 60 days.",
]
doc_response = co.embed(
texts=docs,
model="embed-english-v3.0",
input_type="search_document"
)
doc_embeddings = doc_response.embeddings
# Embed query at search time
query = "Can I work from home?"
q_response = co.embed(
texts=[query],
model="embed-english-v3.0",
input_type="search_query"
)
q_emb = q_response.embeddings[0]
# Rank
scores = [(cosine_similarity(q_emb, d), docs[i]) for i, d in enumerate(doc_embeddings)]
scores.sort(reverse=True)
print(scores[0][1]) # "Remote work is permitted up to three days per week."
import cohere
co = cohere.Client("your-api-key")
# English documents + French query — the model handles it
docs = [
"The Eiffel Tower is located in Paris, France.",
"Mount Fuji is the highest mountain in Japan.",
"The Amazon River flows through Brazil.",
]
doc_embs = co.embed(
texts=docs,
model="embed-multilingual-v3.0",
input_type="search_document"
).embeddings
# Query in French
query_fr = "Quelle est la hauteur de la tour Eiffel?"
q_emb = co.embed(
texts=[query_fr],
model="embed-multilingual-v3.0",
input_type="search_query"
).embeddings[0]
import numpy as np
scores = [np.dot(q_emb, d) / (np.linalg.norm(q_emb) * np.linalg.norm(d)) for d in doc_embs]
best = docs[np.argmax(scores)]
print(best) # "The Eiffel Tower is located in Paris, France."
Cross-lingual retrieval works because the multilingual model maps semantically equivalent content from different languages to nearby regions in embedding space.
import cohere, time
from typing import Iterator
co = cohere.Client("your-api-key")
def embed_documents_batched(
texts: list[str],
model: str = "embed-english-v3.0",
batch_size: int = 96 # Cohere max is 96 per request
) -> list[list[float]]:
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = co.embed(
texts=batch,
model=model,
input_type="search_document"
)
embeddings.extend(response.embeddings)
# Rate-limit buffer
if i + batch_size < len(texts):
time.sleep(0.1)
return embeddings
Embed v3 supports binary and int8 quantisation to reduce storage by 32× and 128× respectively with minimal quality loss — a game-changer for billion-scale corpora:
response = co.embed(
texts=docs,
model="embed-english-v3.0",
input_type="search_document",
embedding_types=["float", "int8", "binary"] # get all three at once
)
float_embeddings = response.embeddings.float # 1024 floats = 4096 bytes
int8_embeddings = response.embeddings.int8 # 1024 int8 = 1024 bytes (4× smaller)
binary_embeddings = response.embeddings.binary # 128 bits = 16 bytes (256× smaller)
Use float embeddings for small corpora and maximum accuracy. Switch to int8 for large corpora with modest quality tradeoff. Binary is mainly for filtering/pre-ranking before float rescoring.
Always set input_type. The default is search_document. Forgetting to set search_query for user queries produces noticeably worse retrieval. Make this a code review checklist item.
96-item batch limit. Cohere's API accepts at most 96 texts per request. Larger batches need to be chunked in your code (as shown above).
Token limit: 512 tokens. Text beyond 512 tokens is truncated. Chunk long documents before embedding — a sentence or paragraph is typically 50–100 tokens, well under the limit.
Different model versions aren't cross-compatible. Embeddings from embed-english-v2.0 can't be compared to those from embed-english-v3.0. If you upgrade models, you must re-embed your entire corpus.
Cohere's embedding API enforces input_type as a required parameter to ensure optimal embedding quality, requiring callers to specify whether they are embedding search queries ("search_query") or documents ("search_document"). This design forces correct usage — embeddings for the same text differ depending on input_type, as the model applies different transformations to optimize query-document asymmetry for retrieval. Applications must maintain separate embedding pipelines for query-time and index-time operations and cannot mix embeddings produced with different input_type values in the same similarity search.
Cohere Embed v3's binary quantization feature reduces embedding storage from 1,024 floats (4KB per vector) to 1,024 bits (128 bytes per vector), a 32x compression ratio, with retrieval quality degradation of only 2–5 percentage points on most benchmarks. The binary quantization is applied server-side by specifying embedding_types=["binary"] in the API request, returning integer arrays instead of float arrays. This storage efficiency makes Cohere Embed v3 with binary quantization the most cost-effective option for large-scale document retrieval systems where embedding storage costs are a significant operational expense.
Cohere's multilingual embedding capability supports 100+ languages with a unified embedding space, meaning documents in different languages with similar semantic content map to similar regions of the embedding space. This cross-lingual alignment enables multilingual retrieval without language detection or separate per-language indexes — a single embedding index can retrieve relevant documents regardless of the language of either the query or the document. Cross-lingual retrieval quality is highest for closely related language pairs (Spanish-Portuguese, German-Dutch) and lower for distant language pairs (English-Chinese), reflecting the distribution of multilingual training data.
Rate limit management is the primary operational challenge for high-throughput Cohere embedding deployments. Cohere's API enforces per-minute token limits that require exponential backoff, request batching, and potentially parallel API keys for applications exceeding the base rate limit. The maximize_batch_size pattern — packing as many texts as possible into each API call up to the 96-text batch limit — minimizes the number of API calls required and reduces the probability of hitting rate limits for a given throughput target. Applications with strict SLA requirements should maintain a queue-based batching architecture that absorbs traffic spikes without exceeding rate limits.
Cohere's embed_job API enables asynchronous bulk embedding of large document corpora without blocking on individual API calls. Submitting an embed job returns a job ID that can be polled for completion, with results returned as a dataset file that can be downloaded when the job finishes. This async pattern eliminates the connection management overhead of synchronous bulk embedding for corpora of millions of documents, enabling cost-effective large-scale indexing jobs without maintaining long-running client processes or complex retry logic for the duration of the embedding run.
Cohere's embeddings-as-a-service model eliminates the operational overhead of hosting and maintaining embedding infrastructure but introduces API dependency and per-token costs that scale with usage volume. For applications embedding less than 1M tokens per month, the API cost is typically lower than the infrastructure cost of self-hosting an equivalent open-source model. Above approximately 50M tokens per month, self-hosting BGE or E5 models on GPU instances becomes cheaper than Cohere API fees, making cost modeling based on actual or projected monthly token volumes the first step in the build-vs-buy decision for embedding infrastructure.
Cohere's v3 compression modes offer integer8 and binary quantization alongside the full float32 embeddings. Integer8 quantization reduces storage by 4x while maintaining near-full-precision quality for most retrieval tasks. Binary quantization reduces storage by 32x with approximately 3–5% quality reduction on English retrieval, making it viable for applications where storage cost is paramount. Testing both quantization levels on a representative query-document evaluation set before committing to a quantization configuration ensures that the quality tradeoff is acceptable for the specific application requirements.
Cohere's rerank-english-v3.0 model provides cross-encoder reranking as a managed API service, accepting a query and a list of documents and returning relevance scores without requiring model hosting. The reranker API supports up to 100 documents per request and returns scores within 200–500ms, making it viable for latency-tolerant production pipelines. Combining Cohere's embedding API for retrieval with the reranker API for reranking provides a fully managed two-stage retrieval pipeline without any self-hosted model infrastructure, at the cost of two API calls per query and the associated latency of each call.
Cohere's rerank-english-v3.0 model provides cross-encoder reranking as a managed API service, accepting a query and a list of documents and returning relevance scores without requiring model hosting. The reranker API supports up to 100 documents per request and returns scores within 200–500ms, making it viable for latency-tolerant production pipelines. Combining Cohere's embedding API for retrieval with the reranker API for reranking provides a fully managed two-stage retrieval pipeline without any self-hosted model infrastructure, at the cost of two API calls per query and the associated latency of each call.
Cohere Embed's integration with AWS Bedrock and Azure AI enables enterprise deployments that process all data within regional cloud boundaries without sending text to Cohere's own infrastructure. This data residency capability is important for regulated industries (healthcare, finance, legal) where data must remain within specific geographic or organizational boundaries. The AWS Bedrock integration exposes Cohere Embed via the standard Bedrock API, enabling cost allocation via AWS budgets and integration with AWS IAM for fine-grained access control — significant operational advantages for organizations already standardized on AWS infrastructure.