An open-source, developer-friendly vector database designed for embedding storage and retrieval, with in-memory and persistent modes.
Chroma is built for developers who want to go from "zero to working RAG" in minutes, not hours. It prioritises simplicity: one pip install, no external service, runs in-process. The tradeoff is that it's not designed for billion-scale production deployments — for that, reach for Qdrant, Pinecone, or Milvus.
When to use Chroma: prototyping, personal projects, small-to-medium production RAG (<1M docs), local development without a running service.
pip install chromadb
import chromadb
# Ephemeral in-memory client (data lost on restart)
client = chromadb.Client()
# Create a collection (like a table in SQL)
collection = client.create_collection(name="my_docs")
# Add documents — Chroma handles embedding with its default model
collection.add(
documents=[
"Python was created by Guido van Rossum in 1991.",
"JavaScript is the language of the web browser.",
"Rust provides memory safety without garbage collection.",
],
ids=["doc-1", "doc-2", "doc-3"] # unique IDs, you choose
)
# Query
results = collection.query(
query_texts=["Who invented Python?"],
n_results=2
)
print(results["documents"]) # [["Python was created by...", "JavaScript is..."]]
print(results["distances"]) # similarity scores
import chromadb
# Persistent client — saves to disk automatically
client = chromadb.PersistentClient(path="./chroma_db")
# Collections persist across restarts
collection = client.get_or_create_collection("my_docs")
collection.add(
documents=["Document content here."],
ids=["doc-001"]
)
# Later, in a new process:
client2 = chromadb.PersistentClient(path="./chroma_db")
collection2 = client2.get_collection("my_docs")
print(collection2.count()) # still 1
For a standalone server (useful for multi-process or client-server setups):
chroma run --path ./chroma_db --port 8000
client = chromadb.HttpClient(host="localhost", port=8000)
By default, Chroma uses all-MiniLM-L6-v2 via sentence-transformers. Override with any embedding function:
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
# Use OpenAI embeddings
openai_ef = OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-3-small"
)
collection = client.create_collection(
name="openai_collection",
embedding_function=openai_ef
)
# Or bring your own via sentence-transformers
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
st_ef = SentenceTransformerEmbeddingFunction(model_name="BAAI/bge-large-en-v1.5")
collection = client.create_collection(name="bge_collection", embedding_function=st_ef)
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
# Add docs with metadata
collection.add(
documents=["FAQ: returns are free within 30 days.", "Blog: our story began in 2020.", "FAQ: shipping takes 3-5 days."],
metadatas=[{"type":"faq","year":2023},{"type":"blog","year":2020},{"type":"faq","year":2023}],
ids=["faq-1","blog-1","faq-2"]
)
# Filter: only FAQ documents
results = collection.query(
query_texts=["How do I return a product?"],
n_results=2,
where={"type": {"$eq": "faq"}} # metadata filter
)
# Full-text filter (where_document)
results = collection.query(
query_texts=["return policy"],
n_results=2,
where_document={"$contains": "30 days"}
)
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create a vector store from documents
docs = [
Document(page_content="Refunds within 30 days.", metadata={"source": "faq"}),
Document(page_content="Free shipping over $50.", metadata={"source": "faq"}),
]
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Use as a retriever in a RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("What is the return policy?")
print(docs[0].page_content)
Embedding function must match at read time. If you created a collection with OpenAI embeddings and query it with the default Chroma embeddings, you'll get garbage results. The collection doesn't store which embedding function was used — you must remember and pass the same one every time.
Not designed for concurrent writes. SQLite-backed persistent Chroma has write locking issues under high concurrency. For multi-process production workloads, use the HTTP server mode or a different database.
IDs must be unique strings. Upserting with an existing ID replaces the old document. Deleting requires knowing the IDs — there's no "delete by metadata" operation at the Python client level.
No built-in replication. The embedded/persistent mode has a single node. For HA, deploy the Chroma server behind your own load balancer, or use a database with native replication (Qdrant, Weaviate).
Chroma is optimized for developer experience and rapid prototyping rather than production scale. Its Python-first design, automatic embedding, and simple API make it the fastest vector database to get working in a new project. However, Chroma lacks the distributed scaling, fine-grained access control, and operational tooling (monitoring dashboards, backup/restore, high availability) that production deployments require. Most teams start with Chroma in development and migrate to Qdrant, Weaviate, or pgvector for production, using Chroma's consistent API as the development environment.
| Scenario | Recommended | Reason |
|---|---|---|
| Development/prototyping | Chroma | Zero setup, Pythonic API |
| Production, self-hosted | Qdrant or Weaviate | HA, monitoring, scaling |
| Already using Postgres | pgvector | Single database, simple ops |
| Serverless/managed | Pinecone or Qdrant Cloud | No infrastructure |
Chroma's client-server mode separates the embedding storage from the application process, enabling multiple application instances to share the same vector store. Running chroma run --path /chroma-data starts a persistent Chroma server that clients connect to via ChromaDB(host="localhost", port=8000). This mode is appropriate for small production deployments where multiple service replicas need to access shared vectors — for example, a containerized API service with 3 replicas sharing a single Chroma instance running in a separate container.
Chroma's in-memory mode (default) loses all vectors on process exit; persistent_client = chromadb.PersistentClient(path="/data/chroma") stores vectors on disk using DuckDB, enabling durability across restarts. Persistent mode trades startup latency (load vectors from disk, ~1–5 seconds for 1M vectors) for durability and horizontal scaling. Chroma's persistent backend is DuckDB, an embedded SQL database that auto-indexes vectors and metadata for fast retrieval; reads are memory-mapped, keeping active vectors in RAM while inactive vectors remain on disk. For production RAG systems, persistent mode is essential: vectors survive container restarts and version upgrades. Sharding strategy: divide documents across multiple Chroma instances by document_id hash, each managing 50–100M vectors (single instance limit). A single DuckDB instance with 1M 768-dim vectors consumes ~3GB RAM; creating 20 collections (multi-tenant) increases this to ~60GB RAM, suitable for 8×8GB GPU nodes. Persistent backups use filesystem snapshots: AWS EBS snapshots of the /data/chroma volume enable rollback if corruption occurs; Kubernetes StatefulSets with persistent volumes automatically manage this. Migration from in-memory to persistent requires exporting vectors (collection.get(include=["embeddings", "metadatas"])) and reimporting into new persistent instance, typically a one-time setup cost.
Chroma integrates embedding providers: default=OpenAI, also supports Hugging Face (sentence-transformers), Cohere, and custom embeddings. Specifying embedding_function=HuggingFaceEmbeddingFunction(model_name="BAAI/bge-large-en-v1.5") uses BGE embeddings locally (no API calls, no cost, latency ~100ms per embedding). Sentence-transformers library provides fine-tuned models for specific domains: "BAAI/bge-law-en-v1.5" for legal documents (30% better precision on contract retrieval), "all-minilm-l6-v2" for general text (fast, lower accuracy). Custom embeddings via EmbeddingFunction interface allow hybrid approaches: combine dense embeddings (Chroma vector search) with sparse embeddings (BM25 full-text search for exact matches). Multi-embedding strategy: store documents with multiple embeddings optimized for different query types: semantic embedding for conceptual search, domain-specific embedding for specialized terminology, sparse embedding for exact matching. At retrieval time, parallel search across embeddings and re-rank by hybrid score (0.5×semantic_score + 0.3×domain_score + 0.2×sparse_score). In production, embedding model selection impacts cost: OpenAI ada-002 costs $0.02 per 1K queries; local BGE costs zero post-training (GPU amortized), at scale saving millions in API costs. Model drift: as document corpus evolves, old embeddings become misaligned with new documents; re-embedding entire corpus monthly ensures consistency (compute cost: 1M documents × 100ms per embedding ÷ 8 parallel workers = 1.4 hours).
Single-node Chroma scales to ~100M vectors before latency degrades (query p99 >1 second). Horizontal scaling uses sharding: hash each collection by collection_id % num_shards, each shard is an independent Chroma instance. Kubernetes DaemonSets deploy one Chroma pod per node, each local instance serves its shard; load balancer routes collection_id to correct shard. For multi-tenant RAG: each customer gets isolated collection(s), sharded across infrastructure. Tenant1 → Chroma-shard0, Tenant2 → Chroma-shard1, etc., preventing cross-tenant data leakage and enabling per-tenant SLA: high-paying tenants get dedicated shards with higher HNSW ef values (better recall), low-tier tenants share cheaper shards. Metadata filtering (filter={"customer_id": "tenant_1"}) is insufficient for isolation (SQL injection risk); network isolation via separate Chroma instances is required for compliance. Replication for HA: primary Chroma instance writes to a replicated journal (Kafka or persistent log), standby reads from journal and maintains hot replica; failover takes <1 second. Chroma 0.4+ includes built-in replication; earlier versions require custom solutions. Cost optimization: cold storage (infrequently accessed collections) use slower persistent backends; warm storage (active collections) keep vectors in-memory. This hybrid approach reduces infrastructure cost 40–60% for datasets with skewed access patterns (80% of queries target 20% of collections).