Chroma

Chroma's design philosophy
In-memory quick start
Persistent storage
Embedding functions
Querying with filters
LangChain integration
Gotchas

SECTION 01

Chroma's design philosophy

Chroma is built for developers who want to go from "zero to working RAG" in minutes, not hours. It prioritises simplicity: one pip install, no external service, runs in-process. The tradeoff is that it's not designed for billion-scale production deployments — for that, reach for Qdrant, Pinecone, or Milvus.

When to use Chroma: prototyping, personal projects, small-to-medium production RAG (<1M docs), local development without a running service.

SECTION 02

In-memory quick start

pip install chromadb

import chromadb

# Ephemeral in-memory client (data lost on restart)
client = chromadb.Client()

# Create a collection (like a table in SQL)
collection = client.create_collection(name="my_docs")

# Add documents — Chroma handles embedding with its default model
collection.add(
    documents=[
        "Python was created by Guido van Rossum in 1991.",
        "JavaScript is the language of the web browser.",
        "Rust provides memory safety without garbage collection.",
    ],
    ids=["doc-1", "doc-2", "doc-3"]    # unique IDs, you choose
)

# Query
results = collection.query(
    query_texts=["Who invented Python?"],
    n_results=2
)
print(results["documents"])   # [["Python was created by...", "JavaScript is..."]]
print(results["distances"])   # similarity scores

SECTION 03

Persistent storage

import chromadb

# Persistent client — saves to disk automatically
client = chromadb.PersistentClient(path="./chroma_db")

# Collections persist across restarts
collection = client.get_or_create_collection("my_docs")
collection.add(
    documents=["Document content here."],
    ids=["doc-001"]
)

# Later, in a new process:
client2 = chromadb.PersistentClient(path="./chroma_db")
collection2 = client2.get_collection("my_docs")
print(collection2.count())   # still 1

For a standalone server (useful for multi-process or client-server setups):

chroma run --path ./chroma_db --port 8000

client = chromadb.HttpClient(host="localhost", port=8000)

SECTION 04

Embedding functions

By default, Chroma uses all-MiniLM-L6-v2 via sentence-transformers. Override with any embedding function:

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Use OpenAI embeddings
openai_ef = OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)
collection = client.create_collection(
    name="openai_collection",
    embedding_function=openai_ef
)

# Or bring your own via sentence-transformers
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
st_ef = SentenceTransformerEmbeddingFunction(model_name="BAAI/bge-large-en-v1.5")
collection = client.create_collection(name="bge_collection", embedding_function=st_ef)

SECTION 05

Querying with filters

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")

# Add docs with metadata
collection.add(
    documents=["FAQ: returns are free within 30 days.", "Blog: our story began in 2020.", "FAQ: shipping takes 3-5 days."],
    metadatas=[{"type":"faq","year":2023},{"type":"blog","year":2020},{"type":"faq","year":2023}],
    ids=["faq-1","blog-1","faq-2"]
)

# Filter: only FAQ documents
results = collection.query(
    query_texts=["How do I return a product?"],
    n_results=2,
    where={"type": {"$eq": "faq"}}       # metadata filter
)

# Full-text filter (where_document)
results = collection.query(
    query_texts=["return policy"],
    n_results=2,
    where_document={"$contains": "30 days"}
)

SECTION 06

LangChain integration

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create a vector store from documents
docs = [
    Document(page_content="Refunds within 30 days.", metadata={"source": "faq"}),
    Document(page_content="Free shipping over $50.", metadata={"source": "faq"}),
]
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Use as a retriever in a RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke("What is the return policy?")
print(docs[0].page_content)

SECTION 07

Gotchas

Embedding function must match at read time. If you created a collection with OpenAI embeddings and query it with the default Chroma embeddings, you'll get garbage results. The collection doesn't store which embedding function was used — you must remember and pass the same one every time.

Not designed for concurrent writes. SQLite-backed persistent Chroma has write locking issues under high concurrency. For multi-process production workloads, use the HTTP server mode or a different database.

IDs must be unique strings. Upserting with an existing ID replaces the old document. Deleting requires knowing the IDs — there's no "delete by metadata" operation at the Python client level.

No built-in replication. The embedded/persistent mode has a single node. For HA, deploy the Chroma server behind your own load balancer, or use a database with native replication (Qdrant, Weaviate).

Chroma in production vs alternatives

Chroma is optimized for developer experience and rapid prototyping rather than production scale. Its Python-first design, automatic embedding, and simple API make it the fastest vector database to get working in a new project. However, Chroma lacks the distributed scaling, fine-grained access control, and operational tooling (monitoring dashboards, backup/restore, high availability) that production deployments require. Most teams start with Chroma in development and migrate to Qdrant, Weaviate, or pgvector for production, using Chroma's consistent API as the development environment.

Scenario	Recommended	Reason
Development/prototyping	Chroma	Zero setup, Pythonic API
Production, self-hosted	Qdrant or Weaviate	HA, monitoring, scaling
Already using Postgres	pgvector	Single database, simple ops
Serverless/managed	Pinecone or Qdrant Cloud	No infrastructure

Chroma's client-server mode separates the embedding storage from the application process, enabling multiple application instances to share the same vector store. Running chroma run --path /chroma-data starts a persistent Chroma server that clients connect to via ChromaDB(host="localhost", port=8000). This mode is appropriate for small production deployments where multiple service replicas need to access shared vectors — for example, a containerized API service with 3 replicas sharing a single Chroma instance running in a separate container.

Chroma Persistent Mode, Vector Persistence, and Production Deployment

Chroma's in-memory mode (default) loses all vectors on process exit; persistent_client = chromadb.PersistentClient(path="/data/chroma") stores vectors on disk using DuckDB, enabling durability across restarts. Persistent mode trades startup latency (load vectors from disk, ~1–5 seconds for 1M vectors) for durability and horizontal scaling. Chroma's persistent backend is DuckDB, an embedded SQL database that auto-indexes vectors and metadata for fast retrieval; reads are memory-mapped, keeping active vectors in RAM while inactive vectors remain on disk. For production RAG systems, persistent mode is essential: vectors survive container restarts and version upgrades. Sharding strategy: divide documents across multiple Chroma instances by document_id hash, each managing 50–100M vectors (single instance limit). A single DuckDB instance with 1M 768-dim vectors consumes ~3GB RAM; creating 20 collections (multi-tenant) increases this to ~60GB RAM, suitable for 8×8GB GPU nodes. Persistent backups use filesystem snapshots: AWS EBS snapshots of the /data/chroma volume enable rollback if corruption occurs; Kubernetes StatefulSets with persistent volumes automatically manage this. Migration from in-memory to persistent requires exporting vectors (collection.get(include=["embeddings", "metadatas"])) and reimporting into new persistent instance, typically a one-time setup cost.

Embedding Functions, Custom Models, and Multi-Embedding Strategies

Chroma integrates embedding providers: default=OpenAI, also supports Hugging Face (sentence-transformers), Cohere, and custom embeddings. Specifying embedding_function=HuggingFaceEmbeddingFunction(model_name="BAAI/bge-large-en-v1.5") uses BGE embeddings locally (no API calls, no cost, latency ~100ms per embedding). Sentence-transformers library provides fine-tuned models for specific domains: "BAAI/bge-law-en-v1.5" for legal documents (30% better precision on contract retrieval), "all-minilm-l6-v2" for general text (fast, lower accuracy). Custom embeddings via EmbeddingFunction interface allow hybrid approaches: combine dense embeddings (Chroma vector search) with sparse embeddings (BM25 full-text search for exact matches). Multi-embedding strategy: store documents with multiple embeddings optimized for different query types: semantic embedding for conceptual search, domain-specific embedding for specialized terminology, sparse embedding for exact matching. At retrieval time, parallel search across embeddings and re-rank by hybrid score (0.5×semantic_score + 0.3×domain_score + 0.2×sparse_score). In production, embedding model selection impacts cost: OpenAI ada-002 costs $0.02 per 1K queries; local BGE costs zero post-training (GPU amortized), at scale saving millions in API costs. Model drift: as document corpus evolves, old embeddings become misaligned with new documents; re-embedding entire corpus monthly ensures consistency (compute cost: 1M documents × 100ms per embedding ÷ 8 parallel workers = 1.4 hours).

Scalability Patterns, Sharding, and Multi-Tenant Isolation

Single-node Chroma scales to ~100M vectors before latency degrades (query p99 >1 second). Horizontal scaling uses sharding: hash each collection by collection_id % num_shards, each shard is an independent Chroma instance. Kubernetes DaemonSets deploy one Chroma pod per node, each local instance serves its shard; load balancer routes collection_id to correct shard. For multi-tenant RAG: each customer gets isolated collection(s), sharded across infrastructure. Tenant1 → Chroma-shard0, Tenant2 → Chroma-shard1, etc., preventing cross-tenant data leakage and enabling per-tenant SLA: high-paying tenants get dedicated shards with higher HNSW ef values (better recall), low-tier tenants share cheaper shards. Metadata filtering (filter={"customer_id": "tenant_1"}) is insufficient for isolation (SQL injection risk); network isolation via separate Chroma instances is required for compliance. Replication for HA: primary Chroma instance writes to a replicated journal (Kafka or persistent log), standby reads from journal and maintains hot replica; failover takes <1 second. Chroma 0.4+ includes built-in replication; earlier versions require custom solutions. Cost optimization: cold storage (infrequently accessed collections) use slower persistent backends; warm storage (active collections) keep vectors in-memory. This hybrid approach reduces infrastructure cost 40–60% for datasets with skewed access patterns (80% of queries target 20% of collections).