Milvus

When to choose Milvus
Quick start with Milvus Lite
Collections and schemas
Insert and search
Index types
Milvus with LangChain
Gotchas

SECTION 01

When to choose Milvus

Milvus is designed for truly large-scale vector workloads: hundreds of millions to billions of vectors, distributed across multiple nodes, with high concurrent throughput. It's the database you graduate to when other solutions start showing scaling limits.

For smaller workloads (<10M vectors), Qdrant or Chroma are simpler to operate. Choose Milvus when: you need to shard across multiple nodes, you want GPU-accelerated indexing, or you're building a platform that will serve multiple large tenants.

SECTION 02

Quick start with Milvus Lite

pip install pymilvus

from pymilvus import MilvusClient

# Milvus Lite — embedded, no server needed (dev/prototyping)
client = MilvusClient("./milvus_demo.db")

# Full Milvus server
# client = MilvusClient(uri="http://localhost:19530")

# Zilliz Cloud (managed Milvus)
# client = MilvusClient(uri="https://your-cluster.zillizcloud.com", token="your-token")

SECTION 03

Collections and schemas

from pymilvus import MilvusClient, DataType

client = MilvusClient("./milvus_demo.db")

# Quick schema definition
client.create_collection(
    collection_name="my_docs",
    dimension=1536,    # embedding dimension
    metric_type="COSINE",
    id_type="string",  # use string IDs
    max_length=100     # max ID string length
)

# Advanced: explicit schema for richer metadata
schema = MilvusClient.create_schema(auto_id=False, enable_dynamic_field=True)
schema.add_field(field_name="id",        datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="embedding", datatype=DataType.FLOAT_VECTOR, dim=1536)
schema.add_field(field_name="category",  datatype=DataType.VARCHAR, max_length=50)
schema.add_field(field_name="text",      datatype=DataType.VARCHAR, max_length=65535)

client.create_collection(collection_name="typed_docs", schema=schema)

SECTION 04

Insert and search

from pymilvus import MilvusClient
from openai import OpenAI

client = MilvusClient("./milvus_demo.db")
oai = OpenAI()

def embed(text):
    return oai.embeddings.create(input=[text], model="text-embedding-3-small").data[0].embedding

# Insert
data = [
    {"id": "doc-1", "text": "Refunds processed in 5 business days.", "category": "policy"},
    {"id": "doc-2", "text": "Free shipping on orders above $50.", "category": "shipping"},
    {"id": "doc-3", "text": "Next-day delivery in major cities.", "category": "shipping"},
]
for d in data:
    d["embedding"] = embed(d["text"])

client.insert(collection_name="my_docs", data=data)

# Search
query_vectors = [embed("How long do returns take?")]
results = client.search(
    collection_name="my_docs",
    data=query_vectors,
    limit=3,
    output_fields=["text", "category"]
)
for hit in results[0]:
    print(f"Score {hit['distance']:.3f}: {hit['entity']['text']}")

SECTION 05

Index types

Index	Speed	Recall	Memory	Best for
FLAT	Slow	100% (exact)	High	Tiny datasets, ground truth
IVF_FLAT	Fast	High	Medium	Medium datasets, good recall
IVF_SQ8	Fast	Good	Low (8-bit quant)	Memory-constrained
HNSW	Very fast	Very high	High	General purpose, QPS priority
DISKANN	Fast	High	Disk-resident	Billion-scale, memory-efficient
GPU_IVF_FLAT	Blazing	High	GPU	High-throughput batch search

from pymilvus import MilvusClient
from pymilvus.client.types import MetricType

index_params = client.prepare_index_params()
index_params.add_index(
    field_name="embedding",
    index_type="HNSW",
    metric_type="COSINE",
    params={"M": 16, "efConstruction": 200}
)
client.create_index(collection_name="my_docs", index_params=index_params)

SECTION 06

Milvus with LangChain

from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create and populate a Milvus vector store
docs = [
    Document(page_content="Return policy: 30 days.", metadata={"source": "faq"}),
    Document(page_content="Shipping: free over $50.", metadata={"source": "faq"}),
]
vectorstore = Milvus.from_documents(
    documents=docs,
    embedding=embeddings,
    connection_args={"uri": "./milvus_langchain.db"},
    collection_name="langchain_docs"
)

# Use as retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
results = retriever.invoke("What is the refund timeframe?")
print(results[0].page_content)

SECTION 07

Gotchas

Collection must be loaded before search. Milvus separates data loading from searching. If you get "collection not loaded" errors, call client.load_collection("name") first. With Milvus Lite this is automatic.

Flush before querying inserted data. Milvus uses a log-structured storage — inserted data sits in a buffer before being indexed. Call client.flush("collection_name") to force indexing if you need immediate queryability.

enable_dynamic_field for flexible metadata. Without enable_dynamic_field=True in the schema, only explicitly declared fields can be stored. Turn it on during development; lock down the schema in production for performance.

Milvus Lite is single-process only. The embedded .db file mode doesn't support concurrent access from multiple processes. For production, run the full Milvus server.

Milvus performance tuning and scaling

Milvus's segment-based storage architecture groups vectors into segments of configurable size, with smaller segments providing better query parallelism and larger segments providing better compression efficiency. The default segment size of 512MB is optimized for general-purpose workloads, but high-throughput ingestion scenarios benefit from larger segments (1–2GB) that reduce the number of concurrent segment merges, while low-latency query workloads benefit from smaller segments (128–256MB) that enable finer-grained parallel search across available CPU cores.

Milvus Lite provides an embedded version of Milvus that runs within a Python process without requiring a separate server deployment, using the same API as the full Milvus server. This enables development and testing on local machines without Docker or Kubernetes, with the same code running unchanged against the full Milvus server in production. The storage format is compatible between Milvus Lite and the server, allowing migration from development to production without re-importing data.

Milvus's role-based access control (RBAC) enables multi-tenant deployments where different applications or users access separate collections with enforced isolation. Database-level namespacing allows multiple logical databases within a single Milvus instance, with each database having independent collections and access controls. This multi-tenancy model reduces infrastructure cost compared to running separate Milvus instances per tenant while maintaining data isolation through RBAC enforcement at the API level.

Milvus Indexing Algorithms and Index Type Selection

Milvus supports multiple indexing strategies optimized for different trade-offs: FLAT (exhaustive search, 100% recall, O(n) latency), IVF_FLAT (inverted file with clustering, ~95% recall, 10× speedup), IVF_PQ (product quantization, ~90% recall, 100× speedup), and HNSW (graph-based, ~98% recall, configurable speed). Index selection depends on recall SLA and throughput requirements: interactive search (sub-100ms latency) uses HNSW with ef=100; batch retrieval can use IVF_PQ with num_probe=64 for massive speedup. For 100M 768-dimensional vectors, FLAT requires ~300GB memory; IVF_FLAT with 1024 clusters reduces this to ~150GB with 10× query speedup; IVF_PQ with 64 clusters and 8-bit quantization further reduces to ~3GB while maintaining ~90% recall. Milvus automatically partitions large indexes across disks and machines; specifying index_params={"nlist": 1024, "m": 8, "nbits": 8} during create_index() configures IVF_PQ for optimal performance on specific hardware. Index build is the dominant cost: building HNSW on 100M vectors takes 2–4 hours on a single machine, motivating asynchronous index building and replica serving during rebuilds.

Partition Keys, Collection Routing, and Tenant Isolation

Milvus partitions allow logically separating data within a collection — useful for multi-tenant systems and time-windowed retention policies. Creating partitions (collection.create_partition("tenant_a")) and inserting tenant-specific data (collection.insert(entities, partition_name="tenant_a")) enables efficient per-tenant search: search(..., partition_names=["tenant_a"]) searches only that tenant's vectors, avoiding cross-tenant data leakage and reducing search latency. For billion-scale deployments, partitioning by insertion timestamp (daily partitions) enables TTL-based deletion: older partitions are dropped entirely (fast) instead of deleting individual vectors (slow). Partition pruning optimizations in Milvus 2.4+ automatically filter partitions based on filter expressions; a query with filter='created_date > "2025-03-01"' skips partitions created before that date, reducing I/O. In Kubernetes deployments, Milvus isolates tenants via separate collections with distinct service accounts and RBAC rules, enabling per-tenant SLA guarantees: high-priority tenants use dedicated index replicas with higher HNSW ef values, while batch tenants share cheaper indices.

Consistency Levels and Tuning for Transactional Guarantees

Milvus offers configurable consistency levels: STRONG (read-after-write, highest latency), SESSION (consistent within a session, typical default), and EVENTUAL (fastest, may return stale vectors). Most ML applications use SESSION consistency — users see their own recent inserts immediately, but other users' updates propagate with <1 second delay. For transactional guarantees (e.g., financial fraud detection requiring up-to-date vector index), STRONG consistency ensures all reads reflect the latest writes, at cost of 10–30% latency increase due to synchronization overhead. Milvus achieves this via message broker (Pulsar/Kafka) and write-ahead logging: all writes are timestamped, and reads wait for durability acknowledgment before returning. In practice, SESSION consistency is sufficient for RAG and search applications; STRONG is used only when strict ordering is required (e.g., concurrent vector updates for the same entity must be sequentially consistent). Consistency levels are set per collection during creation (consistency_level=CollectionConsistency.STRONG); changing consistency post-creation requires re-indexing. For time-sensitive applications, Milvus 2.4+ supports hybrid consistency tuning: most queries use SESSION (fast), while critical operations (regulatory checks) use STRONG, achieving 99th percentile latency improvements without sacrificing compliance.

Metadata Management and Auxiliary Indexing

Beyond vector embeddings, Milvus stores and indexes rich metadata — document IDs, source URLs, extraction timestamps, confidence scores. These auxiliary fields enable complex filtering before ANN search. For a 100M-document RAG system, payload fields include: chunk_id (hash), document_id (int64), source (string), extracted_date (timestamp), confidence (float). Indexing high-cardinality fields (document_id with millions of distinct values) via INVERTED or HASH index enables O(log n) lookups. Lower-cardinality fields (source with 20 distinct values) benefit from bitmap indices, filtering millions of vectors to tens of thousands instantly. Milvus 2.4+ auto-analyzes payload cardinality and selects appropriate index type automatically. TTL (time-to-live) on payloads enables automatic expiration: documents older than 90 days are deleted via background compaction, maintaining fresh data without manual cleanup scripts.