Qdrant

Why Qdrant for production
Collections and points
Getting started
Upserting and searching
Payload filtering
Sparse vectors for hybrid search
Gotchas

SECTION 01

Why Qdrant for production

Qdrant is written in Rust, which translates to predictable low latency, low memory overhead, and no garbage-collection pauses. It handles both dense and sparse vectors natively — meaning you can do BM25-style keyword retrieval and semantic retrieval from the same database, enabling hybrid search without a separate Elasticsearch cluster.

It runs as a single binary (Docker or binary), as a distributed cluster, or as a managed cloud service. For teams that want the control of self-hosting without giving up production features, Qdrant is often the best choice.

SECTION 02

Collections and points

Qdrant organises data into collections (equivalent to tables). Each item in a collection is a point, consisting of:

id — integer or UUID
vector — one or more named dense or sparse vectors
payload — arbitrary JSON metadata (filterable)

A collection can have multiple named vector spaces — for example, a "dense" space for semantic search and a "sparse" space for keyword search, searched jointly in a hybrid query.

SECTION 03

Getting started

# Docker (recommended for development)
docker run -p 6333:6333 qdrant/qdrant

pip install qdrant-client

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(host="localhost", port=6333)

# Create a collection
client.create_collection(
    collection_name="my_docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
print(client.get_collection("my_docs"))

SECTION 04

Upserting and searching

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
from openai import OpenAI

qdrant = QdrantClient(host="localhost", port=6333)
oai = OpenAI()

def embed(text: str) -> list[float]:
    return oai.embeddings.create(input=[text], model="text-embedding-3-small").data[0].embedding

# Upsert points
docs = [
    {"id": 1, "text": "Returns are free within 30 days.", "category": "policy"},
    {"id": 2, "text": "Free shipping on orders over $50.", "category": "shipping"},
    {"id": 3, "text": "Next-day delivery available in major cities.", "category": "shipping"},
]
points = [
    PointStruct(id=d["id"], vector=embed(d["text"]), payload={"text": d["text"], "category": d["category"]})
    for d in docs
]
qdrant.upsert(collection_name="my_docs", points=points)

# Search
query_vector = embed("What is the return policy?")
results = qdrant.search(
    collection_name="my_docs",
    query_vector=query_vector,
    limit=3,
    with_payload=True
)
for r in results:
    print(f"Score {r.score:.3f}: {r.payload['text']}")

SECTION 05

Payload filtering

from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

# Exact match filter
results = qdrant.search(
    collection_name="my_docs",
    query_vector=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="category", match=MatchValue(value="shipping"))]
    ),
    limit=3,
    with_payload=True
)

# Numeric range filter
results = qdrant.search(
    collection_name="my_docs",
    query_vector=query_vector,
    query_filter=Filter(
        must=[FieldCondition(key="year", range=Range(gte=2022, lte=2024))]
    ),
    limit=5
)

Qdrant's filtering is applied before ANN search (pre-filtering for small candidate sets) or after (post-filtering for large indices) — it automatically chooses the faster strategy based on the expected cardinality.

SECTION 06

Sparse vectors for hybrid search

from qdrant_client.models import VectorParams, SparseVectorParams, SparseIndexParams, SparseVector

# Create a collection with both dense and sparse vector spaces
qdrant.create_collection(
    collection_name="hybrid_docs",
    vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},
    sparse_vectors_config={"sparse": SparseVectorParams(index=SparseIndexParams())}
)

# Generate a sparse vector (BM25-style) — requires a sparse encoder
# Here using a simple TF-IDF approximation for illustration
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse

vectorizer = TfidfVectorizer(max_features=30000)
# After fitting: sparse_matrix = vectorizer.transform([text])

# Upsert with both vector types
qdrant.upsert(
    collection_name="hybrid_docs",
    points=[PointStruct(
        id=1,
        vector={
            "dense": dense_embedding,
            "sparse": SparseVector(indices=sparse_indices, values=sparse_values)
        },
        payload={"text": "..."}
    )]
)

SECTION 07

Gotchas

Payload indexing is opt-in. You must explicitly create payload indexes for fields you want to filter on efficiently. Without an index, filters perform a full collection scan:

qdrant.create_payload_index(
    collection_name="my_docs",
    field_name="category",
    field_schema="keyword"   # or "integer", "float", "geo", "text"
)

Collection config is permanent. Vector dimensions and distance metric are fixed at collection creation. Changing them requires creating a new collection and migrating data.

Memory-mapped storage vs RAM. By default, Qdrant stores vectors on disk with memory mapping. For maximum performance, configure on_disk=False to keep vectors in RAM — at the cost of higher memory usage.

gRPC for high throughput. The Python client defaults to REST. For high-throughput ingestion, enable gRPC: QdrantClient(host="localhost", port=6334, prefer_grpc=True).

Qdrant deployment and production configuration

Qdrant supports three deployment modes: in-memory (QdrantClient(":memory:") for testing), on-disk persistence (local path), and the distributed Qdrant server for production. The in-memory mode is ideal for development and testing — it is fast and requires no setup but loses data on process restart. Production deployments use the Qdrant server with persistent storage, accessed via the REST or gRPC API. Qdrant Cloud provides a managed version that eliminates infrastructure management while maintaining the full Qdrant feature set through the same Python client API.

Feature	Qdrant	Pinecone	Weaviate
Sparse+dense hybrid	Native	Via sparse index	Via BM25 module
Self-hosted	Yes (open source)	No	Yes (open source)
Payload filtering	Pre/post filter	Metadata filter	GraphQL where
Quantization	Scalar, product, binary	Limited	PQ compression

Qdrant's quantization options — scalar (INT8), product quantization (PQ), and binary — reduce vector storage size and improve search speed at the cost of recall accuracy. Binary quantization produces the smallest index (32x compression from float32) with surprisingly competitive recall when used with the rescore option, which re-ranks binary search results using full-precision vectors. For high-throughput production deployments where query latency is critical, binary quantization with rescore is the recommended starting configuration as it typically achieves 95%+ recall at 5–10x search speedup.

Qdrant Collections, Payload Indexing, and Structured Search

Qdrant's collection abstraction provides isolation and independent configuration for different use cases. Each collection has its own vector store, payload schema, and indexing parameters. Within collections, payload_indexing enables fast filtering on metadata fields (e.g., doc_id, source, timestamp) before vector search, dramatically speeding up filtered queries. For example, indexing the "category" payload field (VARCHAR type) enables O(log n) lookups on documents matching a specific category before ANN search, avoiding full-collection scans. Payload indexes consume additional memory (~5–15% of vector index size) but enable interactive retrieval latency. In RAG systems with millions of documents split into chunks, creating separate Qdrant collections per document source and payload-indexing the chunk_id field allows fast pre-filtering before semantic search, achieving sub-100ms E2E latency. Qdrant's REST API supports atomic collection creation with payload_config specifications; the Python client abstracts this as collection_exists() and recreate_collection() calls with schema definitions.

HNSW Algorithm Tuning: ef_construct and ef_search

Qdrant's default HNSW implementation requires careful tuning of two key parameters: ef_construct (graph building cost, default 200) and ef_search (query-time search budget, default 100). ef_construct controls how many neighbors are considered when inserting vectors into the HNSW graph; higher values (400–600) improve recall at 2–3× construction time cost. ef_search is a query-time parameter set during upsert_points() or search() calls; increasing ef_search from 50 to 200 improves recall from ~0.85 to ~0.98 but doubles latency. In production, ef_construct should be set once during initial index build (typically 256–512), while ef_search is tuned per query based on recall SLAs: interactive search uses ef_search=100, batch retrieval uses ef_search=200. For time-sensitive applications (sub-100ms requirement), profiling the recall-latency curve empirically reveals the optimal ef_search value; Qdrant's search_params allow per-query override of this parameter, enabling adaptive strategies that increase ef_search for ambiguous queries.

Vector Quantization and Compression for Scale

Qdrant supports scalar quantization and product quantization to reduce index size by 4–16×, critical for billion-scale deployments on budget-constrained infrastructure. Scalar quantization compresses float32 embeddings to uint8 (4× compression) with ~1–3% recall loss; product quantization further factorizes vectors into smaller subspaces, achieving 16× compression at ~5% recall loss. Enabling quantization during collection creation (quantization_config=ScalarQuantization(...)) automatically compresses vectors on upload; query vectors remain float32 for maximum precision during comparison. For a 1M-vector index with d=384 dimensions: uncompressed HNSW consumes ~1.5GB in-memory; scalar quantization reduces this to ~400MB; product quantization (32 subspaces) reduces to ~100MB. The trade-off is latency: quantized vectors require bit unpacking before distance computation, adding ~5–15% latency overhead. In practice, scalar quantization (4–6% recall loss) is nearly always worthwhile for >100M-vector deployments; product quantization is reserved for specialized low-latency CPU deployments where memory dominates cost.

Payload Indexing and Pre-Filtering for Complex Queries

Qdrant's payload_index feature enables indexing metadata fields for fast filtering before vector search, critical for multi-criteria retrieval. A RAG system with documents tagged by source, date, language, and confidence can express queries like: search(query_vector, filter={'source': 'wiki', 'date': {'gte': 2023-01-01}, 'language': 'en', 'confidence': {'gte': 0.9}}). Qdrant executes this as: (1) bitmap filter for exact matches (source='wiki'), (2) range filter for date and confidence, (3) intersection produces candidate set, (4) ANN search on candidates. For 10M documents with 5% matching filter criteria, skipping 95% of documents before ANN search reduces latency 10–20×. Payload indexing auto-completes: Qdrant infers indices on frequently-filtered fields from query logs, no manual configuration needed. In production, monitoring filter selectivity (% of documents matching each filter) reveals poorly-selective filters that don't reduce search space. Example: filter={'is_published': True} on a corpus where 90% documents are published provides minimal benefit; filter={'author': 'alice'} with 0.1% selectivity provides massive acceleration. Chaining filters enables complex business logic: product search with (category AND price_range AND availability AND rating>=4.5) executed as nested filter expressions, maintaining <100ms E2E latency even on 100M product vectors.

Qdrant Replication and High-Availability Setups

Single-node Qdrant is suitable for <10M vectors or non-critical systems; production deployments use master-replica replication for availability. Qdrant replicas are hot standbys: all writes go to master (replication_log written to Qdrant's internal log), replicas consume log and stay synchronized. Failover to replica takes <5 seconds: health checks detect master failure, clients retry on replica connection string, read-only queries resume. Write replay catches up on stale replicas post-recovery. Configuring replication: qdrant_config.yaml includes replication_factor=2 (one master, one replica minimum), and setup_replication_client() connects replica to master's replication stream. For multi-region HA, geo-distributed replicas (us-east, us-west) require eventual consistency: writes to primary us-east replicate with <1s lag to us-west, acceptable for most RAG scenarios. Strict consistency (leader election with quorum) is overkill for vector search; eventual consistency is sufficient since vector databases are read-heavy. Backup strategy: daily snapshots (qdrant export to disk) combined with transaction logs enable RTO <1 hour and RPO <5 minutes, cost-effective via incremental snapshots on S3.