A high-performance, production-ready vector search engine written in Rust, with rich filtering, sparse vector support, and cloud/self-hosted options.
Qdrant is written in Rust, which translates to predictable low latency, low memory overhead, and no garbage-collection pauses. It handles both dense and sparse vectors natively — meaning you can do BM25-style keyword retrieval and semantic retrieval from the same database, enabling hybrid search without a separate Elasticsearch cluster.
It runs as a single binary (Docker or binary), as a distributed cluster, or as a managed cloud service. For teams that want the control of self-hosting without giving up production features, Qdrant is often the best choice.
Qdrant organises data into collections (equivalent to tables). Each item in a collection is a point, consisting of:
A collection can have multiple named vector spaces — for example, a "dense" space for semantic search and a "sparse" space for keyword search, searched jointly in a hybrid query.
# Docker (recommended for development)
docker run -p 6333:6333 qdrant/qdrant
pip install qdrant-client
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(host="localhost", port=6333)
# Create a collection
client.create_collection(
collection_name="my_docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
print(client.get_collection("my_docs"))
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
from openai import OpenAI
qdrant = QdrantClient(host="localhost", port=6333)
oai = OpenAI()
def embed(text: str) -> list[float]:
return oai.embeddings.create(input=[text], model="text-embedding-3-small").data[0].embedding
# Upsert points
docs = [
{"id": 1, "text": "Returns are free within 30 days.", "category": "policy"},
{"id": 2, "text": "Free shipping on orders over $50.", "category": "shipping"},
{"id": 3, "text": "Next-day delivery available in major cities.", "category": "shipping"},
]
points = [
PointStruct(id=d["id"], vector=embed(d["text"]), payload={"text": d["text"], "category": d["category"]})
for d in docs
]
qdrant.upsert(collection_name="my_docs", points=points)
# Search
query_vector = embed("What is the return policy?")
results = qdrant.search(
collection_name="my_docs",
query_vector=query_vector,
limit=3,
with_payload=True
)
for r in results:
print(f"Score {r.score:.3f}: {r.payload['text']}")
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# Exact match filter
results = qdrant.search(
collection_name="my_docs",
query_vector=query_vector,
query_filter=Filter(
must=[FieldCondition(key="category", match=MatchValue(value="shipping"))]
),
limit=3,
with_payload=True
)
# Numeric range filter
results = qdrant.search(
collection_name="my_docs",
query_vector=query_vector,
query_filter=Filter(
must=[FieldCondition(key="year", range=Range(gte=2022, lte=2024))]
),
limit=5
)
Qdrant's filtering is applied before ANN search (pre-filtering for small candidate sets) or after (post-filtering for large indices) — it automatically chooses the faster strategy based on the expected cardinality.
from qdrant_client.models import VectorParams, SparseVectorParams, SparseIndexParams, SparseVector
# Create a collection with both dense and sparse vector spaces
qdrant.create_collection(
collection_name="hybrid_docs",
vectors_config={"dense": VectorParams(size=1536, distance=Distance.COSINE)},
sparse_vectors_config={"sparse": SparseVectorParams(index=SparseIndexParams())}
)
# Generate a sparse vector (BM25-style) — requires a sparse encoder
# Here using a simple TF-IDF approximation for illustration
from sklearn.feature_extraction.text import TfidfVectorizer
import scipy.sparse
vectorizer = TfidfVectorizer(max_features=30000)
# After fitting: sparse_matrix = vectorizer.transform([text])
# Upsert with both vector types
qdrant.upsert(
collection_name="hybrid_docs",
points=[PointStruct(
id=1,
vector={
"dense": dense_embedding,
"sparse": SparseVector(indices=sparse_indices, values=sparse_values)
},
payload={"text": "..."}
)]
)
Payload indexing is opt-in. You must explicitly create payload indexes for fields you want to filter on efficiently. Without an index, filters perform a full collection scan:
qdrant.create_payload_index(
collection_name="my_docs",
field_name="category",
field_schema="keyword" # or "integer", "float", "geo", "text"
)
Collection config is permanent. Vector dimensions and distance metric are fixed at collection creation. Changing them requires creating a new collection and migrating data.
Memory-mapped storage vs RAM. By default, Qdrant stores vectors on disk with memory mapping. For maximum performance, configure on_disk=False to keep vectors in RAM — at the cost of higher memory usage.
gRPC for high throughput. The Python client defaults to REST. For high-throughput ingestion, enable gRPC: QdrantClient(host="localhost", port=6334, prefer_grpc=True).
Qdrant supports three deployment modes: in-memory (QdrantClient(":memory:") for testing), on-disk persistence (local path), and the distributed Qdrant server for production. The in-memory mode is ideal for development and testing — it is fast and requires no setup but loses data on process restart. Production deployments use the Qdrant server with persistent storage, accessed via the REST or gRPC API. Qdrant Cloud provides a managed version that eliminates infrastructure management while maintaining the full Qdrant feature set through the same Python client API.
| Feature | Qdrant | Pinecone | Weaviate |
|---|---|---|---|
| Sparse+dense hybrid | Native | Via sparse index | Via BM25 module |
| Self-hosted | Yes (open source) | No | Yes (open source) |
| Payload filtering | Pre/post filter | Metadata filter | GraphQL where |
| Quantization | Scalar, product, binary | Limited | PQ compression |
Qdrant's quantization options — scalar (INT8), product quantization (PQ), and binary — reduce vector storage size and improve search speed at the cost of recall accuracy. Binary quantization produces the smallest index (32x compression from float32) with surprisingly competitive recall when used with the rescore option, which re-ranks binary search results using full-precision vectors. For high-throughput production deployments where query latency is critical, binary quantization with rescore is the recommended starting configuration as it typically achieves 95%+ recall at 5–10x search speedup.
Qdrant's collection abstraction provides isolation and independent configuration for different use cases. Each collection has its own vector store, payload schema, and indexing parameters. Within collections, payload_indexing enables fast filtering on metadata fields (e.g., doc_id, source, timestamp) before vector search, dramatically speeding up filtered queries. For example, indexing the "category" payload field (VARCHAR type) enables O(log n) lookups on documents matching a specific category before ANN search, avoiding full-collection scans. Payload indexes consume additional memory (~5–15% of vector index size) but enable interactive retrieval latency. In RAG systems with millions of documents split into chunks, creating separate Qdrant collections per document source and payload-indexing the chunk_id field allows fast pre-filtering before semantic search, achieving sub-100ms E2E latency. Qdrant's REST API supports atomic collection creation with payload_config specifications; the Python client abstracts this as collection_exists() and recreate_collection() calls with schema definitions.
Qdrant's default HNSW implementation requires careful tuning of two key parameters: ef_construct (graph building cost, default 200) and ef_search (query-time search budget, default 100). ef_construct controls how many neighbors are considered when inserting vectors into the HNSW graph; higher values (400–600) improve recall at 2–3× construction time cost. ef_search is a query-time parameter set during upsert_points() or search() calls; increasing ef_search from 50 to 200 improves recall from ~0.85 to ~0.98 but doubles latency. In production, ef_construct should be set once during initial index build (typically 256–512), while ef_search is tuned per query based on recall SLAs: interactive search uses ef_search=100, batch retrieval uses ef_search=200. For time-sensitive applications (sub-100ms requirement), profiling the recall-latency curve empirically reveals the optimal ef_search value; Qdrant's search_params allow per-query override of this parameter, enabling adaptive strategies that increase ef_search for ambiguous queries.
Qdrant supports scalar quantization and product quantization to reduce index size by 4–16×, critical for billion-scale deployments on budget-constrained infrastructure. Scalar quantization compresses float32 embeddings to uint8 (4× compression) with ~1–3% recall loss; product quantization further factorizes vectors into smaller subspaces, achieving 16× compression at ~5% recall loss. Enabling quantization during collection creation (quantization_config=ScalarQuantization(...)) automatically compresses vectors on upload; query vectors remain float32 for maximum precision during comparison. For a 1M-vector index with d=384 dimensions: uncompressed HNSW consumes ~1.5GB in-memory; scalar quantization reduces this to ~400MB; product quantization (32 subspaces) reduces to ~100MB. The trade-off is latency: quantized vectors require bit unpacking before distance computation, adding ~5–15% latency overhead. In practice, scalar quantization (4–6% recall loss) is nearly always worthwhile for >100M-vector deployments; product quantization is reserved for specialized low-latency CPU deployments where memory dominates cost.
Qdrant's payload_index feature enables indexing metadata fields for fast filtering before vector search, critical for multi-criteria retrieval. A RAG system with documents tagged by source, date, language, and confidence can express queries like: search(query_vector, filter={'source': 'wiki', 'date': {'gte': 2023-01-01}, 'language': 'en', 'confidence': {'gte': 0.9}}). Qdrant executes this as: (1) bitmap filter for exact matches (source='wiki'), (2) range filter for date and confidence, (3) intersection produces candidate set, (4) ANN search on candidates. For 10M documents with 5% matching filter criteria, skipping 95% of documents before ANN search reduces latency 10–20×. Payload indexing auto-completes: Qdrant infers indices on frequently-filtered fields from query logs, no manual configuration needed. In production, monitoring filter selectivity (% of documents matching each filter) reveals poorly-selective filters that don't reduce search space. Example: filter={'is_published': True} on a corpus where 90% documents are published provides minimal benefit; filter={'author': 'alice'} with 0.1% selectivity provides massive acceleration. Chaining filters enables complex business logic: product search with (category AND price_range AND availability AND rating>=4.5) executed as nested filter expressions, maintaining <100ms E2E latency even on 100M product vectors.
Single-node Qdrant is suitable for <10M vectors or non-critical systems; production deployments use master-replica replication for availability. Qdrant replicas are hot standbys: all writes go to master (replication_log written to Qdrant's internal log), replicas consume log and stay synchronized. Failover to replica takes <5 seconds: health checks detect master failure, clients retry on replica connection string, read-only queries resume. Write replay catches up on stale replicas post-recovery. Configuring replication: qdrant_config.yaml includes replication_factor=2 (one master, one replica minimum), and setup_replication_client() connects replica to master's replication stream. For multi-region HA, geo-distributed replicas (us-east, us-west) require eventual consistency: writes to primary us-east replicate with <1s lag to us-west, acceptable for most RAG scenarios. Strict consistency (leader election with quorum) is overkill for vector search; eventual consistency is sufficient since vector databases are read-heavy. Backup strategy: daily snapshots (qdrant export to disk) combined with transaction logs enable RTO <1 hour and RPO <5 minutes, cost-effective via incremental snapshots on S3.