A PostgreSQL extension that adds vector similarity search to your existing Postgres database, enabling RAG without a separate infrastructure component.
The best database is often the one you already have. If your application already uses PostgreSQL for user data, documents, and metadata, adding vectors to the same database means: no extra service to operate, no data synchronisation between databases, joins between vector results and relational data, and full ACID transactions across both.
The limitation: pgvector is not as fast as dedicated vector databases at extreme scale (>10M vectors with high QPS). But for most applications — millions of documents, tens of thousands of queries per day — pgvector handles it comfortably.
# Using Docker (easiest)
docker run -e POSTGRES_PASSWORD=password -p 5432:5432 pgvector/pgvector:pg16
# Or install extension in existing Postgres 12+
# sudo apt install postgresql-16-pgvector
# Then in psql: CREATE EXTENSION vector;
pip install psycopg2-binary pgvector
import psycopg2
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect("postgresql://postgres:password@localhost/mydb")
register_vector(conn)
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
conn.commit()
from pgvector.psycopg2 import register_vector
import psycopg2
from openai import OpenAI
conn = psycopg2.connect("postgresql://postgres:password@localhost/mydb")
register_vector(conn)
cur = conn.cursor()
oai = OpenAI()
def embed(text): return oai.embeddings.create(input=[text], model="text-embedding-3-small").data[0].embedding
# Create table with vector column
cur.execute('''
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
category TEXT,
embedding vector(1536)
)
''')
conn.commit()
# Insert documents
docs = [
("Refunds processed in 5 business days.", "policy"),
("Free shipping on orders over $50.", "shipping"),
]
for text, cat in docs:
cur.execute("INSERT INTO documents (content, category, embedding) VALUES (%s, %s, %s)",
(text, cat, embed(text)))
conn.commit()
# Cosine similarity search (operator: <=>)
query_emb = embed("What is the refund timeline?")
cur.execute('''
SELECT content, 1 - (embedding <=> %s) AS similarity
FROM documents
ORDER BY embedding <=> %s
LIMIT 3
''', (query_emb, query_emb))
for row in cur.fetchall():
print(f"Sim {row[1]:.3f}: {row[0]}")
Distance operators: <=> = cosine, <-> = L2 (Euclidean), <#> = inner product (negative).
# IVFFlat — good for static datasets, faster to build
cur.execute('''
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100)
''')
# Rule of thumb for lists: sqrt(num_rows) for up to 1M rows, rows/1000 for larger
# HNSW — better recall, faster queries, larger build time (pgvector >=0.5.0)
cur.execute('''
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
''')
# m: connections per node (16 is default); ef_construction: build accuracy
conn.commit()
HNSW is the default choice for new indexes — it has better recall at the same query speed. IVFFlat requires fewer resources to build and is better when you have a huge dataset and want to trade some recall for build time.
pgvector's killer feature: combine BM25 full-text search with vector similarity in a single SQL query:
cur.execute('''
SELECT
content,
ts_rank(to_tsvector('english', content), query) AS bm25_score,
1 - (embedding <=> %s) AS vector_score,
-- Reciprocal Rank Fusion
1.0 / (60 + RANK() OVER (ORDER BY embedding <=> %s)) +
1.0 / (60 + RANK() OVER (ORDER BY ts_rank(to_tsvector('english', content), query) DESC)) AS rrf_score
FROM
documents,
plainto_tsquery('english', %s) AS query
WHERE
to_tsvector('english', content) @@ query
OR (embedding <=> %s) < 0.5
ORDER BY rrf_score DESC
LIMIT 5
''', (query_emb, query_emb, query_text, query_emb))
from sqlalchemy import create_engine, Column, Integer, Text
from sqlalchemy.orm import DeclarativeBase, Session
from pgvector.sqlalchemy import Vector
engine = create_engine("postgresql+psycopg2://postgres:password@localhost/mydb")
class Base(DeclarativeBase): pass
class Document(Base):
__tablename__ = "documents"
id = Column(Integer, primary_key=True)
content = Column(Text, nullable=False)
embedding = Column(Vector(1536))
Base.metadata.create_all(engine)
# Insert
with Session(engine) as session:
doc = Document(content="Refunds take 5 days.", embedding=embed("Refunds take 5 days."))
session.add(doc)
session.commit()
# Query
from sqlalchemy import select
with Session(engine) as session:
results = session.scalars(
select(Document)
.order_by(Document.embedding.cosine_distance(query_emb))
.limit(5)
).all()
Index must be built before it's used. Adding an HNSW index to an existing large table can take minutes to hours. Plan index creation during low-traffic windows, and use CREATE INDEX CONCURRENTLY in production.
Exact search without index. Without an index, every query does a full sequential scan. For small tables (<10k rows) this is fine; for larger tables, always create an index.
HNSW recall vs ef_search. At query time, SET hnsw.ef_search = 100 (default 40) to increase recall at the cost of latency. Tune this based on your recall requirements.
Max vector dimensions is 16,000. More than sufficient for current embedding models, but worth knowing.
Choosing between IVFFlat and HNSW indexes in pgvector depends on the performance requirements and dataset size. IVFFlat partitions vectors into lists and searches only the nearest lists, providing faster index build times and lower memory usage but requiring careful tuning of the lists and probes parameters. HNSW builds a hierarchical graph index that provides better recall at comparable query latency but uses 2–3x more memory and takes longer to build. For most production deployments serving online queries, HNSW is the preferred choice due to its consistently higher recall at low latency.
| Index type | Build time | Memory | Recall quality | Best for |
|---|---|---|---|---|
| No index (exact) | N/A | Low | 100% (exact) | <100K rows, exact search |
| IVFFlat | Fast | Medium | Medium (tunable) | Large datasets, build-time sensitive |
| HNSW | Slow | High | High | Production, low latency required |
pgvector's HNSW implementation supports online inserts without rebuilding the index, which IVFFlat does not. This makes HNSW the only viable option for applications with continuous document ingestion where the vector index must remain available during updates. The m parameter (connections per node, default 16) and ef_construction parameter (search width during build, default 64) control the index quality-build-time tradeoff; increasing both to 32 and 128 respectively produces higher recall with a 3–4x longer build time.
Connection pooling is a critical operational consideration for pgvector deployments. Vector similarity queries are computationally more intensive than typical relational queries, holding database connections for longer than simple key lookups. PgBouncer in transaction mode (pooling connections at the transaction level) is the recommended pool configuration, as it handles the higher connection counts typical of horizontally scaled API services without requiring a proportionally large Postgres connection limit. The max_parallel_workers_per_gather Postgres parameter, which controls parallel query execution for vector scans, should be tuned based on available CPU cores to maximize throughput for concurrent vector queries.
pgvector's metadata filtering enables combined vector and relational queries that filter on structured attributes before or after vector search. The WHERE clause approach (filtering before vector search) reduces the effective search space but may produce lower recall if the filter is highly selective and eliminates most of the relevant vectors. The post-filter approach (retrieve top-k vectors, then apply the WHERE clause in application code) maintains retrieval recall but requires retrieving more candidates than needed to ensure sufficient results after filtering. For high-selectivity filters (returning less than 10% of the corpus), pre-filtering with a larger top-k parameter balances recall and query efficiency.
pg_embedding (the open-source extension from Neon that later merged with pgvector) and pgvector are now the same project, making pgvector the definitive community standard for Postgres vector search. Cloud providers including AWS RDS, Google Cloud SQL, Azure Database for PostgreSQL, and Supabase all support pgvector natively, eliminating the need for self-managed vector database infrastructure for teams already running Postgres. The availability of pgvector on managed Postgres services is the strongest argument for choosing pgvector over standalone vector databases for teams where operational simplicity and reduced infrastructure surface area are higher priorities than maximum throughput.
pgvector's support for partial indexes enables memory-efficient retrieval for multi-tenant applications where each tenant's vectors need to be searched independently. Creating a separate HNSW index per tenant using a WHERE clause filter (CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops) WHERE tenant_id = 'tenant_A') ensures that tenant data is never mixed during search and that the index size scales with per-tenant data volume rather than total data volume. This approach provides tenant isolation without requiring separate tables or databases per tenant, simplifying the schema while maintaining search efficiency for per-tenant queries.
pgvector's vector type supports multiple distance metrics: L2 distance (Euclidean), inner product, and cosine distance. The choice of distance metric should match the embedding model's training objective — models trained with cosine similarity loss (most sentence transformers) should use cosine distance; models trained to maximize inner product (some recommendation models) should use inner product. Using the wrong distance metric for an embedding model can silently degrade retrieval quality by 10–20 percentage points without triggering any errors, making distance metric documentation and validation a necessary step in pgvector integration.
pgvector performance degrades gracefully under high concurrent write load because HNSW index updates require exclusive locks on the modified portions of the index graph. Applications with high write/read ratios — real-time event streams, live document updates — may see query latency spikes during index update operations. Mitigation strategies include batching writes during low-traffic periods, using IVFFlat indexes (which support batch rebuild more efficiently than HNSW incremental updates), or maintaining a separate "staging" table with recent embeddings that is merged into the main index during off-peak maintenance windows.
pgvector performance degrades gracefully under high concurrent write load because HNSW index updates require exclusive locks on the modified portions of the index graph. Applications with high write/read ratios — real-time event streams, live document updates — may see query latency spikes during index update operations. Mitigation strategies include batching writes during low-traffic periods, using IVFFlat indexes (which support batch rebuild more efficiently than HNSW incremental updates), or maintaining a separate "staging" table with recent embeddings that is merged into the main index during off-peak maintenance windows.