HNSW & IVFFlat Index Creation & Tuning for Production pgvector

Vector similarity search has moved from experimental ML workloads into core infrastructure for search, recommendation, and retrieval-augmented generation, and inside PostgreSQL the pgvector extension is where that search actually runs. Default index settings rarely survive production scale: recall drifts, build windows balloon, and query latency becomes unpredictable as embedding volumes grow. This guide is the reference for creating, tuning, and operating HNSW and IVFFlat indexes under real constraints — covering algorithm selection, parameter calibration, pipeline-aware build strategies, operational runbooks, and capacity planning for teams running vector search at scale.

Ingestion and query paths converge on a single shared ANN index. Build knobs (m, ef_construction, lists) are locked at CREATE INDEX; ef_search and probes are tuned per session at query time.

Core Concepts & Data Modeling

Every pgvector index sits on top of a vector, halfvec, or bit column stored inside an ordinary PostgreSQL table, which means index behaviour inherits the engine’s MVCC semantics, heap layout, and TOAST rules. A single-precision vector(d) occupies 4 * d + 8 bytes per row before index overhead, so a 1,536-dimension OpenAI embedding is roughly 6.2 KB of heap payload per tuple. That storage math drives nearly every downstream decision — memory budget, build time, and the amount of I/O each scan must touch — and is worth internalising before any index exists. The full accounting of column width, TOAST thresholds, and per-row amplification is covered in pgvector storage overhead analysis, and the trade-offs between the vector and halfvec types are laid out in vector data type selection.

Data modeling decisions made before indexing constrain what the index can achieve. Three choices dominate:

Distance operator class. An index is built for exactly one operator: vector_cosine_ops, vector_l2_ops, or vector_ip_ops. A cosine index cannot serve an L2 query, and the planner will silently fall back to a sequential scan if the query operator does not match the index. The decision framework for this is in cosine vs L2 distance metrics; the practical rule is that cosine and inner-product only behave identically when vectors are unit-normalized.
Normalization contract. Cosine indexes are cheapest and most numerically stable when vectors arrive pre-normalized, letting you use the faster inner-product operator at query time. Normalizing inside the ingestion pipeline rather than at query time keeps the index consistent; see normalizing embeddings before pgvector insertion.
Row shape and filtering columns. Metadata columns used for pre-filtering (tenant id, document type, timestamp) should live alongside the vector so that partial indexes and WHERE clauses can prune candidates. Schema drift in those columns quietly degrades filter selectivity over time.

SQL

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id          bigserial PRIMARY KEY,
    tenant_id   integer      NOT NULL,
    doc_type    text         NOT NULL,
    created_at  timestamptz  NOT NULL DEFAULT now(),
    embedding   vector(1536) NOT NULL
);

Index Architecture & Algorithm Overview

pgvector ships two approximate-nearest-neighbor (ANN) structures, and the choice between them dictates memory footprint, query latency, and update overhead. HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph: each vector is a node with bidirectional links to its nearest neighbors, arranged in logarithmically thinning layers so a search greedily descends from a sparse top layer to a dense base layer. It delivers high recall at low latency — the default for interactive search and real-time retrieval — but costs more memory and far longer build time. IVFFlat (Inverted File with Flat storage) runs k-means to partition the vector space into lists Voronoi cells, then at query time scans only the probes cells nearest the query. It builds quickly and uses less memory, but recall depends entirely on partition quality and degrades when the data distribution drifts away from the centroids learned at build time.

The full workload-driven decision matrix — recall SLAs, write rates, memory ceilings, and dataset size thresholds — is maintained in the HNSW vs IVFFlat algorithm selection framework. The table below is the summary teams should anchor on:

Dimension	HNSW	IVFFlat
Index structure	Layered proximity graph	k-means Voronoi partitions
Query complexity	~O(log N) graph traversal	Linear within probed lists
Recall at default settings	High (0.95+ typical)	Moderate; needs `probes` tuning
Build time	Slow (graph construction)	Fast (single k-means pass)
Memory footprint	High (neighbor lists per node)	Low (centroids + flat lists)
Handles distribution drift	Yes (no fixed centroids)	No (centroids fixed at build)
Update / insert cost	Higher (graph mutation)	Lower (assign to nearest centroid)
Best fit	Read-heavy, strict recall, flexible RAM	Write-heavy, RAM-constrained, fast rebuilds

HNSW greedily descends a layered graph (teal path) for logarithmic query cost at high memory; IVFFlat scans only the probed cells (teal) nearest the query, staying lean but tied to centroids fixed at build time.

A practical corollary: IVFFlat does not re-learn its centroids as data changes. New rows are assigned to the nearest existing centroid at insert time, so once the underlying distribution shifts — common when an embedding model is fine-tuned or swapped — the partitioning no longer reflects the data, and raising ivfflat.probes only partially compensates. HNSW, by contrast, mutates the graph on every insert and stays coherent, at the cost of write amplification and WAL volume.

Parameter Reference

The knobs below are the ones that move recall and latency in production. Build-time parameters (m, ef_construction, lists) are fixed at CREATE INDEX and can only be changed by rebuilding; query-time parameters (hnsw.ef_search, ivfflat.probes) are session-level and can be tuned live. Detailed calibration procedures for the HNSW construction parameters live in optimizing m and ef_construction parameters, and IVFFlat sizing in tuning IVFFlat lists for high-throughput similarity search.

Parameter	Scope	Default	Production recommendation	Notes
`m`	HNSW build	16	16–32 (48 for >768 dims / recall-critical)	Max neighbor links per node; drives memory linearly and cannot be altered in place.
`ef_construction`	HNSW build	64	128–256	Candidate-list width during build; raise with `m` to keep graph quality; increases build time and WAL.
`hnsw.ef_search`	HNSW query	40	40–200, tuned per traffic tier	Search-time candidate width; the primary latency/recall dial. Set per-session, not globally.
`lists`	IVFFlat build	100	≈ `sqrt(rows)` up to ~1M rows, then `rows/1000`	Number of Voronoi cells; too few → recall cliff, too many → centroid-scan overhead.
`ivfflat.probes`	IVFFlat query	1	`sqrt(lists)` as a starting point	Cells scanned per query; raise until recall plateaus, then stop.
`maintenance_work_mem`	Build session	64 MB	≥ index working-set size (often several GB)	Too low forces disk spills, checkpoint storms, and 10x+ build times.
`max_parallel_maintenance_workers`	Build session	2	4–8 on large boxes	Parallel HNSW build workers; scale with available cores and I/O headroom.

Set query-time parameters per session so different services can trade latency for recall independently:

SQL

-- HNSW: widen the search frontier for a high-recall tier
SET hnsw.ef_search = 120;

-- IVFFlat: probe more cells for the same query
SET ivfflat.probes = 12;

HNSW memory during construction must be provisioned explicitly through maintenance_work_mem. Insufficient allocation forces PostgreSQL to spill graph state to disk, triggering checkpoint storms and stretching build windows by orders of magnitude. A useful first-pass estimate of the graph-structure overhead (separate from raw vector storage) is (rows * m * 4 * 1.1) / 1024^3 GB — the graph cost is driven by per-node neighbor lists (m), not by dimensionality — which you then validate against live pg_stat_progress_create_index telemetry.

SQL

CREATE INDEX ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 24, ef_construction = 200);

CREATE INDEX ON document_chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 1000);

Pipeline Integration

An index is only ever as fresh and as well-distributed as the pipeline that feeds it. The upstream stages — chunking, embedding generation, type casting, normalization, and batched upsert — all shape index quality, so index tuning cannot be separated from ingestion design. The end-to-end ingestion architecture is covered in the embedding ingestion pipeline engineering section; three couplings matter most for indexing:

Batch geometry. Large COPY or multi-row INSERT ... ON CONFLICT batches amortize WAL and index-maintenance cost far better than row-at-a-time writes. The chunk and batch sizing that keeps embeddings coherent is in batch and chunking strategies for embeddings.
Build ordering. For a bulk backfill, load rows first and build the index afterward. Building HNSW incrementally during a multi-million-row load produces a worse graph and far more WAL than a single post-load build. For continuous streams, the index must stay live during writes — that is where concurrent builds and the async strategies below apply.
Concurrency and back-pressure. High-throughput ingestion runs through connection pools and async workers; the async processing with Python asyncio patterns keep insert concurrency bounded so index maintenance does not starve query traffic.

PYTHON

import numpy as np
import psycopg
from pgvector.psycopg import register_vector

def upsert_batch(conn, rows):
    # rows: list[tuple[tenant_id, doc_type, embedding_np]]
    register_vector(conn)
    with conn.cursor() as cur, cur.copy(
        "COPY document_chunks (tenant_id, doc_type, embedding) FROM STDIN WITH (FORMAT BINARY)"
    ) as copy:
        for tenant_id, doc_type, emb in rows:
            v = emb / np.linalg.norm(emb)          # unit-normalize for cosine/IP
            copy.write_row((tenant_id, doc_type, v.astype(np.float32)))
    conn.commit()

When the index must remain queryable during ingestion, build it without an exclusive lock. CREATE INDEX CONCURRENTLY is the mechanism, but on vector tables it introduces its own memory and checkpoint pressure; the orchestration patterns — background workers, off-peak scheduling, retry backoff — are detailed in asynchronous index build strategies, and a fully worked production build sequence is in step-by-step HNSW index creation for production workloads.

SQL

-- Non-blocking build; monitor progress and lock waits separately
CREATE INDEX CONCURRENTLY idx_chunks_hnsw ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 24, ef_construction = 200);

When using CREATE INDEX CONCURRENTLY, watch pg_stat_activity for sessions waiting on ShareUpdateExclusiveLock. Long-running analytical queries or uncommitted transactions holding row locks will stall the build indefinitely, and PostgreSQL will not cancel it on client disconnect — orphaned workers linger and hold resources. Pipeline automation should run a pre-flight lock check, enforce statement_timeout and idle_in_transaction_session_timeout at the session level, and clean up any resulting invalid index before retrying.

Operational Runbook

Vector indexes need the same lifecycle discipline as any other PostgreSQL index — VACUUM, REINDEX, and continuous monitoring — plus a few vector-specific concerns. The verification and error-triage side of this is standardized in index validation and error categorization; the day-two operations are below.

Monitor build progress. During any build, pg_stat_progress_create_index reports the current phase and tuple counts:

SQL

SELECT phase,
       round(100.0 * blocks_done  / nullif(blocks_total, 0), 1)  AS pct_blocks,
       round(100.0 * tuples_done  / nullif(tuples_total, 0), 1)  AS pct_tuples
FROM pg_stat_progress_create_index;

Watch index usage and planner behaviour. The clearest signal that an index has stopped being used — usually a metric/operator mismatch or a planner cost miss — is a flat idx_scan count:

SQL

SELECT indexrelname,
       idx_scan,
       idx_tup_read,
       pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'document_chunks'
ORDER BY idx_scan DESC;

VACUUM and bloat. MVCC leaves dead tuples behind on every update and delete; for HNSW, deletes also leave tombstoned graph nodes that are only reclaimed by VACUUM. Vector tables under heavy churn need autovacuum tuned more aggressively than the defaults (lower autovacuum_vacuum_scale_factor, higher cost limit) so dead-tuple accumulation does not inflate scan cost. Confirm health with pg_stat_user_tables:

SQL

SELECT relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'document_chunks';

REINDEX and rebuilds. Because m, ef_construction, and lists are immutable, changing them means rebuilding. Use REINDEX INDEX CONCURRENTLY to swap an index in place without blocking reads and writes, and schedule IVFFlat rebuilds against embedding-model version cycles so centroids track the current distribution:

SQL

REINDEX INDEX CONCURRENTLY idx_chunks_hnsw;

Recover invalid indexes. A build that fails mid-flight leaves an index marked invalid. Find and drop these before retrying:

SQL

SELECT c.relname
FROM pg_index i
JOIN pg_class c ON c.oid = i.indexrelid
WHERE i.indisvalid = false;

Security & Multi-Tenancy Considerations

Vectors are not opaque: an embedding can leak information about the source text, and in a shared table one tenant’s query must never surface another tenant’s rows. Two controls carry most of the load, and both are detailed in security boundaries for vector data and its companion how-to, securing pgvector tables with row-level security.

Row-level security (RLS). Enforce tenant isolation in the database, not just the application. An RLS policy keyed on a session variable guarantees that even a mis-scoped ORDER BY embedding <=> $1 cannot cross tenant boundaries. Note that RLS predicates run as filters on top of the ANN scan, so a low-selectivity tenant still benefits from a partial or composite approach.
Partial and per-tenant indexes. For a small number of large tenants, partial indexes (WHERE tenant_id = N) keep each graph tight and recall high. For many small tenants, a single shared index with an RLS filter is more storage-efficient; the crossover depends on tenant-size skew.

SQL

ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON document_chunks
  USING (tenant_id = current_setting('app.tenant_id')::int);

-- Optional: keep a large tenant's graph isolated and dense
CREATE INDEX CONCURRENTLY idx_chunks_hnsw_t42 ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 24, ef_construction = 200)
  WHERE tenant_id = 42;

A subtle multi-tenancy trap: heavy filtering can push ANN recall down because the graph or probed lists may contain few rows matching the filter, forcing the planner toward a sequential scan. Validate recall with the tenant filter applied, not on the global dataset, and raise hnsw.ef_search or ivfflat.probes for heavily filtered tiers.

Performance Benchmarks & Capacity Planning

Capacity planning for vector indexes reduces to three budgets — memory, storage, and build time — each of which has a workable formula. Treat these as starting points and confirm against live telemetry.

Memory (RAM to keep the index hot). For HNSW, the searchable structure should fit in RAM to avoid per-query disk reads. Approximate the resident set as vector storage plus graph overhead:

Vector bytes: rows * (4 * dims + 8)
HNSW graph bytes: rows * m * 8 (neighbor lists, both directions)

A 10M-row, 1,536-dim, m = 24 HNSW index therefore needs roughly 10e6 * (4*1536 + 8) ≈ 57 GB of vectors plus 10e6 * 24 * 8 ≈ 1.9 GB of graph — a strong argument for halfvec (halving the vector term) on high-dimensional models. A full worked example is in calculating pgvector storage requirements for 10M embeddings.

Build time and I/O. HNSW build cost scales with rows * ef_construction * log(rows); doubling ef_construction roughly doubles build time. IVFFlat build is dominated by a single k-means pass, near-linear in rows, which is why it is the pragmatic choice when rebuild frequency is high. Provision maintenance_work_mem to hold the working set and raise max_parallel_maintenance_workers on multi-core hosts to parallelize HNSW construction.

Recall benchmarking. Recall cannot be assumed from parameters; measure it against exact brute-force results on a held-out sample. Run automated recall tests over a stratified sample of 10,000–50,000 vectors, and treat a drop beyond ~2% at the target ef_search/probes as a regression to investigate:

PYTHON

import numpy as np

def recall_at_k(exact_ids, approx_ids, k=10):
    hits = [len(set(a[:k]) & set(e[:k])) / k
            for a, e in zip(approx_ids, exact_ids)]
    return float(np.mean(hits))

# exact_ids: ground truth from ORDER BY embedding <-> q (no index / seqscan)
# approx_ids: results from the ANN index at the tier's ef_search / probes
print(f"recall@10 = {recall_at_k(exact_ids, approx_ids):.4f}")

Pair this with EXPLAIN (ANALYZE, BUFFERS) to confirm the planner actually chose the index rather than falling back to a sequential scan, and adjust default_statistics_target on the vector column if row estimates are driving bad plans.

Adoption Checklist

Teams standing up production vector search on pgvector should be able to answer each of these before routing live traffic:

Algorithm chosen against the workload, not the demo — recall SLA, write rate, RAM ceiling, and rebuild cadence all fed into the HNSW vs IVFFlat algorithm selection decision.
Operator class matches the query — cosine/L2/inner-product aligned end to end, vectors normalized in the pipeline where cosine is used.
Build parameters calibrated and recorded — m, ef_construction, or lists chosen from measured recall, with the values version-controlled alongside the schema.
maintenance_work_mem sized for the build — no disk spill, verified against pg_stat_progress_create_index.
Builds are non-blocking in production — concurrent builds, off-peak scheduling, and invalid-index recovery automated.
Query-time knobs are per-tier — ef_search/probes exposed as tunable runtime settings, not global constants.
Recall is monitored, not assumed — automated ground-truth benchmarking gates promotions and alerts on regression.
Isolation is enforced in the database — RLS and, where warranted, partial per-tenant indexes, validated with the tenant filter applied.
Lifecycle jobs scheduled — autovacuum tuned for churn, IVFFlat rebuilds aligned to embedding-model versions, REINDEX CONCURRENTLY for parameter changes.

Version compatibility and reference build patterns live in the pgvector repository; PostgreSQL’s own guidance on non-blocking builds is in the docs for concurrent index creation.

Up: pgvector Architecture & Vector Fundamentals — storage mechanics and data-type foundations this tuning work sits on.
HNSW vs IVFFlat algorithm selection — the workload-driven decision framework.
Optimizing m and ef_construction parameters — HNSW build-parameter calibration.
Asynchronous index build strategies — zero-downtime concurrent builds for live tables.
Index validation & error categorization — recall benchmarking and failure triage.
Embedding ingestion pipeline engineering — the upstream pipeline that feeds these indexes.

HNSW & IVFFlat Index Creation & Tuning for Production pgvector

Core Concepts & Data Modeling #

Index Architecture & Algorithm Overview #

Parameter Reference #

Pipeline Integration #

Operational Runbook #

Security & Multi-Tenancy Considerations #

Performance Benchmarks & Capacity Planning #

Adoption Checklist #

Related #