Batch Chunking Strategies for Embeddings

High-throughput embedding pipelines rarely collapse at the model inference layer; they fracture at the ingestion boundary. When raw documents overflow a model’s context window, trigger memory pressure during tokenization, or emit misaligned vector payloads, the entire retrieval stack degrades — recall drops silently, upserts deadlock under concurrency, and the ANN index bloats with near-duplicate fragments. Within the broader embedding ingestion pipeline engineering discipline, chunking is the primary control point for memory allocation, network egress, and vector-store synchronization. This page compares the three chunking strategies that matter in production, walks a runnable PostgreSQL and Python implementation, and shows how to validate that boundary decisions preserve recall before you commit them at scale.

Up: Embedding Ingestion Pipeline Engineering

Architectural Divergence & Trade-offs

Three chunking strategies dominate production ingestion, and the correct choice is dictated by corpus homogeneity, the embedding model’s native context window, and the recall target you are willing to defend.

Strategy	How boundaries are chosen	Strengths	Costs	Best fit
Fixed-token windowing	Split at a hard `chunk_size` token count with a fixed `overlap`	Deterministic memory footprint, trivial to parallelize, predictable batch shapes	Cuts mid-sentence; boundary fragmentation lowers recall on prose	Logs, transcripts, uniform records
Recursive-character splitting	Descend a separator hierarchy (paragraph → sentence → word) until each fragment fits `max_chunk_tokens`	Preserves semantic units, minimal padding waste, resilient to heterogeneous input	Variable chunk sizes complicate batch packing	Mixed corpora: docs, HTML, code, markdown
Semantic (embedding-guided) chunking	Group adjacent sentences whose pairwise similarity stays above a threshold	Highest retrieval coherence, boundaries follow meaning	Requires a pre-embedding pass — 2-3x ingestion cost	High-value knowledge bases, RAG over dense reference text

Fixed-token windowing is the baseline because it makes memory allocation and batch shapes predictable — every fragment is the same size, so VRAM budgeting and buffer pre-allocation become arithmetic rather than guesswork. Recursive-character splitting is the pragmatic default for real corpora: it respects natural separators and collapses padding waste, at the price of variable-length fragments that the batching layer must repack. Semantic chunking buys the highest recall but doubles or triples ingestion cost, so reserve it for reference material where retrieval precision directly drives product value.

A cross-cutting decision is the storage type of the resulting vectors, which interacts with chunk volume: at hundreds of millions of fragments, the choice covered in vector data type selection (vector vs halfvec) changes your index footprint more than any tuning knob. Quantify the blast radius first with pgvector storage overhead analysis so chunk granularity is chosen against a real byte budget, not a guess.

Parameter Space & Diagnostic Workflow

The knobs below govern the trade-off between semantic fidelity and ingestion throughput. Defaults are conservative; the production recommendation column reflects settings that survive contact with a 10M+ fragment corpus.

Parameter	Layer	Default	Production recommendation	Notes
`chunk_size`	Segmenter	512 tokens	512 for context ≤ 8k models; 768–1024 for large-context models	Match to the model’s native window, not the max
`overlap`	Segmenter	0	10–15% of `chunk_size`	Prevents boundary fragmentation; higher inflates fragment count
`max_chunk_tokens`	Segmenter	none	Hard cap = model window − 8	Enforce before serialization to avoid silent truncation
`batch_size`	Embedding	32	Power of 2 aligned to VRAM (64/128/256)	Drives tensor-core utilization; scale down on OOM
`insert_batch`	Upsert	1 row	512–2048 rows per statement	Amortizes round-trips; keep under `max_wal_size` pressure
`maintenance_work_mem`	Index build	64MB	1–2GB during bulk load	Prevents spill-to-disk on `CREATE INDEX`

Before committing a configuration, profile the fragment distribution the settings actually produce. Skewed chunk length distributions are the leading indicator of a mis-tuned segmenter, and they surface long before recall regressions do:

SQL

-- Fragment-length distribution for the current ingestion batch.
-- Wide spread or a spike at max_chunk_tokens signals truncation or a bad separator hierarchy.
SELECT
    width_bucket(token_count, 0, 1024, 16) AS bucket,
    count(*)                               AS fragments,
    round(avg(token_count))                AS avg_tokens,
    max(token_count)                       AS max_tokens
FROM embedding_staging
WHERE batch_id = current_setting('pipeline.batch_id')::uuid
GROUP BY bucket
ORDER BY bucket;

A healthy recursive-split run shows a smooth taper toward max_chunk_tokens; a hard column of fragments sitting exactly at the cap means the segmenter is truncating rather than splitting on a separator.

Step-by-Step Implementation

The reference flow below stages fragments, embeds them in aligned batches, normalizes, and upserts idempotently. Every fragment carries a deterministic identity so retries never duplicate vectors.

1. Define the staging and production tables

Idempotent ingestion depends on a composite natural key. The staging table absorbs writes without contending on the ANN index; the production table carries the vector and its index.

SQL

CREATE TABLE embedding_staging (
    batch_id      uuid        NOT NULL,
    doc_id        text        NOT NULL,
    chunk_index   int         NOT NULL,
    content_hash  bytea       NOT NULL,
    token_count   int         NOT NULL,
    body          text        NOT NULL,
    embedding     vector(768),
    status        text        NOT NULL DEFAULT 'pending',
    PRIMARY KEY (doc_id, chunk_index)
);

CREATE TABLE document_chunks (
    doc_id        text         NOT NULL,
    chunk_index   int          NOT NULL,
    content_hash  bytea        NOT NULL,
    embedding     vector(768)  NOT NULL,
    inserted_at   timestamptz  NOT NULL DEFAULT now(),
    PRIMARY KEY (doc_id, chunk_index)
);

2. Segment documents with a bounded recursive splitter

Enforce max_chunk_tokens before anything leaves the segmenter. The content_hash over the fragment body is what makes re-ingestion of an unchanged document a no-op.

PYTHON

import hashlib
from dataclasses import dataclass

SEPARATORS = ["\n\n", "\n", ". ", " "]  # paragraph -> sentence -> word

@dataclass
class Chunk:
    doc_id: str
    chunk_index: int
    body: str
    token_count: int
    content_hash: bytes

def recursive_split(text, count_tokens, max_tokens, overlap_tokens, seps=SEPARATORS):
    """Split text so every fragment is <= max_tokens, descending the separator list."""
    if count_tokens(text) <= max_tokens or not seps:
        return [text]
    sep, rest = seps[0], seps[1:]
    parts, buf = [], ""
    for piece in text.split(sep):
        candidate = f"{buf}{sep}{piece}" if buf else piece
        if count_tokens(candidate) <= max_tokens:
            buf = candidate
        else:
            if buf:
                parts.append(buf)
            buf = piece if count_tokens(piece) <= max_tokens else None
            if buf is None:  # single piece still too big -> recurse deeper
                parts.extend(recursive_split(piece, count_tokens, max_tokens, overlap_tokens, rest))
                buf = ""
    if buf:
        parts.append(buf)
    return parts

def chunk_document(doc_id, text, count_tokens, max_tokens=500, overlap_tokens=64):
    fragments = recursive_split(text, count_tokens, max_tokens, overlap_tokens)
    out = []
    for i, body in enumerate(fragments):
        digest = hashlib.sha256(body.encode("utf-8")).digest()
        out.append(Chunk(doc_id, i, body, count_tokens(body), digest))
    return out

3. Embed in GPU-aligned batches

Align batch_size to CUDA kernel dimensions and repack the variable-length fragments a recursive splitter produces. Because recursive splitting yields uneven lengths, sort by token count before batching so each batch pads to a tight maximum rather than the global one.

PYTHON

import torch

def embed_batches(chunks, model, batch_size=128):
    ordered = sorted(chunks, key=lambda c: c.token_count)  # length-bucketed batching
    for start in range(0, len(ordered), batch_size):
        batch = ordered[start:start + batch_size]
        try:
            vectors = model.encode([c.body for c in batch], convert_to_numpy=True)
        except torch.cuda.OutOfMemoryError:
            torch.cuda.empty_cache()
            half = max(1, len(batch) // 2)            # dynamic back-off on OOM
            yield from embed_batches(batch, model, batch_size=half)
            continue
        for chunk, vec in zip(batch, vectors):
            yield chunk, vec

The distance metric you index against — set out in cosine vs L2 distance metrics — determines whether normalization is mandatory. For cosine similarity via inner product, L2-normalize every vector before insertion; the full rationale and pitfalls live in type casting & vector normalization.

4. Upsert idempotently

The composite key plus ON CONFLICT guarantees exactly-once semantics under parallel dispatch: a retried batch overwrites rather than duplicating, and an unchanged content_hash short-circuits the write.

SQL

INSERT INTO document_chunks (doc_id, chunk_index, content_hash, embedding)
VALUES ($1, $2, $3, $4)
ON CONFLICT (doc_id, chunk_index) DO UPDATE
    SET embedding    = EXCLUDED.embedding,
        content_hash = EXCLUDED.content_hash,
        inserted_at  = now()
WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash;

Aligning chunk identity with the schema is a shared concern with metadata mapping & schema design; keep the (doc_id, chunk_index) contract identical on both sides so partial re-indexing stays cheap. For distributed dispatch that saturates network I/O while awaiting GPU responses, drive the loop with async processing with Python AsyncIO, and for multi-worker fault tolerance — retry backoff, dead-letter routing, exactly-once delivery — see building a resilient Python embedding pipeline with Celery.

5. Promote and index

Bulk-load into staging, then promote in one atomic statement and build the ANN index without holding an exclusive lock.

SQL

-- Atomic promotion of a validated batch.
INSERT INTO document_chunks (doc_id, chunk_index, content_hash, embedding)
SELECT doc_id, chunk_index, content_hash, embedding
FROM embedding_staging
WHERE batch_id = $1 AND status = 'ready'
ON CONFLICT (doc_id, chunk_index) DO UPDATE
    SET embedding = EXCLUDED.embedding, content_hash = EXCLUDED.content_hash;

-- Build the index concurrently to avoid an exclusive table lock.
SET maintenance_work_mem = '2GB';
CREATE INDEX CONCURRENTLY idx_chunks_hnsw
    ON document_chunks USING hnsw (embedding vector_cosine_ops);

Validation & Recall Testing

Chunk granularity only pays off if retrieval quality holds. Validate two things before promoting a strategy: that queries hit the ANN index, and that recall against a ground-truth set clears your target.

Confirm the planner uses the index rather than falling back to a sequential scan:

SQL

EXPLAIN (ANALYZE, BUFFERS)
SELECT doc_id, chunk_index
FROM document_chunks
ORDER BY embedding <=> $1
LIMIT 10;
-- Expect: "Index Scan using idx_chunks_hnsw". A "Seq Scan" means the index
-- was skipped -- check that the query operator matches the index opclass.

Measure recall@K by comparing approximate results against an exact brute-force baseline over a sampled query set. This is the only signal that tells you whether a coarser or finer chunk boundary actually helped:

PYTHON

def recall_at_k(conn, queries, k=10):
    hits = 0
    for q in queries:
        approx = {r[0] for r in conn.execute(
            "SELECT doc_id||':'||chunk_index FROM document_chunks "
            "ORDER BY embedding <=> %s LIMIT %s", (q, k)).fetchall()}
        exact = {r[0] for r in conn.execute(
            "SET LOCAL enable_indexscan = off; "  # force exact scan for ground truth
            "SELECT doc_id||':'||chunk_index FROM document_chunks "
            "ORDER BY embedding <=> %s LIMIT %s", (q, k)).fetchall()}
        hits += len(approx & exact) / k
    return hits / len(queries)

Sweep two or three chunk_size values and keep the smallest fragment count that holds recall@10 above target — smaller fragments raise index density and traversal latency for no gain once recall is saturated. The interaction between fragment count and index quality is calibrated in optimizing m and ef_construction parameters.

Failure Modes & Gotchas

Silent truncation at the model boundary. If max_chunk_tokens is not enforced before serialization, the embedding model truncates oversized fragments without error — the tail of the document is never embedded. The diagnostic is the length-distribution query above: a hard spike at the cap is the tell.
Sequential-scan fallback after bulk load. A freshly loaded table with stale statistics makes the planner ignore a valid HNSW index. Run ANALYZE document_chunks; immediately after promotion, and re-check the EXPLAIN plan.
WAL pressure from row-at-a-time upserts. Single-row INSERT ... ON CONFLICT in a tight loop floods the write-ahead log and stalls checkpoints. Batch 512–2048 rows per statement and keep insert_batch under the headroom implied by max_wal_size.
Duplicate vectors from a missing natural key. Without the (doc_id, chunk_index) primary key, a retried batch appends near-duplicate fragments that inflate the index and skew similarity results. The composite key plus content_hash guard is non-negotiable.
VRAM fragmentation on variable-length batches. Unsorted recursive fragments pad every batch to the global maximum, wasting tensor cores and inviting OOM. Length-bucket before batching and dynamically halve batch_size on OutOfMemoryError.

Monitoring & Alerting Hooks

Instrument both the write path and the index so degradation is caught before it reaches retrieval SLAs. Track ingestion depth and staging backlog:

SQL

-- Staging backlog by status -- alert if 'pending' grows unbounded (dispatch stalled).
SELECT status, count(*) AS fragments, min(batch_id::text) AS oldest_batch
FROM embedding_staging
GROUP BY status;

-- Index scan health from the catalog -- idx_scan should climb; idx_tup_read/idx_scan
-- rising fast signals recall drift or a widening ef_search.
SELECT relname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE relname = 'idx_chunks_hnsw';

Export these as Prometheus gauges so backlog growth and index-scan ratios page before users notice. A minimal exporter query and alert threshold:

PYTHON

# Prometheus-compatible sample: staging backlog and index-scan ratio.
def collect_metrics(conn):
    pending = conn.execute(
        "SELECT count(*) FROM embedding_staging WHERE status = 'pending'").fetchone()[0]
    scans, reads = conn.execute(
        "SELECT idx_scan, idx_tup_read FROM pg_stat_user_indexes "
        "WHERE relname = 'idx_chunks_hnsw'").fetchone()
    ratio = (reads / scans) if scans else 0
    # ALERT: staging_pending > 50000 for 10m  OR  index_tup_per_scan > 500
    return {"staging_pending": pending, "index_tup_per_scan": ratio}

For write-path auditing on multi-tenant ingestion, pgaudit logging on the document_chunks table captures who promoted which batch, which pairs cleanly with the isolation controls in security boundaries for vector data.

FAQ

What chunk size should I start with for a RAG pipeline?

Start at 512 tokens with 10-15% overlap for models with an 8k or smaller context window, and 768-1024 for large-context models. Match chunk_size to the model’s native window rather than its maximum, then sweep two or three values and keep the smallest fragment count that holds recall@10 above target.

How do I stop retries from creating duplicate vectors?

Give every fragment a composite natural key (doc_id, chunk_index) and upsert with INSERT ... ON CONFLICT (doc_id, chunk_index) DO UPDATE. Add a content_hash guard in the WHERE clause so an unchanged fragment short-circuits the write entirely. This yields exactly-once semantics even under parallel dispatch.

Does overlap improve recall or just inflate storage?

Modest overlap (10-15%) meaningfully reduces boundary fragmentation on prose, recovering answers that straddle a cut point. Beyond ~20% the marginal recall gain flattens while fragment count — and therefore index size and query latency — climbs linearly. Validate with recall@K rather than assuming more overlap is better.

When is semantic chunking worth the extra cost?

Only when retrieval precision drives product value and the corpus is dense reference text. Semantic chunking runs a pre-embedding pass to place boundaries on meaning, costing 2-3x the ingestion budget. For logs, transcripts, and mixed document sets, recursive-character splitting delivers most of the benefit at a fraction of the cost.

Why did my query fall back to a sequential scan after bulk loading?

Stale planner statistics. A freshly loaded table has no row estimates, so the planner ignores the HNSW index. Run ANALYZE document_chunks; immediately after promoting a batch and re-check EXPLAIN (ANALYZE) for an Index Scan node.

Metadata Mapping & Schema Design — keep the chunk identity contract identical across staging and production.
Type Casting & Vector Normalization — normalize embeddings correctly for your distance metric before insertion.
Async Processing with Python AsyncIO — saturate network I/O while awaiting GPU responses.
Building a Resilient Python Embedding Pipeline with Celery — fault-tolerant multi-worker dispatch with retries and dead-letter routing.
Optimizing m and ef_construction Parameters — tune the HNSW index that chunk granularity feeds.

Up: Embedding Ingestion Pipeline Engineering

Batch Chunking Strategies for Embeddings

Architectural Divergence & Trade-offs #

Parameter Space & Diagnostic Workflow #

Step-by-Step Implementation #

1. Define the staging and production tables #

2. Segment documents with a bounded recursive splitter #

3. Embed in GPU-aligned batches #

4. Upsert idempotently #

5. Promote and index #

Validation & Recall Testing #

Failure Modes & Gotchas #

Monitoring & Alerting Hooks #

FAQ #

Related #

Architectural Divergence & Trade-offs

Parameter Space & Diagnostic Workflow

Step-by-Step Implementation

1. Define the staging and production tables

2. Segment documents with a bounded recursive splitter

3. Embed in GPU-aligned batches

4. Upsert idempotently

5. Promote and index

Validation & Recall Testing

Failure Modes & Gotchas

Monitoring & Alerting Hooks

FAQ

Related