Async Processing with Python AsyncIO for Embedding Ingestion Pipelines

High-throughput vector ingestion breaks in a specific, predictable way: a synchronous loop issues one embedding request, blocks for 200–400 ms of network round-trip, writes one row, and repeats. At that rate a single worker tops out near 3–5 documents per second while the CPU sits idle and the PostgreSQL connection stays parked. Push concurrency naively — an unbounded asyncio.gather() over ten thousand coroutines — and you swing to the opposite failure: HTTP 429 storms from the embedding provider, exhausted connection pools, and out-of-memory kills as the whole corpus buffers in RAM. Getting this right is the core control problem of async ingestion within Embedding Ingestion Pipeline Engineering: saturate the network without overrunning the provider’s rate limit or the database’s connection budget, while keeping writes idempotent. This page covers the concurrency model, the parameters that govern it, a runnable producer–consumer implementation, and the diagnostics that tell you where the loop is stalling.

Up: Embedding Ingestion Pipeline Engineering

Architectural Divergence & Trade-offs

Three concurrency structures dominate async embedding ingestion, and choosing wrong is the difference between a pipeline that self-throttles and one that DDoSes your own provider.

Fan-out with asyncio.gather() is the simplest: build one coroutine per document and await them all. It is correct only when the total task count is small and bounded (a few hundred). At corpus scale it schedules every request simultaneously, so peak concurrency equals the number of documents. That guarantees rate-limit rejections and buffers every payload in memory at once. Use it for a single batch you know fits under the provider quota; never for open-ended ingestion.

Bounded fan-out with asyncio.Semaphore keeps the gather ergonomics but caps in-flight requests. Each coroutine acquires a permit before dispatch and releases it after. Concurrency is now a tunable constant aligned to the provider’s requests-per-minute (RPM) quota rather than the corpus size. This is the right default for a fixed input set that still fits in memory.

Producer–consumer with asyncio.Queue is the model that survives an unbounded stream. Producers read documents and push them into a bounded queue; a fixed pool of consumer tasks pulls, embeds, and upserts. The queue’s maxsize becomes an explicit memory ceiling and applies natural backpressure — when consumers fall behind, queue.put() blocks the producer instead of letting RAM grow without limit. This is the production choice for continuous ingestion and the one the rest of this page implements.

Model	Peak concurrency	Memory profile	Backpressure	When to use
`gather()` fan-out	= document count	Whole batch in RAM	None	Small, known batch under quota
`Semaphore`-bounded	= semaphore value	Whole batch in RAM	On dispatch only	Fixed input set, memory permitting
`Queue` worker pool	= consumer count	Bounded by `maxsize`	End-to-end	Streaming / large corpora

Two cross-cutting decisions sit above all three. First, the event loop runs on one thread, so any CPU-bound step — tokenization, content_hash computation, schema validation covered under Metadata Mapping & Schema Design — must move off the loop via asyncio.to_thread() or a ThreadPoolExecutor, or it will stall every in-flight coroutine. Second, on Linux, swapping the default loop for uvloop typically yields 2–4× throughput on I/O-bound workloads thanks to its Cython/libuv backend; the Python asyncio documentation describes the loop_factory hook used to install it.

Parameter Space & Diagnostic Workflow

Async ingestion has two independent concurrency budgets that must be reconciled: the provider’s RPM and the database’s connection pool. The semaphore governs the first; the asyncpg pool governs the second. If the semaphore is wider than the pool, consumers pile up waiting for a database connection and the write side becomes the bottleneck. Size them together.

Parameter	Default	Production recommendation	Notes
`asyncio.Semaphore(n)`	none	RPM ÷ 60 × avg latency (s), rounded down	For 6000 RPM at 300 ms latency: ~30 permits keeps you under quota
`Queue(maxsize=)`	0 (unbounded)	2–5× consumer count	Explicit memory ceiling; unbounded is an OOM trap
`TCPConnector(limit=)`	100	≥ semaphore value	Below the semaphore, connections become the limiter
`TCPConnector(limit_per_host=)`	0 (unlimited)	50–100	Prevents socket exhaustion against one endpoint
`asyncpg` pool `max_size`	10	Match consumer count	Each consumer needs a connection during upsert
`asyncpg` pool `min_size`	10	5–10	Warm connections avoid cold-start latency spikes
`asyncio.timeout()` per request	none	2–3× p99 latency	Fail fast on hung sockets rather than blocking a permit forever
`loop.slow_callback_duration`	0.1	0.05–0.1	Logs coroutines that monopolize the loop

The diagnostic workflow starts at the database, because an under-provisioned pool masquerades as a slow provider. Check how many ingestion connections are actually active and what they are doing:

SQL

SELECT state, wait_event_type, wait_event, count(*)
FROM pg_stat_activity
WHERE application_name = 'embedding_ingest'
GROUP BY state, wait_event_type, wait_event
ORDER BY count(*) DESC;

If most connections sit in idle in transaction, a consumer is holding a transaction open across an await on the embedding API — a classic async footgun that pins connections and inflates pool pressure. If they cluster on wait_event = 'ClientRead', the loop is dispatching faster than it drains. Cross-reference queue depth (exported from the application, shown below) with this view: a full queue plus idle DB connections means the provider is the bottleneck; a full queue plus saturated connections means the pool is.

Step-by-Step Implementation

The following builds the producer–consumer pipeline end to end: an aiohttp client for the embedding API, a bounded queue, a fixed consumer pool guarded by a semaphore, and idempotent upserts through an asyncpg pool. Chunking is assumed upstream, per Batch Chunking Strategies for Embeddings.

Step 1 — Prepare the target table with a conflict key. Idempotency lives in the schema. A unique constraint on (doc_id, chunk_index) lets retries upsert instead of duplicating vectors.

SQL

CREATE TABLE IF NOT EXISTS doc_embeddings (
    doc_id       text        NOT NULL,
    chunk_index  int         NOT NULL,
    content_hash text        NOT NULL,
    embedding    vector(1536) NOT NULL,
    updated_at   timestamptz NOT NULL DEFAULT now(),
    PRIMARY KEY (doc_id, chunk_index)
);

Step 2 — Create the connection pool and the HTTP session. Size the pool to the consumer count and set an application name so the pg_stat_activity diagnostics above can isolate ingestion traffic.

PYTHON

import asyncio
import aiohttp
import asyncpg

CONCURRENCY = 30          # semaphore width, aligned to provider RPM
QUEUE_MAX = 120           # ~4x consumers: bounded memory ceiling
DSN = "postgresql://ingest@db/vectors"

async def make_resources():
    pool = await asyncpg.create_pool(
        dsn=DSN, min_size=10, max_size=CONCURRENCY,
        max_inactive_connection_lifetime=300.0,
        server_settings={"application_name": "embedding_ingest"},
    )
    connector = aiohttp.TCPConnector(
        limit=CONCURRENCY, limit_per_host=CONCURRENCY, ttl_dns_cache=300,
    )
    session = aiohttp.ClientSession(connector=connector)
    return pool, session

Step 3 — Define the embedding call with a per-request timeout. Wrap the network call in asyncio.timeout() so a hung socket fails fast instead of holding a semaphore permit indefinitely. Transient failures are delegated to Implementing exponential backoff for embedding API calls, which distinguishes 429 from 5xx and applies jittered waits.

PYTHON

async def embed(session, text):
    async with asyncio.timeout(3.0):
        async with session.post(
            "https://api.provider.example/v1/embeddings",
            json={"model": "text-embedding-3-small", "input": text},
        ) as resp:
            resp.raise_for_status()
            payload = await resp.json()
            return payload["data"][0]["embedding"]

Step 4 — Write the consumer. Each consumer acquires a semaphore permit around the embedding call only, then normalizes and upserts. Note the transaction is opened after the network call returns — never hold a database transaction across the await on the provider. Normalization details live in Normalizing embeddings before pgvector insertion.

PYTHON

UPSERT = """
INSERT INTO doc_embeddings (doc_id, chunk_index, content_hash, embedding)
VALUES ($1, $2, $3, $4)
ON CONFLICT (doc_id, chunk_index)
DO UPDATE SET embedding = EXCLUDED.embedding,
              content_hash = EXCLUDED.content_hash,
              updated_at = now()
WHERE doc_embeddings.content_hash IS DISTINCT FROM EXCLUDED.content_hash;
"""

async def consumer(queue, sem, session, pool):
    while True:
        item = await queue.get()
        try:
            async with sem:
                vector = await embed(session, item["text"])
            vec_literal = "[" + ",".join(map(repr, vector)) + "]"
            async with pool.acquire() as conn:
                await conn.execute(
                    UPSERT, item["doc_id"], item["chunk_index"],
                    item["content_hash"], vec_literal,
                )
        except Exception:
            await dead_letter(item)      # replay later; never block the loop
        finally:
            queue.task_done()

Step 5 — Wire producers, consumers, and graceful drain. The producer fills the bounded queue; queue.join() waits for every item to be processed before cancelling the idle consumers.

PYTHON

async def run(documents):
    pool, session = await make_resources()
    queue = asyncio.Queue(maxsize=QUEUE_MAX)
    sem = asyncio.Semaphore(CONCURRENCY)
    consumers = [
        asyncio.create_task(consumer(queue, sem, session, pool))
        for _ in range(CONCURRENCY)
    ]
    for doc in documents:              # producer: blocks when queue is full
        await queue.put(doc)
    await queue.join()                 # drain
    for c in consumers:
        c.cancel()
    await session.close()
    await pool.close()

asyncio.run(run(documents))

For a multi-node version of this same pattern with durable retries and a broker-backed dead-letter queue, see Building a resilient Python embedding pipeline with Celery.

Validation & Recall Testing

An async pipeline can silently write the wrong vectors — truncated payloads, dropped retries, or a normalization mismatch — without ever raising. Validate on three axes: completeness, correctness, and retrieval quality.

Completeness. Every chunk that entered the queue must land as a row. Compare source counts against stored counts per document:

SQL

SELECT doc_id, count(*) AS chunks_stored
FROM doc_embeddings
GROUP BY doc_id
HAVING count(*) <> (SELECT expected_chunks FROM ingest_manifest m
                    WHERE m.doc_id = doc_embeddings.doc_id);

Rows returned here are documents where the async writer lost chunks — usually a consumer exception that routed to the dead-letter queue without a replay.

Correctness of the write path. Confirm the upsert is idempotent by running the same batch twice and asserting updated_at only moves for genuinely changed content (the content_hash IS DISTINCT FROM guard). A second run that rewrites every row means the guard is bypassed and you are churning WAL for nothing.

Retrieval quality. Compare async-ingested vectors against a synchronously computed ground-truth set to catch normalization or dimension drift. Store both and measure cosine agreement in Python:

PYTHON

import asyncpg, asyncio
import numpy as np

async def recall_check(pool, probe_ids, k=10):
    async with pool.acquire() as conn:
        hits = 0
        for pid in probe_ids:
            rows = await conn.fetch(
                """SELECT doc_id FROM doc_embeddings
                   ORDER BY embedding <=> (
                     SELECT embedding FROM ground_truth WHERE doc_id = $1)
                   LIMIT $2""", pid, k)
            truth = await conn.fetch(
                """SELECT doc_id FROM ground_truth
                   ORDER BY embedding <=> (
                     SELECT embedding FROM ground_truth WHERE doc_id = $1)
                   LIMIT $2""", pid, k)
            hits += len(set(r["doc_id"] for r in rows)
                        & set(r["doc_id"] for r in truth))
        return hits / (len(probe_ids) * k)

A recall below ~0.95 against ground truth points to the ingestion layer, not the index — check dimension and normalization before touching the HNSW vs IVFFlat algorithm selection. Confirm the planner is actually using the index rather than falling back to a scan:

SQL

EXPLAIN (ANALYZE, BUFFERS)
SELECT doc_id FROM doc_embeddings
ORDER BY embedding <=> '[...]'::vector
LIMIT 10;

A Seq Scan in the plan during a recall test means the query is exact, not approximate — fine for ground truth, but never what you want in production.

Failure Modes & Gotchas

Idle-in-transaction connection leaks. Opening a transaction before the await embed(...) call pins a database connection for the full network round-trip. Under load, all pool connections end up parked in idle in transaction and consumers deadlock waiting for one. Always compute the vector first, then open the shortest possible transaction to write it.
Unbounded gather() memory blowup. asyncio.gather(*[embed(d) for d in corpus]) materializes one coroutine and one buffered payload per document simultaneously. On a million-chunk corpus this is an immediate OOM. The bounded queue is not optional at scale.
CPU work on the event loop. Tokenization or content_hash computation done inline blocks every other coroutine cooperatively — throughput collapses even though the CPU shows idle. Move it to asyncio.to_thread(). Watch for slow_callback_duration warnings.
Silent 429 swallowing. A bare except: pass around the embedding call turns rate-limit rejections into missing rows that surface only as a recall regression weeks later. Route every failure to the dead-letter queue with its status code preserved.
WAL pressure from per-row commits. Autocommitting one INSERT per chunk generates a WAL flush per row and throttles the whole cluster during bulk loads. Batch writes into multi-row upserts or a COPY staging step; the storage and I/O implications are quantified in pgvector storage overhead analysis.
Semaphore wider than the pool. If Semaphore(100) sits over an asyncpg pool of max_size=10, ninety consumers block on pool.acquire() after embedding, wasting permits and inflating latency. Keep pool max_size ≥ consumer count.

Monitoring & Alerting Hooks

Instrument both halves of the pipeline — the in-process async runtime and the database — because a stall in either looks identical from the outside (throughput drops, latency climbs).

On the application side, export queue depth, in-flight permits, and per-stage latency as Prometheus gauges. Queue depth is the single most predictive signal: a queue pinned at maxsize means consumers are the bottleneck; a queue near zero means producers are starved.

PYTHON

from prometheus_client import Gauge, Histogram

QUEUE_DEPTH = Gauge("embed_queue_depth", "Items waiting in the ingest queue")
INFLIGHT = Gauge("embed_inflight", "Embedding requests in flight")
API_LATENCY = Histogram("embed_api_seconds", "Embedding API latency",
                        buckets=(0.05, 0.1, 0.25, 0.5, 1, 2, 3))

async def sample_metrics(queue, sem):
    while True:
        QUEUE_DEPTH.set(queue.qsize())
        INFLIGHT.set(sem._value)       # remaining permits; capacity - inflight
        await asyncio.sleep(1)

On the database side, alert on connection saturation and long-lived transactions from the ingestion role:

SQL

SELECT count(*) FILTER (WHERE state = 'active')            AS active,
       count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_txn,
       max(extract(epoch FROM now() - xact_start))         AS oldest_txn_s
FROM pg_stat_activity
WHERE application_name = 'embedding_ingest';

Page on idle_in_txn > 0 sustained for more than a few seconds (the connection-leak signature above) and on active approaching the pool max_size (the pool is the limiter — widen it or the consumer count). Track write throughput with pg_stat_user_tables.n_tup_ins deltas over time; a flat line while the queue is non-empty means writes are blocked, not slow.

FAQ

Q: Does uvloop change my code, or just the loop?

Only the loop. Install it via the loop_factory argument to asyncio.run() (or the older policy API) and every asyncio primitive — Queue, Semaphore, gather — behaves identically, just faster on I/O. Your consumer and producer code is unchanged. The gain is largest for network-bound workloads like embedding ingestion and negligible for CPU-bound work, which you should have moved off the loop anyway.

Q: How do I size the semaphore against a provider’s rate limit?

Convert the RPM quota to a concurrency ceiling with permits = (RPM / 60) × average_latency_seconds, rounded down. At 6000 RPM and 300 ms average latency that is 100 × 0.3 = 30 permits. This keeps steady-state throughput just under the quota. Pair it with exponential backoff to absorb the bursts the average hides.

Q: Should I use asyncpg or psycopg3 for the write side?

Both support native async and pooling. asyncpg is faster for raw throughput and has first-class prepared-statement caching, which matters during high-volume upserts; psycopg3 offers a more familiar DB-API surface and easier interop with sync code paths. For a dedicated ingestion writer, asyncpg is the throughput choice. Whichever you pick, size the pool to the consumer count and never hold a transaction across an embedding-API await.

Q: Why is my throughput flat even though CPU and network are idle?

Almost always a synchronous call blocking the single event loop — a blocking DB driver, a requests call instead of aiohttp, or inline tokenization. Turn on loop.slow_callback_duration = 0.1 to log the offending coroutine, and move any CPU-bound or blocking work into asyncio.to_thread() or a ThreadPoolExecutor.

Q: How do I guarantee no duplicate vectors when retries fire?

Make the write idempotent at the schema level with a unique key on (doc_id, chunk_index) and an INSERT ... ON CONFLICT DO UPDATE. A retry then overwrites the same logical row instead of appending a duplicate. Add a content_hash guard so unchanged chunks are skipped entirely, avoiding needless WAL churn.

Batch Chunking Strategies for Embeddings — the upstream stage that fills the ingest queue
Implementing exponential backoff for embedding API calls — retry logic for the 429/5xx failures this page delegates
Type Casting & Vector Normalization — normalizing and typing vectors before upsert
Metadata Mapping & Schema Design — the conflict keys and payload identifiers that make writes idempotent
Building a resilient Python embedding pipeline with Celery — the multi-node evolution of this producer–consumer pattern
Up: Embedding Ingestion Pipeline Engineering

Async Processing with Python AsyncIO for Embedding Ingestion Pipelines

Architectural Divergence & Trade-offs #

Parameter Space & Diagnostic Workflow #

Step-by-Step Implementation #

Validation & Recall Testing #

Failure Modes & Gotchas #

Monitoring & Alerting Hooks #

FAQ #

Related #

Architectural Divergence & Trade-offs

Parameter Space & Diagnostic Workflow

Step-by-Step Implementation

Validation & Recall Testing

Failure Modes & Gotchas

Monitoring & Alerting Hooks

FAQ

Related