Why use full jitter instead of a fixed exponential delay?

Without jitter, every concurrent worker that hit the same 429 wakes on the same doubling schedule and re-hammers the endpoint, turning one rate-limit event into a synchronized storm. Full jitter samples uniformly in [0, capped_delay], decorrelating the wake-ups and minimizing both contention and total completion time.

Should I honor the Retry-After header over my computed backoff?

Yes. When the provider sends Retry-After it is stating exactly when the window reopens; overriding the computed delay with it (clamped to max_delay) respects the rate-limit contract. Ignoring it and continuing exponential growth often triggers an IP-level block.

How many retries should I allow before dead-lettering?

Five to seven. Past roughly seven attempts the probability of success drops below about 12% on most commercial embedding providers, so further retries just burn the concurrency budget. Once the ceiling is hit, serialize the payload to a dead-letter queue and let a reconciliation worker replay it after the provider recovers.

How do I stop retries from creating duplicate vectors?

Make the write idempotent: a unique key on (doc_id, chunk_index) and INSERT ... ON CONFLICT DO UPDATE, plus a content_hash guard so unchanged chunks skip the write. A retry or a dead-letter replay then overwrites the same logical row instead of appending a duplicate embedding.

Implementing Exponential Backoff for Embedding API Calls in Async Python Pipelines

This page shows how to build a production retry wrapper that absorbs HTTP 429 and transient 5xx failures from an embedding provider without stalling the event loop or duplicating vectors. It scopes the problem tightly: given an httpx.AsyncClient posting batches to a vectorization endpoint, how to pace retries with jittered exponential backoff, honor a server Retry-After header, and hand exhausted requests to a dead-letter queue so a run over millions of documents never silently drops a chunk.

Up: Async Processing with Python AsyncIO

Backoff is the failure-handling half of async ingestion. The concurrency model — semaphores, bounded queues, connection pools — is covered in the parent Async Processing with Python AsyncIO guide, and it decides how many requests are in flight; backoff decides what happens when one of them is rejected. Get it wrong in an async context and the two classic failures appear: a blocking time.sleep() that freezes every coroutine on the loop, or an un-jittered retry that fires all in-flight workers at the provider on the same tick and turns one 429 into a self-inflicted rate-limit storm.

Prerequisites

Python 3.11+ (for asyncio.TaskGroup and except*; the retry loop itself runs on 3.9+).
httpx >= 0.24 as the async HTTP client, or aiohttp — the pattern is identical.
A running vector store: PostgreSQL 15+ with pgvector 0.5+ (0.7+ if you store halfvec), reachable through an asyncpg pool sized to your consumer count.
A durable sink for exhausted payloads: a database table, Redis Stream, or SQS queue for the dead-letter path.
Provider limits on hand: the endpoint’s requests-per-minute (RPM) quota and whether it emits Retry-After on 429. These set max_delay and max_retries.

Jittered exponential backoff: each 429 grows the wait (capped) before the worker retries, until the request succeeds or exhausts its budget.

Step-by-step procedure

1. Classify errors before you retry

Retrying the wrong error wastes the whole budget and hides bugs. A 400 Bad Request or 422 from a malformed payload will never succeed on retry — it belongs in the dead-letter queue immediately, not after seven backoffs. Only rate limits, transient upstream faults, and network timeouts are retryable.

PYTHON

import httpx

RETRYABLE_STATUS = {429, 500, 502, 503, 504}

def is_retryable(exc: Exception) -> bool:
    if isinstance(exc, (httpx.ConnectError, httpx.ReadTimeout, httpx.PoolTimeout)):
        return True
    if isinstance(exc, httpx.HTTPStatusError):
        return exc.response.status_code in RETRYABLE_STATUS
    return False

2. Compute the delay: full jitter, capped, Retry-After aware

The canonical formula is delay = min(base_delay * 2 ** attempt, max_delay). In an async pipeline with many concurrent workers, the exponential term alone is dangerous: every worker that hit the same 429 wakes on the same schedule. Full jitter — sampling uniformly in [0, capped_delay] — decorrelates the wake-ups and is the variant AWS measured as minimizing both contention and completion time. When the server sends a Retry-After header (per RFC 6585, which defines the 429 status), it overrides the computed value: the provider is telling you exactly when the window reopens, and ignoring it invites an IP-level block.

PYTHON

import random

def compute_delay(attempt: int, retry_after: float | None,
                  base_delay: float = 1.0, max_delay: float = 30.0) -> float:
    if retry_after is not None:
        return min(retry_after, max_delay)
    capped = min(base_delay * (2 ** attempt), max_delay)
    return random.uniform(0, capped)  # full jitter

def parse_retry_after(exc: Exception) -> float | None:
    if isinstance(exc, httpx.HTTPStatusError):
        raw = exc.response.headers.get("Retry-After")
        if raw and raw.isdigit():
            return float(raw)
    return None

3. Wrap the request in a non-blocking retry loop

The loop must yield to the event loop with await asyncio.sleep() — never time.sleep(), which would block every other coroutine, including the database writers and the metadata step described in Metadata Mapping & Schema Design. Attach a stable request_id (a hash of the chunk content plus model version) so retries are idempotent end to end and the provider can deduplicate on its side.

PYTHON

import asyncio
import logging
from typing import Any

logger = logging.getLogger(__name__)

async def embed_with_backoff(
    client: httpx.AsyncClient,
    payload: dict,
    request_id: str,
    base_delay: float = 1.0,
    max_delay: float = 30.0,
    max_retries: int = 5,
) -> dict[str, Any]:
    for attempt in range(max_retries + 1):
        try:
            resp = await client.post(
                "/v1/embeddings", json=payload,
                headers={"Idempotency-Key": request_id},
            )
            resp.raise_for_status()
            return resp.json()
        except Exception as exc:
            if not is_retryable(exc) or attempt == max_retries:
                logger.error("giving up id=%s attempt=%d: %s", request_id, attempt, exc)
                raise
            delay = compute_delay(attempt, parse_retry_after(exc), base_delay, max_delay)
            logger.warning("retry id=%s attempt=%d sleeping=%.2fs: %s",
                           request_id, attempt, delay, exc)
            await asyncio.sleep(delay)

4. Keep backoff inside the concurrency budget

A worker that is sleeping between retries should not hold a provider permit or a database connection it isn’t using. Acquire the semaphore around the request, and configure the client’s pool limits so a retry storm can’t exhaust sockets. Retry at the chunk level, not the whole batch — if a 512-document batch returns 429, splitting it and retrying the sub-chunks preserves the partial progress rather than replaying everything. Chunk sizing is set upstream in Batch Chunking Strategies for Embeddings.

PYTHON

limits = httpx.Limits(max_connections=100, max_keepalive_connections=20)

async def worker(sem: asyncio.Semaphore, client: httpx.AsyncClient,
                 payload: dict, request_id: str) -> dict | None:
    async with sem:  # permit held only for the active attempt window
        try:
            return await embed_with_backoff(client, payload, request_id)
        except Exception:
            return None  # exhausted -> caller routes to DLQ

5. Route exhausted requests to a dead-letter queue

Backoff cannot fix a sustained outage. Once max_retries is spent, serialize the payload, its request_id, and the last error to a durable sink, then keep the pipeline moving. A separate reconciliation worker replays the dead-letter rows after the provider’s SLA recovers. Because the write path is idempotent — an INSERT ... ON CONFLICT (doc_id, chunk_index) DO UPDATE keyed on the same natural key — replaying a dead-lettered chunk overwrites its logical row instead of appending a duplicate vector.

PYTHON

import json, time

async def dead_letter(pool, payload: dict, request_id: str, error: str) -> None:
    async with pool.acquire() as conn:
        await conn.execute(
            """INSERT INTO embedding_dlq (request_id, payload, last_error, failed_at)
               VALUES ($1, $2, $3, $4)
               ON CONFLICT (request_id) DO UPDATE
                 SET last_error = EXCLUDED.last_error, failed_at = EXCLUDED.failed_at""",
            request_id, json.dumps(payload), error, time.time(),
        )

Parameter reference

Name	Type	Default	Production recommendation	Notes
`base_delay`	float (s)	`1.0`	`0.5`–`1.0`	First backoff before jitter; too high wastes throughput on brief blips.
`max_delay`	float (s)	`30.0`	`30`–`60`	Cap the exponential term to your SLO timeout; beyond it, an outage, not throttling.
`max_retries`	int	`5`	`5`–`7`	Past ~7 attempts success probability falls below ~12% on most commercial providers.
`jitter`	strategy	none	full jitter `uniform(0, cap)`	Decorrelates concurrent workers; the single most important knob at high concurrency.
`respect_retry_after`	bool	`False`	`True`	Server-provided window overrides the computed delay; ignoring it risks IP bans.
`RETRYABLE_STATUS`	set[int]	`{429,5xx}`	`{429,500,502,503,504}`	Never include `4xx` client errors except `429`; they will not recover.
`max_connections`	int	client default	`100` (pool-aligned)	Keep at or below the `asyncpg` pool + provider concurrency so retries can’t exhaust sockets.

Verification

Confirm the loop retries the right number of times and sleeps the right amounts without touching the network or the clock. A mock transport returns two 429s then a 200; patching asyncio.sleep records the delays and keeps the test instant.

PYTHON

import asyncio, httpx, pytest
from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_backoff_retries_then_succeeds():
    responses = [
        httpx.Response(429, headers={"Retry-After": "2"}),
        httpx.Response(429),
        httpx.Response(200, json={"data": [{"embedding": [0.1, 0.2]}]}),
    ]
    transport = httpx.MockTransport(lambda req: responses.pop(0))
    client = httpx.AsyncClient(transport=transport, base_url="http://api")

    with patch("asyncio.sleep", new=AsyncMock()) as slept:
        result = await embed_with_backoff(client, {"input": "x"}, "req-1", max_retries=5)

    assert result["data"][0]["embedding"] == [0.1, 0.2]
    assert slept.await_count == 2            # exactly two retries
    assert slept.await_args_list[0].args[0] == 2.0   # Retry-After honored on attempt 0

Under real load, confirm the pacing works by watching the ratio of 429s that resolve on retry. If the retry rate climbs but the 429 rate does not fall, the delays are too short or jitter is missing.

SQL

-- exhausted requests should trend to zero once the provider recovers
SELECT date_trunc('minute', to_timestamp(failed_at)) AS minute,
       count(*) AS dead_lettered
FROM embedding_dlq
GROUP BY 1 ORDER BY 1 DESC LIMIT 10;

Troubleshooting

Whole pipeline freezes during a rate-limit spike. A blocking time.sleep() (or a synchronous HTTP client like requests) slipped into the retry path and stalled the event loop. Set loop.slow_callback_duration = 0.1 to log the offending coroutine, and replace it with await asyncio.sleep() on an async client.
429s get worse right after they start. No jitter: every worker retries on the same doubling schedule and re-hammers the endpoint. Switch to full jitter (random.uniform(0, cap)) and verify the sampled delays actually vary in your logs.
Sockets exhausted / PoolTimeout under retries. In-flight retries are holding connections. Set httpx.Limits(max_connections=..., max_keepalive_connections=...) at or below your asyncpg pool size, and hold the semaphore only around the active attempt.
Duplicate vectors appear after an outage. Retries or dead-letter replays are inserting instead of upserting. Enforce a unique key on (doc_id, chunk_index) and an INSERT ... ON CONFLICT DO UPDATE; add a content_hash guard so unchanged chunks skip the write. See Metadata Mapping & Schema Design for the conflict keys.
Retries never stop / batch never finishes. The provider returns a large Retry-After and it is being applied uncapped. Clamp it with min(retry_after, max_delay) and enforce a max_retries ceiling so a stuck request dead-letters instead of blocking a consumer forever.

Async Processing with Python AsyncIO — the semaphore/queue concurrency model this backoff plugs into
Building a resilient Python embedding pipeline with Celery — the multi-node evolution of retry + dead-letter handling
Handling metadata drift during vector ingestion — keeping replayed chunks schema-consistent
Up: Async Processing with Python AsyncIO

Implementing Exponential Backoff for Embedding API Calls in Async Python Pipelines

Prerequisites #

Step-by-step procedure #

1. Classify errors before you retry #

2. Compute the delay: full jitter, capped, Retry-After aware #

3. Wrap the request in a non-blocking retry loop #

4. Keep backoff inside the concurrency budget #

5. Route exhausted requests to a dead-letter queue #

Parameter reference #

Verification #

Troubleshooting #

Related #

Prerequisites

Step-by-step procedure

1. Classify errors before you retry

2. Compute the delay: full jitter, capped, Retry-After aware

3. Wrap the request in a non-blocking retry loop

4. Keep backoff inside the concurrency budget

5. Route exhausted requests to a dead-letter queue

Parameter reference

Verification

Troubleshooting

Related