Webhook Delivery System

System DesignSystem DesignOnsitePhoneSoftware EngineerReported Feb, 2026Medium Frequency

Problem Statement

Design a highly scalable webhook delivery system that allows users to register callback URLs for specific events. When an event occurs, the system must reliably deliver HTTP POST requests to all registered webhooks with the event payload. The system must handle up to 1 billion events per day while ensuring reliable delivery, retry logic, security, and observability.

Comprehensive Solution Resources

For detailed architectural solutions and implementation approaches, these resources provide excellent technical depth:

Design a Webhook System

System Design School - Webhook Solution

Video Explanation

The content below focuses on real interview experiences from candidates who interviewed at OpenAI, showing what actually gets discussed and emphasized during the interview.

Real Interview Experiences

Experience 1: REST API, Caching, DB Design, and Retry Mechanisms

"The objective is to design a Webhook service which allows users to register their callback address and an eventId. Whenever eventId is triggered, the system should call the registered callback address with a specific payload. It is safe to assume that each eventId → callback address is unique and one eventId will trigger only one callback address. Assume a highly scalable webhook delivery system which can handle up to 1B events per day. Webhook service that serves URL, questions around caching, db design, focus on failure and retry mechanism in message queue. Implement the REST service with JSON body and query for GET and POST."

What the interviewer emphasized:

REST API Design

How to structure webhook registration endpoints (POST /webhooks)

How to implement GET endpoints to retrieve webhook status and delivery logs

JSON request/response format design

Query parameters for filtering and pagination

Caching Strategy

Where to apply caching in the system (webhook configurations vs. delivery logs)

What to cache: Active webhook registrations for fast event fan-out

Cache invalidation strategy when webhooks are updated/deleted

Trade-offs: Freshness vs. performance

Database Design

Schema for webhook registrations table (event_id, callback_url, user_id, is_active)

Schema for delivery logs/audit trail (delivery_id, webhook_id, status, attempt_count, timestamps)

Indexing strategy for efficient queries

How to handle the uniqueness constraint: one eventId → one callback per user

Failure and Retry Mechanism in Message Queue (Primary focus area)

How to implement retry logic using message queue features

Exponential backoff timing strategy

Which HTTP status codes should trigger retries (5xx, timeouts) vs. permanent failures (4xx)

Dead letter queue (DLQ) for messages that exceed max retries

How to track retry attempts (in message metadata vs. database)

Preventing retry storms when many webhooks fail simultaneously

Key insight: The interviewer focused heavily on the practical implementation details—not just high-level architecture, but actual REST endpoints, caching patterns, and how to use message queue primitives (visibility timeout, DLQ, message attributes) to implement robust retry logic. The simplified assumption (1:1 mapping of eventId to callback) meant less focus on fan-out complexity, but more depth on reliability mechanisms.

Reference solution

#30 Webhook Delivery System — Solution

✦ AI-Generated Solution · System Design · Comprehensive Reliable HTTP delivery of events to user-registered callback URLs at 1B events/day (~12K events/sec average, plan for 5–10× peak). The interview emphasis is REST API, caching, DB design, and the retry mechanism.

1. Requirements

Functional

Register/update/delete a webhook: (event_id → callback_url) (assume unique 1:1 mapping per the prompt).
On an event, POST the payload to the registered callback.
Query webhook status and delivery logs (GET, with filtering + pagination).

Non-functional

Reliable, at-least-once delivery with retries (consumers must be idempotent).
Durable — never drop an accepted event.
Secure — signed payloads, HTTPS, SSRF protection.
Observable — per-delivery status, attempts, latency.
Scale: 1B events/day.

2. Capacity

1B/day ÷ 86,400 ≈ ~12K events/sec average; design for ~60–100K/sec peak.
Delivery is I/O-bound (waiting on slow/cold customer endpoints) → many concurrent workers, not CPU-heavy.
Delivery logs dominate storage: ~1B rows/day → TTL + cold storage tiering.

3. Architecture

Webhook delivery architecture

Decouple accept from deliver with a durable queue. Accepting an event is a fast, durable write; delivery (slow, retried) happens asynchronously in workers.

4. API Design

POST   /v1/webhooks            { "event_id": "...", "callback_url": "https://...", "secret": "..." }
GET    /v1/webhooks/{id}       -> config + status
DELETE /v1/webhooks/{id}
GET    /v1/webhooks/{id}/deliveries?status=failed&cursor=...&limit=50   -> delivery log (paginated)
POST   /v1/events             { "event_id": "...", "payload": {...} }    -> 202 Accepted (queued)

Use cursor-based pagination for delivery logs (offset pagination is too expensive at this volume).

5. Data Model

CREATE TABLE webhooks (
  id           UUID PRIMARY KEY,
  event_id     TEXT UNIQUE,         -- 1:1 mapping per problem statement
  callback_url TEXT NOT NULL,
  secret       TEXT NOT NULL,       -- for HMAC signing
  is_active    BOOLEAN DEFAULT true,
  created_at   TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE deliveries (
  id            UUID PRIMARY KEY,
  webhook_id    UUID,
  event_uid     UUID,               -- dedupe key, also sent as header for consumer idempotency
  status        TEXT,               -- PENDING|SUCCESS|FAILED|DEAD
  attempt_count INT DEFAULT 0,
  next_retry_at TIMESTAMPTZ,
  last_status   INT,                -- last HTTP code
  created_at    TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_deliveries_lookup ON deliveries (webhook_id, status, created_at);

6. Caching

Active webhook configs (the event_id → callback_url + secret map) are read on every event → cache in Redis for fast fan-out; invalidate on update/delete (write-through). This is the hot read path the interviewer probes.
Do not cache delivery logs (write-heavy, append-only) — index them instead.

7. Retry Mechanism (the primary focus)

Implemented with message-queue primitives:

Which failures retry: transient — 5xx, connection timeout, DNS failure → retry. Permanent — 4xx (except 429) → fail fast (don't retry a malformed/forbidden endpoint). 429 → honor Retry-After.
Exponential backoff with jitter: delays like 1s, 5s, 30s, 5m, 30m, 2h … up to a max (e.g. 6–8 attempts), with random jitter to avoid synchronized retry storms.
Implementation: a delay queue / scheduled re-enqueue (Kafka with a delay topic, SQS visibility timeout, or RabbitMQ TTL+DLX). Store attempt_count and next_retry_at on the delivery; a scheduler re-enqueues due retries.
Dead Letter Queue: after max attempts → move to DLQ, mark DEAD, alert, and expose for manual replay.
Idempotency: send a stable X-Webhook-Id/event_uid header so consumers can dedupe (delivery is at-least-once).
Retry-storm protection: per-endpoint circuit breaker — if a callback is failing consistently, back off that endpoint specifically so one dead consumer doesn't starve workers.

def schedule_retry(delivery):
    delivery.attempt_count += 1
    if delivery.attempt_count > MAX_ATTEMPTS or permanent(delivery.last_status):
        dead_letter(delivery)                       # -> DLQ, alert
        return
    backoff = min(BASE * 2 ** delivery.attempt_count, MAX_BACKOFF)
    delivery.next_retry_at = now() + backoff + random_jitter()
    delay_queue.put(delivery, eta=delivery.next_retry_at)

8. Security

Sign every payload: X-Signature = HMAC_SHA256(secret, body) so consumers verify authenticity and integrity. Include a timestamp to prevent replay.
HTTPS only; validate callback URLs on registration to block SSRF (no private/loopback IP ranges, no metadata endpoints).

9. Scaling & Observability

Partition the queue by webhook_id for ordering + parallelism; autoscale stateless delivery workers on queue depth.
Tier delivery logs: hot (recent, queryable) → cold (object storage) via TTL.
Metrics: delivery success rate, p99 latency, attempts distribution, DLQ size, per-endpoint failure rate (drives circuit breaking).

10. Summary

Concern	Decision
Accept vs deliver	Decouple via durable queue; `202 Accepted`
Retries	Exponential backoff + jitter, delay queue, max attempts → DLQ
Retry classification	`5xx`/timeout retry; `4xx` fail fast; honor `429 Retry-After`
Caching	Active webhook configs in Redis (write-through)
Idempotency	Stable event id header; at-least-once
Security	HMAC-signed payloads, HTTPS, SSRF allowlist
Logs	Cursor pagination, indexed, TTL to cold storage

WhiteboardAuto-save enabled

Loading whiteboard…