Design a highly scalable webhook delivery system that allows users to register callback URLs for specific events. When an event occurs, the system must reliably deliver HTTP POST requests to all registered webhooks with the event payload. The system must handle up to 1 billion events per day while ensuring reliable delivery, retry logic, security, and observability.
For detailed architectural solutions and implementation approaches, these resources provide excellent technical depth:
Design a Webhook System
System Design School - Webhook Solution
Video Explanation
The content below focuses on real interview experiences from candidates who interviewed at OpenAI, showing what actually gets discussed and emphasized during the interview.
"The objective is to design a Webhook service which allows users to register their callback address and an eventId. Whenever eventId is triggered, the system should call the registered callback address with a specific payload. It is safe to assume that each eventId → callback address is unique and one eventId will trigger only one callback address. Assume a highly scalable webhook delivery system which can handle up to 1B events per day. Webhook service that serves URL, questions around caching, db design, focus on failure and retry mechanism in message queue. Implement the REST service with JSON body and query for GET and POST."
What the interviewer emphasized:
REST API Design
How to structure webhook registration endpoints (POST /webhooks)
How to implement GET endpoints to retrieve webhook status and delivery logs
JSON request/response format design
Query parameters for filtering and pagination
Caching Strategy
Where to apply caching in the system (webhook configurations vs. delivery logs)
What to cache: Active webhook registrations for fast event fan-out
Cache invalidation strategy when webhooks are updated/deleted
Trade-offs: Freshness vs. performance
Database Design
Schema for webhook registrations table (event_id, callback_url, user_id, is_active)
Schema for delivery logs/audit trail (delivery_id, webhook_id, status, attempt_count, timestamps)
Indexing strategy for efficient queries
How to handle the uniqueness constraint: one eventId → one callback per user
Failure and Retry Mechanism in Message Queue (Primary focus area)
How to implement retry logic using message queue features
Exponential backoff timing strategy
Which HTTP status codes should trigger retries (5xx, timeouts) vs. permanent failures (4xx)
Dead letter queue (DLQ) for messages that exceed max retries
How to track retry attempts (in message metadata vs. database)
Preventing retry storms when many webhooks fail simultaneously
Key insight: The interviewer focused heavily on the practical implementation details—not just high-level architecture, but actual REST endpoints, caching patterns, and how to use message queue primitives (visibility timeout, DLQ, message attributes) to implement robust retry logic. The simplified assumption (1:1 mapping of eventId to callback) meant less focus on fan-out complexity, but more depth on reliability mechanisms.
✦ AI-Generated Solution · System Design · Comprehensive Reliable HTTP delivery of events to user-registered callback URLs at 1B events/day (~12K events/sec average, plan for 5–10× peak). The interview emphasis is REST API, caching, DB design, and the retry mechanism.
Functional
(event_id → callback_url) (assume unique 1:1 mapping per the prompt).Non-functional

Decouple accept from deliver with a durable queue. Accepting an event is a fast, durable write; delivery (slow, retried) happens asynchronously in workers.
POST /v1/webhooks { "event_id": "...", "callback_url": "https://...", "secret": "..." }
GET /v1/webhooks/{id} -> config + status
DELETE /v1/webhooks/{id}
GET /v1/webhooks/{id}/deliveries?status=failed&cursor=...&limit=50 -> delivery log (paginated)
POST /v1/events { "event_id": "...", "payload": {...} } -> 202 Accepted (queued)
Use cursor-based pagination for delivery logs (offset pagination is too expensive at this volume).
CREATE TABLE webhooks (
id UUID PRIMARY KEY,
event_id TEXT UNIQUE, -- 1:1 mapping per problem statement
callback_url TEXT NOT NULL,
secret TEXT NOT NULL, -- for HMAC signing
is_active BOOLEAN DEFAULT true,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE deliveries (
id UUID PRIMARY KEY,
webhook_id UUID,
event_uid UUID, -- dedupe key, also sent as header for consumer idempotency
status TEXT, -- PENDING|SUCCESS|FAILED|DEAD
attempt_count INT DEFAULT 0,
next_retry_at TIMESTAMPTZ,
last_status INT, -- last HTTP code
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_deliveries_lookup ON deliveries (webhook_id, status, created_at);
event_id → callback_url + secret map) are read on every event → cache in Redis for fast fan-out; invalidate on update/delete (write-through). This is the hot read path the interviewer probes.Implemented with message-queue primitives:
5xx, connection timeout, DNS failure → retry. Permanent — 4xx (except 429) → fail fast (don't retry a malformed/forbidden endpoint). 429 → honor Retry-After.1s, 5s, 30s, 5m, 30m, 2h … up to a max (e.g. 6–8 attempts), with random jitter to avoid synchronized retry storms.attempt_count and next_retry_at on the delivery; a scheduler re-enqueues due retries.DEAD, alert, and expose for manual replay.X-Webhook-Id/event_uid header so consumers can dedupe (delivery is at-least-once).def schedule_retry(delivery):
delivery.attempt_count += 1
if delivery.attempt_count > MAX_ATTEMPTS or permanent(delivery.last_status):
dead_letter(delivery) # -> DLQ, alert
return
backoff = min(BASE * 2 ** delivery.attempt_count, MAX_BACKOFF)
delivery.next_retry_at = now() + backoff + random_jitter()
delay_queue.put(delivery, eta=delivery.next_retry_at)
X-Signature = HMAC_SHA256(secret, body) so consumers verify authenticity and integrity. Include a timestamp to prevent replay.webhook_id for ordering + parallelism; autoscale stateless delivery workers on queue depth.| Concern | Decision |
|---|---|
| Accept vs deliver | Decouple via durable queue; 202 Accepted |
| Retries | Exponential backoff + jitter, delay queue, max attempts → DLQ |
| Retry classification | 5xx/timeout retry; 4xx fail fast; honor 429 Retry-After |
| Caching | Active webhook configs in Redis (write-through) |
| Idempotency | Stable event id header; at-least-once |
| Security | HMAC-signed payloads, HTTPS, SSRF allowlist |
| Logs | Cursor pagination, indexed, TTL to cold storage |