Inference API System Design

System DesignSystem DesignOnsitePhoneSoftware EngineerReported May, 2026High Frequency

Problem Statement

Design a high-concurrency inference API system that can handle massive concurrent requests efficiently. The inference API endpoint is provided and cannot be modified—your focus is on designing the surrounding infrastructure, particularly the batch service that manages requests to GPU workers.

Given Constraints

Fixed inference API: You are given an inference API that you cannot modify

Client-facing synchronous requests: Clients make synchronous HTTP requests and wait for responses, but internal processing can be asynchronous through queues

Near real-time latency requirement: Despite internal asynchronous processing, responses must be fast enough to feel near real-time (typically < 500ms-1s)

Traffic requirements: Must handle high concurrent requests with predictable latency

GPU resources: Limited GPU resources that need efficient utilization

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Phase 1: Requirements

Functional Requirements

Submit inference requests: Users can submit prompts and receive model responses

Request batching: Aggregate individual requests into batches for efficient GPU processing

Multi-tier support: Handle requests from different user tiers (free, paid, enterprise) with different priorities

Rate limiting: Protect system from overload with tiered rate limits

Non-Functional Requirements

Scale: 1,000 RPS initially, plan for 10,000+ RPS

Latency: P95 < 500ms end-to-end (including queue wait + inference)

Availability: 99.9% uptime, graceful degradation under load

Cost efficiency: Maximize GPU utilization (target 70-80%) while maintaining latency SLA

Capacity Estimation

Traffic:

Target: 1,000 RPS (plan for 10x growth to 10,000 RPS)
Peak: 3x average (3,000 RPS at peak)
Distribution: 70% free tier, 25% paid, 5% enterprise

GPU Throughput (given):

Batch size: 32 requests
Inference time per batch: 50ms
Batching delay (average): ~20ms
Data transfer overhead: ~5ms

Total time per batch: 50ms + 20ms + 5ms = 75ms

Batches per GPU per second: 1000ms / 75ms = 13.3

Requests per GPU per second: 13.3 × 32 = 426 RPS

GPUs needed (raw): 1000 / 426 = 2.35

GPUs needed (at 70% utilization): 2.35 / 0.7 ≈ 4 GPUs

For 10,000 RPS: 10,000 / 426 = 23.5; / 0.7 ≈ 34 GPUs (round up for headroom)

For 3x peak on that target: ~100 GPUs

Always calculate GPU needs at your target utilization (70-80%), not 100%. This provides buffer for traffic spikes and prevents latency from exploding as utilization approaches capacity.

Phase 2: Data Model

Core Entities

-- Request tracking (PostgreSQL for durability)
Request {
  id: UUID (PK)
  user_id: UUID (FK)
  tier: Enum (free, paid, enterprise)
  model: String
  status: Enum (queued, processing, completed, failed)
  created_at: Timestamp
  completed_at: Timestamp
  latency_ms: Integer
}

-- GPU worker registry (Redis for real-time status)
-- Populated via heartbeats; NOT used for dispatch (GPUs self-balance via pull).

-- Purpose: feed healthy_gpu_count to the dynamic rate limiter (Phase 5)

-- and surface fleet health to ops dashboards.

GPUWorker {
  worker_id: String
  status: Enum (healthy, degraded, offline)
  current_utilization: Float
  in_flight: Integer           // 0 or 1 with single-batch-per-GPU workers
  last_heartbeat: Timestamp
}

-- Batch metadata (Batcher internal state)

Batch {
  batch_id: UUID
  request_ids: [UUID]
  created_at: Timestamp
  sent_to_gpu_at: Timestamp
  gpu_worker_id: String
}

Request Queue Structure (Redis)

Key: queue:{tier}  (e.g., queue:free, queue:paid, queue:enterprise)

Type: List (FIFO)

Key: inflight:{tier}:{batcher_id}
Type: List (in-flight requests for crash recovery)
Value: {
  request_id: "req_abc123",
  user_id: "user_456",
  input: "...",
  parameters: {...},
  enqueued_at: timestamp,
  timeout_at: timestamp
}

Use separate queues per tier to enable priority processing. The batcher can pull from enterprise queue first, then paid, then free. Use LPUSH + RPOPLPUSH/BRPOPLPUSH to atomically move items into an inflight list and a reaper to return stale items to the queue.

Phase 3: External API Contract (Fixed)

Protocol (given)

The protocol and public endpoints are fixed by the provided inference API and cannot be changed. The examples below assume HTTP/REST for illustration.

Given Endpoint (example)

POST /v1/inference

Request:

{
  "model": "claude-3",
  "input": "What is the capital of France?",
  "parameters": {
    "temperature": 0.7,
    "max_tokens": 100
  }
}

Response:

{
  "request_id": "req_abc123",
  "output": "The capital of France is Paris.",
  "metadata": {
    "tokens_used": 15,
    "latency_ms": 127,
    "model_version": "claude-3-2024"
  }
}
Clients send a single request and wait for the response on the same connection. If the fixed API already exposes a status or polling endpoint, you can use it for client retries; otherwise treat request_id as a trace id for internal debugging only.

Internal/Ops Endpoints

GET /internal/health

Response:

{
  "status": "healthy",
  "gpu_utilization": 0.72,
  "queue_depth": 15,
  "requests_per_second": 850
}

Optional debug (not part of the public contract):
GET /internal/requests/{request_id}

Response:

{
  "status": "completed",  // queued, processing, completed, failed
  "latency_ms": 127,
  "error": null
}

Authentication

API key-based: Authorization: Bearer <api_key>

Rate limits tied to API key and user tier

Internal APIs use service tokens

Phase 4: High-Level Design

Architecture Diagram

Data Layer

GPU Cluster

Batching Layer

Queue Layer

Application Layer

Edge Layer

Clients

enqueue

poll priority

push batch

BLPOP

publish responses

notify gateway

Client 1

Client 2

Client N

Load Balancer

API Gateway Rate Limiter

Enterprise Queue

Paid Queue

Free Queue

Batch Queue

Request Batcher

GPU Worker 1

GPU Worker 2

GPU Worker N

Redis Queues + State + Pub/Sub

PostgreSQL Persistence

Request Flow

Client → Load Balancer (2ms): Client sends POST /v1/inference

Load Balancer → API Gateway (1ms): Routes to healthy gateway instance

API Gateway (5ms):

Authenticates API key

Checks rate limit (Redis lookup)

Determines user tier

Writes request to appropriate tier queue

Stores request_id -> gateway_instance + response_channel in Redis

Keeps HTTP request open with async I/O (timeout)
Request waits in queue (0-40ms): Average ~20ms for uniform traffic

Batch dispatch and inference (55ms):

Batcher polls tier queues in priority order and forms batch

Batcher pushes batch onto shared batch_queue (Redis List): ~1ms

Idle GPU worker claims batch via BLPOP: ~4ms

GPU inference: ~50ms

Response path (8ms):

GPU Worker publishes each response to the owning Gateway's pub/sub channel (responses:{gateway_id}), with request_id in the payload

Gateway receives response (pub/sub) and resolves the pending connection by request_id

Returns response to waiting client

Total latency: ~90-130ms (well under 500ms SLA)

Connection routing is critical: Keep the live socket in the Gateway process (pending map). Store request_id -> gateway_instance + response_channel in Redis with a TTL matching your request timeout. When the GPU worker publishes results, only the owning Gateway delivers the response.

GPU WorkerBatcherRedisAPI GatewayClientGPU WorkerBatcherRedisAPI GatewayClientGateway holds connection openloop[Batching (every ~40ms)]POST /v1/inferenceAuth + Rate LimitEnqueue requestStore request_id -> gateway instance + response channelPoll tier queues (priority order)Return requestsPush batch to batch_queueBLPOP batch_queueReturn batch (32 requests)Run inference (~50ms)Publish each response to responses:{gateway_id}Notify (pub/sub)Resolve pending connection by request_idReturn response

Component Deep Dive

Request Batcher

The batcher is the core component balancing throughput vs latency:

class RequestBatcher:
    def __init__(self, consumer_id):
        self.consumer_id = consumer_id
        self.batch_size = 32
        self.timeout_ms = 40  # Max wait time

    async def collect_batch(self):
        batch = []
        start_time = now()

        while len(batch) < self.batch_size:
            # Check elapsed time
            if now() - start_time > self.timeout_ms:
                break

            # Pull from queues in priority order
            found_request = False
            for queue in ['enterprise', 'paid', 'free']:
                queue_key = f'queue:{queue}'
                inflight_key = f'inflight:{queue}:{self.consumer_id}'
                # Atomically move to inflight for crash recovery
                request = redis.rpoplpush(queue_key, inflight_key)
                if request:
                    batch.append(request)
                    found_request = True
                    if len(batch) >= self.batch_size:
                        break

            # Small sleep if no requests found this iteration
            if not found_request:
                await sleep(1)  # 1ms - avoid busy spinning

        # Caller pushes batch to batch_queue (LPUSH) and removes from
        # inflight after the GPU publishes an ack for every request_id.
        return batch

Batching trade-off: Larger batches = better GPU utilization but higher latency. Timeout-based batching (send when full OR after 40ms) balances both—full utilization during high traffic, acceptable latency during low traffic.

GPU Dispatch

With pull-based dispatch, GPU workers self-balance: whichever worker is idle claims the next batch. There is no centralized scheduler, no stale queue_depth view, and no thundering herd when multiple batchers are running. Workers block on the shared batch_queue via BLPOP and publish responses directly to Redis pub/sub.

async def gpu_worker_loop(worker_id):
    while True:
        # Blocks until a batch is available; atomic claim across workers.
        batch = redis.blpop('batch_queue', timeout=0)

        try:
            outputs = run_inference([r.input for r in batch.requests])  # ~50ms
            # A batch may contain requests from many Gateway instances,
            # so we publish per-request to the owning Gateway's channel.
            # The gateway_id was stamped onto each request at enqueue time.
            for req, output in zip(batch.requests, outputs):
                redis.publish(f'responses:{req.gateway_id}', {
                    'request_id': req.id,
                    'output': output,
                })
        except Exception as e:
            # Requeue once for another worker; DLQ after retry budget exhausted.
            # See "Request Timeouts and Retries" below for handle_gpu_failure.
            handle_gpu_failure(batch, e)

        redis.hset(f'gpu:{worker_id}', 'last_heartbeat', now())

Why pull, not push? "Least-loaded" push uses queue_depth as a lagging signal. With multiple batcher instances, they can all pick the same "least loaded" GPU at the same tick and create a thundering herd. Pull (BLPOP) is atomic and race-free by construction. Push only earns its complexity when you need global scheduling intelligence (heterogeneous GPU pools, model affinity, canary rollouts), which this design does not require.

The GPU registry still exists (see Phase 2) but serves a different master: it feeds healthy_gpu_count into the dynamic rate limiter's capacity calculation and powers ops dashboards. Dispatch does not consult it.

Phase 5: Scaling & Trade-offs

Deep Dive: Dynamic Rate Limiting

Problem: What happens when half your GPU cluster goes down?

The key insight is that rate limits must adjust dynamically based on available capacity:

class DynamicRateLimiter:
    def calculate_capacity(self):
        active_gpus = redis.get('healthy_gpu_count')
        rps_per_gpu = 426  # From estimation
        target_utilization = 0.7

        return active_gpus * rps_per_gpu * target_utilization

    def should_throttle(self, tier, current_rps):
        capacity = self.calculate_capacity()
        queue_depth = redis.get('total_queue_depth')

        # Throttling thresholds
        if queue_depth > 500:
            # Critical: Only accept enterprise
            return tier != 'enterprise'
        elif queue_depth > 100:
            # Warning: Only accept paid+
            return tier == 'free'
        elif current_rps > capacity:
            # At capacity: Throttle free tier
            return tier == 'free'

        return False

Feedback loop: GPU capacity information must flow back to the rate limiter. When GPUs fail, the system should tighten rate limits within seconds, not minutes. Use Redis pub/sub or a metrics system for this feedback.

Deep Dive: Handling Traffic Spikes

When traffic suddenly doubles:

The 70% utilization target is critical here—it buys you time while new GPUs spin up. Running at 95% utilization leaves no buffer for spikes.

Deep Dive: Request Timeouts and Retries

# Timeout strategy
# Note: These are hard caps; P95 target remains < 500ms.
TIMEOUTS = {
    'queue_age': 2000,      # Drop if in queue > 2s
    'gpu_processing': 3000,  # Terminate if GPU takes > 3s
    'total_request': 4000    # End-to-end timeout
}

# Retry strategy
# With pull-based dispatch, "retry on a different GPU" is automatic:
# the failed batch is simply pushed back onto batch_queue and the
# next idle worker (which may be a different one) claims it.
def handle_gpu_failure(batch, error):
    if batch.retry_count >= 2:
        return send_to_dlq(batch)

    batch.retry_count += 1
    # Re-enqueue for another worker to BLPOP.
    redis.lpush('batch_queue', batch)

Retry storms: Unlimited retries during an outage can amplify load 2-3x. Limit to 1-2 retries with exponential backoff, and use a dead letter queue for persistent failures.

Auto-Scaling Triggers

Important rules:

Scale up aggressively, scale down conservatively

Cooldown period: 5-10 min between scaling operations

Remove at most 10-20% of fleet per scale-down

Trade-offs Discussion

Interview Checklist

Clarified functional requirements (batching, multi-tier, rate limiting)

Discussed scale (1K → 10K RPS) and latency SLA (500ms)

Calculated GPU capacity with utilization buffer

Explained batching strategy (timeout-based)

Designed priority queues for user tiers

Explained response routing (request_id -> gateway_id + pending map)

Covered dynamic rate limiting based on GPU capacity

Addressed traffic spike handling and auto-scaling

Discussed timeout and retry strategy

Identified single points of failure (batcher, GPU workers)

Summary

WhiteboardAuto-save enabled

Loading whiteboard…