Design a high-concurrency inference API system that can handle massive concurrent requests efficiently. The inference API endpoint is provided and cannot be modified—your focus is on designing the surrounding infrastructure, particularly the batch service that manages requests to GPU workers.
Fixed inference API: You are given an inference API that you cannot modify
Client-facing synchronous requests: Clients make synchronous HTTP requests and wait for responses, but internal processing can be asynchronous through queues
Near real-time latency requirement: Despite internal asynchronous processing, responses must be fast enough to feel near real-time (typically < 500ms-1s)
Traffic requirements: Must handle high concurrent requests with predictable latency
GPU resources: Limited GPU resources that need efficient utilization
Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.
Submit inference requests: Users can submit prompts and receive model responses
Request batching: Aggregate individual requests into batches for efficient GPU processing
Multi-tier support: Handle requests from different user tiers (free, paid, enterprise) with different priorities
Rate limiting: Protect system from overload with tiered rate limits
Scale: 1,000 RPS initially, plan for 10,000+ RPS
Latency: P95 < 500ms end-to-end (including queue wait + inference)
Availability: 99.9% uptime, graceful degradation under load
Cost efficiency: Maximize GPU utilization (target 70-80%) while maintaining latency SLA
Traffic:
Target: 1,000 RPS (plan for 10x growth to 10,000 RPS)
Peak: 3x average (3,000 RPS at peak)
Distribution: 70% free tier, 25% paid, 5% enterprise
GPU Throughput (given):
Batch size: 32 requests
Inference time per batch: 50ms
Batching delay (average): ~20ms
Data transfer overhead: ~5ms
Total time per batch: 50ms + 20ms + 5ms = 75ms
Batches per GPU per second: 1000ms / 75ms = 13.3
Requests per GPU per second: 13.3 × 32 = 426 RPS
GPUs needed (raw): 1000 / 426 = 2.35
GPUs needed (at 70% utilization): 2.35 / 0.7 ≈ 4 GPUs
For 10,000 RPS: 10,000 / 426 = 23.5; / 0.7 ≈ 34 GPUs (round up for headroom)
For 3x peak on that target: ~100 GPUs
Always calculate GPU needs at your target utilization (70-80%), not 100%. This provides buffer for traffic spikes and prevents latency from exploding as utilization approaches capacity.
-- Request tracking (PostgreSQL for durability)
Request {
id: UUID (PK)
user_id: UUID (FK)
tier: Enum (free, paid, enterprise)
model: String
status: Enum (queued, processing, completed, failed)
created_at: Timestamp
completed_at: Timestamp
latency_ms: Integer
}
-- GPU worker registry (Redis for real-time status)
-- Populated via heartbeats; NOT used for dispatch (GPUs self-balance via pull).
-- Purpose: feed healthy_gpu_count to the dynamic rate limiter (Phase 5)
-- and surface fleet health to ops dashboards.
GPUWorker {
worker_id: String
status: Enum (healthy, degraded, offline)
current_utilization: Float
in_flight: Integer // 0 or 1 with single-batch-per-GPU workers
last_heartbeat: Timestamp
}
-- Batch metadata (Batcher internal state)
Batch {
batch_id: UUID
request_ids: [UUID]
created_at: Timestamp
sent_to_gpu_at: Timestamp
gpu_worker_id: String
}
Key: queue:{tier} (e.g., queue:free, queue:paid, queue:enterprise)
Type: List (FIFO)
Key: inflight:{tier}:{batcher_id}
Type: List (in-flight requests for crash recovery)
Value: {
request_id: "req_abc123",
user_id: "user_456",
input: "...",
parameters: {...},
enqueued_at: timestamp,
timeout_at: timestamp
}
Use separate queues per tier to enable priority processing. The batcher can pull from enterprise queue first, then paid, then free. Use LPUSH + RPOPLPUSH/BRPOPLPUSH to atomically move items into an inflight list and a reaper to return stale items to the queue.
The protocol and public endpoints are fixed by the provided inference API and cannot be changed. The examples below assume HTTP/REST for illustration.
POST /v1/inference
Request:
{
"model": "claude-3",
"input": "What is the capital of France?",
"parameters": {
"temperature": 0.7,
"max_tokens": 100
}
}
Response:
{
"request_id": "req_abc123",
"output": "The capital of France is Paris.",
"metadata": {
"tokens_used": 15,
"latency_ms": 127,
"model_version": "claude-3-2024"
}
}
Clients send a single request and wait for the response on the same connection. If the fixed API already exposes a status or polling endpoint, you can use it for client retries; otherwise treat request_id as a trace id for internal debugging only.
GET /internal/health
Response:
{
"status": "healthy",
"gpu_utilization": 0.72,
"queue_depth": 15,
"requests_per_second": 850
}
Optional debug (not part of the public contract):
GET /internal/requests/{request_id}
Response:
{
"status": "completed", // queued, processing, completed, failed
"latency_ms": 127,
"error": null
}
API key-based: Authorization: Bearer <api_key>
Rate limits tied to API key and user tier
Internal APIs use service tokens
Data Layer
GPU Cluster
Batching Layer
Queue Layer
Application Layer
Edge Layer
Clients
enqueue
enqueue
enqueue
poll priority
poll priority
poll priority
push batch
BLPOP
BLPOP
BLPOP
publish responses
publish responses
publish responses
notify gateway
Client 1
Client 2
Client N
Load Balancer
API Gateway Rate Limiter
Enterprise Queue
Paid Queue
Free Queue
Batch Queue
Request Batcher
GPU Worker 1
GPU Worker 2
GPU Worker N
Redis Queues + State + Pub/Sub
PostgreSQL Persistence
Client → Load Balancer (2ms): Client sends POST /v1/inference
Load Balancer → API Gateway (1ms): Routes to healthy gateway instance
API Gateway (5ms):
Authenticates API key
Checks rate limit (Redis lookup)
Determines user tier
Writes request to appropriate tier queue
Stores request_id -> gateway_instance + response_channel in Redis
Keeps HTTP request open with async I/O (timeout)
Request waits in queue (0-40ms): Average ~20ms for uniform traffic
Batch dispatch and inference (55ms):
Batcher polls tier queues in priority order and forms batch
Batcher pushes batch onto shared batch_queue (Redis List): ~1ms
Idle GPU worker claims batch via BLPOP: ~4ms
GPU inference: ~50ms
Response path (8ms):
GPU Worker publishes each response to the owning Gateway's pub/sub channel (responses:{gateway_id}), with request_id in the payload
Gateway receives response (pub/sub) and resolves the pending connection by request_id
Returns response to waiting client
Total latency: ~90-130ms (well under 500ms SLA)
Connection routing is critical: Keep the live socket in the Gateway process (pending map). Store request_id -> gateway_instance + response_channel in Redis with a TTL matching your request timeout. When the GPU worker publishes results, only the owning Gateway delivers the response.
GPU WorkerBatcherRedisAPI GatewayClientGPU WorkerBatcherRedisAPI GatewayClientGateway holds connection openloop[Batching (every ~40ms)]POST /v1/inferenceAuth + Rate LimitEnqueue requestStore request_id -> gateway instance + response channelPoll tier queues (priority order)Return requestsPush batch to batch_queueBLPOP batch_queueReturn batch (32 requests)Run inference (~50ms)Publish each response to responses:{gateway_id}Notify (pub/sub)Resolve pending connection by request_idReturn response
Request Batcher
The batcher is the core component balancing throughput vs latency:
class RequestBatcher:
def __init__(self, consumer_id):
self.consumer_id = consumer_id
self.batch_size = 32
self.timeout_ms = 40 # Max wait time
async def collect_batch(self):
batch = []
start_time = now()
while len(batch) < self.batch_size:
# Check elapsed time
if now() - start_time > self.timeout_ms:
break
# Pull from queues in priority order
found_request = False
for queue in ['enterprise', 'paid', 'free']:
queue_key = f'queue:{queue}'
inflight_key = f'inflight:{queue}:{self.consumer_id}'
# Atomically move to inflight for crash recovery
request = redis.rpoplpush(queue_key, inflight_key)
if request:
batch.append(request)
found_request = True
if len(batch) >= self.batch_size:
break
# Small sleep if no requests found this iteration
if not found_request:
await sleep(1) # 1ms - avoid busy spinning
# Caller pushes batch to batch_queue (LPUSH) and removes from
# inflight after the GPU publishes an ack for every request_id.
return batch
Batching trade-off: Larger batches = better GPU utilization but higher latency. Timeout-based batching (send when full OR after 40ms) balances both—full utilization during high traffic, acceptable latency during low traffic.
GPU Dispatch
With pull-based dispatch, GPU workers self-balance: whichever worker is idle claims the next batch. There is no centralized scheduler, no stale queue_depth view, and no thundering herd when multiple batchers are running. Workers block on the shared batch_queue via BLPOP and publish responses directly to Redis pub/sub.
async def gpu_worker_loop(worker_id):
while True:
# Blocks until a batch is available; atomic claim across workers.
batch = redis.blpop('batch_queue', timeout=0)
try:
outputs = run_inference([r.input for r in batch.requests]) # ~50ms
# A batch may contain requests from many Gateway instances,
# so we publish per-request to the owning Gateway's channel.
# The gateway_id was stamped onto each request at enqueue time.
for req, output in zip(batch.requests, outputs):
redis.publish(f'responses:{req.gateway_id}', {
'request_id': req.id,
'output': output,
})
except Exception as e:
# Requeue once for another worker; DLQ after retry budget exhausted.
# See "Request Timeouts and Retries" below for handle_gpu_failure.
handle_gpu_failure(batch, e)
redis.hset(f'gpu:{worker_id}', 'last_heartbeat', now())
Why pull, not push? "Least-loaded" push uses queue_depth as a lagging signal. With multiple batcher instances, they can all pick the same "least loaded" GPU at the same tick and create a thundering herd. Pull (BLPOP) is atomic and race-free by construction. Push only earns its complexity when you need global scheduling intelligence (heterogeneous GPU pools, model affinity, canary rollouts), which this design does not require.
The GPU registry still exists (see Phase 2) but serves a different master: it feeds healthy_gpu_count into the dynamic rate limiter's capacity calculation and powers ops dashboards. Dispatch does not consult it.
Problem: What happens when half your GPU cluster goes down?
The key insight is that rate limits must adjust dynamically based on available capacity:
class DynamicRateLimiter:
def calculate_capacity(self):
active_gpus = redis.get('healthy_gpu_count')
rps_per_gpu = 426 # From estimation
target_utilization = 0.7
return active_gpus * rps_per_gpu * target_utilization
def should_throttle(self, tier, current_rps):
capacity = self.calculate_capacity()
queue_depth = redis.get('total_queue_depth')
# Throttling thresholds
if queue_depth > 500:
# Critical: Only accept enterprise
return tier != 'enterprise'
elif queue_depth > 100:
# Warning: Only accept paid+
return tier == 'free'
elif current_rps > capacity:
# At capacity: Throttle free tier
return tier == 'free'
return False
Feedback loop: GPU capacity information must flow back to the rate limiter. When GPUs fail, the system should tighten rate limits within seconds, not minutes. Use Redis pub/sub or a metrics system for this feedback.
When traffic suddenly doubles:
The 70% utilization target is critical here—it buys you time while new GPUs spin up. Running at 95% utilization leaves no buffer for spikes.
# Timeout strategy
# Note: These are hard caps; P95 target remains < 500ms.
TIMEOUTS = {
'queue_age': 2000, # Drop if in queue > 2s
'gpu_processing': 3000, # Terminate if GPU takes > 3s
'total_request': 4000 # End-to-end timeout
}
# Retry strategy
# With pull-based dispatch, "retry on a different GPU" is automatic:
# the failed batch is simply pushed back onto batch_queue and the
# next idle worker (which may be a different one) claims it.
def handle_gpu_failure(batch, error):
if batch.retry_count >= 2:
return send_to_dlq(batch)
batch.retry_count += 1
# Re-enqueue for another worker to BLPOP.
redis.lpush('batch_queue', batch)
Retry storms: Unlimited retries during an outage can amplify load 2-3x. Limit to 1-2 retries with exponential backoff, and use a dead letter queue for persistent failures.
Important rules:
Scale up aggressively, scale down conservatively
Cooldown period: 5-10 min between scaling operations
Remove at most 10-20% of fleet per scale-down
Clarified functional requirements (batching, multi-tier, rate limiting)
Discussed scale (1K → 10K RPS) and latency SLA (500ms)
Calculated GPU capacity with utilization buffer
Explained batching strategy (timeout-based)
Designed priority queues for user tiers
Explained response routing (request_id -> gateway_id + pending map)
Covered dynamic rate limiting based on GPU capacity
Addressed traffic spike handling and auto-scaling
Discussed timeout and retry strategy
Identified single points of failure (batcher, GPU workers)