Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.
You are provided with a fixed backend function that you cannot modify:
def batchstring(inputs: list[str]) -> list[str]:
"""
Processes a batch of string inputs and returns string outputs.
Constraints:
- Input size: 1-100 strings per batch
- Output size: 1-100 strings (one per input)
- Latency: ~100ms per batch (fixed, regardless of batch size within limits)
- Concurrency: Each GPU instance can only process ONE batch at a time
"""
# Fixed implementation - you cannot modify this
pass
How do you design a service that:
Accepts individual synchronous HTTP requests from users
Aggregates them into batches internally
Routes batches to available GPU workers
Maps responses back to the original requesters
Maintains low latency while maximizing throughput
Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.
Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.
Frame requirements as user capabilities:
Submit inference requests — Users should be able to send a single string input via HTTP and receive a processed string output
Synchronous response — Users should receive responses in the same HTTP connection (no polling or callbacks)
High concurrency — System should handle thousands of simultaneous users without degradation
Keep functional requirements minimal for this problem. The complexity lies in the internal batching mechanism, not user-facing features.
The 100ms fixed GPU processing time is the irreducible minimum. All other latency (batching, routing, network) must fit within the remaining ~100ms budget.
GPU count calculation:
Given:
Target: 1,000 RPS
GPU processing: 100ms per batch (10 batches/sec/GPU)
Target batch size: 32 requests
Throughput per GPU = 10 batches/sec × 32 requests/batch = 320 RPS
Raw GPUs needed = 1,000 RPS / 320 RPS = 3.125 → 4 GPUs
With 70% utilization headroom = 4 / 0.7 ≈ 6 GPUs
Latency breakdown:
Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS)
Network overhead: ~10ms
GPU processing: 100ms
───────────────────────
Total average: ~126ms ✓ (under 200ms target)
Concurrent connections:
Connections = RPS × Average latency = 1,000 × 0.126s = 126 concurrent
Well within default OS limits (can support 10K+ with tuning)
InferenceRequest
├── request_id: UUID (unique identifier)
├── input: string (user's input text)
├── timestamp: datetime (arrival time)
├── return_to: string (API instance identifier)
└── status: enum (pending, processing, completed, failed)
Batch
├── batch_id: UUID
├── requests: list[InferenceRequest] (1-100 items)
├── created_at: datetime
└── gpu_id: string (assigned GPU)
InferenceResponse
├── request_id: UUID (maps to original request)
├── output: string (processed result)
└── latency_ms: int (total processing time)
This system is largely stateless—no durable database needed. All state is transient and tied to in-flight requests.
REST is appropriate because:
Simple request-response model
Standard HTTP semantics
Easy to integrate with any client
Submit Inference Request
POST /api/inference
Content-Type: application/json
Request:
{
"input": "E equals "
}
Response (200 OK):
{
"output": "E equals mc^2",
"request_id": "req_abc123",
"latency_ms": 142
}
Response (503 Service Unavailable):
{
"error": "SERVICE_OVERLOADED",
"message": "Too many pending requests",
"retry_after_ms": 5000
}
response routing
HTTP response
Users (1...N)
Load Balancer
API Servers (1...N)
Redis Request Queue
Batching Service
Redis Batch Queue
GPU Workers (1...N)
Redis Pub/Sub
Step 1: Request Arrival
User → Load Balancer → API Server 2
API Server 2:
Generate unique request_id: "req_xyz"
Store connection: pending_requests["req_xyz"] = http_connection
3. Push to Redis queue: {request_id, input, return_to: "api-2"}
4. Wait for response (keep connection open)
Step 2: Batch Formation
Batching Service:
BLPOP from Redis request queue (blocks until available)
Accumulate requests into current_batch
Trigger when: batch.size == 32 OR elapsed_time > 50ms
Push formed batch to Redis batch queue
Step 3: GPU Processing (Pull-Based)
GPU Worker (runs in a loop):
BLPOP from batch queue (blocks until batch available)
Execute batchstring(inputs) → 100ms
Publish results directly to Redis Pub/Sub
Why pull-based? GPUs claim batches atomically via BLPOP—no race conditions, no need to track GPU availability. When a GPU finishes, it simply pulls the next batch.
Step 4: Response Routing
GPU Worker (after processing):
For each (request_id, output) in results:
Publish to Redis channel "responses:api-2":
{request_id: "req_xyz", output: "..."}
API Server 2:
Subscribed to "responses:api-2"
Receive message for request_id "req_xyz"
Lookup: connection = pending_requests["req_xyz"]
Send HTTP response through connection
Delete from pending_requests
Timeout-based batching is critical. Pure size-based batching causes unacceptable latency during low traffic periods.
class BatchingService:
def __init__(self, batch_size=32, timeout_ms=50):
self.batch_size = batch_size
self.timeout_ms = timeout_ms
self.current_batch = []
self.batch_start_time = None
def should_send_batch(self):
if len(self.current_batch) >= self.batch_size:
return True # Size trigger
if self.batch_start_time and elapsed_ms() > self.timeout_ms:
return True # Timeout trigger
return False
Trade-off: Timeout Selection
At 1000 RPS, requests arrive every 1ms. A batch of 32 fills in ~32ms, so the size trigger fires before the timeout. The timeout only matters during low traffic.
Rule of thumb: Set timeout to ~50% of your latency budget after GPU processing. With 100ms remaining (200ms target - 100ms GPU), a 50ms timeout leaves headroom for network overhead.
This is where most candidates fail. The HTTP connection exists between User ↔ API Server. The Batching Service cannot directly send responses through that connection.
Why is this hard?
User connects to API Server 1
Request is batched by the Batching Service and processed by a GPU worker (different process)
GPU worker publishes the result to a routing channel
How does the result get back to API Server 1's HTTP connection?
Solution: Redis Pub/Sub for Response Routing
Each API instance subscribes to its own channel:
API-1 subscribes to "responses:api-1"
API-2 subscribes to "responses:api-2"
Request includes "return_to" field identifying origin API instance
GPU workers publish each response to the correct channel
Alternative: Collocated Architecture
For simpler deployments, run the batching logic within each API server:
┌────────────────────────────────┐ ┌─────────────────┐
│ API Server Instance │ │ Batch Queue │
│ ┌───────────────────────────┐ │ │ (Redis) │
│ │ HTTP Handler │ │ └────────┬────────┘
│ │ - Keeps connection open │ │ │
│ └───────────┬───────────────┘ │ v
│ │ │ ┌─────────────────┐
│ ┌───────────▼───────────────┐ │ │ GPU Workers │
│ │ Local Batcher │──┼────>│ (shared pool) │
│ │ - In-memory batch │ │ └─────────────────┘
│ │ - Direct connection ref │<─┼──── Response via Pub/Sub
│ └───────────────────────────┘ │
└────────────────────────────────┘
Why "Medium" GPU efficiency? With 3 API instances, each forms batches independently. At 1000 RPS split evenly (~333 RPS each), each instance fills a batch of 32 in ~96ms. Meanwhile, the centralized approach aggregates all 1000 RPS and fills batches in ~32ms—fewer partially-filled batches.
Trade-off comparison:
When to use collocated: Start with collocated for simplicity. Migrate to separate batching service when you observe GPU under-utilization due to small batch sizes across instances.
1. Latency (P95 < 200ms)
Timeout-based batching caps waiting time at 50ms
Co-locate components in same availability zone
Use connection pooling to Redis
Monitor and alert when P95 approaches 180ms
2. Throughput (1,000 RPS)
With 6 GPUs at 70% utilization:
Theoretical max: 6 × 320 = 1,920 RPS
Sustainable: 1,920 × 0.7 = 1,344 RPS ✓
3. Availability (99.9%)
Batching Service resilience: Multiple batching instances can safely pull from the same request queue (BLPOP is atomic). If one crashes, others continue. Partially-formed batches in the crashed instance are lost, but those requests timeout at the API layer and users retry.
Bottleneck 1: GPU Processing (100ms fixed)
This is the irreducible bottleneck. The only solution is horizontal scaling (more GPUs).
RPS needed → GPUs required
500 RPS → 3 GPUs
1,000 RPS → 6 GPUs
5,000 RPS → 25 GPUs
10,000 RPS → 50 GPUs
Bottleneck 2: Connection Limits (C10K Problem)
At 10,000 RPS with 150ms latency:
Concurrent connections = 10,000 × 0.15 = 1,500 per LB
Solutions:
Increase OS file descriptor limits (ulimit -n 65536)
Horizontal scale API instances
Consider HTTP/2 multiplexing
Bottleneck 3: Redis Throughput
Operations per request:
1 LPUSH (API → request queue)
1 RPOP (batching service dequeues)
1 LPUSH per batch (batching → batch queue, amortized: 1/32 per request)
1 BLPOP per batch (GPU claims batch, amortized: 1/32 per request)
1 PUBLISH (GPU → response routing)
At 10,000 RPS: ~30,000 Redis ops/sec (3 per request)
Redis easily handles 100K+ ops/sec on modest hardware ✓
GPU crash mid-batch — If a GPU fails while processing, 32 user requests fail simultaneously. This requires explicit handling.
GPU Worker with timeout and retry:
async def gpu_worker_loop():
while True:
batch = await redis.blpop("batch_queue", timeout=0)
try:
result = await asyncio.wait_for(
asyncio.to_thread(batchstring, batch.inputs),
timeout=0.3 # 3x expected (300ms)
)
await publish_results(batch, result)
except asyncio.TimeoutError:
# GPU hung - requeue batch for another GPU
await redis.lpush("batch_queue", batch)
await report_unhealthy()
break # Exit and let orchestrator restart this worker
except Exception as e:
# Processing failed - notify all waiters with error
await publish_errors(batch, str(e))
Timeout handling at API layer:
async def handle_request(input: str):
request_id = generate_id()
pending_requests[request_id] = asyncio.get_running_loop().create_future()
await enqueue_request(request_id, input)
try:
result = await asyncio.wait_for(
pending_requests[request_id],
timeout=5.0 # 5 second deadline
)
return result
except asyncio.TimeoutError:
del pending_requests[request_id]
raise HTTPException(504, "Request timeout")
Aggressive batching (larger batches, longer timeout):
Pro: Higher GPU utilization, lower cost per request
Con: Higher user-perceived latency
Conservative batching (smaller batches, shorter timeout):
Pro: Lower latency, better user experience
Con: More GPU overhead, higher cost
Adaptive approach:
def calculate_batch_params(queue_depth, gpu_utilization):
if gpu_utilization < 0.5 and queue_depth < 10:
# Under-utilized: prioritize latency
return BatchParams(size=16, timeout_ms=30)
elif queue_depth > 100 or gpu_utilization > 0.85:
# Overloaded: prioritize throughput
return BatchParams(size=64, timeout_ms=100)
else:
# Normal: balanced
return BatchParams(size=32, timeout_ms=50)
GPU cold start time (1-5 minutes) means aggressive scaling-up is necessary. Scale down conservatively to avoid thrashing.
Requirements Phase:
Clarified latency SLA (P95 target)
Confirmed throughput requirements (RPS)
Asked about user priority tiers
Verified GPU processing constraints
Data Model Phase:
Identified transient vs persistent data
Request tracking structure defined
Batch formation structure defined
API Phase:
Single endpoint with clear contract
Error codes for all failure modes
Timeout behavior specified
High-Level Design Phase:
Complete request flow drawn
Response routing mechanism explained
Batching strategy with timeout
GPU assignment logic
Scaling Phase:
GPU count calculation shown
Connection limits addressed
Failure handling for GPU crashes
Auto-scaling triggers defined
This design handles 1,000 RPS with ~140ms average latency using 6 GPUs, meeting all functional and non-functional requirements while remaining operationally simple.