Batch Processing Service System Design

System DesignSystem DesignPhoneSoftware EngineerReported Nov, 2025

Problem Statement

Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.

Given Function Signature

You are provided with a fixed backend function that you cannot modify:

def batchstring(inputs: list[str]) -> list[str]:
    """
    Processes a batch of string inputs and returns string outputs.

    Constraints:
    - Input size: 1-100 strings per batch
    - Output size: 1-100 strings (one per input)
    - Latency: ~100ms per batch (fixed, regardless of batch size within limits)
    - Concurrency: Each GPU instance can only process ONE batch at a time
    """
    # Fixed implementation - you cannot modify this
    pass

Core Challenge

How do you design a service that:

Accepts individual synchronous HTTP requests from users

Aggregates them into batches internally

Routes batches to available GPU workers

Maps responses back to the original requesters

Maintains low latency while maximizing throughput

Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Phase 1: Requirements

Functional Requirements

Frame requirements as user capabilities:

Submit inference requests — Users should be able to send a single string input via HTTP and receive a processed string output

Synchronous response — Users should receive responses in the same HTTP connection (no polling or callbacks)

High concurrency — System should handle thousands of simultaneous users without degradation

Keep functional requirements minimal for this problem. The complexity lies in the internal batching mechanism, not user-facing features.

Non-Functional Requirements

The 100ms fixed GPU processing time is the irreducible minimum. All other latency (batching, routing, network) must fit within the remaining ~100ms budget.

Capacity Estimation

GPU count calculation:

Given:

Target: 1,000 RPS
GPU processing: 100ms per batch (10 batches/sec/GPU)
Target batch size: 32 requests

Throughput per GPU = 10 batches/sec × 32 requests/batch = 320 RPS

Raw GPUs needed = 1,000 RPS / 320 RPS = 3.125 → 4 GPUs

With 70% utilization headroom = 4 / 0.7 ≈ 6 GPUs

Latency breakdown:

Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS)

Network overhead: ~10ms

GPU processing: 100ms

───────────────────────

Total average: ~126ms ✓ (under 200ms target)

Concurrent connections:

Connections = RPS × Average latency = 1,000 × 0.126s = 126 concurrent

Well within default OS limits (can support 10K+ with tuning)

Phase 2: Data Model

Core Entities

InferenceRequest

├── request_id: UUID (unique identifier)

├── input: string (user's input text)

├── timestamp: datetime (arrival time)

├── return_to: string (API instance identifier)

└── status: enum (pending, processing, completed, failed)

Batch

├── batch_id: UUID

├── requests: list[InferenceRequest] (1-100 items)

├── created_at: datetime

└── gpu_id: string (assigned GPU)

InferenceResponse

├── request_id: UUID (maps to original request)

├── output: string (processed result)

└── latency_ms: int (total processing time)

Data Locality

This system is largely stateless—no durable database needed. All state is transient and tied to in-flight requests.

Phase 3: API Design

Protocol Choice: REST

REST is appropriate because:

Simple request-response model

Standard HTTP semantics

Easy to integrate with any client

Endpoints

Submit Inference Request

POST /api/inference

Content-Type: application/json

Request:

{
  "input": "E equals "
}

Response (200 OK):

{
  "output": "E equals mc^2",
  "request_id": "req_abc123",
  "latency_ms": 142
}

Response (503 Service Unavailable):

{
  "error": "SERVICE_OVERLOADED",
  "message": "Too many pending requests",
  "retry_after_ms": 5000
}

Error Codes

Phase 4: High-Level Design

Architecture

response routing

HTTP response

Users (1...N)

Load Balancer

API Servers (1...N)

Redis Request Queue

Batching Service

Redis Batch Queue

GPU Workers (1...N)

Redis Pub/Sub

Request Flow

Step 1: Request Arrival

User → Load Balancer → API Server 2

API Server 2:

Generate unique request_id: "req_xyz"
Store connection: pending_requests["req_xyz"] = http_connection

3. Push to Redis queue: {request_id, input, return_to: "api-2"}
4. Wait for response (keep connection open)

Step 2: Batch Formation

Batching Service:

BLPOP from Redis request queue (blocks until available)
Accumulate requests into current_batch
Trigger when: batch.size == 32 OR elapsed_time > 50ms
Push formed batch to Redis batch queue

Step 3: GPU Processing (Pull-Based)

GPU Worker (runs in a loop):

BLPOP from batch queue (blocks until batch available)
Execute batchstring(inputs) → 100ms
Publish results directly to Redis Pub/Sub

Why pull-based? GPUs claim batches atomically via BLPOP—no race conditions, no need to track GPU availability. When a GPU finishes, it simply pulls the next batch.

Step 4: Response Routing

GPU Worker (after processing):

For each (request_id, output) in results:

  Publish to Redis channel "responses:api-2":
    {request_id: "req_xyz", output: "..."}

API Server 2:

Subscribed to "responses:api-2"
Receive message for request_id "req_xyz"
Lookup: connection = pending_requests["req_xyz"]
Send HTTP response through connection
Delete from pending_requests

Batching Strategy

Timeout-based batching is critical. Pure size-based batching causes unacceptable latency during low traffic periods.

class BatchingService:
    def __init__(self, batch_size=32, timeout_ms=50):
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.current_batch = []
        self.batch_start_time = None

    def should_send_batch(self):
        if len(self.current_batch) >= self.batch_size:
            return True  # Size trigger
        if self.batch_start_time and elapsed_ms() > self.timeout_ms:
            return True  # Timeout trigger
        return False

Trade-off: Timeout Selection

At 1000 RPS, requests arrive every 1ms. A batch of 32 fills in ~32ms, so the size trigger fires before the timeout. The timeout only matters during low traffic.

Rule of thumb: Set timeout to ~50% of your latency budget after GPU processing. With 100ms remaining (200ms target - 100ms GPU), a 50ms timeout leaves headroom for network overhead.

Response Mapping: The Critical Design Decision

This is where most candidates fail. The HTTP connection exists between User ↔ API Server. The Batching Service cannot directly send responses through that connection.

Why is this hard?

User connects to API Server 1

Request is batched by the Batching Service and processed by a GPU worker (different process)

GPU worker publishes the result to a routing channel

How does the result get back to API Server 1's HTTP connection?

Solution: Redis Pub/Sub for Response Routing

Each API instance subscribes to its own channel:

  API-1 subscribes to "responses:api-1"
  API-2 subscribes to "responses:api-2"

Request includes "return_to" field identifying origin API instance

GPU workers publish each response to the correct channel

Alternative: Collocated Architecture

For simpler deployments, run the batching logic within each API server:

┌────────────────────────────────┐ ┌─────────────────┐

│ API Server Instance │ │ Batch Queue │

│ ┌───────────────────────────┐ │ │ (Redis) │

│ │ HTTP Handler │ │ └────────┬────────┘

│ │ - Keeps connection open │ │ │

│ └───────────┬───────────────┘ │ v

│ │ │ ┌─────────────────┐

│ ┌───────────▼───────────────┐ │ │ GPU Workers │

│ │ Local Batcher │──┼────>│ (shared pool) │

│ │ - In-memory batch │ │ └─────────────────┘

│ │ - Direct connection ref │<─┼──── Response via Pub/Sub

│ └───────────────────────────┘ │

└────────────────────────────────┘

Why "Medium" GPU efficiency? With 3 API instances, each forms batches independently. At 1000 RPS split evenly (~333 RPS each), each instance fills a batch of 32 in ~96ms. Meanwhile, the centralized approach aggregates all 1000 RPS and fills batches in ~32ms—fewer partially-filled batches.

Trade-off comparison:

When to use collocated: Start with collocated for simplicity. Migrate to separate batching service when you observe GPU under-utilization due to small batch sizes across instances.

Phase 5: Scaling & Trade-offs

Addressing Non-Functional Requirements

1. Latency (P95 < 200ms)

Timeout-based batching caps waiting time at 50ms

Co-locate components in same availability zone

Use connection pooling to Redis

Monitor and alert when P95 approaches 180ms

2. Throughput (1,000 RPS)

With 6 GPUs at 70% utilization:

Theoretical max: 6 × 320 = 1,920 RPS

Sustainable: 1,920 × 0.7 = 1,344 RPS ✓

3. Availability (99.9%)

Batching Service resilience: Multiple batching instances can safely pull from the same request queue (BLPOP is atomic). If one crashes, others continue. Partially-formed batches in the crashed instance are lost, but those requests timeout at the API layer and users retry.

Identifying Bottlenecks

Bottleneck 1: GPU Processing (100ms fixed)

This is the irreducible bottleneck. The only solution is horizontal scaling (more GPUs).

RPS needed → GPUs required

500 RPS → 3 GPUs

1,000 RPS → 6 GPUs

5,000 RPS → 25 GPUs

10,000 RPS → 50 GPUs

Bottleneck 2: Connection Limits (C10K Problem)

At 10,000 RPS with 150ms latency:

Concurrent connections = 10,000 × 0.15 = 1,500 per LB

Solutions:

Increase OS file descriptor limits (ulimit -n 65536)
Horizontal scale API instances
Consider HTTP/2 multiplexing

Bottleneck 3: Redis Throughput

Operations per request:

1 LPUSH (API → request queue)
1 RPOP (batching service dequeues)
1 LPUSH per batch (batching → batch queue, amortized: 1/32 per request)
1 BLPOP per batch (GPU claims batch, amortized: 1/32 per request)
1 PUBLISH (GPU → response routing)

At 10,000 RPS: ~30,000 Redis ops/sec (3 per request)

Redis easily handles 100K+ ops/sec on modest hardware ✓

Deep Dive: Failure Handling

GPU crash mid-batch — If a GPU fails while processing, 32 user requests fail simultaneously. This requires explicit handling.

GPU Worker with timeout and retry:

async def gpu_worker_loop():
    while True:
        batch = await redis.blpop("batch_queue", timeout=0)
        try:
            result = await asyncio.wait_for(
                asyncio.to_thread(batchstring, batch.inputs),
                timeout=0.3  # 3x expected (300ms)
            )
            await publish_results(batch, result)
        except asyncio.TimeoutError:
            # GPU hung - requeue batch for another GPU
            await redis.lpush("batch_queue", batch)
            await report_unhealthy()
            break  # Exit and let orchestrator restart this worker
        except Exception as e:
            # Processing failed - notify all waiters with error
            await publish_errors(batch, str(e))

Timeout handling at API layer:

async def handle_request(input: str):
    request_id = generate_id()
    pending_requests[request_id] = asyncio.get_running_loop().create_future()

    await enqueue_request(request_id, input)

    try:
        result = await asyncio.wait_for(
            pending_requests[request_id],
            timeout=5.0  # 5 second deadline
        )
        return result
    except asyncio.TimeoutError:
        del pending_requests[request_id]
        raise HTTPException(504, "Request timeout")

Trade-off Discussion: Latency vs Throughput

Aggressive batching (larger batches, longer timeout):

Pro: Higher GPU utilization, lower cost per request

Con: Higher user-perceived latency

Conservative batching (smaller batches, shorter timeout):

Pro: Lower latency, better user experience

Con: More GPU overhead, higher cost

Adaptive approach:

def calculate_batch_params(queue_depth, gpu_utilization):
    if gpu_utilization < 0.5 and queue_depth < 10:
        # Under-utilized: prioritize latency
        return BatchParams(size=16, timeout_ms=30)
    elif queue_depth > 100 or gpu_utilization > 0.85:
        # Overloaded: prioritize throughput
        return BatchParams(size=64, timeout_ms=100)
    else:
        # Normal: balanced
        return BatchParams(size=32, timeout_ms=50)

Auto-Scaling Strategy

GPU cold start time (1-5 minutes) means aggressive scaling-up is necessary. Scale down conservatively to avoid thrashing.

Interview Checklist

Requirements Phase:

Clarified latency SLA (P95 target)

Confirmed throughput requirements (RPS)

Asked about user priority tiers

Verified GPU processing constraints

Data Model Phase:

Identified transient vs persistent data

Request tracking structure defined

Batch formation structure defined

API Phase:

Single endpoint with clear contract

Error codes for all failure modes

Timeout behavior specified

High-Level Design Phase:

Complete request flow drawn

Response routing mechanism explained

Batching strategy with timeout

GPU assignment logic

Scaling Phase:

GPU count calculation shown

Connection limits addressed

Failure handling for GPU crashes

Auto-scaling triggers defined

Summary

This design handles 1,000 RPS with ~140ms average latency using 6 GPUs, meeting all functional and non-functional requirements while remaining operationally simple.

WhiteboardAuto-save enabled

Loading whiteboard…