Design an online chess platform similar to Chess.com. Users should quickly find opponents, play in real time, and finish games with correct chess clock behavior (each player has limited time, with optional increment). This problem primarily tests whether you can design matchmaking, authoritative move processing, and server-side countdown logic under low-latency constraints.
This walkthrough follows the Interview Framework. Use it as a guide, not a script.
Users should be able to join matchmaking with preferences (time control, rated/casual)
Users should be able to get matched with another player and start a game
Users should be able to submit moves and see opponent moves in near real time
Users should be able to play with a chess clock (including timeout and optional increment)
Users should be able to handle game lifecycle actions (resign, draw offer/accept, reconnect)
Assume standard head-to-head chess only (no tournaments, spectators, bots, or variants like Chess960) unless the interviewer asks to extend.
Clock state must be authoritative on the server. Client-side countdown is display only.
Assumptions:
2M daily active players
8% peak concurrency = 160K concurrent players
2 players per game => 80K concurrent games
Average one move every 8 seconds per game
Write/read throughput:
Move writes: 80K / 8 = 10K moves/sec
Per move fan-out to both players: ~20K real-time outbound events/sec
At ~300 bytes/event, outbound is ~6 MB/sec (excluding protocol overhead)
Matchmaking queue:
If 10% of active players are waiting at peak: 16K queue entries
Queue ops must support rapid insert/remove and efficient nearest-rating lookup
In interviews, state that move traffic is modest; strict latency and correctness are harder than raw throughput.
Player
├── id: UUID
├── username: string
├── rating_blitz: int
├── rating_rapid: int
└── created_at: timestamp
QueueEntry
├── id: UUID
├── player_id: UUID (FK)
├── mode: enum (rated, casual)
├── time_control: string (e.g., "5+0", "10+5")
├── rating: int
├── region: string
├── joined_at: timestamp
└── status: enum (waiting, matched, cancelled, expired)
Game
├── id: UUID
├── white_player_id: UUID (FK)
├── black_player_id: UUID (FK)
├── mode: enum (rated, casual)
├── time_control_base_ms: bigint
├── increment_ms: int
├── status: enum (active, white_won, black_won, draw, aborted)
├── result_reason: enum (checkmate, resignation, timeout, draw, disconnect_forfeit)
├── current_fen: string
├── move_count: int
├── turn: enum (white, black)
├── started_at: timestamp
└── ended_at: timestamp
MoveEvent
├── id: UUID
├── game_id: UUID (FK)
├── move_number: int
├── player_id: UUID
├── uci: string (e2e4)
├── san: string (optional)
├── fen_after: string
├── remaining_white_ms: bigint
├── remaining_black_ms: bigint
├── server_received_at: timestamp
└── is_legal: boolean
ClockState
├── game_id: UUID (PK/FK)
├── white_remaining_ms: bigint
├── black_remaining_ms: bigint
├── active_side: enum (white, black)
├── turn_started_server_ms: bigint (authoritative server timestamp)
└── version: bigint
Player 1:N QueueEntry
Player 1:N Game (as white/black)
Game 1:N MoveEvent
Game 1:1 ClockState
Store both event history (MoveEvent) and latest snapshot (Game.current_fen, ClockState) for fast reads plus replayability.
POST /api/chess/queue Join queue
DELETE /api/chess/queue/{entry_id} Leave queue
GET /api/chess/queue/status Check queue wait status
GET /api/chess/games/{game_id} Get current game state
POST /api/chess/games/{game_id}/move Submit move
POST /api/chess/games/{game_id}/resign Resign game
POST /api/chess/games/{game_id}/draw Offer/accept/decline draw
POST /api/chess/games/{game_id}/abort Abort (early game only)
GET /api/chess/games/{game_id}/moves Paginated move history
Submit move request:
{
"move_number": 17,
"uci": "e2e4",
"client_sent_at_ms": 1730000000000,
"idempotency_key": "4f8d7a3f"
}
Submit move response (200):
{
"accepted": true,
"game_id": "game-123",
"move_number": 17,
"fen_after": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1",
"turn": "black",
"remaining_white_ms": 178450,
"remaining_black_ms": 180000,
"game_status": "active"
}
WSS /ws/chess?token=<signed_jwt>
{ "type": "game.matched", "game_id": "game-123", "color": "white", "opponent": { "id": "u2", "rating": 1830 } }
{ "type": "game.move", "game_id": "game-123", "move_number": 17, "uci": "e2e4", "fen_after": "...", "turn": "black" }
{ "type": "game.clock_sync", "game_id": "game-123", "white_remaining_ms": 178450, "black_remaining_ms": 180000, "server_now_ms": 1730000001234 }
{ "type": "game.ended", "game_id": "game-123", "result": "white_won", "reason": "timeout" }
Every move should include a monotonic move_number or version; reject stale/out-of-order commands to prevent double-apply.
State / Data / Async (Below App)
App
Edge
Client
Async
Data
Cache
HTTPS/WSS
HTTPS/WSS
Player Client A
Player Client B
Load Balancer
API Service
WebSocket Gateway
Matchmaking Service
Game Service
Chess Rules Engine
Redis Queue + Hot Game State
PostgreSQL Game + Move History
Kafka / Event Bus
Anti-cheat / Analytics Workers
Matchmaking Service
Maintains waiting pools by time_control + mode + region
Uses rating-based candidate search with widening tolerance over wait time
Creates game assignment atomically and removes both players from queue
Game Service
Authoritative command handler for move, resign, and draw actions
Validates move legality via chess rules engine
Computes clock deltas using authoritative server time
Schedules and revalidates timeout deadlines for the active side
Persists move event and updates snapshot atomically
WebSocket Gateway
Maintains per-player real-time session
Pushes match/game/clock updates
Supports reconnect and state resync
WebSocketRedisMatchmaking ServiceAPI ServiceUserWebSocketRedisMatchmaking ServiceAPI ServiceUseralt[candidate found]loop[every 100ms]POST /api/chess/queue (5+0, rated, 1800)enqueue(player)ZADD queue:5+0:rated rating=1800find nearest rating candidateatomic remove both (Lua/transaction)create game_id + colorsemit game.matched to both users
Player BPostgreSQLRedisRules EngineGame ServicePlayer APlayer BPostgreSQLRedisRules EngineGame ServicePlayer Amove(game_id, move_number=17, uci=e2e4)load GameSnapshot + ClockStatevalidate turn/versionvalidate legal move on current FENlegal + next FENdeduct elapsed time for active sideapply increment and switch turnappend MoveEvent (transaction)update snapshot + clockrefresh timeout trigger for next sideack move acceptedpush game.move + game.clock_sync
Use an event-based clock model, not per-second DB writes.
At move commit time:
elapsed = now_server_ms - turn_started_server_ms
remaining_active = remaining_active - elapsed
if remaining_active <= 0 => timeout loss
remaining_active += increment_ms # for Fischer increment controls
active_side = opposite(active_side)
turn_started_server_ms = now_server_ms
Why this works:
Exact countdown is derived from (remaining_ms, turn_started_server_ms, active_side)
No ticking writes every second
Server remains source of truth; clients render smooth local countdown between syncs
Persisted timestamps stay meaningful during failover or game ownership handoff
Timeout can be enforced even when no new move arrives
function applyMoveAndClock(state: ClockState, nowMs: number, incrementMs: number): ClockState {
const elapsed = nowMs - state.turn_started_server_ms;
if (state.active_side === 'white') {
state.white_remaining_ms -= elapsed;
if (state.white_remaining_ms <= 0) throw new Error('white_timeout');
state.white_remaining_ms += incrementMs;
state.active_side = 'black';
} else {
state.black_remaining_ms -= elapsed;
if (state.black_remaining_ms <= 0) throw new Error('black_timeout');
state.black_remaining_ms += incrementMs;
state.active_side = 'white';
}
state.turn_started_server_ms = nowMs;
state.version += 1;
return state;
}
Timeout enforcement pattern:
After each committed move, enqueue a delayed timeout check for now + remaining_active_ms of the next side.
When the delayed job fires, reload ClockState and verify version/active_side before ending game.
If version changed, discard the stale timeout job.
Do not trust client timestamps to deduct clock time. Network jitter and clock spoofing will create unfair losses.
Keep latest snapshot in Redis: FEN, move number, clock state, game status
Persist full move history in PostgreSQL for durable replay
On reconnect, client requests GET /api/chess/games/{id} and resumes via snapshot + delta events
Clocks continue while disconnected. If a player disconnects and never returns, timeout remains a valid game result.
Low match latency
Separate pools by time control and mode to avoid incompatible matches
Start with narrow rating band (for example +/-50), widen over wait time
Use in-memory/Redis structures for fast nearest-neighbor retrieval
Low move latency
Keep active game state in Redis close to game servers
Use sticky routing by game_id to reduce cross-node coordination
Push over WebSocket directly; avoid polling for active matches
High correctness
Single authoritative game processor per game_id partition
Strict optimistic concurrency with version/move_number
Idempotency keys to dedupe retries
1. Hot partitions for popular time controls
Problem: Blitz pools (for example 3+0, 5+0) dominate traffic.
Mitigation:
Shard queues by region and rating bucket
Run multiple match workers with distributed lock ownership
2. Redis memory pressure from many active games
Problem: 80K active games with snapshots + session metadata can grow quickly.
Mitigation:
Keep only hot fields in Redis
Periodically compact/evict inactive games to DB-backed reads
3. Duplicate move submissions on retry
Problem: Mobile networks cause client retries for the same move.
Mitigation:
Require idempotency_key and (game_id, player_id, move_number) uniqueness
Return previous success for duplicate keys
Interview recommendation:
Start with strict window
Widen every few seconds
Cap max gap by mode (rated tighter than casual)
For interview, choose hybrid:
Redis for active state
PostgreSQL for durable move events
Rebuild snapshot from events if cache is lost
Applying moves on the client first and treating server as eventual consistency. In chess, this causes illegal or divergent board states.
Implementing chess clocks with periodic client pings. This is vulnerable to latency spikes and manipulation.
Using one global queue without mode/time-control partitioning. It creates poor matches and complex filtering overhead.
Clarified rated vs casual scope and time control variants
Set concrete latency and clock correctness targets
Defined reconnect behavior and game-ending rules
Explained queue partitioning + rating widening strategy
Designed authoritative move validation pipeline
Explained event-based server clock math
Covered reconnect from snapshot + move history
Addressed queue hotspots and shard strategy
Mentioned idempotency/version checks for duplicate moves
Discussed cache + durable store hybrid trade-off
The most important insight is that this system is less about raw scale and more about fairness under latency. If matchmaking, move order, and countdown timing are not authoritative and deterministic, users quickly lose trust in the platform.