Back

Design Chess.com (Online Chess Game)

System DesignSystem DesignOnsitePhoneSoftware EngineerReported Apr, 2026Medium Frequency

Design an online chess platform similar to Chess.com. Users should quickly find opponents, play in real time, and finish games with correct chess clock behavior (each player has limited time, with optional increment). This problem primarily tests whether you can design matchmaking, authoritative move processing, and server-side countdown logic under low-latency constraints.

This walkthrough follows the Interview Framework. Use it as a guide, not a script.

Phase 1: Requirements

Functional Requirements

Users should be able to join matchmaking with preferences (time control, rated/casual)

Users should be able to get matched with another player and start a game

Users should be able to submit moves and see opponent moves in near real time

Users should be able to play with a chess clock (including timeout and optional increment)

Users should be able to handle game lifecycle actions (resign, draw offer/accept, reconnect)

Assume standard head-to-head chess only (no tournaments, spectators, bots, or variants like Chess960) unless the interviewer asks to extend.

Non-Functional Requirements

Clock state must be authoritative on the server. Client-side countdown is display only.

Capacity Estimation

Assumptions:

2M daily active players

8% peak concurrency = 160K concurrent players

2 players per game => 80K concurrent games

Average one move every 8 seconds per game

Write/read throughput:

Move writes: 80K / 8 = 10K moves/sec

Per move fan-out to both players: ~20K real-time outbound events/sec

At ~300 bytes/event, outbound is ~6 MB/sec (excluding protocol overhead)

Matchmaking queue:

If 10% of active players are waiting at peak: 16K queue entries

Queue ops must support rapid insert/remove and efficient nearest-rating lookup

In interviews, state that move traffic is modest; strict latency and correctness are harder than raw throughput.

Phase 2: Data Model

Core Entities

Player

├── id: UUID

├── username: string

├── rating_blitz: int

├── rating_rapid: int

└── created_at: timestamp

QueueEntry

├── id: UUID

├── player_id: UUID (FK)

├── mode: enum (rated, casual)

├── time_control: string (e.g., "5+0", "10+5")

├── rating: int

├── region: string

├── joined_at: timestamp

└── status: enum (waiting, matched, cancelled, expired)

Game

├── id: UUID

├── white_player_id: UUID (FK)

├── black_player_id: UUID (FK)

├── mode: enum (rated, casual)

├── time_control_base_ms: bigint

├── increment_ms: int

├── status: enum (active, white_won, black_won, draw, aborted)

├── result_reason: enum (checkmate, resignation, timeout, draw, disconnect_forfeit)

├── current_fen: string

├── move_count: int

├── turn: enum (white, black)

├── started_at: timestamp

└── ended_at: timestamp

MoveEvent

├── id: UUID

├── game_id: UUID (FK)

├── move_number: int

├── player_id: UUID

├── uci: string (e2e4)

├── san: string (optional)

├── fen_after: string

├── remaining_white_ms: bigint

├── remaining_black_ms: bigint

├── server_received_at: timestamp

└── is_legal: boolean

ClockState

├── game_id: UUID (PK/FK)

├── white_remaining_ms: bigint

├── black_remaining_ms: bigint

├── active_side: enum (white, black)

├── turn_started_server_ms: bigint (authoritative server timestamp)

└── version: bigint

Entity Relationships

Player 1:N QueueEntry

Player 1:N Game (as white/black)

Game 1:N MoveEvent

Game 1:1 ClockState

Store both event history (MoveEvent) and latest snapshot (Game.current_fen, ClockState) for fast reads plus replayability.

Phase 3: API Design

Protocol Choices

REST Endpoints

Matchmaking

POST /api/chess/queue Join queue

DELETE /api/chess/queue/{entry_id} Leave queue

GET /api/chess/queue/status Check queue wait status

Game lifecycle

GET /api/chess/games/{game_id} Get current game state

POST /api/chess/games/{game_id}/move Submit move

POST /api/chess/games/{game_id}/resign Resign game

POST /api/chess/games/{game_id}/draw Offer/accept/decline draw

POST /api/chess/games/{game_id}/abort Abort (early game only)

GET /api/chess/games/{game_id}/moves Paginated move history

Submit move request:

{

"move_number": 17,

"uci": "e2e4",

"client_sent_at_ms": 1730000000000,

"idempotency_key": "4f8d7a3f"

}

Submit move response (200):

{

"accepted": true,

"game_id": "game-123",

"move_number": 17,

"fen_after": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1",

"turn": "black",

"remaining_white_ms": 178450,

"remaining_black_ms": 180000,

"game_status": "active"

}

WebSocket Events

WSS /ws/chess?token=<signed_jwt>

{ "type": "game.matched", "game_id": "game-123", "color": "white", "opponent": { "id": "u2", "rating": 1830 } }

{ "type": "game.move", "game_id": "game-123", "move_number": 17, "uci": "e2e4", "fen_after": "...", "turn": "black" }

{ "type": "game.clock_sync", "game_id": "game-123", "white_remaining_ms": 178450, "black_remaining_ms": 180000, "server_now_ms": 1730000001234 }

{ "type": "game.ended", "game_id": "game-123", "result": "white_won", "reason": "timeout" }

Every move should include a monotonic move_number or version; reject stale/out-of-order commands to prevent double-apply.

Phase 4: High-Level Design

Architecture Overview

State / Data / Async (Below App)

App

Edge

Client

Async

Data

Cache

HTTPS/WSS

HTTPS/WSS

Player Client A

Player Client B

Load Balancer

API Service

WebSocket Gateway

Matchmaking Service

Game Service

Chess Rules Engine

Redis Queue + Hot Game State

PostgreSQL Game + Move History

Kafka / Event Bus

Anti-cheat / Analytics Workers

Component Responsibilities

Matchmaking Service

Maintains waiting pools by time_control + mode + region

Uses rating-based candidate search with widening tolerance over wait time

Creates game assignment atomically and removes both players from queue

Game Service

Authoritative command handler for move, resign, and draw actions

Validates move legality via chess rules engine

Computes clock deltas using authoritative server time

Schedules and revalidates timeout deadlines for the active side

Persists move event and updates snapshot atomically

WebSocket Gateway

Maintains per-player real-time session

Pushes match/game/clock updates

Supports reconnect and state resync

Data Flow: Matchmaking

WebSocketRedisMatchmaking ServiceAPI ServiceUserWebSocketRedisMatchmaking ServiceAPI ServiceUseralt[candidate found]loop[every 100ms]POST /api/chess/queue (5+0, rated, 1800)enqueue(player)ZADD queue:5+0:rated rating=1800find nearest rating candidateatomic remove both (Lua/transaction)create game_id + colorsemit game.matched to both users

Data Flow: Move + Clock Update

Player BPostgreSQLRedisRules EngineGame ServicePlayer APlayer BPostgreSQLRedisRules EngineGame ServicePlayer Amove(game_id, move_number=17, uci=e2e4)load GameSnapshot + ClockStatevalidate turn/versionvalidate legal move on current FENlegal + next FENdeduct elapsed time for active sideapply increment and switch turnappend MoveEvent (transaction)update snapshot + clockrefresh timeout trigger for next sideack move acceptedpush game.move + game.clock_sync

Deep Dive: Chess Clock Correctness

Use an event-based clock model, not per-second DB writes.

At move commit time:

elapsed = now_server_ms - turn_started_server_ms

remaining_active = remaining_active - elapsed

if remaining_active <= 0 => timeout loss

remaining_active += increment_ms # for Fischer increment controls

active_side = opposite(active_side)

turn_started_server_ms = now_server_ms

Why this works:

Exact countdown is derived from (remaining_ms, turn_started_server_ms, active_side)

No ticking writes every second

Server remains source of truth; clients render smooth local countdown between syncs

Persisted timestamps stay meaningful during failover or game ownership handoff

Timeout can be enforced even when no new move arrives

function applyMoveAndClock(state: ClockState, nowMs: number, incrementMs: number): ClockState {

const elapsed = nowMs - state.turn_started_server_ms;

if (state.active_side === 'white') {

state.white_remaining_ms -= elapsed;

if (state.white_remaining_ms <= 0) throw new Error('white_timeout');

state.white_remaining_ms += incrementMs;

state.active_side = 'black';

} else {

state.black_remaining_ms -= elapsed;

if (state.black_remaining_ms <= 0) throw new Error('black_timeout');

state.black_remaining_ms += incrementMs;

state.active_side = 'white';

}

state.turn_started_server_ms = nowMs;

state.version += 1;

return state;

}

Timeout enforcement pattern:

After each committed move, enqueue a delayed timeout check for now + remaining_active_ms of the next side.

When the delayed job fires, reload ClockState and verify version/active_side before ending game.

If version changed, discard the stale timeout job.

Do not trust client timestamps to deduct clock time. Network jitter and clock spoofing will create unfair losses.

Reconnection Strategy

Keep latest snapshot in Redis: FEN, move number, clock state, game status

Persist full move history in PostgreSQL for durable replay

On reconnect, client requests GET /api/chess/games/{id} and resumes via snapshot + delta events

Clocks continue while disconnected. If a player disconnects and never returns, timeout remains a valid game result.

Phase 5: Scaling & Trade-offs

Addressing Non-Functional Requirements

Low match latency

Separate pools by time control and mode to avoid incompatible matches

Start with narrow rating band (for example +/-50), widen over wait time

Use in-memory/Redis structures for fast nearest-neighbor retrieval

Low move latency

Keep active game state in Redis close to game servers

Use sticky routing by game_id to reduce cross-node coordination

Push over WebSocket directly; avoid polling for active matches

High correctness

Single authoritative game processor per game_id partition

Strict optimistic concurrency with version/move_number

Idempotency keys to dedupe retries

Bottlenecks and Mitigations

1. Hot partitions for popular time controls

Problem: Blitz pools (for example 3+0, 5+0) dominate traffic.

Mitigation:

Shard queues by region and rating bucket

Run multiple match workers with distributed lock ownership

2. Redis memory pressure from many active games

Problem: 80K active games with snapshots + session metadata can grow quickly.

Mitigation:

Keep only hot fields in Redis

Periodically compact/evict inactive games to DB-backed reads

3. Duplicate move submissions on retry

Problem: Mobile networks cause client retries for the same move.

Mitigation:

Require idempotency_key and (game_id, player_id, move_number) uniqueness

Return previous success for duplicate keys

Trade-off: Match Quality vs Wait Time

Interview recommendation:

Start with strict window

Widen every few seconds

Cap max gap by mode (rated tighter than casual)

Trade-off: State Model

For interview, choose hybrid:

Redis for active state

PostgreSQL for durable move events

Rebuild snapshot from events if cache is lost

Common Pitfalls

Applying moves on the client first and treating server as eventual consistency. In chess, this causes illegal or divergent board states.

Implementing chess clocks with periodic client pings. This is vulnerable to latency spikes and manipulation.

Using one global queue without mode/time-control partitioning. It creates poor matches and complex filtering overhead.

Interview Checklist

Requirements Phase

Clarified rated vs casual scope and time control variants

Set concrete latency and clock correctness targets

Defined reconnect behavior and game-ending rules

Design Phase

Explained queue partitioning + rating widening strategy

Designed authoritative move validation pipeline

Explained event-based server clock math

Covered reconnect from snapshot + move history

Scaling Phase

Addressed queue hotspots and shard strategy

Mentioned idempotency/version checks for duplicate moves

Discussed cache + durable store hybrid trade-off

Summary

The most important insight is that this system is less about raw scale and more about fairness under latency. If matchmaking, move order, and countdown timing are not authoritative and deterministic, users quickly lose trust in the platform.


WhiteboardAuto-save enabled
Loading whiteboard…