Design a Crossword Puzzle Solver

System DesignSystem DesignOnsitePhoneSoftware EngineerReported Apr, 2026Medium Frequency

Design a service that solves crossword puzzles. Given a board with word slots (positions, directions, lengths) and a dictionary of ~1 million words, find any valid assignment of words that satisfies all intersection constraints.

The board is medium-sized (~50x50) with ~100 word slots to fill. This is NOT an algorithm optimization problem—the interviewer wants to see you recognize that single-machine brute force is intractable (search space ~10^300) and design a distributed job scheduling system to parallelize the search.

Common mistake: Don't spend too much time optimizing the algorithm itself. The key insight is proving single-machine won't work, then designing how to distribute the search across workers, handle dead ends quickly, and dynamically split/redistribute tasks when some workers finish early or get stuck.

Real Interview Experiences

Experience #1: Stochastic Optimization Approach

Board size not specified initially; interviewer clarified "medium-sized" (~50x50) with ~100 word slots and 1M word dictionary. Only need to find ANY valid solution.

Proved single-machine brute force is intractable, then proposed a stochastic optimization approach (simulated annealing style). Spent the entire interview explaining the E2E flow—no deep dives. Interviewer said: "First time seeing this approach, need to verify it later."

The 'correct' answer is distributed DFS, but I've never seen anyone answer that. Most people use simulation methods—split tasks and handle dead ends by redistributing.

Experience #2: Algorithm Trap

Treated this as a pure algorithm problem and spent too much time trying to find the "optimal" algorithm. Realized mid-interview it's about distributed systems, not algorithm optimization.

Tried to recover by applying the standard system design template (functional requirements, non-functional requirements, high-level design)—this was also a mistake for this problem type.

This is essentially a job scheduler problem. When the board is large, single-machine brute force won't finish in time. You need to split the problem across workers, handle dead ends quickly, and dynamically redistribute when some tasks are too large or hit dead ends early.

Solution Walkthrough

A crossword puzzle solver is a service that takes a puzzle board with empty word slots and a dictionary, then finds a valid assignment of words that satisfies all constraints. The core challenge is solving a constraint satisfaction problem (CSP) at scale—when brute force on a single machine won't finish in time.

This walkthrough follows the Interview Framework. Use it as a guide, not a script—adapt based on interviewer cues.

Phase 1: Requirements

Functional Requirements

Solve puzzles — Given a board with word slots and a dictionary, find any valid word assignment

Handle constraints — Words must fit their slots and share letters at intersections

Return solution — Output the solved puzzle with words and their positions/directions

Out of Scope

Puzzle generation (creating puzzles from scratch)

Clue matching (we ignore clue semantics; clues may be present but are not used by the solver)

Multiple solutions (finding one valid solution is sufficient)

User interface or interactive editing

Non-Functional Requirements

The interviewer will likely tell you the board is "medium-sized" (around 50×50). This is intentional—it's large enough that single-machine brute force won't work, forcing you to design a distributed solution.

Capacity Estimation

Why single-machine brute force fails:

100 word slots to fill

Average of ~1,000 words in dictionary matching each slot's length

Naive search space: 1000^100 = 10^300 combinations

Even with pruning, this is intractable on one machine

The insight: While the total search space is astronomical, constraint propagation dramatically prunes it. The challenge is exploring the pruned tree fast enough—which requires distributed computation.

Common mistake: Don't spend too much time optimizing the single-machine algorithm. The interviewer wants to see you recognize when to distribute the problem. Prove single-machine won't work, then move on.

Phase 2: Data Model

Core Entities

Puzzle

├── id: UUID

├── width: integer

├── height: integer

├── slots: Slot[]

└── status: "pending" | "solving" | "solved" | "failed"

Slot

├── id: integer

├── start_row: integer

├── start_col: integer

├── direction: "across" | "down"

├── length: integer

├── clue: string | null // Optional metadata, ignored by solver

└── assigned_word: string | null

Dictionary

├── words: string[]

└── by_length: Map<integer, string[]> // Pre-indexed by word length

SearchState

├── id: UUID

├── puzzle_id: UUID

├── assignments: Map<slot_id, word> // Current partial solution

├── remaining_slots: slot_id[] // Unassigned slots

├── depth: integer // How deep in the search tree

└── parent_state_id: UUID | null // For backtracking

Task

├── id: UUID

├── puzzle_id: UUID

├── state_id: UUID

├── priority: integer

├── status: "pending" | "running" | "completed" | "dead_end"

└── assigned_worker: string | null

Key Design Decisions

Why track SearchState separately?

Enables work distribution—each state is an independent subtree to explore

Supports resumption after worker failures

Allows dynamic task splitting

Why index dictionary by length?

Slots have fixed lengths; only words of matching length are candidates

Reduces candidate set from 1M to ~1-10K per slot

Phase 3: API Design

This is a batch processing system, not a real-time API. We need endpoints for job submission and result retrieval.

REST Endpoints

POST /puzzles

Request: {

"width": 50,

"height": 50,

"slots": [

{ "start_row": 0, "start_col": 0, "direction": "across", "length": 5, "clue": "A greeting" }, // Optional, ignored

...

"dictionary_id": "en-1m" // Reference stored dictionary; inline list only for toy puzzles

}

Response: { "puzzle_id": "abc123", "status": "pending" }

GET /puzzles/{puzzle_id}

Response: {

"puzzle_id": "abc123",

"status": "solving",

"progress": { "explored_states": 150000, "active_workers": 8 }

}

GET /puzzles/{puzzle_id}/solution

Response: {

"puzzle_id": "abc123",

"status": "solved",

"solution": [

{ "slot_id": 1, "word": "HELLO", "direction": "across", "position": "1-Across" },

...

]

}

Phase 4: High-Level Design

Architecture Overview

Storage Layer

Worker Pool

Task Queue

Coordination Layer

API Layer

Client

Submit puzzle

User

API Gateway

Coordinator Service

ZooKeeper

Message Queue Redis

Worker 1

Worker 2

Worker N

State Store PostgreSQL

Dictionary In-Memory

Component Walkthrough

1. Submitting a Puzzle

Task QueueState StoreCoordinatorAPI GatewayUserTask QueueState StoreCoordinatorAPI GatewayUserPOST /puzzlesSubmit puzzleStore puzzle + initial stateEnqueue initial task (full puzzle)puzzle_id{ puzzle_id, status: "pending" }

2. Worker Processing Loop

Each worker runs a continuous loop:

while true:

task = queue.dequeue()

state = load_state(task.state_id)

if is_solved(state):

mark_puzzle_solved(state)

return

if is_dead_end(state):

mark_task_dead_end(task)

continue

Pick next slot to fill (use heuristics)

slot = pick_most_constrained_slot(state)

candidates = get_valid_words(slot, state)

if len(candidates) == 0:

mark_task_dead_end(task)

continue

if len(candidates) > SPLIT_THRESHOLD:

Too many candidates - split into subtasks

for word in candidates:

new_state = apply_assignment(state, slot, word)

save_state(new_state)

queue.enqueue(new_task(new_state))

else:

Few candidates - explore sequentially

for word in candidates:

new_state = apply_assignment(state, slot, word)

if explore_dfs(new_state):

return # Found solution

The key insight: This is distributed DFS with dynamic task splitting and a shared queue. When a worker finds a state with many candidates, it splits the work. When candidates are few, it explores sequentially to avoid queue overhead.

3. Constraint Propagation

When assigning a word to a slot, propagate constraints to intersecting slots:

After: Slot 1 = HELLO

Slot 1: HELLO ✓

Slot 2 (intersects at 'L') Candidates: ELITE ✓ EMBER ✗, ENTER ✗

Before Assignment

Slot 1 (Across) Candidates: HELLO, HELPS, HELIX

Slot 2 (Down) Candidates: ELITE, EMBER, ENTER

Constraint propagation is crucial—it eliminates invalid candidates early, dramatically pruning the search space.

The Core Algorithm: Distributed DFS

Why DFS over BFS?

For crossword solving:

We only need any valid solution, not the optimal one

DFS memory is bounded (just the current path)

DFS can find solutions faster by going deep quickly

Distributed DFS Strategy

Parallel Exploration

First Split (3 candidates)

Initial State

Dead end

Solution found!

Root: Empty Puzzle

Task 1: Slot1=HELLO

Task 2: Slot1=HELPS

Task 3: Slot1=HELIX

Worker 1 Explores T1 subtree

Worker 2 Explores T2 subtree

Worker 3 Explores T3 subtree

Backtrack

Notify all workers

Key mechanisms:

Task splitting — When a state has many candidates, split into subtasks

Queue-based work distribution — Idle workers pull tasks from the shared queue

Early termination — When any worker finds a solution, broadcast cancellation

Dead-end detection — When no valid candidates exist, backtrack immediately

Phase 5: Scaling & Deep Dives

Deep Dive 1: Variable Ordering Heuristics

Problem: Which slot should we fill next? Random ordering leads to wasted exploration.

Solution: Most Constrained Variable (MCV) heuristic—fill the slot with fewest valid candidates first.

function pick_most_constrained_slot(state):

min_candidates = infinity

best_slot = null

for slot in state.remaining_slots:

candidates = count_valid_words(slot, state)

if candidates < min_candidates:

min_candidates = candidates

best_slot = slot

return best_slot

Why MCV works:

Slots with few candidates are "bottlenecks"

Filling them first detects dead ends earlier

Dramatically reduces average tree depth explored

Interview gold: Mention that this is a classic CSP heuristic (also called Minimum Remaining Values or MRV). If the interviewer asks for alternatives, discuss the degree heuristic (choose the slot that constrains the most other slots) or least constraining value (for value ordering).

Deep Dive 2: Dynamic Task Splitting

Problem: Some tasks explore huge subtrees; others hit dead ends instantly. Fixed task sizes cause load imbalance.

Solution: Dynamic splitting based on estimated subtree size.

SPLIT_THRESHOLD = 10 # Tune based on queue latency vs. parallelism

function should_split(candidates, depth):

Split if many candidates AND not too deep

(Deep splits create too many small tasks)

return len(candidates) > SPLIT_THRESHOLD and depth < MAX_SPLIT_DEPTH

Adaptive splitting:

Track average task completion time

If workers are idle → split more aggressively

If queue is backed up → split less, do more local work

Coordinator Monitors

Adjust Strategy

High idle % → Lower SPLIT_THRESHOLD

Queue backup → Raise SPLIT_THRESHOLD

Queue Depth

Worker Idle %

Deep Dive 3: Handling Dead Ends Efficiently

Problem: Many paths lead to dead ends. How do we detect and abandon them quickly?

Solution: Forward checking before exploring.

Before assigning a word, check if the assignment leaves any valid word for intersecting slots. This is a lightweight consistency check (not full AC-3):

function is_consistent(state, slot, word):

Apply tentative assignment

new_state = apply_assignment(state, slot, word)

Check all intersecting slots still have candidates

for intersecting_slot in get_intersections(slot):

if count_valid_words(intersecting_slot, new_state) == 0:

return false # Dead end guaranteed

return true

This prunes obviously-bad branches without full exploration.

Deep Dive 4: Stochastic/Heuristic Approaches

Alternative to exhaustive DFS: Use stochastic optimization as a time-bounded pre-pass when deterministic search is too slow. If it fails, fall back to DFS to preserve completeness.

One interviewee mentioned using "stochastic optimization" and the interviewer said it was a valid approach. This is worth knowing as an alternative.

Simulated Annealing approach:

function solve_stochastic(puzzle):

Start with random (possibly invalid) assignment

state = random_assignment(puzzle)

temperature = INITIAL_TEMP

while temperature > MIN_TEMP:

Pick a random slot, try a different word

slot = random_slot(state)

new_word = random_valid_word(slot)

Count constraint violations

old_violations = count_violations(state)

new_state = swap_word(state, slot, new_word)

new_violations = count_violations(new_state)

Accept if better, or probabilistically if worse

if new_violations < old_violations:

state = new_state

elif random() < exp((old_violations - new_violations) / temperature):

state = new_state

temperature *= COOLING_RATE

return state if count_violations(state) == 0 else null

Trade-offs:

Faster for some puzzles

No guarantee of finding solution even if one exists

Can be parallelized trivially (run multiple independent searches)

Deep Dive 5: Early Termination and Cancellation

Problem: When one worker finds a solution, how do we stop all others?

Solution: Broadcast cancellation with generation numbers.

Coordinator:

on solution_found(puzzle_id, solution):

store_solution(puzzle_id, solution)

increment_generation(puzzle_id)

broadcast("cancel", puzzle_id, generation)

Worker:

before processing task:

if task.generation < current_generation(puzzle_id):

skip task # Stale work

periodically:

check for cancellation messages

if cancelled:

abandon current work

Worker 3Worker 2CoordinatorWorker 1Worker 3Worker 2CoordinatorWorker 1Solution found!Store solutionCancel (generation=2)Cancel (generation=2)Check generation, abandonCheck generation, abandon

Deep Dive 6: Fault Tolerance

Worker failure:

Coordinator tracks task assignments with heartbeats

If worker dies, re-enqueue its active task

State is stored in database, not just worker memory

Coordinator failure:

Use ZooKeeper for leader election

New coordinator reconstructs state from database

Workers reconnect automatically

Deep Dive 7: Detecting "No Solution Exists"

Problem: What if the puzzle has no valid solution? How do we know when to stop?

Solution: Track task completion—when all tasks reach dead ends, no solution exists.

Coordinator tracks:

├── total_tasks_created: counter

├── tasks_completed: counter (dead ends + solution found)

└── active_tasks: set

When tasks_completed == total_tasks_created AND active_tasks is empty:

if no solution found:

mark puzzle as "unsolvable"

Key insight: Distributed DFS is complete—it explores the entire search space. If every branch hits a dead end and no tasks remain, we've proven no solution exists.

This is an important edge case to mention in interviews. Most candidates focus on the happy path (solution found) and forget about completeness guarantees.

Interview Checklist

Before finishing, verify you've covered:

Proved single-machine brute force won't work (10^300 search space)

Explained distributed DFS as the core approach

Described constraint propagation for pruning

Discussed variable ordering heuristics (MCV)

Covered dynamic task splitting and load balancing

Explained dead-end detection and backtracking

Addressed early termination when solution found

Mentioned fault tolerance for workers

Discussed how to detect "no solution exists"

Summary

Key takeaway: This problem is about recognizing when a brute-force approach needs to be distributed, and designing the job scheduling infrastructure to parallelize a fundamentally sequential algorithm (DFS) while handling load imbalance and failures.

WhiteboardAuto-save enabled

Loading whiteboard…