Design a chat system that supports only 1-to-1 messaging between users. Group chats, channels, and other multi-party features are explicitly out of scope.
This walkthrough follows the Interview Framework and focuses on what you'd actually present in a 45-60 minute interview.
Anthropic interviewers are known to dig deep into implementation details. Don't hand-wave any component—be prepared to explain the "how" behind every "what." If you mention WebSockets, be ready to discuss connection management. If you mention message ordering, be ready to discuss clock synchronization.
Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.
A 1-to-1 chat system has fewer features than WhatsApp or Slack, but clarifying scope quickly prevents over-engineering.
Frame these as user capabilities:
Send messages — Users can send text messages to another user
Receive messages — Users receive messages in near real-time if online
Conversation history — Users can scroll back through past messages
Read receipts — Sender sees when recipient has read the message
Offline delivery — Messages sent to offline users are delivered when they come online
Keep it to 4-5 core features. Presence is a common nice-to-have if you have time. Since this is 1-to-1 only, you don't need group management, channels, or complex permission systems. Acknowledge these are out of scope if asked.
Ask clarifying questions:
"Should we prioritize availability or consistency?" — Messages should arrive in order even if it takes slightly longer.
"How long do we store messages?" — Permanent storage for this design (can adjust if privacy is a concern).
"Do we need multi-device support?" — Yes, users may have phone + desktop.
Do a quick back-of-envelope calculation:
Users:
DAU: 100 million
Average messages sent per user per day: 10
Total messages per day: 1 billion
Traffic:
Messages per second (average): 1B / 86,400 ≈ 11,500 QPS
Peak (3x average): ~35,000 QPS
Connections:
Storage:
Average message size: 200 bytes (text + metadata)
Daily storage: 1B × 200 bytes = 200 GB/day
Yearly storage: ~73 TB/year (before replication)
At 35K QPS and 10M concurrent connections, we clearly need a distributed system with horizontal scaling, efficient connection management, and smart message routing. However, this is simpler than WhatsApp (no group fan-out).
Identify key entities before jumping into APIs. This establishes shared vocabulary with your interviewer.
User
├── user_id (UUID)
├── username
├── last_seen_at
└── created_at
Device (for multi-device support)
├── device_id (UUID)
├── user_id
├── device_type (phone, desktop, web)
├── push_token (for push notifications)
└── last_active_at
Conversation
├── conversation_id (UUID)
├── participant_1 (user_id)
├── participant_2 (user_id)
├── created_at
└── updated_at (last message time)
-- Constraint: participant_1 < participant_2 (canonical ordering)
-- Unique index on (participant_1, participant_2)
Message
├── message_id (UUID)
├── conversation_id
├── sender_id
├── client_message_id (string, per sender per conversation)
├── content (text)
├── sequence_number (per-conversation)
└── created_at
MessageStatus (per device)
├── message_id
├── device_id
├── status (sent, delivered)
└── updated_at
-- Primary key: (message_id, device_id)
Canonical ordering for 1-to-1 conversations: Always store the smaller user_id as participant_1. This ensures you can find "the conversation between User A and User B" with a single lookup regardless of who initiated it.
1. Sequence Numbers for Ordering
Using timestamps alone for message ordering is problematic due to clock skew. Use a server-assigned, per-conversation sequence number:
def get_or_create_conversation(user_a, user_b):
# Canonical ordering
p1, p2 = min(user_a, user_b), max(user_a, user_b)
return db.upsert(participant_1=p1, participant_2=p2)
2. Message Status Tracking
Track two states with different mechanisms:
Delivered: Tracked per-device in MessageStatus—each device ACKs independently
Read: Use a high-water mark—store last_read_sequence_number per user per conversation. All messages with sequence_number <= that value are implicitly read
This keeps delivery precise while read receipts remain scalable.
For a chat system, we need bidirectional real-time communication. This is a perfect use case for WebSockets.
For messaging, both parties need to push data: the client sends messages, the server pushes incoming messages. WebSockets allow this over a single persistent connection.
Client → Server:
// Send a message
{
"action": "send_message",
"conversation_id": "conv_123",
"content": "Hello!",
"client_message_id": "local_456" // For idempotency
}
// Mark messages as read (send highest sequence number viewed)
{
"action": "read_receipt",
"conversation_id": "conv_123",
"last_read_sequence": 42
}
// Heartbeat
{ "action": "ping" }
Server → Client:
// New message received
{
"event": "new_message",
"message_id": "msg_789",
"conversation_id": "conv_123",
"sender_id": "user_456",
"content": "Hello!",
"sequence_number": 42,
"timestamp": "2024-01-15T10:30:00Z"
}
// Message delivered to recipient's device
{
"event": "delivered",
"message_id": "msg_789",
"conversation_id": "conv_123",
"timestamp": "2024-01-15T10:30:01Z"
}
// Recipient read messages up to sequence N
{
"event": "read",
"conversation_id": "conv_123",
"reader_id": "user_456",
"last_read_sequence": 42
}
// Presence change
{
"event": "presence",
"user_id": "user_456",
"online": false
}
Delivered vs Read: "Delivered" means the message reached the recipient's device (single checkmark → double checkmark in WhatsApp). "Read" means the recipient opened the conversation (double checkmark turns blue). These are separate events.
The client_message_id is crucial for idempotency. If the connection drops during a send, the client can retry with the same ID, and the server can deduplicate.
Implementation detail: persist client_message_id with a unique constraint per sender/conversation. On retry, return the existing message_id and sequence_number instead of creating a duplicate.
Some operations work better as REST:
// Get user's conversations (inbox)
GET /conversations?cursor=...&limit=20
// Get messages in a conversation (history/pagination)
GET /conversations/{id}/messages?before_sequence=12345&limit=50
// Start a new conversation (or get existing)
POST /conversations
Request: { "recipient_user_id": "..." }
// Get presence for contacts
GET /users/presence?user_ids=id1,id2,id3
This is the core of your interview. Start with a working design, then evolve it.
Clients
Phone App
Desktop App
Load Balancer L4
WebSocket Server 1
WebSocket Server 2
WebSocket Server N
Redis Cluster Pub/Sub + Presence
Message Service
PostgreSQL Messages + Users
Components:
Walk through the data flow as you draw:
When User A sends a message to User B, the message hits A's WebSocket server and is forwarded to the Message Service. The service generates a sequence number, stores the message, enqueues it for B, then publishes to Redis Pub/Sub. B's WebSocket server receives the publish, atomically moves the message to an in-flight queue, and delivers it. When B's client acknowledges, we delete it from in-flight.
User BWebSocket Server 2RedisPostgreSQLMessage ServiceWebSocket Server 1User AUser BWebSocket Server 2RedisPostgreSQLMessage ServiceWebSocket Server 1User Asend_message (content)Store messageINCR seq:conv_123INSERT message with sequence_numberLPUSH inbox:device_B (message)PUBLISH user:B (notification)message_id, sequence_numbersent confirmationSubscribe notificationBRPOPLPUSH inbox:device_B inflight:device_Bnew_message eventACK (delivered)LREM inflight:device_B (message)Update status: deliveredPUBLISH user:A (delivered receipt)Delivered notificationdelivered status update
Use an inbox + in-flight queue (or Redis Streams with ACKs) for guaranteed delivery. Move a message to in-flight before delivery; delete only after ACK. This handles offline users, network failures, and server crashes gracefully.
With dozens of WebSocket servers, how does WS1 (sender's server) route a message to WS2 (receiver's server)?
Option 1: Connection Registry
Maintain a mapping of user_id → server_id in Redis. Look up the user's server and send directly.
Cons: Registry lookups add latency; stale entries if servers crash.
Option 2: Pub/Sub (Recommended)
Each WebSocket server subscribes to channels for its connected users. When User B connects to WS2, WS2 subscribes to channel user:B. To send a message to B, publish to that channel.
WS1 publishes: PUBLISH user:B "{message}"
WS2 (subscribed to user:B) receives and delivers
Pros: No registry to maintain; naturally handles server failures (re-subscribe on reconnect).
Pub/Sub is the recommended approach for interviews. It's simpler to explain, handles failures gracefully, and is used by real chat systems.
Messages for offline users accumulate in their inbox. Delivery uses an in-flight queue for ACKs. When they reconnect:
Client connects and provides last known sequence number per conversation
Server drains the inbox (undelivered messages) in order
Server backfills any gaps from durable storage using the sequence number
Client ACKs after persisting locally
Server clears in-flight entries on ACK
-- Users table
CREATE TABLE users (
user_id UUID PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
last_seen_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- Devices table (for multi-device support)
CREATE TABLE devices (
device_id UUID PRIMARY KEY,
user_id UUID NOT NULL REFERENCES users,
device_type VARCHAR(20) NOT NULL, -- phone, desktop, web
push_token VARCHAR(255),
last_active_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_devices_user ON devices(user_id);
-- Conversations table
CREATE TABLE conversations (
conversation_id UUID PRIMARY KEY,
participant_1 UUID NOT NULL REFERENCES users,
participant_2 UUID NOT NULL REFERENCES users,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT participant_order CHECK (participant_1 < participant_2),
UNIQUE (participant_1, participant_2)
);
-- Messages table
CREATE TABLE messages (
message_id UUID PRIMARY KEY,
conversation_id UUID NOT NULL REFERENCES conversations,
sender_id UUID NOT NULL REFERENCES users,
client_message_id VARCHAR(64) NOT NULL,
content TEXT NOT NULL,
sequence_number BIGINT NOT NULL,
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE (conversation_id, sequence_number),
UNIQUE (conversation_id, sender_id, client_message_id)
);
CREATE INDEX idx_messages_conversation ON messages(conversation_id, sequence_number DESC);
-- Message delivery status (per device)
CREATE TABLE message_status (
message_id UUID NOT NULL REFERENCES messages,
device_id UUID NOT NULL REFERENCES devices,
status VARCHAR(20) NOT NULL DEFAULT 'sent', -- sent, delivered
updated_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (message_id, device_id)
);
-- Read receipts (high-water mark per user per conversation)
CREATE TABLE read_receipts (
user_id UUID NOT NULL REFERENCES users,
conversation_id UUID NOT NULL REFERENCES conversations,
last_read_sequence BIGINT NOT NULL DEFAULT 0,
last_read_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (user_id, conversation_id)
);
For 100M+ users, shard by conversation_id:
Shard key: hash(conversation_id) % num_shards
Benefits:
All messages in a conversation are co-located
No cross-shard queries for conversation history
Even distribution (UUID is random)
Trade-off:
"Get all conversations for user X" requires scatter-gather
Solved with a separate user_conversations index table
With a working design in place, address the non-functional requirements and potential bottlenecks.
Problem: How do we ensure messages appear in the same order for both users?
Challenges:
Network delays can cause out-of-order delivery
Clock skew makes timestamps unreliable
Concurrent sends from both users
Solution: Per-conversation sequence numbers with server-side ordering
Option A: Redis INCR (Recommended for high throughput)
def send_message(conversation_id, sender_id, content, client_message_id):
# Atomic increment in Redis
seq = redis.incr(f"seq:{conversation_id}")
# Insert with assigned sequence number
message = db.execute("""
INSERT INTO messages (conversation_id, sender_id, client_message_id, content, sequence_number)
VALUES ($1, $2, $3, $4, $5)
RETURNING *
""", conversation_id, sender_id, client_message_id, content, seq)
return message
Option B: PostgreSQL advisory locks (Lower throughput, simpler)
-- Use advisory lock per conversation for serialization
SELECT pg_advisory_xact_lock(hashtext($1::text)); -- Lock on conversation_id
INSERT INTO messages (conversation_id, sender_id, client_message_id, content, sequence_number)
VALUES (
$1, $2, $3, $4,
(SELECT COALESCE(MAX(sequence_number), 0) + 1 FROM messages WHERE conversation_id = $1)
)
RETURNING *;
Redis INCR is preferred because it's fast (~0.1ms) and doesn't block database writes. The trade-off is that sequence numbers may have gaps if a message insert fails after incrementing Redis.
Client-side handling:
Display messages ordered by sequence_number
If message arrives with gap (seq 5, then seq 7), request missing messages
Optimistic UI: show sent message immediately, reorder if needed
Naive approach: Update a read_receipt row every time user scrolls.
Problem: User scrolling through history could generate hundreds of writes/second.
Optimized approach: Debounced, batched updates
// Client-side debouncing
let pendingReadReceipt = null;
let debounceTimer = null;
function onMessageViewed(sequenceNumber) {
pendingReadReceipt = sequenceNumber;
clearTimeout(debounceTimer);
debounceTimer = setTimeout(() => {
sendReadReceipt(pendingReadReceipt);
}, 2000); // 2 second debounce
}
Server-side: Only update if the new message has a higher sequence number than the current high-water mark.
Presence is a nice-to-have; include it if time allows.
Challenge: Track online/offline status for millions of users efficiently.
Approach: Heartbeat-based presence with Redis
HEARTBEAT_INTERVAL = 30 # seconds
PRESENCE_TTL = 60 # seconds (2x heartbeat)
def set_online(user_id):
redis.setex(f"presence:{user_id}", PRESENCE_TTL, "online")
publish_presence_change(user_id, online=True)
def heartbeat(user_id):
redis.expire(f"presence:{user_id}", PRESENCE_TTL)
def is_online(user_id):
return redis.exists(f"presence:{user_id}")
Presence subscription: For 1-to-1 chat, only notify users who have an active conversation. Maintain a Redis set of "presence subscribers" per user.
Challenge: User has phone and laptop. How to keep them in sync?
Approach: Each device has its own inbox and delivery tracking
Device registration: When a user logs in on a new device, create a device record
Inbox per device: Redis maintains inbox:{device_id} (and inflight:{device_id} for delivery tracking)
Fan-out on send: When sending to User B, add message to inbox of all B's devices
Independent ACKs: Each device ACKs delivery independently, updating message_status
Read sync: When user reads on one device, the read_receipts table is updated. Other devices fetch this on next sync
def send_to_user(user_id, message):
devices = db.query("SELECT device_id FROM devices WHERE user_id = $1", user_id)
for device in devices:
redis.lpush(f"inbox:{device.device_id}", message.to_json())
redis.publish(f"user:{user_id}", "new_message")
For 1-to-1 chat, the fan-out is small (typically 2-5 devices per user). This is much simpler than group chat where you'd fan out to hundreds of users.
Consistency vs Availability:
Recommendation: Favor consistency within a conversation—users notice out-of-order messages.
Latency vs Durability:
Recommendation: Synchronous write to durable storage before confirming to sender. The extra latency is acceptable; losing messages is not.
Handling WebSocket server failure: If a server crashes, clients reconnect to another server via the load balancer. They fetch pending messages from inbox/in-flight on reconnect.
Before wrapping up, verify you've covered:
Requirements Phase
Scope clarified (1-to-1 only, no groups)
4-5 functional requirements identified
Scale, latency, availability discussed
Quick capacity estimation completed
Data Model
Key entities: User, Device, Conversation, Message, MessageStatus
Canonical conversation ordering explained
Sequence numbers for message ordering
Delivered vs read status tracking
API Design
WebSocket choice justified (bidirectional, real-time)
Key commands defined (send, receive, read_receipt)
Idempotency via client_message_id
High-Level Design
Architecture diagram with data flow
Message routing explained (Pub/Sub)
Offline delivery handled (Inbox + in-flight or Streams)
Database schema and sharding
Scaling & Trade-offs
Message ordering deep dive
Read receipts optimization
Consistency vs availability discussed
At least one bottleneck identified
The 1-to-1 chat design is simpler than WhatsApp (no group fan-out), but interviewers will dig into details. Focus on message ordering, routing via Pub/Sub, and offline delivery via inbox + in-flight—these are the areas where depth matters.