Back

1-to-1 Chat System Design

System DesignSystem DesignOnsitePhoneSoftware EngineerReported Apr, 2026

Design a chat system that supports only 1-to-1 messaging between users. Group chats, channels, and other multi-party features are explicitly out of scope.

This walkthrough follows the Interview Framework and focuses on what you'd actually present in a 45-60 minute interview.

Anthropic interviewers are known to dig deep into implementation details. Don't hand-wave any component—be prepared to explain the "how" behind every "what." If you mention WebSockets, be ready to discuss connection management. If you mention message ordering, be ready to discuss clock synchronization.

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Phase 1: Requirements (~5 minutes)

A 1-to-1 chat system has fewer features than WhatsApp or Slack, but clarifying scope quickly prevents over-engineering.

Functional Requirements

Frame these as user capabilities:

Send messages — Users can send text messages to another user

Receive messages — Users receive messages in near real-time if online

Conversation history — Users can scroll back through past messages

Read receipts — Sender sees when recipient has read the message

Offline delivery — Messages sent to offline users are delivered when they come online

Keep it to 4-5 core features. Presence is a common nice-to-have if you have time. Since this is 1-to-1 only, you don't need group management, channels, or complex permission systems. Acknowledge these are out of scope if asked.

Non-Functional Requirements

Ask clarifying questions:

"Should we prioritize availability or consistency?" — Messages should arrive in order even if it takes slightly longer.

"How long do we store messages?" — Permanent storage for this design (can adjust if privacy is a concern).

"Do we need multi-device support?" — Yes, users may have phone + desktop.

Capacity Estimation

Do a quick back-of-envelope calculation:

Users:

  • DAU: 100 million

  • Average messages sent per user per day: 10

  • Total messages per day: 1 billion

Traffic:

  • Messages per second (average): 1B / 86,400 ≈ 11,500 QPS

  • Peak (3x average): ~35,000 QPS

Connections:

  • Concurrent users at peak: 10% of DAU = 10M WebSocket connections

Storage:

  • Average message size: 200 bytes (text + metadata)

  • Daily storage: 1B × 200 bytes = 200 GB/day

  • Yearly storage: ~73 TB/year (before replication)

At 35K QPS and 10M concurrent connections, we clearly need a distributed system with horizontal scaling, efficient connection management, and smart message routing. However, this is simpler than WhatsApp (no group fan-out).

Phase 2: Data Model (~5 minutes)

Identify key entities before jumping into APIs. This establishes shared vocabulary with your interviewer.

Core Entities

User

├── user_id (UUID)

├── username

├── email

├── last_seen_at

└── created_at

Device (for multi-device support)

├── device_id (UUID)

├── user_id

├── device_type (phone, desktop, web)

├── push_token (for push notifications)

└── last_active_at

Conversation

├── conversation_id (UUID)

├── participant_1 (user_id)

├── participant_2 (user_id)

├── created_at

└── updated_at (last message time)

-- Constraint: participant_1 < participant_2 (canonical ordering)

-- Unique index on (participant_1, participant_2)

Message

├── message_id (UUID)

├── conversation_id

├── sender_id

├── client_message_id (string, per sender per conversation)

├── content (text)

├── sequence_number (per-conversation)

└── created_at

MessageStatus (per device)

├── message_id

├── device_id

├── status (sent, delivered)

└── updated_at

-- Primary key: (message_id, device_id)

Canonical ordering for 1-to-1 conversations: Always store the smaller user_id as participant_1. This ensures you can find "the conversation between User A and User B" with a single lookup regardless of who initiated it.

Key Design Decisions

1. Sequence Numbers for Ordering

Using timestamps alone for message ordering is problematic due to clock skew. Use a server-assigned, per-conversation sequence number:

def get_or_create_conversation(user_a, user_b):
    # Canonical ordering
    p1, p2 = min(user_a, user_b), max(user_a, user_b)
    return db.upsert(participant_1=p1, participant_2=p2)

2. Message Status Tracking

Track two states with different mechanisms:

Delivered: Tracked per-device in MessageStatus—each device ACKs independently

Read: Use a high-water mark—store last_read_sequence_number per user per conversation. All messages with sequence_number <= that value are implicitly read

This keeps delivery precise while read receipts remain scalable.

Phase 3: API Design (~5 minutes)

For a chat system, we need bidirectional real-time communication. This is a perfect use case for WebSockets.

Why WebSockets?

For messaging, both parties need to push data: the client sends messages, the server pushes incoming messages. WebSockets allow this over a single persistent connection.

WebSocket Commands

Client → Server:

// Send a message
{
  "action": "send_message",
  "conversation_id": "conv_123",
  "content": "Hello!",
  "client_message_id": "local_456"  // For idempotency
}

// Mark messages as read (send highest sequence number viewed)
{
  "action": "read_receipt",
  "conversation_id": "conv_123",
  "last_read_sequence": 42
}

// Heartbeat
{ "action": "ping" }

Server → Client:

// New message received
{
  "event": "new_message",
  "message_id": "msg_789",
  "conversation_id": "conv_123",
  "sender_id": "user_456",
  "content": "Hello!",
  "sequence_number": 42,
  "timestamp": "2024-01-15T10:30:00Z"
}

// Message delivered to recipient's device
{
  "event": "delivered",
  "message_id": "msg_789",
  "conversation_id": "conv_123",
  "timestamp": "2024-01-15T10:30:01Z"
}

// Recipient read messages up to sequence N
{
  "event": "read",
  "conversation_id": "conv_123",
  "reader_id": "user_456",
  "last_read_sequence": 42
}

// Presence change
{
  "event": "presence",
  "user_id": "user_456",
  "online": false
}

Delivered vs Read: "Delivered" means the message reached the recipient's device (single checkmark → double checkmark in WhatsApp). "Read" means the recipient opened the conversation (double checkmark turns blue). These are separate events.

The client_message_id is crucial for idempotency. If the connection drops during a send, the client can retry with the same ID, and the server can deduplicate.

Implementation detail: persist client_message_id with a unique constraint per sender/conversation. On retry, return the existing message_id and sequence_number instead of creating a duplicate.

REST Endpoints (Supporting APIs)

Some operations work better as REST:

// Get user's conversations (inbox)

GET /conversations?cursor=...&limit=20

// Get messages in a conversation (history/pagination)
GET /conversations/{id}/messages?before_sequence=12345&limit=50

// Start a new conversation (or get existing)

POST /conversations

Request: { "recipient_user_id": "..." }

// Get presence for contacts

GET /users/presence?user_ids=id1,id2,id3

Phase 4: High-Level Design (~15-25 minutes)

This is the core of your interview. Start with a working design, then evolve it.

Architecture Diagram

Clients

Phone App

Desktop App

Load Balancer L4

WebSocket Server 1

WebSocket Server 2

WebSocket Server N

Redis Cluster Pub/Sub + Presence

  • Inbox + In-flight

Message Service

PostgreSQL Messages + Users

Components:

Message Flow: Sending a Message

Walk through the data flow as you draw:

When User A sends a message to User B, the message hits A's WebSocket server and is forwarded to the Message Service. The service generates a sequence number, stores the message, enqueues it for B, then publishes to Redis Pub/Sub. B's WebSocket server receives the publish, atomically moves the message to an in-flight queue, and delivers it. When B's client acknowledges, we delete it from in-flight.

User BWebSocket Server 2RedisPostgreSQLMessage ServiceWebSocket Server 1User AUser BWebSocket Server 2RedisPostgreSQLMessage ServiceWebSocket Server 1User Asend_message (content)Store messageINCR seq:conv_123INSERT message with sequence_numberLPUSH inbox:device_B (message)PUBLISH user:B (notification)message_id, sequence_numbersent confirmationSubscribe notificationBRPOPLPUSH inbox:device_B inflight:device_Bnew_message eventACK (delivered)LREM inflight:device_B (message)Update status: deliveredPUBLISH user:A (delivered receipt)Delivered notificationdelivered status update

Use an inbox + in-flight queue (or Redis Streams with ACKs) for guaranteed delivery. Move a message to in-flight before delivery; delete only after ACK. This handles offline users, network failures, and server crashes gracefully.

The Routing Problem

With dozens of WebSocket servers, how does WS1 (sender's server) route a message to WS2 (receiver's server)?

Option 1: Connection Registry

Maintain a mapping of user_id → server_id in Redis. Look up the user's server and send directly.

Cons: Registry lookups add latency; stale entries if servers crash.

Option 2: Pub/Sub (Recommended)

Each WebSocket server subscribes to channels for its connected users. When User B connects to WS2, WS2 subscribes to channel user:B. To send a message to B, publish to that channel.

WS1 publishes: PUBLISH user:B "{message}"

WS2 (subscribed to user:B) receives and delivers

Pros: No registry to maintain; naturally handles server failures (re-subscribe on reconnect).

Pub/Sub is the recommended approach for interviews. It's simpler to explain, handles failures gracefully, and is used by real chat systems.

Handling Offline Users

Messages for offline users accumulate in their inbox. Delivery uses an in-flight queue for ACKs. When they reconnect:

  1. Client connects and provides last known sequence number per conversation

  2. Server drains the inbox (undelivered messages) in order

  3. Server backfills any gaps from durable storage using the sequence number

  4. Client ACKs after persisting locally

  5. Server clears in-flight entries on ACK

Database Schema (Detailed)

-- Users table

CREATE TABLE users (

    user_id UUID PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    last_seen_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Devices table (for multi-device support)

CREATE TABLE devices (

    device_id UUID PRIMARY KEY,
    user_id UUID NOT NULL REFERENCES users,
    device_type VARCHAR(20) NOT NULL,  -- phone, desktop, web
    push_token VARCHAR(255),
    last_active_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_devices_user ON devices(user_id);

-- Conversations table

CREATE TABLE conversations (

    conversation_id UUID PRIMARY KEY,
    participant_1 UUID NOT NULL REFERENCES users,
    participant_2 UUID NOT NULL REFERENCES users,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW(),
    CONSTRAINT participant_order CHECK (participant_1 < participant_2),
    UNIQUE (participant_1, participant_2)
);

-- Messages table

CREATE TABLE messages (

    message_id UUID PRIMARY KEY,
    conversation_id UUID NOT NULL REFERENCES conversations,
    sender_id UUID NOT NULL REFERENCES users,
    client_message_id VARCHAR(64) NOT NULL,
    content TEXT NOT NULL,
    sequence_number BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    UNIQUE (conversation_id, sequence_number),
    UNIQUE (conversation_id, sender_id, client_message_id)
);
CREATE INDEX idx_messages_conversation ON messages(conversation_id, sequence_number DESC);

-- Message delivery status (per device)

CREATE TABLE message_status (

    message_id UUID NOT NULL REFERENCES messages,
    device_id UUID NOT NULL REFERENCES devices,
    status VARCHAR(20) NOT NULL DEFAULT 'sent',  -- sent, delivered
    updated_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (message_id, device_id)
);

-- Read receipts (high-water mark per user per conversation)

CREATE TABLE read_receipts (

    user_id UUID NOT NULL REFERENCES users,
    conversation_id UUID NOT NULL REFERENCES conversations,
    last_read_sequence BIGINT NOT NULL DEFAULT 0,
    last_read_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (user_id, conversation_id)
);

Sharding Strategy

For 100M+ users, shard by conversation_id:

Shard key: hash(conversation_id) % num_shards

Benefits:

  • All messages in a conversation are co-located

  • No cross-shard queries for conversation history

  • Even distribution (UUID is random)

Trade-off:

  • "Get all conversations for user X" requires scatter-gather

  • Solved with a separate user_conversations index table

Phase 5: Scaling & Trade-offs (~15-20 minutes)

With a working design in place, address the non-functional requirements and potential bottlenecks.

Deep Dive: Message Ordering

Problem: How do we ensure messages appear in the same order for both users?

Challenges:

Network delays can cause out-of-order delivery

Clock skew makes timestamps unreliable

Concurrent sends from both users

Solution: Per-conversation sequence numbers with server-side ordering

Option A: Redis INCR (Recommended for high throughput) 
def send_message(conversation_id, sender_id, content, client_message_id):
    # Atomic increment in Redis
    seq = redis.incr(f"seq:{conversation_id}")

    # Insert with assigned sequence number
    message = db.execute("""
        INSERT INTO messages (conversation_id, sender_id, client_message_id, content, sequence_number)
        VALUES ($1, $2, $3, $4, $5)
        RETURNING *
    """, conversation_id, sender_id, client_message_id, content, seq)

    return message

Option B: PostgreSQL advisory locks (Lower throughput, simpler)

-- Use advisory lock per conversation for serialization

SELECT pg_advisory_xact_lock(hashtext($1::text)); -- Lock on conversation_id

INSERT INTO messages (conversation_id, sender_id, client_message_id, content, sequence_number)

VALUES (

    $1, $2, $3, $4,
    (SELECT COALESCE(MAX(sequence_number), 0) + 1 FROM messages WHERE conversation_id = $1)
)
RETURNING *;
Redis INCR is preferred because it's fast (~0.1ms) and doesn't block database writes. The trade-off is that sequence numbers may have gaps if a message insert fails after incrementing Redis.

Client-side handling:

Display messages ordered by sequence_number

If message arrives with gap (seq 5, then seq 7), request missing messages

Optimistic UI: show sent message immediately, reorder if needed

Deep Dive: Read Receipts at Scale

Naive approach: Update a read_receipt row every time user scrolls.

Problem: User scrolling through history could generate hundreds of writes/second.

Optimized approach: Debounced, batched updates

// Client-side debouncing
let pendingReadReceipt = null;
let debounceTimer = null;

function onMessageViewed(sequenceNumber) {
    pendingReadReceipt = sequenceNumber;
    clearTimeout(debounceTimer);
    debounceTimer = setTimeout(() => {
        sendReadReceipt(pendingReadReceipt);
    }, 2000);  // 2 second debounce
}

Server-side: Only update if the new message has a higher sequence number than the current high-water mark.

Deep Dive: Presence System

Presence is a nice-to-have; include it if time allows.

Challenge: Track online/offline status for millions of users efficiently.

Approach: Heartbeat-based presence with Redis

HEARTBEAT_INTERVAL = 30 # seconds

PRESENCE_TTL = 60 # seconds (2x heartbeat)

def set_online(user_id):
    redis.setex(f"presence:{user_id}", PRESENCE_TTL, "online")
    publish_presence_change(user_id, online=True)

def heartbeat(user_id):
    redis.expire(f"presence:{user_id}", PRESENCE_TTL)

def is_online(user_id):
    return redis.exists(f"presence:{user_id}")

Presence subscription: For 1-to-1 chat, only notify users who have an active conversation. Maintain a Redis set of "presence subscribers" per user.

Deep Dive: Multi-Device Sync

Challenge: User has phone and laptop. How to keep them in sync?

Approach: Each device has its own inbox and delivery tracking

Device registration: When a user logs in on a new device, create a device record

Inbox per device: Redis maintains inbox:{device_id} (and inflight:{device_id} for delivery tracking)

Fan-out on send: When sending to User B, add message to inbox of all B's devices

Independent ACKs: Each device ACKs delivery independently, updating message_status

Read sync: When user reads on one device, the read_receipts table is updated. Other devices fetch this on next sync

def send_to_user(user_id, message):
    devices = db.query("SELECT device_id FROM devices WHERE user_id = $1", user_id)
    for device in devices:
        redis.lpush(f"inbox:{device.device_id}", message.to_json())
    redis.publish(f"user:{user_id}", "new_message")

For 1-to-1 chat, the fan-out is small (typically 2-5 devices per user). This is much simpler than group chat where you'd fan out to hundreds of users.

Trade-offs Discussion

Consistency vs Availability:

Recommendation: Favor consistency within a conversation—users notice out-of-order messages.

Latency vs Durability:

Recommendation: Synchronous write to durable storage before confirming to sender. The extra latency is acceptable; losing messages is not.

High Availability

Handling WebSocket server failure: If a server crashes, clients reconnect to another server via the load balancer. They fetch pending messages from inbox/in-flight on reconnect.

Interview Checklist

Before wrapping up, verify you've covered:

Requirements Phase

Scope clarified (1-to-1 only, no groups)

4-5 functional requirements identified

Scale, latency, availability discussed

Quick capacity estimation completed

Data Model

Key entities: User, Device, Conversation, Message, MessageStatus

Canonical conversation ordering explained

Sequence numbers for message ordering

Delivered vs read status tracking

API Design

WebSocket choice justified (bidirectional, real-time)

Key commands defined (send, receive, read_receipt)

Idempotency via client_message_id

High-Level Design

Architecture diagram with data flow

Message routing explained (Pub/Sub)

Offline delivery handled (Inbox + in-flight or Streams)

Database schema and sharding

Scaling & Trade-offs

Message ordering deep dive

Read receipts optimization

Consistency vs availability discussed

At least one bottleneck identified

Summary

The 1-to-1 chat design is simpler than WhatsApp (no group fan-out), but interviewers will dig into details. Focus on message ordering, routing via Pub/Sub, and offline delivery via inbox + in-flight—these are the areas where depth matters.


WhiteboardAuto-save enabled
Loading whiteboard…