ho Designing a Chat System Like WhatsApp — SystemCraft

Designing a Chat System Like WhatsApp

Difficulty: Intermediate 📋 Prerequisites: System Design Fundamentals — especially Message Queues and Caching ⏱️ Reading time: 20 min


TL;DR

A chat system delivers messages in real-time using WebSockets for online users and stores-then-forwards for offline users.

flowchart LR
    SENDER["Sender"]:::client
    WS["WebSocket Servers"]:::service
    CHAT["Chat Service"]:::service
    STORE[("Message Store<br/>Cassandra")]:::data
    K["Kafka<br/>fan-out"]:::async
    PUSH["Push Notifications<br/>FCM APNs"]:::external
    RECEIVER["Receiver"]:::client

    SENDER --> WS
    WS --> CHAT
    CHAT --> STORE
    CHAT --> K
    K --> WS
    CHAT --> PUSH
    WS --> RECEIVER

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000
    classDef external fill:#EC407A,stroke:#880E4F,color:#fff

In 3 sentences: Clients maintain a persistent WebSocket connection to the server. When a message is sent, the server persists it, looks up which server the receiver is connected to, and pushes it down their WebSocket. If the receiver is offline, the message waits in a queue and a push notification is sent.


Understanding the Problem

💬 What is a chat system? A real-time messaging platform that lets users send text, images, and files to individuals or groups. Messages must be delivered reliably (even if the recipient is offline), ordered correctly, and displayed in real-time. Think WhatsApp, Telegram, Facebook Messenger, or Slack. The hard parts: guaranteed delivery across flaky mobile networks, real-time push without polling, group fan-out at scale, and end-to-end encryption.

Naive First Cut

flowchart LR
    SENDER["Sender"]:::client
    API["Chat API"]:::service
    DB[("Messages DB<br/>one table")]:::data
    RECEIVER["Receiver<br/>polls every 5s"]:::client

    SENDER --> API
    API --> DB
    RECEIVER --> API

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Sender POSTs message to an API, stored in a DB. Receiver polls the API every 5 seconds for new messages.

Why this breaks:

Prior Art

Technology Choices

Tier Purpose Primary pick Alternatives
Real-time transport Push messages to online clients WebSocket (long-lived) SSE, MQTT (IoT/mobile-optimized), gRPC streaming
Connection management Track who’s online on which server Redis Pub/Sub + connection registry Kafka, custom session store
Message storage Durable, ordered message log Cassandra (partition per conversation) ScyllaDB, DynamoDB, TiDB
Message queue Decouple sender from fan-out Kafka (per-user topic or partition) SQS, RabbitMQ, Pulsar
Offline delivery Store messages until recipient connects Redis sorted set per user SQS per user, Cassandra unread table
Presence Who’s online Redis with TTL per user Dedicated presence service
Media storage Images, files, voice notes S3 / GCS with CDN MinIO, Azure Blob
Push notifications Offline users FCM + APNs OneSignal, SNS
E2E encryption Message privacy Signal Protocol (Double Ratchet) Custom, Noise Protocol

Functional Requirements

Core:

  1. Users can send messages (text) to another user in real-time (1:1 chat).
  2. Users can create groups and send messages to all group members.
  3. Messages are delivered reliably even if the recipient is offline (store-and-forward).

Below the line:

Non-Functional Requirements

Core:

Below the line:

Core Entities

API / System Interface

WebSocket: wss://chat.example.com/ws
  → Client authenticates on connect (JWT)
  → Bidirectional: send messages, receive messages, typing, presence

REST (fallback + media):
POST /v1/messages         → send a message (fallback if WS down)
GET  /v1/conversations/:id/messages?after=<seqNo>  → sync history
POST /v1/media/upload     → upload image/file, get a mediaUrl
POST /v1/groups           → create group

Wire format (over WebSocket):

{"type": "message", "to": "conv_123", "text": "hello", "clientMsgId": "uuid"}
{"type": "ack", "messageId": "msg_456", "status": "delivered"}
{"type": "typing", "conversationId": "conv_123", "userId": "u_789"}

Security: WebSocket authenticated via JWT on handshake. clientMsgId is for client-side dedupe (idempotency). Server generates the authoritative messageId and timestamp.

High-Level Design

1) User sends a 1:1 message (both online)

flowchart LR
    SENDER["Sender"]:::client
    WS1["WebSocket Server A"]:::service
    CHAT["Chat Service"]:::service
    STORE[("Message Store<br/>Cassandra")]:::data
    ROUTE["Connection Registry<br/>Redis"]:::data
    WS2["WebSocket Server B"]:::service
    RECEIVER["Receiver"]:::client

    SENDER --> WS1
    WS1 --> CHAT
    CHAT --> STORE
    CHAT --> ROUTE
    ROUTE --> WS2
    WS2 --> RECEIVER

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Flow:

  1. Sender’s app sends the message over its WebSocket connection to Server A.
  2. Server A forwards to the Chat Service.
  3. Chat Service persists the message to Cassandra (partition key = conversationId, clustering key = messageId for ordering).
  4. Chat Service looks up receiver’s WebSocket server from Redis connection registry: ws_conn:{receiverId} → serverB.
  5. Chat Service pushes the message to Server B (internal pub/sub or direct gRPC).
  6. Server B pushes the message down the receiver’s WebSocket.
  7. Receiver’s app sends back an ack with status: delivered.
  8. Server sends delivered receipt back to sender.

2) Receiver is offline — store and forward

flowchart LR
    SENDER["Sender"]:::client
    CHAT["Chat Service"]:::service
    STORE[("Message Store")]:::data
    OFFLINE[("Offline Queue<br/>Redis sorted set")]:::data
    PUSH["Push Service"]:::service
    FCM["FCM and APNs"]:::external

    SENDER --> CHAT
    CHAT --> STORE
    CHAT --> OFFLINE
    CHAT --> PUSH
    PUSH --> FCM

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000
    classDef external fill:#EC407A,stroke:#880E4F,color:#fff

Flow:

  1. Chat Service checks connection registry → receiver not online.
  2. Message persisted to Cassandra (same as before — always store).
  3. Message ID added to receiver’s offline queue (Redis sorted set, scored by sequence number).
  4. Push notification sent via FCM/APNs: “You have a new message from X.”
  5. When receiver opens the app and reconnects via WebSocket, the server drains the offline queue: sends all pending messages in order.
  6. Receiver acks each; server removes from offline queue.

3) Group message fan-out

flowchart LR
    SENDER["Sender"]:::client
    CHAT["Chat Service"]:::service
    STORE[("Message Store")]:::data
    K["Kafka<br/>fan-out topic"]:::async
    FAN["Fan-out Workers"]:::service
    WS["WebSocket Servers"]:::service
    MEMBERS["Group Members"]:::client

    SENDER --> CHAT
    CHAT --> STORE
    CHAT --> K
    K --> FAN
    FAN --> WS
    WS --> MEMBERS

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000

Flow:

  1. Sender sends to group conv_123 (256 members).
  2. Chat Service stores ONE copy of the message (partition key = conv_123).
  3. Publishes a fan-out event to Kafka: {messageId, conversationId, members: [u1..u256]}.
  4. Fan-out workers consume, look up each member’s connection, and push individually.
  5. Online members get real-time delivery. Offline members get offline queue + push notification.

Why write-once, fan-out-on-read:

Potential Deep Dives

Deep Dive 1 — How to handle 2M WebSocket connections per server

Bad — one thread per connection (Java BIO). 2M threads = impossible. OOM at ~10K threads.

Good — NIO event loop (Netty, Node.js). Netty handles millions of connections on a single event loop group. Each connection is just a file descriptor + a small buffer. Memory = ~10KB per idle connection. 2M connections ≈ 20GB RAM. Doable on a 64GB box.

Great — tiered connection handling.

Deep Dive 2 — Message ordering in distributed systems

Problem: Sender sends “Hello” then “How are you?” but they arrive at different servers or are processed out of order.

Bad — rely on server timestamps. Clock skew between servers means timestamps can be out of order. NTP gives ~10ms accuracy at best.

Good — per-conversation monotonic sequence number. Chat Service assigns a seqNo per conversation using Redis INCR conv_seq:{convId}. Messages display in seqNo order. Client sorts locally.

Great — combine sequence number + client-side vector clock for offline conflict resolution.

Deep Dive 3 — Reliable delivery with at-least-once + client dedupe

Problem: Network is unreliable. Message might be delivered twice if the ack is lost.

Flow:

Sender → Server: message (clientMsgId: "abc")
Server → Sender: ack (messageId: "msg_1", clientMsgId: "abc")
Server → Receiver: message (messageId: "msg_1")
Receiver → Server: delivered ack (messageId: "msg_1")

What if receiver’s ack is lost? Server retries delivery. Receiver sees msg_1 twice. Client dedupes by messageId — if already in local DB, ignore.

What if sender’s send is retried? Server checks clientMsgId: "abc" against a short-lived dedupe cache. If seen, returns the same messageId without re-storing.

Result: at-least-once from server side, exactly-once from user’s perspective (client dedupe).

Deep Dive 4 — How to sync message history across devices

Problem: User has phone + web + desktop. All three must show the same messages.

Solution: pull-based sync with sequence numbers.

This is the “ordered log” model (Facebook Iris). The server is the source of truth; clients are materialized views with a cursor.

Deep Dive 5 — Group fan-out: write amplification vs read amplification

Write amplification (push model):

Read amplification (pull model):

Hybrid (what WhatsApp/Discord do):

Final Architecture

flowchart LR
    CLIENTS["Mobile and Web Clients"]:::client
    LB["Load Balancer<br/>sticky by userId"]:::edge
    WS["WebSocket Servers<br/>Netty edge tier"]:::service
    CHAT["Chat Service"]:::service
    REG[("Connection Registry<br/>Redis")]:::data
    STORE[("Message Store<br/>Cassandra")]:::data
    OFFLINE[("Offline Queue<br/>Redis sorted set")]:::data
    K["Kafka<br/>fan-out and events"]:::async
    FAN["Fan-out Workers"]:::service
    PUSH["Push Service"]:::service
    MEDIA[("S3 and CDN<br/>media")]:::data
    FCM["FCM and APNs"]:::external

    CLIENTS --> LB
    LB --> WS
    WS --> CHAT
    CHAT --> REG
    CHAT --> STORE
    CHAT --> OFFLINE
    CHAT --> K
    K --> FAN
    FAN --> WS
    CHAT --> PUSH
    PUSH --> FCM

    classDef client fill:#FF7043,stroke:#BF360C,color:#fff
    classDef edge fill:#42A5F5,stroke:#0D47A1,color:#fff
    classDef service fill:#66BB6A,stroke:#1B5E20,color:#fff
    classDef async fill:#AB47BC,stroke:#4A148C,color:#fff
    classDef data fill:#FFCA28,stroke:#F57F17,color:#000
    classDef external fill:#EC407A,stroke:#880E4F,color:#fff

Summary

Decision Choice Why
Transport WebSocket Real-time bidirectional, sub-second delivery
Message store Cassandra Partition per conversation, append-only, handles billions
Connection registry Redis Sub-ms lookup of “which server has user X”
Offline delivery Redis sorted set + push notification Ordered drain on reconnect
Group fan-out Kafka → workers Async, retryable, doesn’t block sender
Ordering Per-conversation sequence number Simple, no clock dependency
Delivery guarantee At-least-once + client dedupe Zero message loss, no duplicates visible to user
Multi-device Pull sync with seqNo cursor Ordered log model (Facebook Iris)

Key Technologies Mentioned

Term What it is
WebSocket A persistent two-way connection between client and server. Unlike HTTP (request → response → done), WebSocket stays open so the server can push messages to the client anytime.
Cassandra A distributed NoSQL database optimized for fast writes. Stores data across many machines. Perfect for append-only message logs.
Kafka A distributed event streaming platform. Producers write events, consumers read them. Used here to decouple message sending from delivery fan-out.
Redis In-memory key-value store (< 1ms reads). Used here for connection registry (which user is on which server) and offline message queues.
FCM / APNs Firebase Cloud Messaging (Android) and Apple Push Notification service (iOS). How you send push notifications to phones when the app is closed.
Sequence Number A monotonically increasing integer per conversation. Guarantees message ordering regardless of clock differences between servers.
Store-and-Forward Pattern where the server stores a message durably first, then delivers it when the recipient is available. Ensures zero message loss.
Fan-out Delivering one message to multiple recipients (group chat). “Fan-out on write” = copy to each inbox. “Fan-out on read” = store once, each client fetches.

Related: Rate Limiter · Notification System · System Design Fundamentals

Designing a Food Deliv... →

💬 Comments