Home › Design Google Docs — Collaborative Editing System Design Interview

Designing a Collaborative Editing Platform (Google Docs)

⚡ Difficulty: Advanced 🏷️ Topics: Real-Time Collaboration, OT/CRDT, WebSocket, Conflict Resolution, Presence Awareness 🏢 Asked at: Google, Notion, Figma, Microsoft, Amazon

1. Understanding the Problem

A collaborative editing platform lets multiple users simultaneously edit the same document in real-time — seeing each other’s cursors, changes appearing character-by-character as they type, without any user overwriting another’s work. The hard part? When two users type at the same position in the same millisecond, you need a deterministic way to merge both changes without data loss, all while maintaining sub-100ms latency so typing feels instant.

1.5. Naive First Cut

flowchart LR
    User1["User 1 Browser"]:::client
    User2["User 2 Browser"]:::client
    API["API Server"]:::service
    DB["Postgres DB"]:::data

    User1 --> API
    User2 --> API
    API --> DB

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

How this breaks:

Last-write-wins in Postgres means User 1’s edits silently disappear when User 2 saves — data loss
No way to push changes to other users in real-time — polling every second is too slow and wasteful
Storing the full document on every keystroke creates massive write amplification (100 chars/min × millions of docs)
No conflict resolution — concurrent edits at the same position produce garbled text
Single API server can’t maintain persistent connections for millions of active editors
No presence awareness — users have no idea others are editing, leading to conflicting sections

The rest of the doc evolves this into a production-grade real-time collaborative editing system using operational transforms and persistent WebSocket connections.

1.7. Prior Art We’re Drawing From

Google Wave OT — Pioneered Operational Transformation for real-time collaborative editing. Jupiter protocol with a central server that transforms concurrent operations to maintain consistency. (Google Research)
Figma CRDT — Uses a custom CRDT (Conflict-free Replicated Data Type) for multiplayer design editing with no central coordination for conflict resolution. Demonstrated CRDTs can work at scale with structured documents. (Figma Engineering Blog)
Yjs — Open-source CRDT framework used by Notion, JupyterLab, and others. YATA algorithm for text sequences with O(1) amortized insertion and efficient encoding. (Yjs GitHub)
Automerge — Research CRDT library that treats documents as mergeable JSON structures. Demonstrates how CRDTs handle offline editing and eventual convergence. (Ink and Switch)
Google Docs Jupiter Protocol — Server-mediated OT where each client sends operations to a central server that transforms and broadcasts them. Single point of serialization ensures total ordering. (Operational Transformation FAQ)

2. Functional Requirements

Core (Top 3)

Real-time collaborative editing — multiple users type simultaneously in the same document with changes appearing within 100ms for all participants
Document storage and retrieval — create, open, and save documents; persist all content durably
Version history — view previous versions of a document, restore any earlier state, see who made what changes

Below the Line

Rich text formatting (bold, italic, headings, lists)
Comments and suggestions (track changes)
Offline editing with sync on reconnect
Access control and sharing permissions
Real-time presence indicators (cursors, selections)
Document templates and search

3. Non-Functional Requirements

Core

NFR	Target
Edit propagation latency	< 100ms from one user’s keystroke to appearing on another user’s screen
Consistency	Eventual consistency — all users converge to the same document state regardless of operation order
Availability	99.99% — document editing must never be “down” during work hours
Scale	Support 100M+ documents with up to 50 concurrent editors per document

Below the Line

Version history retained for 30+ days
Document size up to 1M characters
Support 10M+ daily active users across all documents
Sub-second document open time (cold start)

Technology Choices

Tier	Purpose	Stores	Access Pattern	Primary	Alternatives
Document Store	Persistent document content	Full document snapshots + metadata	Read/write by docId	Postgres (or Spanner)	CockroachDB, TiDB
Operation Log	Ordered stream of edit operations	Insert/delete ops with position and version	Append-only, read by docId + version range	Cassandra (or DynamoDB)	ScyllaDB, FoundationDB
Real-time Relay	Push operations to connected editors	WebSocket messages	Fan-out per document session	WebSocket Gateway + Redis Pub/Sub	SSE, gRPC streaming
Presence Cache	Active cursors and selections	userId, cursor position, color	High-QPS reads/writes, TTL-based	Redis Cluster	Memcached
Event Bus	Async events (save, snapshot, version)	Document lifecycle events	Pub/sub per document	Kafka or Redpanda	Kinesis, Pub/Sub
Object Store	Version snapshots and exports	Document snapshots, PDF exports	Batch writes, occasional reads	S3	GCS, MinIO
Search Index	Full-text document search	Document titles and content	Text search by user	Elasticsearch	Typesense, Meilisearch

Why a central OT server, not pure CRDT? Pure CRDTs (like Yjs or Automerge) work great for peer-to-peer scenarios and offline editing. But for a Google Docs-style product where we need a canonical server-side version, fine-grained access control, and version history, server-mediated OT gives us a single serialization point. This makes snapshotting, permissions, and undo/redo simpler. The tradeoff: the server is on the critical path for every operation. We mitigate this with per-document sharding.

Why Cassandra for the operation log? Operations are append-only and partitioned by documentId. We need fast sequential reads (replay ops from version X to Y) and high write throughput. Cassandra’s partition-key based access and LSM storage handle this naturally. We never update or delete individual operations.

4. Core Entities

Document — id, title, owner, permissions, current version number, created/updated timestamps
Operation — documentId, version number, userId, type (insert/delete), position, content, timestamp
Session — documentId, userId, WebSocket connection reference, cursor position, selection range
Version Snapshot — documentId, version number, full document content, timestamp (periodic checkpoint)
User — id, name, email, avatar, color assignment for presence

5. API / System Interface

POST /api/v1/documents
  Body: { title, content?: "" }
  Response: { documentId, version: 0, createdAt }
  Auth: JWT Bearer token
  Note: Creates an empty document

GET /api/v1/documents/{docId}
  Response: { documentId, title, content, version, collaborators[] }
  Auth: JWT Bearer token
  Note: Returns latest snapshot + any pending ops after snapshot

WebSocket /ws/v1/documents/{docId}/edit
  Client sends: { type: "op", ops: [{ type: "insert", pos: 12, content: "hello" }], version: 42 }
  Server broadcasts: { type: "op", ops: [...transformed...], userId, version: 43, serverTimestamp }
  Client sends: { type: "cursor", pos: 17, selectionEnd?: 25 }
  Server broadcasts: { type: "presence", userId, name, color, pos, selectionEnd }
  Auth: JWT ticket in connection params

GET /api/v1/documents/{docId}/history
  Query: ?fromVersion=1&toVersion=100
  Response: { versions: [{ version, userId, timestamp, summary }] }
  Auth: JWT Bearer token

POST /api/v1/documents/{docId}/restore
  Body: { targetVersion: 42 }
  Response: { documentId, newVersion: 101, content }
  Auth: JWT Bearer token (owner/editor only)

6. High-Level Design

FR1: Real-Time Collaborative Editing

The core challenge: when User A inserts “hello” at position 5 and User B simultaneously deletes character at position 3, the positions shift. Without transformation, User A’s insert lands at the wrong spot. We need Operational Transformation (OT) to adjust positions based on concurrent operations.

New components we need:

API Gateway — Entry point for HTTP requests (document CRUD). Handles auth and routing.
WebSocket Gateway — Maintains persistent connections with all active editors. One connection per user per document.
Collaboration Service — The brain. Receives operations from clients, transforms them against concurrent ops using OT, assigns version numbers, and broadcasts to all participants. 💡 Operational Transformation (OT) = an algorithm that adjusts the position/content of an operation based on other operations that happened concurrently. If User B deletes char at pos 3, User A’s insert at pos 5 becomes an insert at pos 4.
Operation Log (Cassandra) — Append-only log of every operation, partitioned by documentId. Used for replaying history and conflict resolution.
Redis Pub/Sub — Routes transformed operations to the correct WebSocket Gateway instance holding each collaborator’s connection.

flowchart LR
    U1["User 1 Browser"]:::client
    U2["User 2 Browser"]:::client
    WSG["WebSocket Gateway"]:::edge
    CS["Collaboration Service"]:::service
    OL["Operation Log Cassandra"]:::data
    RPS["Redis Pub/Sub"]:::async

    U1 --> WSG
    U2 --> WSG
    WSG --> CS
    CS --> OL
    CS --> RPS
    RPS --> WSG

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0

Color	Meaning
🟠 Purple-Orange	Client apps
🔵 Blue	Edge / Gateway
🟢 Green	Backend services
🟡 Yellow	Data stores
🟣 Purple	Async (Kafka / Pub/Sub)
🔴 Pink	External services

Step-by-step flow:

User 1 types “hello” at position 12 → client generates operation {type: "insert", pos: 12, content: "hello", clientVersion: 42}
Operation sent over WebSocket to WebSocket Gateway → forwarded to Collaboration Service
Collaboration Service checks: is clientVersion == server’s current version for this doc? If yes, operation applies directly. If not (concurrent edits happened), transform the operation against all ops between clientVersion and current server version.
Transformed operation is assigned the next server version (43), appended to Operation Log in Cassandra
Collaboration Service publishes transformed op to Redis Pub/Sub channel doc:{docId}:ops
WebSocket Gateway instances subscribed to that channel receive the op and push it to all connected clients (except the author, who gets an ACK instead)
Each client applies the transformed operation to their local document state

Why not just broadcast the raw operation?

If User 1 is at version 42 and User 2 sends an op based on version 41, User 1 needs to see User 2’s op transformed against the operation that took version 41 → 42. The server does this centrally so each client always receives operations that can be applied directly to their current state.

FR2: Document Storage and Retrieval

Documents need to be persisted durably so users can close the browser and come back later. But we can’t write the full document to disk on every keystroke — that’s 5-10 writes per second per active document. Instead, we store operations and periodically create snapshots.

New components we need:

Document Service — Handles document CRUD (create, open, list, delete). Responsible for assembling the current document state from snapshot + recent operations.
Snapshot Service — Periodically compacts the operation log into a full document snapshot. Without snapshots, opening a document with 100K operations would require replaying all of them.
Document DB (Postgres) — Stores document metadata (title, owner, permissions, current version, latest snapshot version).

flowchart LR
    User["User Browser"]:::client
    GW["API Gateway"]:::edge
    DS["Document Service"]:::service
    DB["Document DB Postgres"]:::data
    OL["Operation Log Cassandra"]:::data
    SS["Snapshot Service"]:::service
    S3["Object Store S3"]:::data

    User --> GW
    GW --> DS
    DS --> DB
    DS --> OL
    SS --> OL
    SS --> S3
    DS --> S3

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

Step-by-step flow:

User opens a document → GET request hits Document Service
Document Service reads metadata from Postgres: {docId, title, latestSnapshotVersion: 950, currentVersion: 987}
Fetches the snapshot at version 950 from S3 (full document content at that point)
Reads operations 951-987 from Cassandra (only 37 ops to replay, not the entire history)
Applies those 37 operations to the snapshot → produces current document state
Returns assembled document to user, establishes WebSocket connection for live editing
Meanwhile, Snapshot Service runs periodically (every 100 operations or 5 minutes): reads all ops since last snapshot, applies them, writes new snapshot to S3, updates Postgres with new snapshot version

Why snapshot + operation log, not just store the full document?

Two reasons. First, writing the full document on every keystroke would be 5-10 Postgres writes/sec per active document — unsustainable at scale. Operations are tiny (insert 3 chars at pos 12) and append-only, which Cassandra handles trivially. Second, the operation log IS the version history. Every keystroke is preserved, enabling fine-grained undo and time-travel.

FR3: Version History

Users need to see what the document looked like at any point in the past, who made changes, and restore earlier versions. This is built directly on top of the operation log.

New components we need:

History Service — Reads the operation log and reconstructs document state at any historical version. Groups operations into logical “sessions” for a human-readable change timeline.
Change Summarizer — Groups individual character operations into meaningful change descriptions (“User A added paragraph about pricing” rather than 847 individual insert operations).

flowchart LR
    User["User Browser"]:::client
    GW["API Gateway"]:::edge
    HS["History Service"]:::service
    OL["Operation Log Cassandra"]:::data
    S3["Snapshots S3"]:::data
    CS["Change Summarizer"]:::service
    KF["Kafka"]:::async

    User --> GW
    GW --> HS
    HS --> OL
    HS --> S3
    KF --> CS
    CS --> DB["History DB Postgres"]:::data

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0

Step-by-step flow:

User clicks “Version History” → GET /documents/{docId}/history
History Service reads the pre-computed change summaries from History DB: grouped by user session (a session = continuous editing by one user without 5+ minutes of inactivity)
User selects a specific version (e.g., version 450) → History Service fetches nearest snapshot before 450 (say version 400 from S3)
Replays operations 401-450 from Cassandra → produces exact document state at version 450
Returns this reconstructed document to the UI for preview
If user clicks “Restore”: a new operation is created that replaces current content with the historical state, increments version, broadcasts to all active editors

Change Summarizer (async):

Listens to Kafka topic doc.operations
Batches operations by (userId + 5-minute inactivity gap) into “change sessions”
Produces summaries: “User A edited lines 45-52” or “User B deleted 3 paragraphs”
Stores in History DB for fast timeline rendering

6.5. Core Flows

Flow 1: Concurrent Edit with OT Resolution

sequenceDiagram
    participant U1 as User 1
    participant U2 as User 2
    participant WSG as WebSocket Gateway
    participant CS as Collaboration Service
    participant OL as Operation Log

    Note over U1,U2: Both at document version 10
    U1->>WSG: op: insert "A" at pos 5 (baseVersion: 10)
    U2->>WSG: op: delete pos 3 (baseVersion: 10)
    WSG->>CS: User1 op (baseVersion 10)
    CS->>CS: No concurrent ops. Assign version 11.
    CS->>OL: Append op (version 11)
    CS->>WSG: Broadcast to User 2
    WSG->>U2: Transformed op: insert "A" at pos 5 (version 11)
    WSG->>CS: User2 op (baseVersion 10)
    CS->>CS: Transform against version 11. Delete at pos 3 stays pos 3.
    Note over CS: Insert at pos 5 does not affect delete at pos 3
    CS->>OL: Append transformed op (version 12)
    CS->>WSG: Broadcast to User 1
    WSG->>U1: Transformed op: delete pos 3 (version 12)
    Note over U1,U2: Both converge to same state

Walk-through:

Both users start at version 10 of the document
User 1 inserts “A” at position 5 — arrives at server first, gets version 11 directly (no transformation needed since it’s based on the current version)
User 2’s delete at position 3 arrives based on version 10, but server is now at version 11. Server transforms it against version 11’s operation (insert at pos 5). Since the delete is at pos 3 and the insert was at pos 5, the delete position is unaffected — stays at pos 3
Both users converge: the document has “A” inserted at pos 5 AND character at pos 3 deleted

Non-obvious failure path: What if the Collaboration Service crashes mid-transform? The operation was never assigned a version number, so the client will timeout and retry. Operations are idempotent (same content + same base version = same transform). The client retries with the same baseVersion, and the server re-processes it.

Flow 2: Document Open with Snapshot Reconstruction

sequenceDiagram
    participant User
    participant GW as API Gateway
    participant DS as Document Service
    participant PG as Postgres
    participant S3 as Object Store
    participant CAS as Cassandra Op Log
    participant WSG as WebSocket Gateway

    User->>GW: GET /documents/{docId}
    GW->>DS: Fetch document
    DS->>PG: Get metadata + latest snapshot version
    PG-->>DS: snapshotVersion: 950 currentVersion: 987
    DS->>S3: Fetch snapshot at version 950
    S3-->>DS: Full document content at v950
    DS->>CAS: Read operations 951 to 987
    CAS-->>DS: 37 operations
    DS->>DS: Apply 37 ops to snapshot
    DS-->>User: Document content + version 987
    User->>WSG: Open WebSocket for docId
    WSG->>WSG: Register session
    WSG-->>User: Connected. Listening for ops from v988+

Non-obvious failure path: What if a snapshot is corrupted in S3? The Document Service falls back to the previous snapshot (version 850) and replays 100 more operations (851-987). Slower, but guarantees correctness. If ALL snapshots are lost, the service can rebuild from operation 0 — the op log is the source of truth.

Document Session State Machine

stateDiagram-v2
    [*] --> LOADING : User opens document
    LOADING --> CONNECTED : Snapshot loaded + WebSocket open
    CONNECTED --> EDITING : User types
    EDITING --> CONNECTED : Idle timeout 30s
    CONNECTED --> RECONNECTING : WebSocket drops
    RECONNECTING --> CONNECTED : Reconnected within 30s
    RECONNECTING --> OFFLINE : Disconnected over 30s
    OFFLINE --> SYNCING : Connection restored
    SYNCING --> CONNECTED : All buffered ops sent and ACKed
    CONNECTED --> CLOSED : User closes document
    CLOSED --> [*]

Each transition: EDITING buffers operations locally and sends over WebSocket. RECONNECTING queues new operations client-side. SYNCING replays all buffered operations to the server with proper base versions for OT resolution.

7. Deep Dives

Deep Dive 1: OT vs CRDT — Conflict Resolution Strategy

Problem: Two users edit the same position simultaneously. Without conflict resolution, one user’s changes are lost or the document becomes garbled.

Bad: Last-write-wins (LWW). Overwrite the document with whoever saved last. Simple, but any concurrent edit is lost. Completely unacceptable for collaborative editing.

Good: Lock-based editing. Lock paragraphs or sections while a user is editing them. Prevents conflicts but destroys the real-time collaboration experience — users see “section locked by User B” instead of fluid typing.

Great: Server-mediated Operational Transformation using the Jupiter/dOPT protocol. (Borrowing from Google Wave and Google Docs.)

How OT works:

The core OT algorithm maintains a server version counter and transforms each incoming operation against all operations that happened between the client’s base version and the server’s current version.

Transform rules for text operations:
- insert(pos, char) vs insert(pos2, char2):
    if pos < pos2: insert stays at pos (the other insert is after us)
    if pos > pos2: insert shifts to pos+1 (the other insert pushed us right)
    if pos == pos2: break tie by userId (deterministic)

- insert(pos, char) vs delete(pos2):
    if pos <= pos2: insert stays at pos
    if pos > pos2: insert shifts to pos-1 (the deleted char pulled us left)

- delete(pos) vs insert(pos2, char):
    if pos < pos2: delete stays at pos
    if pos >= pos2: delete shifts to pos+1

- delete(pos) vs delete(pos2):
    if pos < pos2: delete stays at pos
    if pos > pos2: delete shifts to pos-1
    if pos == pos2: delete becomes no-op (already deleted)

flowchart LR
    C1["Client 1"]:::client
    C2["Client 2"]:::client
    OT["OT Engine"]:::service
    VER["Version Manager"]:::service
    OL["Op Log"]:::data

    C1 -->|"op + baseVersion"| OT
    C2 -->|"op + baseVersion"| OT
    OT --> VER
    VER --> OL
    OT -->|"transformed op"| C1
    OT -->|"transformed op"| C2

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

Why OT over CRDT for this use case?

Factor	OT (Server-mediated)	CRDT (Decentralized)
Server involvement	Required (single serialization point)	Optional (peers can sync directly)
Version history	Trivial (linear version chain)	Complex (DAG of causal versions)
Undo/redo	Simple (reverse the operation)	Hard (tombstones, causal dependencies)
Memory overhead	Low (operations are tiny)	Higher (tombstones never deleted)
Offline support	Harder (need server to transform)	Native (merge on reconnect)
Correctness proof	Well-understood (25+ years)	Newer (still evolving for rich text)

For a Google Docs-like product with always-online users, centralized permissions, and strong version history needs, OT wins on simplicity. If we needed peer-to-peer or offline-first (like a note-taking app), CRDT would be the choice.

Deep Dive 2: Real-Time Sync via WebSocket at Scale

Problem: 10M concurrent users editing documents. Each user has a persistent WebSocket connection. Operations must be routed from the Collaboration Service to the exact WebSocket Gateway instance holding each user’s connection.

Bad: Single WebSocket server for everyone. Can’t scale past ~100K connections per instance (file descriptors, RAM for buffers). Single point of failure.

Good: Multiple WebSocket Gateway instances behind a load balancer with sticky sessions. Works, but: when the Collaboration Service produces a transformed operation, how does it know which gateway instance holds User 2’s connection?

Great: WebSocket Gateway fleet + Redis Pub/Sub for last-mile routing + connection registry.

flowchart LR
    CS["Collaboration Service"]:::service
    RPS["Redis Pub/Sub"]:::async
    WSG1["WS Gateway 1"]:::edge
    WSG2["WS Gateway 2"]:::edge
    WSG3["WS Gateway 3"]:::edge
    U1["User 1"]:::client
    U2["User 2"]:::client
    U3["User 3"]:::client

    CS --> RPS
    RPS --> WSG1
    RPS --> WSG2
    RPS --> WSG3
    WSG1 --> U1
    WSG2 --> U2
    WSG3 --> U3

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0

Mechanism:

When a user connects via WebSocket, the Gateway registers (docId, userId) → gatewayInstanceId in Redis
Gateway subscribes to Redis Pub/Sub channel doc:{docId}:ops
When Collaboration Service produces a transformed op, it publishes to doc:{docId}:ops channel
Only gateways with users in that document are subscribed → they receive the op and push to connected users
Each gateway instance handles ~100K connections. 100 instances = 10M concurrent users

Connection handling:

Heartbeat: Client sends ping every 30 seconds. If server receives no ping for 60 seconds, connection is considered dead — clean up session.
Reconnection: Client stores last received version. On reconnect, sends {resumeFrom: lastVersion}. Server replays any missed operations from the op log.
Graceful shutdown: When a gateway instance is being drained (deploy), it sends a REDIRECT message to all connected clients with a new gateway URL. Clients reconnect within 5 seconds — zero downtime deploys.

Scaling math: At 10M concurrent users editing 2M documents, average 5 users per document. Each document channel has 5 subscribers. Redis Pub/Sub handles ~1M messages/sec per node. With 500 ops/sec per document × 2M documents = 1B messages/sec total → need 1000 Redis Pub/Sub shards partitioned by docId. In practice, only 10% of documents are actively being edited at any moment, so ~100 shards suffice.

Deep Dive 3: Version History and Efficient Undo

Problem: A document with 100K edits over 6 months. User wants to see “what did this look like last Tuesday?” Replaying 100K operations from scratch takes seconds — too slow for an interactive timeline.

Bad: Store the full document on every save (Ctrl+S). Massive storage waste (1MB doc × 1000 saves = 1GB per doc). Also misses all the changes between saves.

Good: Store every operation and replay from the start. Accurate but slow for old documents. Opening version 500 of a 100K-operation doc requires replaying 500 ops from version 0.

Great: Periodic snapshots + operation segments. (Borrowing from event sourcing best practices.)

Mechanism:

Every 100 operations (or every 5 minutes of activity), create a snapshot: serialize the full document state, store in S3 with version tag
To reconstruct any version V: find the nearest snapshot before V, replay only the operations between that snapshot and V
Worst case: replay 99 operations (not 100K)
Snapshots are immutable — old ones are never deleted (they serve as checkpoints in version history)

Undo implementation:

Local undo (before ACK): Client reverses its own pending operation locally. Trivial.
Collaborative undo (after ACK): Cannot simply reverse the operation because other users’ ops may have been applied since. Instead: generate an inverse operation and send it as a new operation through OT. Example: undo of insert("hello", pos 5) = delete(5 chars at pos 5). This inverse op goes through the full OT pipeline, getting transformed against any concurrent operations.
Undo stack per user: Each user has their own undo stack. Undoing User A’s last edit doesn’t affect User B’s edits.

Storage efficiency:

Operations are tiny: average 20-50 bytes each (type + position + 1-5 chars)
100K operations for a 6-month document ≈ 5MB in Cassandra
Snapshots: one every 100 ops × 100K ops = 1000 snapshots × 1MB each = 1GB in S3
Total per document: ~1GB for a heavily edited doc over 6 months. With S3 tiering (Glacier after 30 days), cost is negligible.

Deep Dive 4: Scaling to Millions of Concurrent Documents

Problem: 2M documents being actively edited simultaneously. Each document needs its own OT state (current version, pending transforms). How do we shard the Collaboration Service?

Bad: Single Collaboration Service instance handling all documents. Memory for 2M document states × OT buffers = too much. Single point of failure.

Good: Shard by docId using consistent hashing across N Collaboration Service instances. Each instance owns a subset of documents. Works until an instance crashes — those documents become unavailable.

Great: Consistent hash ring with virtual nodes + stateless OT workers backed by Redis for document state.

flowchart LR
    WSG["WebSocket Gateway"]:::edge
    LB["Load Balancer"]:::edge
    CS1["Collab Worker 1"]:::service
    CS2["Collab Worker 2"]:::service
    CS3["Collab Worker 3"]:::service
    RED["Redis Doc State"]:::data
    OL["Operation Log"]:::data

    WSG --> LB
    LB -->|"docId hash"| CS1
    LB -->|"docId hash"| CS2
    LB -->|"docId hash"| CS3
    CS1 --> RED
    CS2 --> RED
    CS3 --> RED
    CS1 --> OL

    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0

Mechanism:

Each document is assigned to a Collaboration Worker via consistent hashing on docId
The worker’s OT state for each document (current version, last N ops for transform buffer) lives in Redis, not in-process memory
When a worker receives an operation: reads doc state from Redis, transforms, writes new version + op atomically (Redis transaction / Lua script), publishes to Pub/Sub
If a worker crashes: any other worker can pick up the document because state is in Redis, not local memory
Rebalancing: add a new worker node to the hash ring → it takes ownership of some docIds → reads their state from Redis → starts processing immediately

Why Redis for OT state and not just Cassandra?

The OT transform operation needs atomic read-modify-write: “read current version, transform against recent ops, increment version, write new op.” This must be serialized per document. Redis single-threaded execution + Lua scripts give us this atomicity. Cassandra doesn’t support atomic read-then-write.

Handling hot documents (50+ concurrent editors):

A viral document with 50 editors generates 50 ops/sec. Each op requires transform against the last ~10 operations = manageable.
If a single document exceeds 200 ops/sec (unlikely in text editing), the worker can batch operations in 50ms windows and transform them together before broadcasting.

Deep Dive 5: Presence and Cursor Tracking

Problem: Users need to see each other’s cursors and selections in real-time. With 50 users in a document, each moving their cursor on every keystroke, that’s 50 × 5 cursor updates/sec = 250 messages/sec just for presence.

Bad: Broadcast every cursor movement to all users immediately. 250 messages/sec × 50 recipients = 12,500 messages/sec for one document. Wasteful — cursor positions between keystrokes are irrelevant.

Good: Throttle cursor updates to 1 per 100ms per user. Reduces to 50 × 10/sec × 50 = 25K messages/sec per hot document. Still heavy.

Great: Throttled cursor updates + position transformation + server-side aggregation.

Mechanism:

Client throttles cursor position sends to every 200ms (5 updates/sec per user max)
Cursor positions are sent as document positions (character offset), not screen coordinates
When other users’ operations shift text, the server transforms all active cursor positions using the same OT rules (cursor at pos 10 shifts to pos 11 after an insert at pos 5)
Server batches all cursor positions for a document every 200ms and sends one aggregated presence update to all users: {cursors: [{userId: "A", pos: 12, color: "#ff6b6b"}, {userId: "B", pos: 45, selection: [45,60], color: "#4ecdc4"}]}
Presence data stored in Redis with 10-second TTL. If no update in 10 seconds, cursor disappears (user left or went idle)

Color assignment: Each user gets a deterministic color based on a hash of their userId from a predefined palette of 12 high-contrast colors. Consistent across sessions.

7.5. Design Self-Audit

Question	Answer
Dedicated search index?	Yes — Elasticsearch for full-text document search by title and content. Indexed asynchronously via Kafka on document save events
Stale reads after writes?	After typing, your own changes appear instantly (optimistic local apply). Other users see changes within 100ms via WebSocket push. Acceptable.
Single points of failure?	Redis doc state has replicas with auto-failover. Collaboration workers are stateless (state in Redis). Cassandra op log uses RF=3.
Dead-letter / reconciliation?	Failed snapshot jobs go to DLQ and retry. If a client’s op is rejected (version mismatch), client rebases and retries automatically. Orphaned sessions (no heartbeat > 60s) are cleaned by a sweeper.
Data freshness across caches?	Cursor presence TTL = 10s (stale cursors auto-disappear). Document metadata cache TTL = 30s. Op log is source of truth — no cache invalidation needed.
Cost at scale?	Redis (OT state + Pub/Sub): 20 shards × r6g.large ≈ $4000/month. Cassandra op log (RF=3): 12 nodes ≈ $6000/month. WebSocket Gateways (100 instances): $10K/month. S3 snapshots: $500/month. Total hot-path: ~$21K/month for 10M DAU.

8. Final Architecture

flowchart LR
    subgraph Clients
        U1["User Browsers"]:::client
    end

    subgraph Edge
        LB["Load Balancer"]:::edge
        GW["API Gateway"]:::edge
        WSG["WebSocket Gateway Fleet"]:::edge
    end

    subgraph Services
        DS["Document Service"]:::service
        CS["Collaboration Service OT"]:::service
        HS["History Service"]:::service
        SS["Snapshot Service"]:::service
        SUMM["Change Summarizer"]:::service
        PRES["Presence Aggregator"]:::service
    end

    subgraph Async
        KF["Kafka"]:::async
        RPS["Redis Pub/Sub"]:::async
    end

    subgraph Data
        PG["Postgres Metadata"]:::data
        CAS["Cassandra Op Log"]:::data
        RED["Redis Doc State + Presence"]:::data
        S3["S3 Snapshots"]:::data
        ES["Elasticsearch"]:::data
    end

    U1 --> LB
    LB --> GW
    LB --> WSG
    GW --> DS
    WSG --> CS
    CS --> RED
    CS --> CAS
    CS --> RPS
    RPS --> WSG
    DS --> PG
    DS --> S3
    DS --> CAS
    SS --> CAS
    SS --> S3
    KF --> SUMM
    KF --> ES
    CS --> KF
    HS --> CAS
    HS --> S3
    PRES --> RED
    PRES --> RPS

    classDef client fill:#4c3a5e,stroke:#818cf8,color:#e2e8f0
    classDef edge fill:#1e3a5f,stroke:#60a5fa,color:#e2e8f0
    classDef service fill:#1a3a2a,stroke:#4ade80,color:#e2e8f0
    classDef async fill:#3b1f5e,stroke:#c084fc,color:#e2e8f0
    classDef data fill:#3b3520,stroke:#fbbf24,color:#e2e8f0
    classDef external fill:#4a1942,stroke:#f472b6,color:#e2e8f0

Want a deep dive on rich text OT (formatting operations), offline editing with CRDT fallback, or access control for shared documents? Drop a comment below 👇

🎯 Key Takeaways

Operational Transform (OT) resolves concurrent edits by transforming positions
Server-mediated OT gives linear version history — simpler than CRDT for online editing
Snapshot + operation log avoids writing full document on every keystroke
Redis Pub/Sub routes operations to the correct WebSocket gateway

Chat System (WhatsApp) — similar WebSocket fan-out, presence tracking
Notification System — multi-channel push delivery
Stock Broker (Robinhood) — event sourcing, ordered operations

Designing a Collaborative Editing Platform (Google Docs)

1. Understanding the Problem

1.5. Naive First Cut

1.7. Prior Art We’re Drawing From

2. Functional Requirements

Core (Top 3)

Below the Line

3. Non-Functional Requirements

Core

Below the Line

Technology Choices

4. Core Entities

5. API / System Interface

6. High-Level Design

FR1: Real-Time Collaborative Editing

FR2: Document Storage and Retrieval

FR3: Version History

6.5. Core Flows

Flow 1: Concurrent Edit with OT Resolution

Flow 2: Document Open with Snapshot Reconstruction

Document Session State Machine

7. Deep Dives

Deep Dive 1: OT vs CRDT — Conflict Resolution Strategy

Deep Dive 2: Real-Time Sync via WebSocket at Scale

Deep Dive 3: Version History and Efficient Undo

Deep Dive 4: Scaling to Millions of Concurrent Documents

Deep Dive 5: Presence and Cursor Tracking

7.5. Design Self-Audit

8. Final Architecture

🎯 Key Takeaways

Related Designs

💬 Comments