Lesson 42 of 105 23 minFlagship

System Design: Designing a Collaborative Editor (Google Docs)

How does Google Docs handle real-time simultaneous edits? A technical deep dive into Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs).

Reading Mode

Hide the curriculum rail and keep the lesson centered for focused reading.

Key Takeaways

  • **Real-time Collaboration:** Multiple users can edit the same document at the same time.
  • **Low Latency:** Changes from one user should appear on others' screens in milliseconds.
  • **Consistency:** Everyone must eventually see the same final document state.
Recommended Prerequisites
System Design Interview FrameworkSystem Design: Designing a Distributed Task Scheduler

Premium outcome

From vague architecture answers to staff-level trade-off thinking.

Backend engineers preparing for senior, staff, and architecture rounds.

What you unlock

  • A reusable system design answer framework for ambiguous prompts
  • Clear language for consistency, scaling, and reliability trade-offs
  • Case-study depth across feeds, payments, storage, and messaging systems

Designing a real-time collaborative document editor like Google Docs, Notion, or Figma represents one of the most complex challenges in distributed systems. When multiple users edit the same line of text simultaneously, the system must synchronize their state instantly without losing edits or corrupting the document, even during high network latencies or offline operations.

This case study provides an exhaustive, production-grade analysis of the synchronization algorithms, WebSockets infrastructures, data models, and scaling architectures required to build a collaborative editor supporting millions of active co-editors at scale.


1. Requirements & Core Constraints

To architect a highly concurrent, stateful synchronization platform, we must define functional boundaries, high-performance SLAs, and back-of-the-envelope calculations.

Functional Constraints

  • Real-Time Multi-User Editing: Multiple co-editors must be able to view and edit the exact same document simultaneously, seeing characters render in real-time.
  • Conflict Resolution: The system must automatically resolve typing conflicts (e.g., two users inserting or deleting characters at the exact same location) without manual user intervention.
  • User Presence & Cursors: Clients must receive low-latency updates of co-editors' live cursor positions, text selections, and active user lists.
  • Offline Workspace Sync: Users must be able to type or delete content while offline. The platform must cleanly merge their offline edits with the latest server state upon reconnection without losing local changes or corrupting the server document.
  • Infinite Revision History: The system must maintain a comprehensive, immutable audit log of every individual keystroke, enabling complete version rollback capabilities.

Non-Functional SLAs

  • Ultra-Low Local Latency: Local keystrokes must be rendered instantly ($< 10\text{ms}$) on the typing user's screen. Remote keystrokes from other editors must render within $\le 100\text{ms}$ (p99 latency) on active co-editors' screens under standard network conditions.
  • Strong Eventual Consistency (Convergence): When all users stop typing, every connected client's document must converge to show the exact same character sequence, character-for-character.
  • High Concurrency & Durability: The architecture must scale to support $10,000,000$ concurrent active editing sessions and $100,000,000$ Daily Active Users (DAUs). Documents must have zero chance of data loss or state corruption ($99.999999999%$ durability).
  • High Availability: High availability ($99.99%$ uptime SLA) for connection gateways and document rendering engines.

Back-of-the-Envelope Estimates

To size our network, memory, and database tiers, we calculate the peak throughput of a globally active collaborative workspace:

1. Ingest Operation Throughput

  • Daily Active Users (DAUs): $100,000,000$.
  • Concurrent Active Editors (Peak): Assuming $10%$ of DAUs are actively editing at the peak hour: $$\text{Concurrent Active Users} = 10,000,000$$
  • Keystroke Velocity: An average user types at approximately $40$ words per minute. Assuming an average of $5$ characters per word, this translates to $200$ characters per minute, or: $$\text{Typing Frequency} \approx 3.3 \text{ operations/second per user}$$
  • Total Write Ingest QPS: $$\text{Ingress QPS} = 10,000,000 \text{ users} \times 3.3 \text{ ops/sec} = 33,000,000 \text{ writes/sec}$$

2. Network Ingress Bandwidth

Each edit operation is packaged into a lightweight JSON or binary payload containing metadata (document ID, client ID, operation types, position index, character value, and vector clocks). Assume each raw WebSocket edit frame is $200 \text{ bytes}$: $$\text{Ingress Bandwidth} = 33,000,000 \text{ ops/sec} \times 200 \text{ bytes} = 6.6 \text{ GB/s} = 52.8 \text{ Gbps}$$

3. Network Egress Bandwidth (The Fan-Out Challenge)

  • Assume an average of $5$ active co-editors are viewing and collaborating on each active document simultaneously.
  • When one user types, the server must broadcast the operation to the other $4$ co-editors: $$\text{Total Egress Broadcast QPS} = 33,000,000 \text{ writes/sec} \times 4 = 132,000,000 \text{ updates/sec}$$
  • Standardized broadcast payloads are stripped of excess client headers, averaging $150 \text{ bytes}$: $$\text{Egress Bandwidth} = 132,000,000 \text{ updates/sec} \times 150 \text{ bytes} = 19.8 \text{ GB/s} = 158.4 \text{ Gbps}$$ This requires a highly optimized WebSocket gateway fleet to manage massive, high-throughput network streaming.

4. Storage Calculations (5-Year Horizon)

  • New Documents Created: Assume $100,000,000$ new documents are created annually. Over $5$ years, total documents = $500,000,000$.
  • Compressed Snapshots (AWS S3): Average size of a fully compiled document snapshot (containing rich text, formatting, and structural JSON) = $100 \text{ KB}$. $$\text{Snapshot Storage} = 500,000,000 \text{ docs} \times 100 \text{ KB} = 50 \text{ TB}$$
  • Keystroke Operation Logs (Cassandra): Suppose each document has an average lifespan of $10,000$ keystroke operations. $$\text{Total Global Operations} = 500,000,000 \text{ docs} \times 10,000 \text{ ops} = 5 \times 10^{12} \text{ (5 Trillion Operations)}$$
  • If each wide-column row stores an operation at an optimized $50 \text{ bytes}$: $$\text{Raw Cassandra Storage} = 5 \times 10^{12} \times 50 \text{ bytes} = 250 \text{ TB}$$
  • With a Cassandra Replication Factor of $3$ to ensure maximum partition availability and write durability: $$\text{Total Operations Storage} = 250 \text{ TB} \times 3 = 750 \text{ TB}$$

2. API Design & Core Contracts

Collaborative editing utilizes bidirectional WebSockets to stream low-overhead character modifications, supported by a REST lifecycle API.

1. Document Creation & Lifecycle (REST)

POST /api/v1/documents Creates a brand new document with custom metadata.

Request Headers:

Content-Type: application/json
Authorization: Bearer <JWT_TOKEN>

Request Payload:

{
  "title": "Distributed Ledgers Architecture",
  "owner_id": "usr_88a91b2c",
  "workspace_id": "wsp_99f8e7d6"
}

Response Payload (201 Created):

{
  "doc_id": "doc_33e0892a",
  "title": "Distributed Ledgers Architecture",
  "version": 0,
  "s3_snapshot_url": null,
  "created_at": 1782236500,
  "updated_at": 1782236500
}

2. WebSocket Handshake & Session Upgrades

GET /api/v1/editor/connect Upgrades the standard client connection to a stateful, persistent WebSocket session.

Request Headers:

Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Protocol: csp-collab-v1

Query Parameters:

doc_id=doc_33e0892a
client_id=cli_55b44c3d
auth_token=<JWT_TOKEN>

3. WebSocket Real-Time Messages (JSON Schema Format)

A. Client-to-Server Keystroke Operation (document_edit)

Sent by an editing client whenever a character is inserted or deleted.

{
  "event": "document_edit",
  "doc_id": "doc_33e0892a",
  "client_id": "cli_55b44c3d",
  "client_seq": 45,
  "server_version": 142,
  "operation": {
    "type": "INSERT",
    "position": 1054,
    "text": "a",
    "attributes": {
      "bold": true,
      "italic": false
    }
  },
  "vector_clock": {
    "cli_55b44c3d": 45,
    "cli_92f11b8e": 88
  },
  "timestamp": 1782236505
}

B. Server-to-Client Broadcast Operation (document_broadcast)

Dispatched by the Collaboration Server to all other active co-editors connected to the document.

{
  "event": "document_broadcast",
  "doc_id": "doc_33e0892a",
  "origin_client_id": "cli_55b44c3d",
  "server_version": 143,
  "operation": {
    "type": "INSERT",
    "position": 1054,
    "text": "a",
    "attributes": {
      "bold": true,
      "italic": false
    }
  },
  "vector_clock": {
    "cli_55b44c3d": 45,
    "cli_92f11b8e": 88
  },
  "timestamp": 1782236506
}

C. Live Presence & Cursor Coordinate Update (presence_update)

Sent at regular intervals (or on movement) to distribute cursor coordinates and active text highlights.

{
  "event": "presence_update",
  "doc_id": "doc_33e0892a",
  "client_id": "cli_55b44c3d",
  "cursor_position": 1055,
  "selection_range": {
    "start": 1050,
    "end": 1055
  },
  "user_profile": {
    "name": "Jane Doe",
    "color": "#FF5733"
  },
  "timestamp": 1782236507
}

D. Out-of-Sync Recovery Message (sync_error)

Dispatched by the server when a client's lag or network partition exceeds the Operation Buffer boundary, forcing a full state resynchronization.

{
  "event": "sync_error",
  "error_code": "OUT_OF_SYNC_ERROR",
  "message": "Client version is too stale to transform. Re-fetching latest document snapshot.",
  "latest_server_version": 450,
  "fallback_snapshot_url": "https://csp-snapshots.s3.amazonaws.com/doc_33e0892a/v400.json"
}

3. High-Level Design (HLD)

To handle massive concurrency, the collaborative editor's architecture relies on stateful connection routing, decoupling persistent gateway fleets from transactional persistence engines.

graph TD
    %% Clients Layer
    subgraph Clients["Active Editing Clients"]
        ClientA["Editing Client A (User A)"]
        ClientB["Editing Client B (User B)"]
        ClientC["Editing Client C (User C)"]
    end

    %% Routing Layer
    LB["Consistent Hash Load Balancer (Anycast IP)"]
    
    %% Stateful WebSocket Fleet
    subgraph WSFleet["WebSocket Gateway Fleet (Netty/epoll Engine)"]
        WSNode1["WS Gateway Node 1"]
        WSNode2["WS Gateway Node 2"]
    end

    %% Collaboration Coordination
    subgraph CollabCluster["Stateful Collaboration Cluster"]
        CollabServer1["Collab Server 1 (Host: Doc A, B)"]
        CollabServer2["Collab Server 2 (Host: Doc C)"]
        OTEngine1["In-Memory OT / CRDT Engine"]
        
        CollabServer1 --- OTEngine1
    end

    %% Memory & Caching
    RedisCluster[("Redis Cluster (Presence & IP Session Directory)")]
    
    %% Decoupling Queue
    KafkaCluster{{"Apache Kafka Operation Pipeline"}}

    %% Persistence Layer
    subgraph Storage["Distributed Persistence Tier"]
        CassandraCluster[("ScyllaDB/Cassandra (Append-Only Op Logs)")]
        DBPrimary[("PostgreSQL Cluster (Metadata, Permissions, ACLs)")]
        S3Store[("AWS S3 (Compressed JSON Document Snapshots)")]
    end

    %% Compaction Layer
    CompactionWorkers["Snapshot Compaction Workers"]

    %% Flow Paths
    ClientA -->|1. Keystroke WebSockets stream| LB
    ClientB -->|1. Keystroke WebSockets stream| LB
    ClientC -->|1. Keystroke WebSockets stream| LB
    
    LB -->|Sticky doc_id route| WSNode1
    LB -->|Sticky doc_id route| WSNode2
    
    WSNode1 -->|2. Stream Client Operations| CollabServer1
    WSNode2 -->|2. Stream Client Operations| CollabServer1
    
    CollabServer1 -->|3. Register Active Session| RedisCluster
    CollabServer1 -->|4. Push to Queue| KafkaCluster
    
    KafkaCluster -->|5. Multi-Partition Commit| CassandraCluster
    KafkaCluster -->|6. Batch Ingress| CompactionWorkers
    
    CompactionWorkers -->|7. Read raw logs / Generate state JSON| CassandraCluster
    CompactionWorkers -->|8. Upload latest S3 Snapshot| S3Store
    CompactionWorkers -->|9. Update snapshot pointer| DBPrimary

    classDef database fill:#0d3b66,stroke:#f4d35e,stroke-width:2px,color:#fff;
    classDef cluster fill:#2e0f38,stroke:#f4d35e,stroke-width:2px,color:#fff;
    classDef client fill:#3d5a80,stroke:#293241,stroke-width:2px,color:#fff;
    classDef loadbalancer fill:#ee6c4d,stroke:#293241,stroke-width:2px,color:#fff;
    class CassandraCluster,DBPrimary,S3Store,RedisCluster database;
    class CollabCluster,WSFleet cluster;
    class ClientA,ClientB,ClientC client;
    class LB loadbalancer;

End-to-End Architectural Workflows

1. Keystroke Ingress & Synchronization Loop

  1. Local Optimistic Rendering: When User A presses a key, the client-side editor instantly renders the character locally to achieve a zero-latency typing feel.
  2. WebSocket Ingest: Simultaneously, the client packages the edit event into a document_edit frame and transmits it over an active WebSocket connection.
  3. Sticky LB Routing: The Anycast Load Balancer analyzes the incoming WebSocket handshake query parameters. It uses Consistent Hashing on the doc_id to route all clients collaborating on that specific document to the exact same physical Collaboration Server Node (Collab Server 1).
  4. Operation Sequencing: Collab Server 1 receives the edit frame. The in-memory OT Synchronization Engine checks if the client's server_version matches the server's current version index. If a version gap is identified, the server transforms the index values, commits the finalized operation to its local queue, and increments the server version.
  5. Real-time Fan-out Egress: The Collaboration Server immediately streams the transformed operation back down the WebSocket Gateway Nodes to all other co-editors viewing that document.
  6. Async Ledger Streaming: The server asynchronously writes the operation frame to an Apache Kafka topic keyed by doc_id to guarantee zero-loss message ingestion.

2. The Persistent Storage Loop & Compaction

  1. Wide-Column Op logging: Operational records are pulled from Kafka partitions and logged into ScyllaDB/Cassandra to form the immutable chronological revision log.
  2. Compaction Queue: Every $100$ operations written for a document, a Snapshot Compaction Worker is triggered.
  3. Compactor Processing: The worker performs a wide-column range read from Cassandra to compile all raw changes since the last S3 JSON snapshot.
  4. S3 Snapshot Upload: The worker merges changes, compresses the compiled JSON document state using Brotli, uploads it to AWS S3, and updates the primary PostgreSQL Metadata DB with the new s3_snapshot_url and corresponding version pointer.
  5. Log Truncation: Cassandra logs preceding this updated snapshot version are marked for automated TTL deletion.

4. Low-Level Design (LLD) & Data Models

Database Selection Rationale

Database Model Type Core Purpose Rationale
PostgreSQL Relational SQL Metadata, Ownership, and ACLs Document creation, directory nesting, sharing permissions, and organization groups require strict foreign key constraints, ACID guarantees, and secondary relational indices.
ScyllaDB / Cassandra Wide-Column NoSQL Append-Only Operation Logs Character-level edits are extremely write-heavy. Cassandra's LSM-Tree storage engine supports exceptionally fast sequential append writes at scale.
AWS S3 Object Store Compressed JSON Snapshots Large, static JSON document states are stored cheaply, durably, and can be retrieved using high-speed edge CDNs when a document first loads.
Redis Cluster In-Memory Key-Value Live Cursor Presence & Sticky Directories Temporary cursor indices and coordinates change multiple times a second. Redis allows sub-1ms read/write times and supports TTL expirations.

SQL DDL Database Schemas (PostgreSQL)

-- Core Users Table
CREATE TABLE users (
    id VARCHAR(64) PRIMARY KEY,
    email VARCHAR(255) UNIQUE NOT NULL,
    display_name VARCHAR(128) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Document Metadata Table
CREATE TABLE documents (
    id VARCHAR(64) PRIMARY KEY,
    owner_id VARCHAR(64) NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
    title VARCHAR(255) NOT NULL DEFAULT 'Untitled Document',
    latest_version INT NOT NULL DEFAULT 0,
    s3_snapshot_url VARCHAR(512) NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Access Control Lists (ACLs) for Collaboration Permissions
CREATE TABLE document_permissions (
    doc_id VARCHAR(64) NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    user_id VARCHAR(64) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    permission_role VARCHAR(32) NOT NULL DEFAULT 'VIEWER', -- 'OWNER', 'EDITOR', 'VIEWER'
    granted_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (doc_id, user_id)
);

-- Multi-Region Server Mapping Index (Session Directory)
CREATE TABLE active_document_sessions (
    session_id VARCHAR(64) PRIMARY KEY,
    doc_id VARCHAR(64) NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    user_id VARCHAR(64) NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    gateway_node_ip VARCHAR(64) NOT NULL,
    collab_server_ip VARCHAR(64) NOT NULL,
    connected_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_doc_permissions_user ON document_permissions(user_id);
CREATE INDEX idx_sessions_doc_collab ON active_document_sessions(doc_id, collab_server_ip);

Cassandra Data Model Schema (Wide-Column Log)

The operation log database uses doc_id as the partition key to distribute write workloads across the Cassandra cluster. The clustering column version_number is sorted in descending order to allow rapid range retrieval of recent edits.

CREATE KEYSPACE csp_document_store 
WITH replication = {
    'class': 'NetworkTopologyStrategy', 
    'us-east-1': 3, 
    'eu-west-1': 3
};

CREATE TABLE csp_document_store.operation_logs (
    doc_id text,
    version_number int,
    client_id text,
    client_sequence int,
    operation_type text, -- 'INSERT', 'DELETE'
    char_index int,
    character_data text,
    attributes_json text, -- Formatting (bold, italic, etc.)
    vector_clock_map map<text, bigint>,
    created_at timestamp,
    PRIMARY KEY (doc_id, version_number)
) WITH CLUSTERING ORDER BY (version_number DESC);

5. Synchronization Algorithms (OT vs. CRDT)

A collaborative system must guarantee mathematical convergence. When multiple co-editors write concurrently, the server resolves conflicts using one of two primary synchronization engines.

Comparative Architectural Analysis

graph TD
    subgraph OTMath["Operational Transformation (OT)"]
        O1["Op 1: Insert('s', pos: 3)"]
        O2["Op 2: Delete(pos: 0)"]
        ServerOT["Authoritative Server Node"]
        
        O1 & O2 --> ServerOT
        ServerOT -->|T(O1, O2)| TransformedO1["Insert('s', pos: 2)"]
        ServerOT -->|T(O2, O1)| TransformedO2["Delete(pos: 0)"]
    end

    subgraph CRDTMath["Conflict-Free Replicated Data Types (CRDTs)"]
        NodeTree["Tree of Immutable Characters"]
        CharA["Char 'H' (ID: 0.1)"]
        CharB["Char 'i' (ID: 0.2)"]
        CharC["Char '!' (ID: 0.15)"]
        
        CharA --> CharC
        CharC --> CharB
    end
Dimension Operational Transformation (OT) Conflict-Free Replicated Data Types (CRDT)
Concept Transmit lightweight indexing edits; central server transforms index coordinates based on race order. Represent document as a decentralized tree of unique, immutable character elements with fractional indexing.
Coordination Stateful/Centralized: Requires an authoritative server to assign and serialize operation orders. Decentralized/Peer-to-Peer: Supports serverless multi-master synchronization and P2P mesh architectures.
Memory Footprint Minimal: Only stores the raw text buffer and a small circular buffer of recent history events. Substantial: Every single character ever typed maintains unique metadata, generating up to $100\times$ storage overhead.
Complexity Very High: Corner cases for $N$-way concurrent transformations are incredibly difficult to design. Moderate: Concept is mathematically proven and self-balancing, but requires indexing garbage-collection algorithms.
Production Use Google Docs, Etherpad, Apache Wave. Figma (using custom CRDT), Notion, Yjs, Automerge.

Operational Transformation (OT) Implementation

Step-by-Step Transform Scenario

  1. Initial Document State: "Hello"
  2. User A Concurrent Edit: Inserts "!" at position 5 ($O_A = \text{INSERT}(5, \text{"!"})$).
  3. User B Concurrent Edit: Deletes character "o" at position 4 ($O_B = \text{DELETE}(4)$).
  4. Server Receives Operations:
    • The server decides $O_A$ arrives first, setting its execution index first.
    • When User B's operation ($O_B$) is evaluated, its target index does not change because the insert happened after the delete position. Transformed $T(O_B, O_A) = \text{DELETE}(4)$.
    • When User A's operation ($O_A$) is processed for User B's client, the delete at position 4 shifted the end of the word left by 1 index. Thus, $O_A$'s insertion index must shift left by 1. Transformed $T(O_A, O_B) = \text{INSERT}(4, \text{"!"})$.
  5. Final Convergence: Both clients evaluate and render "Hell!".

Production-Grade TypeScript OT Engine Implementation

Below is a complete, runnable TypeScript implementation of a stateful Collaboration OT synchronization manager:

export interface Operation {
  type: 'INSERT' | 'DELETE';
  position: number;
  text: string; // The character or string acted upon
}

export class OTEngine {
  /**
   * Transforms operational parameters relative to concurrent sibling actions
   * @param opA Operation to transform
   * @param opB Concurrent operation already applied on the server
   */
  public static transform(opA: Operation, opB: Operation): Operation {
    const transformed: Operation = { ...opA };

    if (opA.type === 'INSERT' && opB.type === 'INSERT') {
      if (opA.position > opB.position) {
        transformed.position += opB.text.length;
      } else if (opA.position === opB.position) {
        // Break index ties deterministically using string character comparison
        if (opA.text > opB.text) {
          transformed.position += opB.text.length;
        }
      }
    } else if (opA.type === 'INSERT' && opB.type === 'DELETE') {
      if (opA.position > opB.position) {
        // Shift position index left if concurrent deletion occurred prior
        const shift = Math.min(opA.position - opB.position, opB.text.length);
        transformed.position -= shift;
      }
    } else if (opA.type === 'DELETE' && opB.type === 'INSERT') {
      if (opA.position >= opB.position) {
        transformed.position += opB.text.length;
      }
    } else if (opA.type === 'DELETE' && opB.type === 'DELETE') {
      if (opA.position > opB.position) {
        if (opA.position >= opB.position + opB.text.length) {
          transformed.position -= opB.text.length;
        } else {
          // Partial overlapping deletion boundaries
          const remainingLength = opA.position - opB.position;
          transformed.position = opB.position;
          transformed.text = opA.text.substring(remainingLength);
        }
      }
    }

    return transformed;
  }
}

// In-Memory Collaboration Server State Manager for a Document Session
export class CollaborationSession {
  private documentBuffer: string;
  private serverVersion: number;
  private historyLog: Array<{ version: number; op: Operation }>;

  constructor(initialText: string) {
    this.documentBuffer = initialText;
    this.serverVersion = 0;
    this.historyLog = [];
  }

  public getDocumentState(): string {
    return this.documentBuffer;
  }

  public getVersion(): number {
    return this.serverVersion;
  }

  /**
   * Applies an incoming client edit to the document, transforming it against historical edits if needed
   * @param clientOp The operation submitted by the client
   * @param clientBaseVersion The server version the client started editing from
   */
  public applyOperation(clientOp: Operation, clientBaseVersion: number): number {
    let transformedOp = { ...clientOp };

    // Traverse the history log to catch up and transform the operation
    for (let i = clientBaseVersion; i < this.historyLog.length; i++) {
      const historicalEntry = this.historyLog[i];
      transformedOp = OTEngine.transform(transformedOp, historicalEntry.op);
    }

    // Apply the transformed operation to the in-memory document state
    this.executeOp(transformedOp);
    
    // Commit the transformed operation to the revision log
    this.serverVersion++;
    this.historyLog.push({ version: this.serverVersion, op: transformedOp });

    return this.serverVersion;
  }

  private executeOp(op: Operation): void {
    if (op.type === 'INSERT') {
      this.documentBuffer = 
        this.documentBuffer.substring(0, op.position) + 
        op.text + 
        this.documentBuffer.substring(op.position);
    } else if (op.type === 'DELETE') {
      this.documentBuffer = 
        this.documentBuffer.substring(0, op.position) + 
        this.documentBuffer.substring(op.position + op.text.length);
    }
  }
}

6. Scaling Challenges & System Bottlenecks

1. Stateful WebSocket Gateway Fleet Scaling

Managing $10,000,000$ concurrent WebSocket connections is memory-bound. If we use standard JVM thread-per-connection patterns, the thread stack memory footprint will quickly exhaust system resources.

  • The Mitigation: Non-Blocking epoll Networking Engine. The WebSocket Fleet is built using asynchronous, event-driven, non-blocking I/O frameworks (e.g., Netty, Project Loom, or Rust Tokio).
  • Socket Pruning & Heartbeats: Sockets send a low-payload ping frame every $30$ seconds. Sockets that fail to respond within a $5$-second grace window are aggressively closed to free up system descriptors.

2. High-Traffic Document Hotspots (The "Mega-Doc" Bottleneck)

When a high-traffic live document (such as a breaking news release or a celebrity sharing a document with thousands of viewers) is edited, consistent hashing will route all users to a single Collaboration Server node. This node will quickly experience CPU bottlenecking and network card saturation.

  • The Mitigation: Reader-Writer Tier Separation.
    • Editors connect directly to the primary, write-authorized Collaboration Server.
    • Standard viewers connect to a highly distributed WebSocket Read Fleet.
    • The write-authorized node aggregates and batch-broadcasts operations downstream to a distributed Redis Pub/Sub Mesh or Apache Kafka partition cluster.
    • Read nodes subscribe to this mesh, distributing downstream network traffic away from the primary editing server.
graph TD
    Editors[Editing Users] -->|Stateful WS Writes| CollabWriteNode[Primary Stateful Write Node]
    CollabWriteNode -->|1. Stream Transformed Ops| RedisPubSub[(Redis Pub/Sub Mesh)]
    
    RedisPubSub -->|2. Event Broadcast| WSReadFleet1[Read Node Fleet 1]
    RedisPubSub -->|2. Event Broadcast| WSReadFleet2[Read Node Fleet 2]
    
    WSReadFleet1 --> ViewersGroup1[Viewer Group 1]
    WSReadFleet2 --> ViewersGroup2[Viewer Group 2]

3. Log Compaction & Recovery Latency

Replaying hundreds of thousands of historical operations from Cassandra to load a document would saturate database connections and delay page rendering.

  • The Mitigation: Compaction Workers & S3 Snapshots.
    • A background compaction worker processes document changes in batches.
    • The worker merges the oldest operations into a single static JSON document structure, applies gzip compression, and uploads the snapshot to AWS S3.
    • The PostgreSQL metadata table is updated to reference this latest S3 URL.
    • Operation logs in Cassandra prior to the compacted version are pruned. When a user opens a document, the client loads the latest compressed snapshot from S3 in a single point lookup, and replays only the recent operations since that snapshot.

7. Technical Trade-offs & Consistency Models

1. Sticky Consistent Hashing vs. Multi-Region Active-Active Coordination


2. Floating-Point Precision Blowout in CRDTs


3. epoll Egress Optimization


8. Resilience & Failure Scenarios

1. Collaboration Server Node Crashes

If a stateful Collaboration Server node crashes mid-session, all active WebSocket connections are terminated, and the in-memory operations queue is lost.

  • Failover Recovery Plan:
    • The Load Balancer detects the node crash via periodic health checks and re-maps the consistent hashing ring.
    • When disconnected clients attempt to reconnect, their requests are routed to a healthy backup Collaboration Server node.
    • The backup node queries PostgreSQL to retrieve the latest s3_snapshot_url and corresponding version number.
    • It pulls the snapshot from S3 and fetches any outstanding operations from Cassandra since that snapshot version.
    • Rebuilding the active in-memory state is completed in under 2 seconds, and synchronization resumes with minimal disruption to the user.

2. Late Offline Operation Sync (Slow Clients)

If a user edits a document offline on a long flight and reconnects hours later, their operations will arrive at the server late, with stale version numbers.

  • The Mitigation: Operation History Buffers.
    • Under the OT model, the Collaboration Server maintains a circular Operation History Buffer storing the past $1000$ operations.
    • If a late client reconnects and submits changes based on a stale version (e.g., version $150$ when the server is currently at version $180$), the server transforms the late edits against the missing operations in the history buffer.
    • If the client's base version is too old and falls outside the history buffer, the server rejects the operations.
    • It sends a sync_error frame back to the client, triggering a Git-style three-way diff merge that displays conflict-resolution prompts to the user.

9. Candidate Verbal Script (Interview Guide)

A high-fidelity transcript of how an L6/L7 candidate navigates the stateful real-time editor design during a system design interview:

Interviewer: "How would you design a highly scalable, real-time collaborative editor like Google Docs that handles concurrent edits without document corruption?"

Candidate: "I will architect a stateful collaboration system that leverages two primary layers: a stateful, persistent WebSocket Gateway Fleet to handle concurrent client connections, and a cluster of Collaboration Servers that manage the document state.

To guarantee that all concurrent edits converge on the exact same state without conflicts, I will use the Operational Transformation (OT) algorithm. Instead of sending full document payloads over the wire, clients will stream lightweight document_edit frames containing the specific operation—INSERT or DELETE—and its target index.

To avoid expensive distributed locks, I will use Consistent Hashing at the Load Balancer, hashing on the doc_id. This guarantees that all active co-editors working on a specific document are routed to the exact same physical Collaboration Server Node. This node sequences and applies the incoming edits in its local queue."

Interviewer: "What happens if User A inserts a character at position 5, and User B simultaneously deletes the character at position 4? How does the server transform these operations?"

Candidate: "If the server decides User A's insert operation ($O_A$) executes first, it applies the insertion of the new character at index 5. The document version is incremented.

When User B's delete operation ($O_B$) arrives, the server transforms it. Because User A's insert happened after User B's delete target (index 4), User B's delete position remains unaffected. The server applies $T(O_B, O_A) = \text{DELETE}(4)$.

However, when User A's insert is broadcasted to User B's client, the client has already deleted the character at position 4. This deletion shifted all subsequent characters left by 1 index.

Therefore, the server transforms User A's operation before broadcasting it: the insertion index is shifted left by 1. Transformed $T(O_A, O_B) = \text{INSERT}(4)$. Both clients apply the transformed operations and converge on the exact same character sequence."

Interviewer: "If a document is edited millions of times, won't replaying the entire history of operations to load the document become extremely slow?"

Candidate: "Yes, loading a document by replaying millions of historical character edits from Cassandra would take minutes and saturate server memory.

To prevent this, we implement Snapshot Compaction. We run background compaction workers that periodically read raw operation logs from Apache Cassandra, merge them into a single compiled document state, and save a static JSON snapshot to AWS S3 every 100 operations.

When a traveler opens a document, the server loads the latest S3 snapshot in a single point-lookup and replays only the recent operations since that snapshot. This keeps document load times under $200\text{ms}$ globally."

Want to track your progress?

Sign in to save your progress, track completed lessons, and pick up where you left off.