System Design: Designing a Distributed File System (HDFS/GCS Style)

Mental Model

Scaling storage to petabytes across thousands of commodity servers requires decoupling control operations from high-throughput data streams. A master-worker distributed file system isolates metadata organization onto a highly available, memory-optimized NameNode cluster while streaming raw, block-fragmented data payloads directly to and from rack-aware, autonomous DataNodes.

Requirements and System Goals

To design a distributed file system capable of driving high-throughput analytics pipelines (e.g. MapReduce, Spark, or Large Language Model training ingestion), we establish clear capacity limits and latency bounds.

1. Functional Requirements

Hierarchical File System Namespace: Support standard directory structures, file creation, deletion, rename, and listing.
Large Block-Based Ingestion: Automatically partition files into massive, fixed-size contiguous blocks.
Rack-Aware State Replication: Automatically distribute block copies across distinct hardware racks to survive raw network switch or node failures.

2. Non-Functional Requirements & Performance Budgets

High Read/Write Throughput: Maximize sequential data transfer bandwidth. Reading a 100 Gigabyte (GB) file must stream at a composite rate of greater than 10 Gbps over clustered interfaces.
Fault Tolerance & Durability: Survive the concurrent failure of entire server racks without data loss or pipeline interruption.
Metadata Lookup Latency: The NameNode must resolve file-to-block routing queries inside a P99 budget of less than 10ms.
Write Pipeline Latency: Acknowledging a block write to the client must return in less than 50ms once successfully pipelined to replicas.

3. Back-of-the-Envelope Estimation: Metadata Memory Footprint

Unlike raw data, the HDFS NameNode holds the entire namespace structure and block-location map directly in its JVM heap to keep lookups sub-millisecond. We calculate the NameNode memory limits under massive scale:

Target Cluster Capacity: 100 Petabytes (PB).
Block Configuration: Fixed size of 128 Megabytes (MB) per block.
- Larger block sizes (e.g., 128MB or 256MB) are chosen specifically to minimize the metadata records stored in the NameNode memory, preventing GC pauses.
Total Blocks Count: $100 \text{ PB} \div 128 \text{ MB} = 100,000,000 \text{ Gigabytes} \div 128 \text{ Megabytes} \approx 819,200 \text{ blocks}$.
Replication Factor: $3\times$ replication.
- Total physical blocks tracked in the cluster = $819,200 \times 3 = 2,457,600 \text{ physical blocks}$.
Metadata Footprint Sizing:
- On average, each block metadata record (Block ID, list of hosting DataNodes, generation stamp) requires 150 bytes of memory.
- Directory and file namespace objects require 150 bytes per inode.
- Assume our file system stores $10,000,000$ files across directories.
- Memory required for blocks: $2,457,600 \times 150 \text{ bytes} \approx 368 \text{ Megabytes (MB)}$.
- Memory required for namespace inodes: $10,000,000 \times 150 \text{ bytes} \approx 1.5 \text{ Gigabytes (GB)}$.
- Total NameNode JVM Heap footprint: $\approx 1.87 \text{ GB}$. This scales linearly; storing 1 Billion files requires a dedicated, robust physical master host equipped with at least 150 GB of highly optimized ECC RAM to prevent JVM garbage collection thrashing.

API Interfaces and Service Contracts

Distributed file system operations separate metadata leases from direct DataNode block pipelines.

1. Initiate File Write & Obtain Lease

POST /api/v1/files/open

Request Payload:

{
  "filePath": "/analytics/logs/2026-05-31/raw_events.csv",
  "blockSizeBytes": 134217728,
  "replicationFactor": 3,
  "clientHost": "client-worker-99ab"
}

Response Payload (201 Created):

{
  "filePath": "/analytics/logs/2026-05-31/raw_events.csv",
  "leaseToken": "lease_token_019a-88cd",
  "blockSizeBytes": 134217728,
  "firstBlock": {
    "blockId": "blk_9981a-uuid7",
    "generationStamp": 1,
    "targetDataNodes": [
      {
        "hostName": "datanode-rack1-01.corp.internal",
        "ipAddress": "10.0.1.12"
      },
      {
        "hostName": "datanode-rack1-02.corp.internal",
        "ipAddress": "10.0.1.13"
      },
      {
        "hostName": "datanode-rack2-01.corp.internal",
        "ipAddress": "10.0.2.12"
      }
    ]
  }
}

2. Register Block Completion

POST /api/v1/blocks/complete

Request Payload:

{
  "sagaId": "blk_9981a-uuid7",
  "leaseToken": "lease_token_019a-88cd",
  "bytesWritten": 134217728,
  "checksum": "sha256-a1928bc7...",
  "confirmedNodes": [
    "10.0.1.12",
    "10.0.1.13",
    "10.0.2.12"
  ]
}

Response Payload (200 OK):

{
  "blockState": "FINALIZED",
  "nextBlockReady": true
}

High-Level Design and Visualizations

Decoupling NameNode control flows from the direct DataNode client streams prevents master networking starvation under heavy bulk reads.

1. Active-Standby NameNode Cluster Layout

To ensure continuous metadata service availability and prevent Split-Brain states, we utilize ZooKeeper active election alongside shared EditLog journals.

graph TD
    subgraph Client
        C[HDFS Client]
    end

    subgraph NameNode High Availability Cluster
        C -->|1. Get Block Metadata| NN_Active[Active NameNode]
        NN_Standby[Standby NameNode]
        
        %% ZooKeeper Fencing
        ZKFC_A[ZooKeeper Failover A] <-->|Monitor Heartbeat| NN_Active
        ZKFC_B[ZooKeeper Failover B] <-->|Monitor Heartbeat| NN_Standby
        
        ZKFC_A <-->|Active Lock| ZK[(ZooKeeper Consensus)]
        ZKFC_B <-->|Active Lock| ZK
        
        %% Shared Journal
        NN_Active -->|2. Write EditLogs| JN[(JournalNode Cluster)]
        NN_Standby -.->|3. Tail EditLogs| JN
    end
    
    subgraph Storage Rack 1
        C -->|4. Read/Write Block Direct| DN_A[(DataNode A)]
        C -->|4. Read/Write Block Direct| DN_B[(DataNode B)]
    end
    
    subgraph Storage Rack 2
        C -->|4. Read/Write Block Direct| DN_C[(DataNode C)]
    end
    
    DN_A <-->|Heartbeats & Block Reports| NN_Active
    DN_B <-->|Heartbeats & Block Reports| NN_Active
    DN_C <-->|Heartbeats & Block Reports| NN_Active

2. Block Write Replication Pipeline Sequence

HDFS clients stream block payloads directly to DataNodes in a sequential, pipelined packet queue.

sequenceDiagram
    autonumber
    participant Client as HDFS Client
    participant DN1 as DataNode A (Rack 1)
    participant DN2 as DataNode B (Rack 1)
    participant DN3 as DataNode C (Rack 2)

    Client->>DN1: Stream Packet (64 KB Chunk)
    DN1->>DN2: Pipeline Packet (Forward)
    DN2->>DN3: Pipeline Packet (Forward)
    
    Note over DN3: Verify Packet Checksum
    DN3-->>DN2: Acknowledge Packet (ACK)
    DN2-->>DN1: Acknowledge Packet (ACK)
    DN1-->>Client: Acknowledge Packet (ACK)

Low-Level Design and Schema Strategies

To support sub-millisecond indexing, we represent NameNode in-memory tables and block allocations clearly.

1. NameNode Memory Struct Representation

Internally, the NameNode maps files to blocks using time-sortable memory hierarchies.

{
  "inodes": {
    "/analytics/logs/2026-05-31/raw_events.csv": {
      "inodeId": "inod_019a-99cd",
      "owner": "hadoop_analytics",
      "permissions": "rwxr-xr-x",
      "fileSizeBytes": 268435456,
      "blocksList": [
        {
          "blockId": "blk_9981a-uuid7",
          "blockSize": 134217728,
          "generationStamp": 1,
          "replicas": ["10.0.1.12", "10.0.1.13", "10.0.2.12"]
        },
        {
          "blockId": "blk_9982b-uuid7",
          "blockSize": 134217728,
          "generationStamp": 1,
          "replicas": ["10.0.1.12", "10.0.2.12", "10.0.2.13"]
        }
      ]
    }
  }
}

2. EditLog Journal Persistent Layout

Because in-memory state is volatile, every metadata transaction (creating a file, adding a block, renaming a folder) is appended to a sequential persistent journal (EditLog) on disk.

# EditLog transactional representation
OP_START_LOG_SEGMENT txid=1029881
OP_ADD filepath="/analytics/logs/2026-05-31/raw_events.csv" owner="hadoop" client="10.0.5.21" txid=1029882
OP_ALLOCATE_BLOCK blockid="blk_9981a-uuid7" txid=1029883
OP_CLOSE filepath="/analytics/logs/2026-05-31/raw_events.csv" length=268435456 txid=1029884

During startup, the NameNode loads a snapshot (FSImage) and replays the outstanding EditLog transaction logs to reconstruct the entire namespace in RAM, ensuring zero transaction loss.

Scaling and Operational Challenges

1. NameNode JVM Garbage Collection Exhaustion

As the file count grows to hundreds of millions, the sheer density of metadata objects in JVM heap creates massive garbage collection overhead.

The Danger: When a Stop-the-World GC pause triggers, the NameNode freezes execution for up to 30 seconds. DataNodes stop receiving master ping responses, assume the NameNode is dead, and begin a chaotic failover sequence.
Staff Mitigation:
1. Enforce a strict Minimum Block Size (e.g. 128MB or 256MB) to limit the total block metadata count.
2. Implement Namespace Federation. Instead of one NameNode, partition the file system directory trees across multiple independent NameNodes (e.g., /user managed by NameNode A, /analytics managed by NameNode B), utilizing ViewFS client mount tables to present a unified directory mapping.

2. Rack-Aware Replication Placement Math

To achieve maximum fault tolerance, HDFS distributes block replicas across different physical racks using a rack-aware algorithm:

The Placement Strategy:
- Replica 1: Placed on a local DataNode in the same rack as the client.
- Replica 2: Placed on a DataNode in a physically distinct remote rack.
- Replica 3: Placed on a different DataNode in the same remote rack as Replica 2.
The Mathematical Benefit: Placing replicas across two racks rather than three preserves network switch bandwidth (since we only cross the core datacenter WAN switch once during the write pipeline) while guaranteeing that the block survives a total rack power loss or top-of-rack (TOR) switch blowout.

Architectural Trade-offs and Replication Decisions

Choosing how to protect data against node failure dictates storage costs and sequential read throughput budgets.

Operational Dimension	$3\times$ Block Replication	Erasure Coding (Reed-Solomon 6+3)
Storage Overhead	Massive ($200%$ storage cost premium)	Low ($50%$ storage cost premium)
Write Pipeline Complexity	Low (Simple sequential TCP forwarding)	High (Heavy mathematical encoding computations)
CPU Saturation	Negligible	High (Constant parity bit calculation)
Read Recovery Speed	Fast (Direct read from active healthy replica)	Slow (Must read 6 remaining fragments to rebuild)

Failure Modes and Fault Tolerance Strategies

1. ZooKeeper Active-Fencing & Split-Brain Mitigation

Under a network partition where the Active NameNode and Standby NameNode become isolated, both might assume they are the only active leader. If both accept client writes, the EditLog journals will permanently diverge, destroying file system integrity.

The Failover Battle: ZooKeeper Failover Controllers (ZKFC) monitor active states. If the WAN sever splits, ZKFC B detects that ZKFC A's ZooKeeper active lease has expired, and attempts to promote Standby NameNode B to Active.
Staff Mitigation: Enforce Strict Active-Fencing. Before Standby NameNode B is promoted to Active, ZKFC B must execute an SSH fencing command to log into the original Active NameNode A and execute a hard kill -9 process command, or trigger a physical IPMI power fencing command to cut power to Node A's motherboard, ensuring only one Active NameNode can possibly exist in the datacenter.

2. DataNode Checksum Corruption & Bit Rot

Under prolonged storage, silent disk corruption (bit rot) can alter blocks on DataNodes without throwing hard OS hardware exceptions.

Mitigation: DataNodes store a block-checksum file (.meta) for every block. When a client reads a block, it validates the payload checksum against the meta file. If a mismatch is detected, the client rejects the payload, fetches the block from a healthy replica, and alerts the NameNode. The NameNode schedules a background replication job to overwrite the corrupt DataNode block from a healthy copy, self-healing the cluster.

Staff Engineer Perspective

Production Readiness Checklist

Before moving a distributed file cluster into production:

IPMI Power Fencing configured: Hardware fencing controllers are verified to cut power to partitioned NameNode hosts automatically during failovers.
Federated directory ViewFS configured: Large directory mounts are split across multiple namespace NameNodes.
DataNode Disk Health thresholds active: DataNodes are configured to take disks offline automatically if local write checksum failures exceed 1%.
Rack Topology mappings updated: Datacenter node configurations reflect correct network switches to ensure rack-aware replication works.

Verbal Script

Interviewer: "How would you design a distributed file system to store petabytes of data, and how would you protect the system from master name node split-brain failures?"

Candidate: "To design a petabyte-scale distributed file system, I would adopt a decoupled master-worker architecture, similar to HDFS.

The control operations reside in a centralized, memory-optimized NameNode, which tracks the hierarchical namespace, folder inodes, and file-to-block routing maps directly in its JVM heap for sub-millisecond lookups. The raw file payloads are divided into massive, fixed-size blocks—such as 128MB—to keep metadata counts low, and stored directly on thousands of commodity DataNodes.

To read or write data, the client performs a fast handshake with the NameNode to retrieve block locations, and then streams the block data directly to the DataNodes in a pipelined packet queue. This prevents the NameNode from becoming a network or disk bottleneck.

To ensure high availability and prevent a catastrophic Split-Brain state where two NameNodes concurrently act as Active leaders, I would construct a ZooKeeper-backed High-Availability cluster. We run ZooKeeper Failover Controllers on both master hosts.

If a network partition occurs and the Standby NameNode is promoted, the failover controller must enforce strict active-fencing before the Standby is allowed to accept writes. It issues a remote SSH fencing command to kill the original Active NameNode process, or triggers a physical IPMI motherboard power fencing command to cut power to the host.

This guarantees that only a single NameNode is active, protecting our transaction journal EditLogs and shared storage folders from data corruption."