Cloud-Native Databases: Why the Log is the Database

Introduction & The Cloud-Native Database Shift

In traditional database systems (like standard PostgreSQL or MySQL), the database engine runs on a single server where CPU, memory, and disk storage are tightly coupled. In this monolithic paradigm, the database manages page files (typically 8 KB to 16 KB blocks of data) on local block storage. To guarantee durability (the "D" in ACID), every transaction write must write to two distinct places: the Write-Ahead Log (WAL) for crash recovery, and the modified data pages themselves (often flushed asynchronously via checkpointing).

When moved to cloud environments, this traditional model encounters a massive bottleneck: Write Amplification and Network I/O. Because cloud block storage (such as Amazon EBS) is attached over a virtual network rather than a physical bus, sending entire 8 KB pages over the network for every minor modification is highly inefficient.

Cloud-native databases, such as Amazon Aurora, solve this bottleneck by decoupling the compute layer (which runs query processing and transaction management) from the storage layer (which is implemented as a smart, distributed, log-structured service). In this architecture, the compute node does not write full data pages to storage. Instead, it writes only the Redo Log Records. The storage service receives these logs and materializes them into data pages in the background. In short: The Log is the Database.

Requirements and System Goals

Functional Requirements

Decoupled Compute and Storage Layers: The compute engine should handle query parsing, optimization, transaction coordination, and caching. The storage layer must be a distributed, self-healing, multi-tenant service dedicated strictly to log durability and page materialization.
Single-Writer, Multi-Reader Architecture: Support a single primary writer node that accepts read-write traffic, and up to 15 read-only replica nodes that share the same underlying storage volume.
Point-in-Time Recovery (PITR): The storage layer must retain a continuous log history, allowing the database to be restored to any arbitrary microsecond within the retention window.
Instantaneous Crash Recovery: Recovery from a compute node crash must be near-instantaneous. The system should avoid the traditional database recovery process of reading the entire WAL from disk to rebuild dirty pages.
Auto-scaling Storage Volume: The storage layer must automatically scale in chunks (e.g., 10 GB segments) up to 128 TB without manual provisioning.

Non-Functional Requirements

High Write Throughput: The system must minimize network egress from the compute node to storage. Network write amplification must be reduced by at least 90 percent compared to traditional databases.
Low Replication Lag: Read replicas must have a replication lag of less than 15 milliseconds by sharing the same physical storage and streaming log updates directly from the writer.
Fault Tolerance and Durability (High Availability): Storage must survive the complete loss of an entire Availability Zone (AZ) plus one additional storage node (an AZ+1 failure) without losing data, and survive the loss of an AZ without affecting write availability.
Strong Consistency for Writes: The writer node must maintain strict linearizable consistency for writes, while replicas must guarantee monotonic read consistency.

API Interfaces and Service Contracts

To enable decoupled compute-to-storage operations, we define gRPC-based interfaces for log shipping, page reading, and storage node administration.

1. Log Shipping API (Compute to Storage Node)

When a transaction commits, the compute node packages the redo logs and ships them to the storage nodes.

syntax = "proto3";

package database.storage.v1;

service StorageService {
  // Ship a batch of redo log records from Compute to a Storage Node
  rpc WriteRecords(WriteRecordsRequest) returns (WriteRecordsResponse);
  
  // Read a materialized page at a specific Log Sequence Number (LSN)
  rpc ReadPage(ReadPageRequest) returns (ReadPageResponse);
}

message RedoLogRecord {
  uint64 lsn = 1;             // Monotonically increasing Log Sequence Number
  uint64 transaction_id = 2;   // Transaction identifier
  uint32 page_id = 3;         // Target database page identifier
  uint32 offset = 4;          // Offset within the page
  bytes redo_data = 5;        // Physical or logical delta bytes
  uint32 record_type = 6;     // Operation type (e.g., INSERT, UPDATE, DELETE)
}

message WriteRecordsRequest {
  string volume_id = 1;
  string segment_id = 2;
  repeated RedoLogRecord records = 3;
  uint64 last_committed_lsn = 4; // High-water mark for replication tracking
}

message WriteRecordsResponse {
  bool success = 1;
  uint64 acknowledged_lsn = 2;   // Highest LSN stored durably by this node
  string error_message = 3;
}

2. Page Read API (Compute requesting from Storage Node)

When the compute node suffers a cache miss in its local buffer pool, it requests the specific data page from the storage layer.

message ReadPageRequest {
  string volume_id = 1;
  uint32 page_id = 2;
  uint64 requested_lsn = 3;   // Read version. Storage node must apply logs up to this LSN
}

message ReadPageResponse {
  uint32 page_id = 1;
  uint64 materialized_lsn = 2; // Actual LSN of the materialized page returned
  bytes page_data = 3;         // Complete 8KB or 16KB data page
}

High-Level Design and Visualizations

Decoupling compute from storage shifts the network and architectural topology. The following diagrams compare the traditional page-based architecture with the cloud-native log-based architecture.

Traditional vs. Cloud-Native Storage Architectures

graph TD
    subgraph Traditional Architecture (EBS / Block Storage)
        ComputeA[Compute Node: Query Engine + Buffer Pool]
        EBS_Primary[(Primary EBS Volume)]
        EBS_Replica[(Replica EBS Volume)]
        
        ComputeA -->|1. Write WAL Log | EBS_Primary
        ComputeA -->|2. Write 8KB Dirty Pages| EBS_Primary
        EBS_Primary -->|3. Mirror Block-level I/O| EBS_Replica
    end

    subgraph Cloud-Native Architecture (Aurora-Style)
        ComputeB[Compute Node: Query Engine + Buffer Pool]
        
        subgraph Decoupled Storage Layer (6-Way Replication across 3 AZs)
            subgraph AZ 1
                SN1[(Storage Node 1)]
                SN2[(Storage Node 2)]
            end
            subgraph AZ 2
                SN3[(Storage Node 3)]
                SN4[(Storage Node 4)]
            end
            subgraph AZ 3
                SN5[(Storage Node 5)]
                SN6[(Storage Node 6)]
            end
        end

        ComputeB -->|1. Ship Redo Log Records ONLY| SN1
        ComputeB -->|1. Ship Redo Log Records ONLY| SN2
        ComputeB -->|1. Ship Redo Log Records ONLY| SN3
        ComputeB -->|1. Ship Redo Log Records ONLY| SN4
        ComputeB -->|1. Ship Redo Log Records ONLY| SN5
        ComputeB -->|1. Ship Redo Log Records ONLY| SN6
    end

    style Traditional text-align:left
    style Cloud-Native text-align:left

Inside the Smart Storage Node

The storage node is not a dumb block device. It runs an asynchronous queue-based processing loop to receive logs, organize them, materialize pages, and garbage-collect old log records.

sequenceDiagram
    participant C as Compute Node
    participant R as Log Receiver (Storage Node)
    participant Q as In-Memory Update Queue
    participant C_Loop as Background Coalescing Loop
    participant D as Disk Storage (Segment Pages)

    C->>R: WriteRecordsRequest (Redo Log Records)
    R->>Q: Insert raw log records (ordered by LSN)
    R-->>C: WriteRecordsResponse (Ack LSN)
    
    Note over R,Q: Fast Path Complete (Write is Durable)

    loop Asynchronous Coalescing
        C_Loop->>Q: Pull records for Page X
        C_Loop->>D: Read old Page X base version
        C_Loop->>C_Loop: Apply redo log records to Page X
        C_Loop->>D: Write updated Page X version + New Base
    end

Low-Level Design and Schema Strategies

To support background page materialization, the storage node must track the relationships between log sequence numbers (LSNs), page IDs, and physical disk offsets.

Redo Log Record Structure

struct RedoLogHeader {
    uint64_t lsn;             // Unique monotonically increasing sequence number
    uint64_t tx_id;           // Transaction ID
    uint32_t page_id;         // Database page target (e.g., table block ID)
    uint16_t offset;          // Start offset of modification in page
    uint16_t data_len;        // Length of modified bytes
    uint8_t  op_type;         // Operation byte: 0x01 (Insert), 0x02 (Update), etc.
};

Storage Node Memory Structures

The storage node maintains an in-memory map of pages to their list of uncoalesced log records:

Page ID	Base Page LSN	Pending Log Records LSN List	Coalescing Status
`0x4092`	`100204`	`[100220, 100235, 100290]`	PENDING
`0x981A`	`100195`	`[100210, 100250]`	COALESCING
`0x00FF`	`100300`	`[]`	CLEAN

Compute Node Buffer Pool & Page Fetch Strategy

When the compute node requests page 0x4092 at LSN 100290:

The compute node checks its local buffer pool cache.
On a cache miss, it sends a ReadPageRequest with requested_lsn = 100290 to the storage layer.
The storage node retrieves the base page 0x4092 (representing state at LSN 100204), reads pending logs [100220, 100235, 100290] from disk/memory, applies the deltas to reconstruct the page image at LSN 100290, and returns the materialized page to the compute node.

Scaling and Operational Challenges: Calculations & Formulations

The primary performance victory of this cloud-native database design is the elimination of page-based write amplification. Let us calculate this difference mathematically.

Traditional Database Write Amplification (MySQL on EBS)

In a traditional database, modifying a single row requires writing the entire dirty database page (typically 16 KB) to the block storage subsystem, along with the write-ahead log (WAL) record (typically ~200 bytes). If we write at a rate of 1,000 transactions per second (tps), with each transaction modifying a single row on a distinct page:

$$\text{Total Write Egress} = (\text{Page Size} \times \text{TPS}) + (\text{WAL Size} \times \text{TPS})$$

$$\text{Total Write Egress} = (16,384 \text{ bytes} \times 1,000) + (200 \text{ bytes} \times 1,000) = 16.58 \text{ MB/sec}$$

If we account for mirroring and high availability (e.g., dual-AZ block replication), this write traffic is doubled:

$$\text{Total Network Write Traffic} = 16.58 \text{ MB/sec} \times 2 = 33.16 \text{ MB/sec}$$

Cloud-Native Database Write Amplification (Aurora-Style)

In the cloud-native decoupled database, the compute node does not write data pages. It only ships the redo log records to the storage nodes. The compute node ships these records to 6 storage nodes spread across 3 AZs. For the same workload of 1,000 transactions per second:

$$\text{Redo Record Size} \approx 250 \text{ bytes (including gRPC overhead)}$$

$$\text{Egress per Storage Node} = 250 \text{ bytes} \times 1,000 \text{ tps} = 250 \text{ KB/sec}$$

For 6-way replication, the total network write egress from the compute node is:

$$\text{Total Network Write Traffic} = 250 \text{ KB/sec} \times 6 = 1.50 \text{ MB/sec}$$

Network Bandwidth Reduction Percentage

Comparing the two systems:

$$\text{Traffic Reduction} = \left( 1 - \frac{\text{Cloud-Native Traffic}}{\text{Traditional Traffic}} \right) \times 100$$

$$\text{Traffic Reduction} = \left( 1 - \frac{1.50 \text{ MB/sec}}{33.16 \text{ MB/sec}} \right) \times 100 \approx 95.48%$$

By shipping logs instead of full dirty database pages, the cloud-native database reduces network write egress by over 95 percent. This dramatic reduction mitigates virtual network bandwidth bottlenecks in hypervisors and permits significantly higher write throughput.

Trade-offs and Architectural Alternatives

Decoupling compute and storage is not a silver bullet. It introduces key trade-offs in read latency, complexity, and network dependency.

Architectural Comparison Table

Feature / Dimension	Decoupled Shared-Storage (e.g., Amazon Aurora, Neon)	Shared-Disk (e.g., Oracle RAC)	Shared-Nothing Sharding (e.g., CockroachDB, Spanner)
Write Path Bottleneck	Compute network egress limit	Distributed lock manager limits	WAN latency during 2PC and Consensus
Read Latency (Cache Miss)	High (Requires network fetch from storage service)	Medium (SAN/NAS fiber channel latency)	Low to Medium (Local NVMe or remote partition fetch)
Storage Efficiency	Excellent (Replicas share the same volume via virtual views)	Excellent (Shared block storage)	Poor (Each shard replica maintains full data block duplicates)
Cross-Node Transactions	Trivial (Single primary writer coordinates locks in-memory)	Complex (Distributed Lock Manager coordinates across nodes)	Very Complex (Requires 2-Phase Commit over Consensus groups)
Replication Lag	Extremely Low (less than 15ms)	High (Page replication over block layer)	Zero lag (Consistent quorum reads)
Operational Complexity	High (Requires custom smart storage fabric management)	High (Requires physical SAN setup and clusterware)	Medium (Unified cluster binary, self-contained)

Key Trade-offs

Read Path Latency Tax: A page cache miss on a decoupled compute node requires an RPC to the storage service. In a traditional database, block storage page reads are highly optimized. Decoupled databases rely heavily on large compute memory pools to keep page cache hit ratios high (typically greater than 95 percent).
Compute-Scale Asymmetry: While we can scale read replicas to 15 nodes easily, write operations are still coordinated by a single primary compute writer node. If the application write workload exceeds the capacity of a single large compute instance, sharding must be introduced at the application layer.

Failure Modes and Fault Tolerance Strategies

Decoupling compute from storage shifts failure recovery dynamics from local disk repair to distributed network quorum resolution.

1. Compute Node Crashes (Fast Recovery)

In a traditional database, recovering from a crash requires reading the write-ahead log (WAL) from the last checkpoint to the end of the log, and re-applying all modifications to the data pages. This process can take minutes to hours.

Cloud-Native Strategy: When a compute writer node crashes, a read replica is promoted to writer near-instantaneously. The new writer node does not need to perform crash recovery because the smart storage nodes have already been applying redo logs to pages in the background. The database is available to accept writes in less than 30 seconds.

2. Storage Node Degradation and Quorums

Individual storage disks or network links fail frequently. The system uses a Write Quorum of 4-out-of-6 and a Read Quorum of 3-out-of-6.

AZ Outage Protection: The system replicates data 6 ways across 3 AZs (2 copies per AZ).
Write Availability: If an entire AZ goes offline (2 storage nodes lost), the writer can still achieve a quorum of 4-out-of-6 on the remaining 4 storage nodes in the other two AZs. Write availability is preserved.
Durability (AZ+1 Failure): If an entire AZ is lost (2 nodes) and a third storage node fails in another AZ, we have 3 surviving nodes. The read quorum of 3-out-of-6 allows the system to reconstruct the most recent state by querying the remaining 3 nodes, finding the highest LSN, and performing background repairs.

3. Epoch-Based Split-Brain Mitigation

If a network partition isolates the primary writer compute node, a replica node might be promoted to primary writer, resulting in two active writer nodes.

Mitigation: The storage nodes enforce an Epoch Ticket pattern. Every time a compute node registers as the primary writer, it obtains an epoch ticket from a consensus service (e.g., a Paxos-backed configuration store). Every write RPC to a storage node must include the current epoch number. If a storage node receives a write request with an epoch number less than the highest registered epoch, it rejects the write, immediately neutralizing the old writer.

Verbal Script

Interviewer: "Explain the architecture of a cloud-native database like Amazon Aurora, focusing on how it achieves high write throughput and instant crash recovery."

Candidate: "In traditional databases, the major bottleneck is network I/O and write amplification because the engine writes both the write-ahead log and full 8 KB to 16 KB data pages over the network to block storage.

To overcome this, a cloud-native architecture decouples compute from storage and shifts the paradigm to 'the log is the database.' The compute node handles query execution and transaction management, but writes only the redo log records over the network to a smart, distributed storage layer. It never writes dirty data pages.

The storage layer replicates these log records 6 ways across 3 availability zones, acknowledging writes as soon as 4 out of 6 nodes write the log to their in-memory update queue. In the background, storage nodes asynchronously coalesce these logs with old base pages to materialize modern page versions.

This architecture achieves two major goals: First, write throughput is high because network traffic from the compute node is reduced by over 90 percent. Second, crash recovery is near-instantaneous. Because storage nodes materialize pages in the background, a newly promoted compute writer node does not need to scan and replay logs from a checkpoint. It simply mounts the shared storage volume and immediately starts processing transactions."

Interviewer: "What happens if a page read request hits a storage node that has not finished coalescing the latest log records for that page?"

Candidate: "That scenario is handled by the versioned read protocol. When the compute node requests a page, it sends a ReadPageRequest containing the specific Page ID and the target LSN it needs.

If the storage node has pending log records in its queue for that page, it performs an on-the-fly coalescing operation. It reads the last materialized base page from its disk, applies the pending log records up to the requested LSN in memory, and returns the constructed page to the compute node.

In the background, the coalescing worker will eventually write this new materialized version to disk, updating the base page pointer and garbage-collecting the applied log records. This ensures that the compute node always receives a transactionally consistent view of the page, even if background coalescing is still catch-up processing."