System Design: Designing an Online Judge (LeetCode/HackerRank)

Designing a secure and scalable Online Judge (like LeetCode, HackerRank, or Codeforces) is one of the most high-risk systems engineering tasks. The core challenge lies in running untrusted, arbitrary user code on your infrastructure.

A single malicious submission (such as import os; os.system("rm -rf /") or a fork bomb) can crash compilation hosts, steal proprietary test cases, or turn your compute nodes into a botnet.

This case study analyzes the architecture of a secure, highly scalable online judge platform, detailing asynchronous worker queue patterns, AWS Firecracker micro-VM isolation layers, cgroups resource enforcement, and fast test-case distribution.

1. Requirements & Core Constraints

Architecting an online judge requires balancing rapid feedback for users with strong security isolation boundaries.

Functional Constraints

Multi-Language Code Compilation: Support compiling and executing code written in compiled (C++, Java) and interpreted (Python, JavaScript) languages.
Test Case Verification: Compare user execution outputs against extensive pre-configured test cases, verifying for correctness (Accepted, Wrong Answer).
Execution Resource Gating: Enforce strict execution limits: Time Limit Exceeded (TLE) and Memory Limit Exceeded (MLE).
Real-Time Progress Tracking: Provide users with real-time feedback (e.g., "Running Test Case 15/100") during execution.

Non-Functional SLAs

High Throughput Compilation: Support processing $10,000,000$ (10 Million) code submissions per day, managing sudden bursts during active coding contests.
Asynchronous Processing Latency: Return compilation and run outcomes to the user within $\le 3\text{ seconds}$ of submission for standard runs.
Strong Code Isolation: Sandboxed code must have zero access to the host's file system, internal private network, or parent system processes.
Highly Concurrent Queue Management: Ensure contest submissions do not suffer from head-of-line blocking caused by long-running or slow submissions.

Back-of-the-Envelope Estimation

1. Ingestion and Execution Throughput

Daily Submission Count: $10,000,000$ submissions per day.
Average QPS: $$\text{Average Ingest QPS} = \frac{10,000,000 \text{ submissions}}{86,400 \text{ seconds}} \approx 115 \text{ submissions/sec}$$
Peak Scaling QPS (Contests): System must scale to handle a $10\times$ surge during contest starts: $$\text{Peak Ingest QPS} = 115 \times 10 \approx 1,150 \text{ submissions/sec}$$
Test Case Execution Load: Assume an average problem has 100 test cases: $$\text{Peak Test Executions} = 1,150 \text{ submissions/sec} \times 100 \approx 115,000 \text{ runs/sec}$$

2. Compute Node Allocation Sizing

Execution Duration: Each test case execution runs for an average of $200\text{ms}$ in the sandbox.
Concurrent Executions at Peak: $$\text{Concurrent Executions} = 115,000 \text{ runs/sec} \times 0.2 \text{ seconds} = 23,000 \text{ active sandboxes}$$
Sandbox CPU/Memory Allocation: Each sandbox is allocated $0.5 \text{ vCPU}$ and $256 \text{ MB}$ of RAM.
Total Peak Server Capacity: $$\text{Total Peak vCPUs} = 23,000 \times 0.5 = 11,500 \text{ vCPUs}$$ $$\text{Total Peak RAM} = 23,000 \times 256 \text{ MB} \approx 5.75 \text{ TB of RAM}$$ (This compute fleet is managed dynamically using autoscaling Kubernetes clusters of memory-optimized EC2 instances).

2. API Design & Core Contracts

The online judge API provides endpoints to submit code and receive real-time execution feedback.

1. Submit Code for Evaluation

POST /api/v1/submissions Dispatched by the web app when a user clicks the "Submit" button.

Request Headers:

Content-Type: application/json
Authorization: Bearer <JWT_TOKEN>

Request Payload:

{
  "problem_id": "prob_992",
  "language": "python3",
  "source_code": "def twoSum(nums, target):\n    dct = {}\n    for i, num in enumerate(nums):\n        diff = target - num\n        if diff in dct:\n            return [dct[diff], i]\n        dct[num] = i\n    return []",
  "compiler_flags": "-O3",
  "submission_timeout_ms": 2000
}

Response Payload (201 Created):

{
  "submission_id": "sub_44c3d2e1",
  "status": "QUEUED",
  "submitted_at": 1782236500000,
  "queue_position": 14
}

2. Stream Submission Run Status (WebSocket Connection)

WS /api/v1/submissions/{submission_id}/status Establishes a real-time WebSocket connection to receive live compilation progress.

Message Frame emitted by Server:

{
  "submission_id": "sub_44c3d2e1",
  "status": "RUNNING",
  "completed_test_cases": 45,
  "total_test_cases": 100,
  "current_runtime_ms": 12,
  "current_memory_bytes": 14200000
}

3. High-Level Design (HLD)

To run submissions safely and handle traffic spikes, the architecture separates submission ingestion from code execution using message brokers and worker fleets.

graph TD
    %% Ingestion
    User[User Browser] -->|1. Submit Code| Gateway[API Gateway]
    Gateway -->|2. Ingest Submission| WebServer[Web & Coordination Server]
    
    %% Queue & Buffering
    WebServer -->|3. Publish Job| QueueSystem[("Priority Queue Broker (RabbitMQ)")]
    
    %% Execution Fleet
    QueueSystem -->|4. Pull Job| WorkerFleet[stateless Judge Worker Fleet]
    
    %% Sandbox & Data
    subgraph SandboxWorker["Judge Worker Node (Bare Metal)"]
        WorkerFleet -->|5. Spin up| SandboxEngine["AWS Firecracker micro-VM Sandbox"]
        SandboxEngine <-->|6. Cache Test Cases| LocalCache[("Local SSD Test Cache")]
        S3Store[("S3 Test Case Store")] -->|7. Load Test Cases| LocalCache
    end
    
    %% Progress Tracking
    WorkerFleet -->|8. Push Live Updates| RedisPubSub[("Redis Pub/Sub")]
    RedisPubSub -->|9. WebSocket Push| WebServer
    WebServer -->|10. Render Progress| User
    
    %% Primary storage
    WorkerFleet -->|11. Write Outcome| DBPrimary[("PostgreSQL Submissions DB")]

    classDef database fill:#0d3b66,stroke:#f4d35e,stroke-width:2px,color:#fff;
    classDef cluster fill:#2e0f38,stroke:#f4d35e,stroke-width:2px,color:#fff;
    classDef client fill:#3d5a80,stroke:#293241,stroke-width:2px,color:#fff;
    
    class S3Store,DBPrimary,RedisPubSub,LocalCache database;
    class Gateway,WebServer,QueueSystem,WorkerFleet,SandboxEngine cluster;
    class User client;

End-to-End Architectural Workflows

1. Submission Ingestion and Priority Queue Buffering

Submission Entry: A user submits their code. The request is processed by the API Gateway and handled by the Web & Coordination Server.
Buffering: The server writes the submission metadata to the primary database and publishes a compilation job to the Priority Queue Broker (RabbitMQ).
Contest Prioritization: Jobs are assigned priority scores based on source context. Code submissions for live contests are assigned the highest priority to keep latency low under load.

2. Sandboxed Execution and Verification

Pulling Workload: An idle Judge Worker pulls the submission from the queue.
Sandbox Initialization: The worker spins up a fresh AWS Firecracker micro-VM in under $5\text{ms}$.
Local Test Case Cache: The sandbox retrieves the problem's inputs and expected outputs. To avoid network lag, the worker checks its Local SSD Test Cache. If there is a cache miss, it retrieves them from AWS S3 and caches them locally.
Execution and Monitoring: The worker compiles the code inside the micro-VM. It executes the compiled binary against the test inputs, monitoring runtimes and memory usage using Linux cgroups.
Progress Broadcast: As test cases complete, the worker broadcasts progress updates (e.g., "Run 30/100 Successful") to Redis Pub/Sub, which pushes updates to the client via WebSockets.
Persistence: Once execution completes, the final run results (Accepted, TLE, MLE, WA, Compile Error) are persisted to the PostgreSQL Submissions Database. The sandbox is then torn down, freeing all allocated resources.

4. Low-Level Design (LLD) & Data Models

Database Selection Rationale

An online judge requires strict data schemas for problems and test cases, alongside transactional guarantees for submissions.

Database	Architecture Model	Primary Role	System Justification
PostgreSQL	Relational SQL	Submission Ledger & Schema Store	Problem definitions, test case paths, and user submission histories require ACID compliance and foreign key safety.
AWS S3	Distributed Object Store	Test Case Storage	Large input and output files (up to several megabytes) are stored cheaply and securely in S3.

SQL Database Schema (Online Judge Metadata)

-- Core Problems Registry
CREATE TABLE problems (
    problem_id VARCHAR(64) PRIMARY KEY,
    title VARCHAR(255) NOT NULL,
    difficulty VARCHAR(32) NOT NULL, -- Easy, Medium, Hard
    time_limit_ms INT NOT NULL DEFAULT 2000, -- Hard limit
    memory_limit_bytes BIGINT NOT NULL DEFAULT 268435456, -- 256 MB default
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

-- Test Case Registry
CREATE TABLE problem_test_cases (
    test_case_id VARCHAR(64) PRIMARY KEY,
    problem_id VARCHAR(64) NOT NULL REFERENCES problems(problem_id) ON DELETE CASCADE,
    input_file_s3_path VARCHAR(512) NOT NULL, -- S3 URL to input payload
    expected_output_s3_path VARCHAR(512) NOT NULL -- S3 URL to expected outcome
);

-- Submissions Ledger
CREATE TABLE submissions (
    submission_id VARCHAR(64) PRIMARY KEY,
    user_id VARCHAR(64) NOT NULL,
    problem_id VARCHAR(64) NOT NULL REFERENCES problems(problem_id),
    language VARCHAR(32) NOT NULL,
    source_code TEXT NOT NULL,
    status VARCHAR(32) NOT NULL, -- QUEUED, COMPILING, RUNNING, COMPLETED
    verdict VARCHAR(32), -- AC, WA, TLE, MLE, RE, CE
    execution_time_ms INT,
    execution_memory_bytes BIGINT,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_submissions_user_problem ON submissions(user_id, problem_id);

5. Sandbox Security & Code Isolation

Running untrusted user code on your servers is highly risky. We must establish multiple defense layers to protect host infrastructure.

+--------------------------------------------------------+
|                 Bare Metal Host OS                     |
|  +--------------------------------------------------+  |
|  |           AWS Firecracker Micro-VM               |  |
|  |  +--------------------------------------------+  |  |
|  |  |            User Sandbox Environment        |  |  |
|  |  |  +--------------------------------------+  |  |  |
|  |  |  | - Linux Namespaces (Network Blocked) |  |  |  |
|  |  |  | - Linux Cgroups (RAM capped to 256MB)|  |  |  |
|  |  |  | - Seccomp Filter (Block sys_write)   |  |  |  |
|  |  |  +--------------------------------------+  |  |  |
|  |  +--------------------------------------------+  |  |
|  +--------------------------------------------------+  |
+--------------------------------------------------------+

1. AWS Firecracker Micro-VM Isolation

The Sandbox: Firecracker is an open-source virtualization technology designed for serverless architectures (like AWS Lambda). It combines the security and isolation properties of traditional virtual machines with the speed and resource efficiency of containers.
The Execution: Submissions run inside a dedicated micro-VM running a minimal Linux kernel. It boots in $< 5\text{ms}$ with zero network access and an isolated root file system mounted as read-only.

2. Resource Enforcements via Linux cgroups (Control Groups)

The cgroup: Linux cgroups allow us to limit, prioritize, and isolate the resource usage (CPU, memory, disk I/O, process count) of a collection of processes.
Memory Limit (MLE): We set a hard memory limit on the sandbox cgroup (e.g., memory.limit_in_bytes = 268435456 for 256MB). If the user code attempts to allocate memory beyond this, the OS kernel triggers the Out-Of-Memory (OOM) killer to terminate the process instantly, returning a Memory Limit Exceeded (MLE) verdict.
Time Limit (TLE): The worker registers a high-resolution OS timer (such as setitimer in Linux). If the process is still running when the timer expires, the kernel fires a SIGKILL signal to terminate the process, returning a Time Limit Exceeded (TLE) verdict.
Fork Bomb Mitigation: We set the maximum process count to a low threshold (e.g., pids.max = 1). If the user code tries to spawn a subprocess or execute a fork bomb (while(1) { fork(); }), the system call is blocked, returning a Runtime Error (RE) verdict.

3. System Call Gating using Seccomp Filters

The seccomp: Seccomp (secure computing mode) allows us to restrict the system calls a process can make.
The Rules: The sandbox is configured with a strict seccomp profile that blocks dangerous system calls (such as socket to access the network, mount, reboot, or ptrace). If the user code attempts to invoke a blocked system call, the kernel terminates the process instantly.

Compilable Python Sandbox Process Executor

Below is a Python process execution module illustrating sandbox initialization, cgroups resource gating, and time limits:

import subprocess
import os
import signal
import sys
import resource

class SandboxExecutor:
    def __init__(self, execution_timeout_sec: float, memory_limit_mb: int):
        self.timeout_sec = execution_timeout_sec
        self.memory_limit_bytes = memory_limit_mb * 1024 * 1024

    def limit_resources(self):
        # Configure CPU execution time limit (SIGXCPU will fire if exceeded)
        resource.setrlimit(
            resource.RLIMIT_CPU, 
            (int(self.timeout_sec) + 1, int(self.timeout_sec) + 2)
        )
        
        # Configure Virtual Memory Address space limit (MLE)
        resource.setrlimit(
            resource.RLIMIT_AS, 
            (self.memory_limit_bytes, self.memory_limit_bytes)
        )
        
        # Disable subprocess spawning to block fork bombs
        resource.setrlimit(resource.RLIMIT_NPROC, (1, 1))

    def execute_user_code(self, run_command: list) -> dict:
        try:
            # Execute compilation or runtime binary inside sandbox
            process = subprocess.Popen(
                run_command,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                preexec_fn=self.limit_resources, # Run constraints prior to exec
                text=True
            )
            
            try:
                stdout, stderr = process.communicate(timeout=self.timeout_sec)
                exit_code = process.returncode
                
                if exit_code == 0:
                    return {"verdict": "SUCCESS", "output": stdout}
                elif exit_code == -signal.SIGKILL or exit_code == -signal.SIGXCPU:
                    return {"verdict": "TLE", "reason": "Time Limit Exceeded"}
                else:
                    return {"verdict": "RE", "reason": stderr}
                    
            except subprocess.TimeoutExpired:
                process.kill()
                stdout, stderr = process.communicate()
                return {"verdict": "TLE", "reason": "Execution Timeout"}
                
        except Exception as e:
            return {"verdict": "CE", "reason": str(e)}

if __name__ == "__main__":
    # Test executor sandboxing
    executor = SandboxExecutor(execution_timeout_sec=1.5, memory_limit_mb=128)
    
    # Run a simple Python expression
    result = executor.execute_user_code([sys.executable, "-c", "print('Hello, CodeSprintPro Sandbox')"])
    print(f"Outcome: {result}")

6. Scaling Challenges & System Bottlenecks

Operating a public online judge under heavy load introduces severe bottlenecks. Here is how we mitigate them:

1. Head-of-Line Blocking during Active Contests

The Bottleneck: When a live contest starts, thousands of users submit code simultaneously. If several users submit highly inefficient code that runs for the full time limit (e.g., 2 seconds), these submissions can block the judge workers. This creates a bottleneck in the message queue, delaying other quick submissions and frustrating participants.
The Mitigation: Decoupled Multi-Priority Queues. We isolate contest traffic by using decoupled queues with distinct priorities in RabbitMQ:
- High-Priority Queue: Dedicated exclusively to active, live contests.
- Medium-Priority Queue: Processes standard, everyday practice submissions.
- Low-Priority Queue: Reserved for long-running, multi-file projects or offline test suites.
- The worker fleet is configured to poll queues dynamically, dedicating $80%$ of its worker threads to the high-priority queue. This ensures contest submissions are processed in sub-seconds, even under heavy practice load.

2. Disk I/O Saturation during Large Test-Case Scanning

The Bottleneck: Some complex problems require testing code against hundreds of megabytes of inputs. If hundreds of workers pull these large test cases from S3 concurrently, the S3 network path and local disk I/O on the workers can saturate rapidly, causing severe execution delays.
The Mitigation: Shared SSD Storage & Local Cache Layers.
- Each judge worker host has a fast, dedicated NVMe SSD cache that stores popular test cases locally.
- Test case metadata is stored with a unique SHA-256 hash. When a worker pulls a job, it checks the local cache. If the hash matches, it bypasses the network download entirely.
- The local cache is managed using a Least Recently Used (LRU) eviction policy to balance storage space with download efficiency.

7. Technical Trade-offs & Consistency Models

1. Docker Container Isolation vs. Micro-VM Isolation

2. Parallel vs. Sequential Test-Case Execution

8. Resilience & Failure Scenarios

1. sandbox Host Compromise (Worker Panic)

If a user exploits a zero-day kernel bug and successfully escapes the Firecracker micro-VM, they gain root access to the parent judge host.

Recovery Protocol: Each judge host node runs a security agent that monitors the host OS for anomalies (such as unauthorized outbound connections or file system writes). If an anomaly is detected, the agent triggers a node-level panic.
It immediately halts all active sandboxes, updates its status in the gateway to offline, and shuts down the host.
The coordination layer detects the node's offline status, marks the active submissions for re-evaluation in another isolated zone, and alerts the engineering team.

2. Queue Broker Saturation and Backpressure

If the RabbitMQ queue broker becomes saturated during a massive contest, submissions can accumulate rapidly, causing severe delays.

Recovery Protocol: The API Gateway implements an adaptive rate limiting strategy. If the queue length exceeds a set safety threshold, the gateway dynamically returns HTTP 429 (Too Many Requests) backpressure responses to practice queue users, prioritizing active contest submissions and keeping queue lengths manageable.

9. Candidate Verbal Script (Interview Guide)

Below is an exhaustive, verbatim transcript showing how a Staff Engineer candidate navigates the design of an online judge:

Interviewer: "How would you design a highly secure and scalable Online Judge system like LeetCode?"

Candidate: "I will architect a highly secure, asynchronously processed online judge platform designed around a Decoupled Worker-Sandbox Architecture.

To handle traffic bursts and live contests, the frontend API gateway will buffer all code submissions into a Priority Queue System using RabbitMQ.

Stateless Judge Workers will poll this queue, dedicating their resources to high-priority contest lanes.

To run user code safely, each worker will execute the submission inside a dedicated, isolated AWS Firecracker micro-VM running a minimal guest kernel with zero network access and a read-only root file system.

Inside the VM, we will enforce strict resource limits using Linux cgroups for memory and CPU time, and apply seccomp filters to block dangerous system calls. Once execution completes, the results are stored in PostgreSQL and progress is pushed to the client via WebSockets."

Interviewer: "What if a user submits a fork bomb designed to crash the system? How do you prevent it from affecting the host?"

Candidate: "We prevent fork bombs using multiple layers of resource limits inside the sandbox:

Linux cgroup Process Limits: We configure the sandbox cgroup with a hard process limit (pids.max = 1), preventing the user code from spawning any subprocesses or threads.
Seccomp Filters: We block the fork, vfork, and clone system calls entirely. If the user code tries to spawn a process, the system call is blocked and the kernel terminates the process instantly.
VM Isolation: Even if the user code manages to crash the guest kernel inside the micro-VM, it has zero impact on the host or other sandboxes, since each submission operates inside its own hardware-isolated memory space."