MySQL Mastery: Sharding and Horizontal Scaling

Mental Model

Sharding is like splitting a massive, unmanageable phonebook into 26 smaller phonebooks, one for each letter of the alphabet. As long as you know the starting letter (the Shard Key), finding a number is incredibly fast.

Read Replicas can scale your read traffic infinitely. But what happens when your Write Traffic exceeds the capacity of your largest Primary server, or your dataset grows beyond 10 Terabytes? You have hit the vertical ceiling. You must Shard.

1. What is Sharding?

Sharding is the process of horizontally partitioning your data across multiple independent MySQL instances. Instead of one massive users table on Server A, you put users 1-10,000 on Server A, users 10,001-20,000 on Server B, and so on.

2. Choosing the Shard Key

The Shard Key determines which server a specific row lives on. If you choose poorly, your system will fail.

Bad Shard Key: created_at If you shard by date, all traffic for "today" will hit the same shard. You have created a massive Hotspot, and the other shards will sit idle.

Good Shard Key: tenant_id or user_id If you shard by tenant_id (e.g., Slack sharding by workspace, or Shopify sharding by store), the data for a specific customer is co-located on a single shard. This allows complex JOINs to work perfectly because all the necessary data lives on the same physical machine.

3. Sharding Algorithms

Once you have a Shard Key, how do you map it to a specific server?

Algorithmic (Modulo) Sharding

Take the user_id, hash it, and apply a modulo operation: hash(user_id) % 4. Pros: Simple, no lookup table required. Cons: Adding a 5th server changes the modulo math for everyone. You have to migrate 80% of your data to rebalance the cluster. (This is solved by Consistent Hashing).

Directory-Based Sharding

Maintain a lookup table (e.g., in Redis or Zookeeper) that maps exactly which tenant_id lives on which shard. Pros: You can move specific high-traffic "whale" customers to their own dedicated isolated servers. Cons: The lookup table becomes a single point of failure and adds latency to every query.

4. The Scatter-Gather Problem

If you shard by user_id, what happens when you need to run an admin query like SELECT * FROM users WHERE status = 'BANNED'?

Because the query doesn't include the Shard Key (user_id), the application doesn't know which server holds the data. It must send the query to all shards in parallel (Scatter), wait for all of them to respond, and merge the results in memory (Gather). This is incredibly slow and destroys the benefits of sharding.

Practice Question

Scenario: You are sharding Twitter's backend. You shard the tweets table by user_id so that fetching a user's timeline is fast. Question: How do you implement the "Global Search" feature where users can search for a hashtag across all tweets?

View Answer

You cannot use the sharded MySQL database for this without triggering a massive scatter-gather operation. For cross-shard querying and full-text search, you must implement the **CQRS pattern**. The sharded MySQL cluster acts as the Write store. You stream changes via CDC (Kafka/Debezium) into a dedicated search engine like **Elasticsearch**, which is explicitly designed to handle global indexes and distributed text search efficiently.

Technical Trade-offs: Architectural Decision

Strategy	Scalability	Complexity	Operational Cost	Performance
Monolithic	Low	Low	Low	Fast (Local)
Microservices	Very High	High	High	Slower (Network)
Serverless	Infinite	Medium	Variable	Variable (Cold Starts)

Production Readiness Checklist

Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:

High Availability: Have we eliminated single points of failure across all layers?
Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?

Verbal Interview Script

Interviewer: "We are reaching the storage limit on our primary Postgres instance. How should we partition our multi-tenant SaaS application?"

Candidate: "Since it's a multi-tenant SaaS, the optimal strategy is tenant-based sharding. We use the tenant_id as the Shard Key. This ensures Data Locality—all the users, orders, and products for a specific tenant reside on the same physical shard. This is crucial because it allows us to continue using standard SQL JOINs and ACID transactions within the boundaries of a single tenant. To route the traffic, we implement a Directory-Based architecture using an API Gateway or a proxy like Vitess that inspects the JWT token, extracts the tenant_id, looks up the shard mapping in a fast cache, and forwards the query to the correct database node."