Mental Model
Sharding is like splitting a massive, unmanageable phonebook into 26 smaller phonebooks, one for each letter of the alphabet. As long as you know the starting letter (the Shard Key), finding a number is incredibly fast.
Read Replicas can scale your read traffic infinitely. But what happens when your Write Traffic exceeds the capacity of your largest Primary server, or your dataset grows beyond 10 Terabytes? You have hit the vertical ceiling. You must Shard.
1. What is Sharding?
Sharding is the process of horizontally partitioning your data across multiple independent MySQL instances.
Instead of one massive users table on Server A, you put users 1-10,000 on Server A, users 10,001-20,000 on Server B, and so on.
2. Choosing the Shard Key
The Shard Key determines which server a specific row lives on. If you choose poorly, your system will fail.
Bad Shard Key: created_at
If you shard by date, all traffic for "today" will hit the same shard. You have created a massive Hotspot, and the other shards will sit idle.
Good Shard Key: tenant_id or user_id
If you shard by tenant_id (e.g., Slack sharding by workspace, or Shopify sharding by store), the data for a specific customer is co-located on a single shard. This allows complex JOINs to work perfectly because all the necessary data lives on the same physical machine.
3. Sharding Algorithms
Once you have a Shard Key, how do you map it to a specific server?
Algorithmic (Modulo) Sharding
Take the user_id, hash it, and apply a modulo operation: hash(user_id) % 4.
Pros: Simple, no lookup table required.
Cons: Adding a 5th server changes the modulo math for everyone. You have to migrate 80% of your data to rebalance the cluster. (This is solved by Consistent Hashing).
Directory-Based Sharding
Maintain a lookup table (e.g., in Redis or Zookeeper) that maps exactly which tenant_id lives on which shard.
Pros: You can move specific high-traffic "whale" customers to their own dedicated isolated servers.
Cons: The lookup table becomes a single point of failure and adds latency to every query.
4. The Scatter-Gather Problem
If you shard by user_id, what happens when you need to run an admin query like SELECT * FROM users WHERE status = 'BANNED'?
Because the query doesn't include the Shard Key (user_id), the application doesn't know which server holds the data. It must send the query to all shards in parallel (Scatter), wait for all of them to respond, and merge the results in memory (Gather). This is incredibly slow and destroys the benefits of sharding.
Practice Question
Scenario: You are sharding Twitter's backend. You shard the tweets table by user_id so that fetching a user's timeline is fast.
Question: How do you implement the "Global Search" feature where users can search for a hashtag across all tweets?
View Answer
You cannot use the sharded MySQL database for this without triggering a massive scatter-gather operation. For cross-shard querying and full-text search, you must implement the **CQRS pattern**. The sharded MySQL cluster acts as the Write store. You stream changes via CDC (Kafka/Debezium) into a dedicated search engine like **Elasticsearch**, which is explicitly designed to handle global indexes and distributed text search efficiently.Technical Trade-offs: Architectural Decision
| Strategy | Scalability | Complexity | Operational Cost | Performance |
|---|---|---|---|---|
| Monolithic | Low | Low | Low | Fast (Local) |
| Microservices | Very High | High | High | Slower (Network) |
| Serverless | Infinite | Medium | Variable | Variable (Cold Starts) |
Production Readiness Checklist
Before deploying this architecture to a production environment, ensure the following Staff-level criteria are met:
- High Availability: Have we eliminated single points of failure across all layers?
- Observability: Are we exporting structured JSON logs, custom Prometheus metrics, and OpenTelemetry traces?
- Circuit Breaking: Do all synchronous service-to-service calls have timeouts and fallbacks (e.g., via Resilience4j)?
- Idempotency: Can our APIs handle retries safely without causing duplicate side effects?
- Backpressure: Does the system gracefully degrade or return HTTP 429 when resources are saturated?
Verbal Interview Script
Interviewer: "We are reaching the storage limit on our primary Postgres instance. How should we partition our multi-tenant SaaS application?"
Candidate: "Since it's a multi-tenant SaaS, the optimal strategy is tenant-based sharding. We use the tenant_id as the Shard Key. This ensures Data Locality—all the users, orders, and products for a specific tenant reside on the same physical shard. This is crucial because it allows us to continue using standard SQL JOINs and ACID transactions within the boundaries of a single tenant. To route the traffic, we implement a Directory-Based architecture using an API Gateway or a proxy like Vitess that inspects the JWT token, extracts the tenant_id, looks up the shard mapping in a fast cache, and forwards the query to the correct database node."