Mental Model
A database incident is not a mystery; it is a sequence of events with a root cause that leaves evidence. Your job as an engineer is to follow the evidence trail systematically before touching anything.
The worst thing you can do during a MySQL incident is start changing things without a diagnosis. Random configuration changes in a degraded system often make things worse. This playbook gives you a systematic investigation protocol for the four most common MySQL production incidents.
Incident Type 1: Sudden Slowdown / High Latency
Symptoms: Query latency spikes from 10ms to 2000ms. Error rates increase. CPU is high or I/O wait is high.
Step 1: See What's Running Right Now
-- Show all currently executing queries
SHOW FULL PROCESSLIST;
-- Or with more detail (requires performance_schema)
SELECT
processlist_id,
processlist_user,
processlist_db,
processlist_command,
processlist_time,
processlist_state,
LEFT(processlist_info, 200) AS query_preview
FROM performance_schema.processlist
WHERE processlist_command != 'Sleep'
ORDER BY processlist_time DESC;
What to look for:
- Queries with
processlist_time> 10 seconds - State:
Waiting for table metadata lock— something holds a metadata lock (often an uncommitted DDL or long transaction) - State:
Sending datawith very high time — data volume problem or missing index - State:
Locked— row-level or table-level lock contention
Step 2: Kill the Blocking Query
-- Identify the blocking thread
SELECT
r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
b.trx_started AS blocking_started,
TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_seconds
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;
-- Kill the blocking thread (CAREFUL: confirm this is safe first)
KILL 12345; -- blocking_thread ID
Warning: Never kill a thread without understanding why it's there. A long-running UPDATE might be legitimate — a data migration, for example. Killing it causes a rollback that might take as long to complete as the original query.
Step 3: EXPLAIN the Slow Query
-- Run EXPLAIN on the slow query (without ANALYZE — it's non-destructive)
EXPLAIN SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.status = 'pending'
AND o.created_at > '2026-01-01';
Red flags in EXPLAIN output:
| Column | Red Flag | Meaning |
|---|---|---|
type |
ALL |
Full table scan — no index used |
type |
index |
Full index scan — better, but still slow on large tables |
rows |
> 100,000 | MySQL will examine that many rows |
Extra |
Using filesort |
Sort can't use an index — expensive on large result sets |
Extra |
Using temporary |
MySQL needs a temp table — very expensive |
key |
NULL |
No index selected |
Step 4: EXPLAIN ANALYZE for Confirmed Queries
Once you identify the slow query, run EXPLAIN ANALYZE in a non-production environment (it actually executes the query):
-- Shows actual vs estimated row counts — reveals optimizer lies
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;
-- Output shows:
-- -> Sort: orders.created_at (actual time=2847.123..2847.456 rows=89234 loops=1)
-- -> Filter: (orders.status = 'pending')
-- -> Table scan on orders (cost=23847.00 rows=234891 actual time=0.023..2201.342 rows=89234 loops=1)
The actual vs estimated rows gap tells you if the optimizer is working with stale statistics — if so, run ANALYZE TABLE orders;.
Incident Type 2: Deadlock
Symptoms: Application receives ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction.
Deadlocks are deterministic: they happen when two transactions acquire locks in opposite orders. Understanding this is essential before you can fix them.
Step 1: Read the InnoDB Deadlock Info
-- Show the last deadlock (stored since last restart)
SHOW ENGINE INNODB STATUS\G
-- Look for the LATEST DETECTED DEADLOCK section
-- It shows:
-- 1. Transaction A: what it held, what it waited for
-- 2. Transaction B: what it held, what it waited for
-- 3. Which transaction was rolled back (the "victim")
Reading the deadlock output:
TRANSACTION 1, ACTIVE 2 sec starting index read
MySQL thread id 156, OS thread handle 140234891776
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
RECORD LOCKS space id 89 page no 5 n bits 72 index PRIMARY of table `db`.`orders`
trx id 421897 lock_mode X locks rec but not gap waiting
TRANSACTION 2, ACTIVE 1 sec
MySQL thread id 157
1 lock struct(s), heap size 1136, 1 row lock(s)
RECORD LOCKS space id 89 page no 5 n bits 72 index PRIMARY of table `db`.`orders`
trx id 421896 lock_mode X locks rec but not gap
Step 2: Identify the Lock Order Problem
Most deadlocks have this pattern:
Time →
Transaction A Transaction B
──────────────────────────────────────────────────
LOCK orders WHERE id = 100
LOCK orders WHERE id = 200
WAIT for orders WHERE id = 200
WAIT for orders WHERE id = 100 ← DEADLOCK
The universal fix: Ensure all code paths that lock multiple rows do so in the same order — typically by ascending primary key.
Step 3: Fix the Deadlock in Code
// BEFORE (deadlock-prone): Different threads might acquire in different orders
public void transferFunds(long fromId, long toId, BigDecimal amount) {
Account from = accountRepo.findByIdForUpdate(fromId); // LOCK id=from
Account to = accountRepo.findByIdForUpdate(toId); // LOCK id=to
// Thread 1: locks 100 then 200
// Thread 2: locks 200 then 100 → DEADLOCK
}
// AFTER (deadlock-safe): Always lock in ascending ID order
public void transferFunds(long fromId, long toId, BigDecimal amount) {
long firstId = Math.min(fromId, toId);
long secondId = Math.max(fromId, toId);
Account first = accountRepo.findByIdForUpdate(firstId);
Account second = accountRepo.findByIdForUpdate(secondId);
Account from = (fromId == firstId) ? first : second;
Account to = (toId == firstId) ? first : second;
// Now: Both threads always lock the smaller ID first → no deadlock possible
}
Step 4: Add Deadlock Monitoring
-- Track deadlock frequency over time
SELECT
variable_value AS deadlocks_since_restart
FROM performance_schema.global_status
WHERE variable_name = 'Innodb_deadlocks';
-- Set up a periodic check in your monitoring
-- Alert if deadlock rate > 10/minute
Incident Type 3: Replication Lag
Symptoms: Read replica is returning stale data. Seconds_Behind_Master is growing. Replica queries are reading data that was written minutes or hours ago.
Step 1: Check Current Lag
-- On the replica
SHOW REPLICA STATUS\G
-- Key fields:
-- Seconds_Behind_Master: Current lag in seconds
-- 0 = in sync
-- NULL = replication is stopped or not connected
-- Large number = replica is catching up
-- Replica_SQL_Running: YES = applying binlogs
-- Replica_IO_Running: YES = receiving binlogs from primary
Step 2: Diagnose the Bottleneck
-- Is the replica IO thread stuck? (Can't receive binlogs)
-- Check: Last_IO_Error, Last_IO_Errno
-- Is the replica SQL thread stuck? (Can't apply binlogs)
-- Check: Last_SQL_Error, Last_SQL_Errno
-- Is a single long transaction causing lag?
SELECT
trx_started,
TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS running_seconds,
trx_query
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30
ORDER BY running_seconds DESC;
Step 3: Common Causes and Fixes
Cause 1: Long-running primary transaction
A single 30-second UPDATE on the primary causes a 30-second gap on the replica (single-threaded applier is blocked until the full transaction replays).
-- Fix: Break large updates into batches on the primary
-- BEFORE (blocks replica for 60 seconds):
UPDATE orders SET status = 'archived' WHERE created_at < '2025-01-01';
-- This might affect 10 million rows
-- AFTER (replica lag stays < 1 second):
-- Run in a loop until done:
UPDATE orders SET status = 'archived'
WHERE created_at < '2025-01-01'
AND status != 'archived'
LIMIT 1000;
-- Sleep 100ms between batches
Cause 2: Missing index on replica
If the replica has a different schema than the primary (e.g., a migration ran on primary but not replica), row-based replication must scan the full table to find the rows to update.
-- Check indexes match
-- On primary:
SHOW INDEX FROM orders;
-- On replica:
SHOW INDEX FROM orders;
-- Fix: Add missing index to replica
Cause 3: Replica I/O bound
The replica disk can't keep up with the primary's write rate.
-- Enable parallel replication (MySQL 5.7+)
-- On replica:
SET GLOBAL slave_parallel_workers = 8;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
-- Persist in my.cnf:
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK
Incident Type 4: Connection Pool Exhaustion
Symptoms: Application returns "Too many connections" or connection pool timeout errors. Database max_connections is hit.
Step 1: Check Current Connection State
-- Current connections vs maximum
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Threads_running'; -- actively executing (not sleeping)
-- See connection distribution by user and host
SELECT user, host, COUNT(*) as connections
FROM information_schema.processlist
GROUP BY user, host
ORDER BY connections DESC;
-- Find sleeping connections holding locks (common problem)
SELECT id, user, time, state, info
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 30
ORDER BY time DESC;
Step 2: Identify the Leak
-- Check if connections aren't being returned properly
-- High Threads_connected + low Threads_running = connection leak
-- Check connection wait time
SHOW STATUS LIKE 'Connection_errors_max_connections';
-- Non-zero = connections are being rejected
-- Check for connections from unexpected sources
SELECT host, COUNT(*) FROM information_schema.processlist
GROUP BY host ORDER BY 2 DESC;
Step 3: Immediate Relief
-- Kill all idle connections older than 60 seconds (emergency only)
SELECT CONCAT('KILL ', id, ';')
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 60;
-- Copy and execute the output carefully
Step 4: Root Cause Fix — HikariCP Configuration
// Spring Boot HikariCP configuration
spring:
datasource:
hikari:
maximum-pool-size: 20 # (CPU cores * 2) + disk spindles is a good starting point
minimum-idle: 5
connection-timeout: 30000 # 30 seconds max wait for a connection
idle-timeout: 600000 # Return idle connections after 10 minutes
max-lifetime: 1800000 # Recycle connections every 30 minutes
keepalive-time: 60000 # Test idle connections every 60 seconds
connection-test-query: SELECT 1 # Validate connection is alive
leak-detection-threshold: 60000 # Warn if a connection is held > 60 seconds
The critical insight on pool sizing: max-pool-size should equal the number of connections your database can effectively serve. For most workloads: (CPU_cores × 2) + effective_disk_spindles. A pool size of 200 on a 4-core database server will cause resource contention. 20–50 connections per application instance is usually correct.
Setting Up Proactive Monitoring
Don't wait for incidents — monitor these metrics continuously:
-- Key metrics to track every 60 seconds:
SELECT
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Threads_connected') AS connections,
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Threads_running') AS active_queries,
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_row_lock_waits') AS lock_waits,
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_deadlocks') AS deadlocks,
(SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Slow_queries') AS slow_queries;
Alert thresholds:
Threads_connected> 80% ofmax_connections→ WarningThreads_running> 50 → Warning (too many concurrent queries)Seconds_Behind_Master> 30 → Warning on replicaInnodb_deadlocksrate > 1/minute → Investigation neededSlow_queriesrate > 10/minute → Checkslow_query_log
Key Takeaways
- SHOW PROCESSLIST + EXPLAIN ANALYZE are your first tools in any MySQL performance incident — never guess without data.
- Deadlocks are always reproducible if you understand the lock acquisition order; the fix is always about consistent ordering.
- Replication lag > 10 seconds means your read replicas are serving stale data — always monitor and alert on
Seconds_Behind_Master.