MySQL Production Incident Playbook: Diagnosing Slow Queries, Deadlocks, and Replication Lag

Mental Model

A database incident is not a mystery; it is a sequence of events with a root cause that leaves evidence. Your job as an engineer is to follow the evidence trail systematically before touching anything.

The worst thing you can do during a MySQL incident is start changing things without a diagnosis. Random configuration changes in a degraded system often make things worse. This playbook gives you a systematic investigation protocol for the four most common MySQL production incidents.

Incident Type 1: Sudden Slowdown / High Latency

Symptoms: Query latency spikes from 10ms to 2000ms. Error rates increase. CPU is high or I/O wait is high.

Step 1: See What's Running Right Now

-- Show all currently executing queries
SHOW FULL PROCESSLIST;

-- Or with more detail (requires performance_schema)
SELECT 
  processlist_id,
  processlist_user,
  processlist_db,
  processlist_command,
  processlist_time,
  processlist_state,
  LEFT(processlist_info, 200) AS query_preview
FROM performance_schema.processlist
WHERE processlist_command != 'Sleep'
ORDER BY processlist_time DESC;

What to look for:

Queries with processlist_time > 10 seconds
State: Waiting for table metadata lock — something holds a metadata lock (often an uncommitted DDL or long transaction)
State: Sending data with very high time — data volume problem or missing index
State: Locked — row-level or table-level lock contention

Step 2: Kill the Blocking Query

-- Identify the blocking thread
SELECT 
  r.trx_id AS waiting_trx_id,
  r.trx_mysql_thread_id AS waiting_thread,
  r.trx_query AS waiting_query,
  b.trx_id AS blocking_trx_id,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query,
  b.trx_started AS blocking_started,
  TIMESTAMPDIFF(SECOND, b.trx_started, NOW()) AS blocking_seconds
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- Kill the blocking thread (CAREFUL: confirm this is safe first)
KILL 12345; -- blocking_thread ID

Warning: Never kill a thread without understanding why it's there. A long-running UPDATE might be legitimate — a data migration, for example. Killing it causes a rollback that might take as long to complete as the original query.

Step 3: EXPLAIN the Slow Query

-- Run EXPLAIN on the slow query (without ANALYZE — it's non-destructive)
EXPLAIN SELECT * FROM orders o 
  JOIN users u ON o.user_id = u.id
  WHERE o.status = 'pending' 
  AND o.created_at > '2026-01-01';

Red flags in EXPLAIN output:

Column	Red Flag	Meaning
`type`	`ALL`	Full table scan — no index used
`type`	`index`	Full index scan — better, but still slow on large tables
`rows`	> 100,000	MySQL will examine that many rows
`Extra`	`Using filesort`	Sort can't use an index — expensive on large result sets
`Extra`	`Using temporary`	MySQL needs a temp table — very expensive
`key`	`NULL`	No index selected

Step 4: EXPLAIN ANALYZE for Confirmed Queries

Once you identify the slow query, run EXPLAIN ANALYZE in a non-production environment (it actually executes the query):

-- Shows actual vs estimated row counts — reveals optimizer lies
EXPLAIN ANALYZE 
  SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at;

-- Output shows:
-- -> Sort: orders.created_at  (actual time=2847.123..2847.456 rows=89234 loops=1)
--     -> Filter: (orders.status = 'pending')  
--         -> Table scan on orders  (cost=23847.00 rows=234891 actual time=0.023..2201.342 rows=89234 loops=1)

The actual vs estimated rows gap tells you if the optimizer is working with stale statistics — if so, run ANALYZE TABLE orders;.

Incident Type 2: Deadlock

Symptoms: Application receives ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction.

Deadlocks are deterministic: they happen when two transactions acquire locks in opposite orders. Understanding this is essential before you can fix them.

Step 1: Read the InnoDB Deadlock Info

-- Show the last deadlock (stored since last restart)
SHOW ENGINE INNODB STATUS\G

-- Look for the LATEST DETECTED DEADLOCK section
-- It shows:
-- 1. Transaction A: what it held, what it waited for
-- 2. Transaction B: what it held, what it waited for
-- 3. Which transaction was rolled back (the "victim")

Reading the deadlock output:

TRANSACTION 1, ACTIVE 2 sec starting index read
MySQL thread id 156, OS thread handle 140234891776
LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)

RECORD LOCKS space id 89 page no 5 n bits 72 index PRIMARY of table `db`.`orders`
trx id 421897 lock_mode X locks rec but not gap waiting

TRANSACTION 2, ACTIVE 1 sec
MySQL thread id 157
1 lock struct(s), heap size 1136, 1 row lock(s)
RECORD LOCKS space id 89 page no 5 n bits 72 index PRIMARY of table `db`.`orders`
trx id 421896 lock_mode X locks rec but not gap

Step 2: Identify the Lock Order Problem

Most deadlocks have this pattern:

Time →

Transaction A                     Transaction B
──────────────────────────────────────────────────
LOCK orders WHERE id = 100        
                                  LOCK orders WHERE id = 200
WAIT for orders WHERE id = 200   
                                  WAIT for orders WHERE id = 100  ← DEADLOCK

The universal fix: Ensure all code paths that lock multiple rows do so in the same order — typically by ascending primary key.

Step 3: Fix the Deadlock in Code

// BEFORE (deadlock-prone): Different threads might acquire in different orders
public void transferFunds(long fromId, long toId, BigDecimal amount) {
    Account from = accountRepo.findByIdForUpdate(fromId); // LOCK id=from
    Account to = accountRepo.findByIdForUpdate(toId);     // LOCK id=to
    // Thread 1: locks 100 then 200
    // Thread 2: locks 200 then 100 → DEADLOCK
}

// AFTER (deadlock-safe): Always lock in ascending ID order
public void transferFunds(long fromId, long toId, BigDecimal amount) {
    long firstId = Math.min(fromId, toId);
    long secondId = Math.max(fromId, toId);
    
    Account first = accountRepo.findByIdForUpdate(firstId);
    Account second = accountRepo.findByIdForUpdate(secondId);
    
    Account from = (fromId == firstId) ? first : second;
    Account to = (toId == firstId) ? first : second;
    // Now: Both threads always lock the smaller ID first → no deadlock possible
}

Step 4: Add Deadlock Monitoring

-- Track deadlock frequency over time
SELECT 
  variable_value AS deadlocks_since_restart
FROM performance_schema.global_status
WHERE variable_name = 'Innodb_deadlocks';

-- Set up a periodic check in your monitoring
-- Alert if deadlock rate > 10/minute

Incident Type 3: Replication Lag

Symptoms: Read replica is returning stale data. Seconds_Behind_Master is growing. Replica queries are reading data that was written minutes or hours ago.

Step 1: Check Current Lag

-- On the replica
SHOW REPLICA STATUS\G

-- Key fields:
-- Seconds_Behind_Master: Current lag in seconds
--   0 = in sync
--   NULL = replication is stopped or not connected
--   Large number = replica is catching up

-- Replica_SQL_Running: YES = applying binlogs
-- Replica_IO_Running: YES = receiving binlogs from primary

Step 2: Diagnose the Bottleneck

-- Is the replica IO thread stuck? (Can't receive binlogs)
-- Check: Last_IO_Error, Last_IO_Errno

-- Is the replica SQL thread stuck? (Can't apply binlogs)
-- Check: Last_SQL_Error, Last_SQL_Errno

-- Is a single long transaction causing lag?
SELECT 
  trx_started,
  TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS running_seconds,
  trx_query
FROM information_schema.innodb_trx
WHERE TIMESTAMPDIFF(SECOND, trx_started, NOW()) > 30
ORDER BY running_seconds DESC;

Step 3: Common Causes and Fixes

Cause 1: Long-running primary transaction

A single 30-second UPDATE on the primary causes a 30-second gap on the replica (single-threaded applier is blocked until the full transaction replays).

-- Fix: Break large updates into batches on the primary
-- BEFORE (blocks replica for 60 seconds):
UPDATE orders SET status = 'archived' WHERE created_at < '2025-01-01';
-- This might affect 10 million rows

-- AFTER (replica lag stays < 1 second):
-- Run in a loop until done:
UPDATE orders SET status = 'archived' 
WHERE created_at < '2025-01-01' 
AND status != 'archived'
LIMIT 1000;
-- Sleep 100ms between batches

Cause 2: Missing index on replica

If the replica has a different schema than the primary (e.g., a migration ran on primary but not replica), row-based replication must scan the full table to find the rows to update.

-- Check indexes match
-- On primary:
SHOW INDEX FROM orders;
-- On replica:
SHOW INDEX FROM orders;
-- Fix: Add missing index to replica

Cause 3: Replica I/O bound

The replica disk can't keep up with the primary's write rate.

-- Enable parallel replication (MySQL 5.7+)
-- On replica:
SET GLOBAL slave_parallel_workers = 8;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';

-- Persist in my.cnf:
slave_parallel_workers = 8
slave_parallel_type = LOGICAL_CLOCK

Incident Type 4: Connection Pool Exhaustion

Symptoms: Application returns "Too many connections" or connection pool timeout errors. Database max_connections is hit.

Step 1: Check Current Connection State

-- Current connections vs maximum
SHOW VARIABLES LIKE 'max_connections';
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Threads_running'; -- actively executing (not sleeping)

-- See connection distribution by user and host
SELECT user, host, COUNT(*) as connections 
FROM information_schema.processlist 
GROUP BY user, host
ORDER BY connections DESC;

-- Find sleeping connections holding locks (common problem)
SELECT id, user, time, state, info
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 30
ORDER BY time DESC;

Step 2: Identify the Leak

-- Check if connections aren't being returned properly
-- High Threads_connected + low Threads_running = connection leak

-- Check connection wait time
SHOW STATUS LIKE 'Connection_errors_max_connections';
-- Non-zero = connections are being rejected

-- Check for connections from unexpected sources
SELECT host, COUNT(*) FROM information_schema.processlist 
GROUP BY host ORDER BY 2 DESC;

Step 3: Immediate Relief

-- Kill all idle connections older than 60 seconds (emergency only)
SELECT CONCAT('KILL ', id, ';')
FROM information_schema.processlist
WHERE command = 'Sleep' AND time > 60;
-- Copy and execute the output carefully

Step 4: Root Cause Fix — HikariCP Configuration

// Spring Boot HikariCP configuration
spring:
  datasource:
    hikari:
      maximum-pool-size: 20          # (CPU cores * 2) + disk spindles is a good starting point
      minimum-idle: 5
      connection-timeout: 30000      # 30 seconds max wait for a connection
      idle-timeout: 600000           # Return idle connections after 10 minutes
      max-lifetime: 1800000          # Recycle connections every 30 minutes
      keepalive-time: 60000          # Test idle connections every 60 seconds
      connection-test-query: SELECT 1 # Validate connection is alive
      leak-detection-threshold: 60000 # Warn if a connection is held > 60 seconds

The critical insight on pool sizing: max-pool-size should equal the number of connections your database can effectively serve. For most workloads: (CPU_cores × 2) + effective_disk_spindles. A pool size of 200 on a 4-core database server will cause resource contention. 20–50 connections per application instance is usually correct.

Setting Up Proactive Monitoring

Don't wait for incidents — monitor these metrics continuously:

-- Key metrics to track every 60 seconds:
SELECT
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Threads_connected') AS connections,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Threads_running') AS active_queries,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_row_lock_waits') AS lock_waits,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_deadlocks') AS deadlocks,
  (SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Slow_queries') AS slow_queries;

Alert thresholds:

Threads_connected > 80% of max_connections → Warning
Threads_running > 50 → Warning (too many concurrent queries)
Seconds_Behind_Master > 30 → Warning on replica
Innodb_deadlocks rate > 1/minute → Investigation needed
Slow_queries rate > 10/minute → Check slow_query_log

Key Takeaways

SHOW PROCESSLIST + EXPLAIN ANALYZE are your first tools in any MySQL performance incident — never guess without data.
Deadlocks are always reproducible if you understand the lock acquisition order; the fix is always about consistent ordering.
Replication lag > 10 seconds means your read replicas are serving stale data — always monitor and alert on Seconds_Behind_Master.