Kafka Consumer Lag: Causes, Diagnosis, and Production Fixes

Kafka consumer lag is not a root cause. It is a symptom.

Lag means consumers are not keeping up with producers for one or more partitions. The fix depends on why: slow processing, too few consumers, partition skew, rebalances, downstream failures, poison messages, or broker problems.

The dangerous move is blindly adding consumers. That works only when there are enough partitions and the bottleneck is consumer parallelism.

What Lag Actually Means

Mental Model

An immutable, high-speed transaction log used to decouple producers from consumers.

For a consumer group:

lag = latest_offset - committed_offset

If topic partition orders-3 has latest offset 10,000 and the consumer group committed offset 9,200, lag is 800 for that partition.

Check lag:

kafka-consumer-groups.sh \
  --bootstrap-server broker:9092 \
  --describe \
  --group payment-consumer

Look at lag per partition, not only total lag.

First Diagnosis

Ask:

Is lag on all partitions or only a few?
Did producer traffic increase?
Did consumer processing time increase?
Are consumers rebalancing repeatedly?
Is a downstream dependency slow?
Are poison messages causing repeated failures?
Are there enough partitions for the desired parallelism?

Partition-level lag tells you the shape of the problem.

All partitions lagging evenly -> consumers are globally too slow
One partition lagging heavily -> hot key or poison message
Lag sawtooth pattern -> rebalances or batch commits
Lag grows during dependency outage -> downstream bottleneck

Slow Processing

If each message takes longer, lag grows even with the same traffic.

Instrument processing duration:

@KafkaListener(topics = "orders")
public void consume(OrderEvent event, Acknowledgment ack) {
    Timer.Sample sample = Timer.start(meterRegistry);
    try {
        orderProcessor.process(event);
        ack.acknowledge();
    } finally {
        sample.stop(meterRegistry.timer("kafka.message.processing.duration"));
    }
}

Watch:

kafka.message.processing.duration p95
consumer.records.consumed.rate
consumer.records.lag.max
downstream.http.latency
db.query.duration

If processing time rose after a deploy, inspect code. If it rose with downstream latency, protect the consumer with timeouts, circuit breakers, or pause consumption.

Too Few Partitions

Kafka consumer parallelism is capped by partitions.

Topic partitions: 6
Consumer instances: 10
Active consumers: 6
Idle consumers: 4

Adding more consumers than partitions does not increase parallelism. If lag is evenly distributed and each partition is busy, you may need more partitions.

But increasing partitions changes key distribution. For ordered workflows, understand the impact before changing partition count.

Hot Partitions

If one partition has most of the lag, the producer key may be skewed.

Example bad key:

producer.send(new ProducerRecord<>("orders", merchantId, event));

If one merchant is huge, one partition becomes hot. Better key choice depends on ordering requirements. If strict per-merchant order is not required, shard the key:

String key = merchantId + ":" + (event.orderId().hashCode() % 16);

This spreads one large merchant across multiple partitions, but sacrifices total ordering for that merchant. That is a business decision, not only an engineering choice.

Rebalance Storms

Frequent rebalances stop consumption and increase lag.

Common causes:

processing takes longer than max.poll.interval.ms
consumers crash and restart
autoscaler rapidly changes replica count
network instability
too many partitions assigned per consumer

Key settings:

max.poll.records=100
max.poll.interval.ms=300000
session.timeout.ms=45000
heartbeat.interval.ms=15000

If processing a batch can take 10 minutes, but max.poll.interval.ms is 5 minutes, Kafka considers the consumer dead. Reduce batch size or increase the interval.

Poison Messages

A poison message fails every time and can block progress if the consumer keeps retrying it.

Use a dead-letter topic after bounded retries:

try {
    process(event);
    ack.acknowledge();
} catch (RetryableException ex) {
    throw ex; // let retry policy handle it
} catch (Exception ex) {
    deadLetterPublisher.publish(event, ex);
    ack.acknowledge(); // skip after preserving the failed message
}

The DLQ event should include:

{
  "originalTopic": "orders",
  "partition": 3,
  "offset": 91234,
  "error": "Invalid currency code",
  "payload": { }
}

Never silently skip messages. Preserve them for replay.

Autoscaling Consumers

Autoscaling on CPU is often wrong. Scale on lag and processing rate:

desired_consumers = lag / target_lag_per_consumer

But cap by partition count:

desired_consumers = min(topic_partitions, computed_consumers)

Avoid aggressive scale-in. Removing consumers causes rebalances, which can make lag worse.

Safe Recovery

When lag is huge:

Stop the bleeding: fix producer spike or downstream outage
Increase consumers up to partition count
Reduce per-message processing time
Temporarily disable non-critical work in the consumer
Consider replaying to a separate catch-up consumer group
Do not reset offsets unless you deliberately want to skip data

Offset reset is a serious operation:

kafka-consumer-groups.sh \
  --bootstrap-server broker:9092 \
  --group payment-consumer \
  --topic orders \
  --reset-offsets \
  --to-latest \
  --execute

This skips unprocessed messages. Use only when the business accepts data loss or the data can be rebuilt elsewhere.

Production Checklist

Monitor lag per partition, not only total lag
Track processing duration
Alert on rebalance rate
Use bounded retries and DLQ
Choose producer keys deliberately
Scale consumers only up to partition count
Tune max.poll.records and max.poll.interval.ms together
Protect downstream dependencies with timeouts
Avoid offset reset as a first response
Document replay procedures

Consumer lag is a symptom with many possible causes. The best Kafka teams do not just add consumers. They read the lag shape, identify the bottleneck, and choose a fix that preserves correctness.

Technical Trade-offs: Messaging Systems

Pattern	Ordering	Durability	Throughput	Complexity
Log-based (Kafka)	Strict (per partition)	High	Very High	High
Memory-based (Redis Pub/Sub)	None	Low	High	Very Low
Push-based (RabbitMQ)	Fair	Medium	Medium	Medium

Key Takeaways

processing takes longer than max.poll.interval.ms
consumers crash and restart
autoscaler rapidly changes replica count