About Kafka integration pre-built alerts

The Kafka integration provides a variety of pre-built alerts that you can use right away to begin troubleshooting issues. In this milestone, you’ll become familiar with these pre-built alerts and learn how to use them to address various problems.

The following alerts are included with the Kafka integration:

Did you know? If your Kafka cluster is functioning properly, you won’t receive any alerts. No news is good news!

KafkaOfflinePartitionCount

Description: Kafka has offline partitions.

What this means: One or more partitions are offline, meaning they have no active leader and cannot be read from or written to. This indicates a critical issue affecting data availability.

What to do: Check broker health and logs to identify why partitions went offline. Verify that the cluster has sufficient healthy brokers and that replication is functioning properly.

KafkaUnderReplicatedPartitions

Description: Kafka has under-replicated partitions.

What this means: Some partitions don’t have the configured number of in-sync replicas, meaning data durability is at risk. If the leader broker fails, data loss could occur.

What to do: Investigate broker performance and network connectivity. Check if any brokers are down or experiencing high load that prevents replication from keeping up.

KafkaActiveController

Description: No active Kafka controller detected or multiple controllers detected.

What this means: The cluster either has no active controller (preventing partition leadership changes) or has split-brain with multiple controllers, both of which are critical issues.

What to do: Check ZooKeeper connectivity and health. Examine broker logs for controller election issues. Ensure network connectivity between brokers and ZooKeeper is stable.

KafkaUncleanLeaderElection

Description: Unclean leader elections are occurring.

What this means: Kafka is electing partition leaders from brokers that were not in-sync, which can result in message loss. This indicates the cluster is prioritizing availability over data consistency.

What to do: Investigate why in-sync replicas are unavailable. Review broker health and replication lag. Consider adjusting replication factors or min.insync.replicas settings.

KafkaISRExpandRate

Description: In-Sync Replica (ISR) expansion rate is high.

What this means: Replicas are frequently joining the ISR set, which may indicate intermittent broker or network issues causing replicas to fall behind and then catch up.

What to do: Monitor broker performance and network stability. Check for broker restarts or network partitions. Review replication lag metrics to identify problematic brokers.

KafkaISRShrinkRate

Description: In-Sync Replica (ISR) shrink rate is high.

What this means: Replicas are frequently being removed from the ISR set because they’re falling behind the leader, indicating replication performance issues.

What to do: Investigate broker resource utilization (CPU, disk I/O, network). Check for slow disks or network issues. Review replication lag and broker logs for errors.

KafkaBrokerCount

Description: Kafka broker count has changed.

What this means: The number of active brokers in the cluster has decreased, which may indicate broker failures or planned maintenance.

What to do: Verify if the broker loss was intentional. If unplanned, investigate why brokers went offline and restore them to maintain cluster capacity and fault tolerance.

KafkaZookeeperSyncConnect

Description: ZooKeeper sync connection issues detected.

What this means: Kafka brokers are experiencing problems maintaining connections to ZooKeeper, which can affect cluster metadata operations and coordination.

What to do: Check ZooKeeper cluster health and network connectivity between Kafka brokers and ZooKeeper nodes. Review ZooKeeper logs for errors.

KafkaLagIsTooHigh

Description: Consumer lag is too high.

What this means: Consumer groups are falling significantly behind in processing messages, indicating consumers can’t keep up with the message production rate.

What to do: Scale up consumer instances, optimize consumer processing logic, or investigate performance bottlenecks in consumer applications.

KafkaLagKeepsIncreasing

Description: Consumer lag keeps increasing.

What this means: Consumer lag is continuously growing over time, indicating a persistent problem where consumers cannot keep pace with producers.

What to do: Urgently investigate consumer performance issues. Consider increasing consumer parallelism, optimizing consumer code, or adjusting partition assignments.

JvmMemoryFillingUp

Description: JVM memory filling up for Kafka broker.

What this means: Kafka broker JVM heap memory usage is high and trending upward, which may indicate a memory leak or insufficient heap size configuration.

What to do: Monitor for garbage collection issues. Consider increasing JVM heap size if appropriate, or investigate potential memory leaks in custom components or configurations.

JvmThreadsDeadlocked

Description: JVM threads are deadlocked in Kafka broker.

What this means: The Kafka broker JVM has detected threads that are stuck in a deadlock, which can cause the broker to become unresponsive.

What to do: Collect thread dumps and analyze for deadlock situations. This typically requires restarting the affected broker and may indicate a bug that needs investigation.

In the next milestone, you explore the Kafka metrics displayed in your dashboards.


page 10 of 12