About Kafka integration pre-built alerts
The Kafka integration provides a variety of pre-built alerts that you can use right away to begin troubleshooting issues. In this milestone, you’ll become familiar with these pre-built alerts and learn how to use them to address various problems.
The following alerts are included with the Kafka integration:
Did you know? If your Kafka cluster is functioning properly, you won’t receive any alerts. No news is good news!
KafkaOfflinePartitionCount
Description: Kafka has offline partitions.
What this means: One or more partitions are offline, meaning they have no active leader and cannot be read from or written to. This indicates a critical issue affecting data availability.
What to do: Check broker health and logs to identify why partitions went offline. Verify that the cluster has sufficient healthy brokers and that replication is functioning properly.
KafkaUnderReplicatedPartitions
Description: Kafka has under-replicated partitions.
What this means: Some partitions don’t have the configured number of in-sync replicas, meaning data durability is at risk. If the leader broker fails, data loss could occur.
What to do: Investigate broker performance and network connectivity. Check if any brokers are down or experiencing high load that prevents replication from keeping up.
KafkaActiveController
Description: No active Kafka controller detected or multiple controllers detected.
What this means: The cluster either has no active controller (preventing partition leadership changes) or has split-brain with multiple controllers, both of which are critical issues.
What to do: Check ZooKeeper connectivity and health. Examine broker logs for controller election issues. Ensure network connectivity between brokers and ZooKeeper is stable.
KafkaUncleanLeaderElection
Description: Unclean leader elections are occurring.
What this means: Kafka is electing partition leaders from brokers that were not in-sync, which can result in message loss. This indicates the cluster is prioritizing availability over data consistency.
What to do: Investigate why in-sync replicas are unavailable. Review broker health and replication lag. Consider adjusting replication factors or min.insync.replicas settings.
KafkaISRExpandRate
Description: In-Sync Replica (ISR) expansion rate is high.
What this means: Replicas are frequently joining the ISR set, which may indicate intermittent broker or network issues causing replicas to fall behind and then catch up.
What to do: Monitor broker performance and network stability. Check for broker restarts or network partitions. Review replication lag metrics to identify problematic brokers.
KafkaISRShrinkRate
Description: In-Sync Replica (ISR) shrink rate is high.
What this means: Replicas are frequently being removed from the ISR set because they’re falling behind the leader, indicating replication performance issues.
What to do: Investigate broker resource utilization (CPU, disk I/O, network). Check for slow disks or network issues. Review replication lag and broker logs for errors.
KafkaBrokerCount
Description: Kafka broker count has changed.
What this means: The number of active brokers in the cluster has decreased, which may indicate broker failures or planned maintenance.
What to do: Verify if the broker loss was intentional. If unplanned, investigate why brokers went offline and restore them to maintain cluster capacity and fault tolerance.
KafkaZookeeperSyncConnect
Description: ZooKeeper sync connection issues detected.
What this means: Kafka brokers are experiencing problems maintaining connections to ZooKeeper, which can affect cluster metadata operations and coordination.
What to do: Check ZooKeeper cluster health and network connectivity between Kafka brokers and ZooKeeper nodes. Review ZooKeeper logs for errors.
KafkaLagIsTooHigh
Description: Consumer lag is too high.
What this means: Consumer groups are falling significantly behind in processing messages, indicating consumers can’t keep up with the message production rate.
What to do: Scale up consumer instances, optimize consumer processing logic, or investigate performance bottlenecks in consumer applications.
KafkaLagKeepsIncreasing
Description: Consumer lag keeps increasing.
What this means: Consumer lag is continuously growing over time, indicating a persistent problem where consumers cannot keep pace with producers.
What to do: Urgently investigate consumer performance issues. Consider increasing consumer parallelism, optimizing consumer code, or adjusting partition assignments.
JvmMemoryFillingUp
Description: JVM memory filling up for Kafka broker.
What this means: Kafka broker JVM heap memory usage is high and trending upward, which may indicate a memory leak or insufficient heap size configuration.
What to do: Monitor for garbage collection issues. Consider increasing JVM heap size if appropriate, or investigate potential memory leaks in custom components or configurations.
JvmThreadsDeadlocked
Description: JVM threads are deadlocked in Kafka broker.
What this means: The Kafka broker JVM has detected threads that are stuck in a deadlock, which can cause the broker to become unresponsive.
What to do: Collect thread dumps and analyze for deadlock situations. This typically requires restarting the affected broker and may indicate a bug that needs investigation.
In the next milestone, you explore the Kafka metrics displayed in your dashboards.
