About Linux server integration pre-built alerts
The Linux server integration provides a variety of pre-built alerts that you can use right away to begin troubleshooting issues. In this step of the journey, you’ll become familiar with these pre-built alerts and learn how to use them to address various problems.
Did you know? If your machine is functioning properly, you won’t receive any alerts. No news is good news!
Node exporter alerts
NodeCPUHighUsage
Description: High CPU usage
What this means: This alert could mean that a process has failed or that a single node is overloaded.
What to do: Check that workloads are evenly distributed among all nodes and that all processes are operating as expected.
NodeClockNotSynchronising
Description: Clock not synchronising
What this means: The system is currently unable to synchronize its internal clock with an external time source. This could lead to time discrepancies if the system’s clock drifts.
What to do: Check the Network Time Protocol (NTP) configuration and confirm that the node can reach the specified time server.
NodeClockSkewDetected
Description: Clock skew detected
What this means: The system’s internal clock is inaccurate and hasn’t self-corrected.
What to do: Check the Network Time Protocol (NTP) service to ensure it’s working correctly and the node can communicate with the designated time server.
NodeDiskIOSaturation
Description: Disk IO queue is high
What this means: The system is currently experiencing a significant amount of disk input/output operations. This high level of disk activity can slow down the system’s performance.
What to do: Check that all processes are running as expected and consider spreading disk-intensive tasks across multiple nodes.
NodeFileDescriptorLimit
Description: Kernel is predicted to exhaust file descriptors limit soon
What this means: The kernel, the core component of the operating system, can only manage a limited number of open files simultaneously. The system is nearing this limit, which may cause problems with opening new files.
What to do: This is often caused by a process that’s opening many files and failing to close them properly.
NodeHighNumberConntrackEntriesUsed
Description: Number of conntrack are getting close to the limit
What this means: Conntrack is a component of the Linux firewall that keeps track of active network connections. The system is currently tracking a high number of connections, which could be a sign of a network problem or a potential security threat.
What to do: Analyze network traffic to determine the root cause.
NodeMemoryHighUtilization
Description: Host is running out of memory
What this means: A memory leak in a program could be causing high memory consumption.
What to do: Check that all processes are operating as expected. If applicable, fix the memory leak and restart the affected process, or distribute memory-intensive tasks across multiple nodes.
NodeMemoryMajorPagesFaults
Description: Memory major page faults are occurring at very high rate
What this means: The system is heavily relying on disk swapping, which means it’s using more memory than is physically available. This significantly degrades performance.
What to do: A potential cause is a memory leak, which you should investigate and resolve.
NodeNetworkReceiveErrs
Description: Network interface is reporting many receive errors
What this means: Network connectivity issues have been detected.
What to do: These could be caused by hardware malfunctions or malicious attacks.
NodeNetworkTransmitErrs
Description: Network interface is reporting many transmit errors
What this means: Network connectivity issues have been detected.
What to do: These could be caused by hardware malfunctions or incorrect network settings.
NodeRAIDDegraded
Description: RAID Array is degraded
What this means: The RAID array is in a critical state and there’s a high risk of data loss.
What to do: To prevent data loss, repair, or replace the failed disks and rebuild the RAID array as soon as possible.
NodeRAIDDiskFailure
Description: Failed device in RAID array
What this means: One of the disks in the RAID array has failed.
What to do: While the array is currently operational, replacing the faulty disk is crucial to prevent potential data loss.
NodeSystemSaturation
Description: System saturated, load per core is very high
What this means: All CPU cores are operating at maximum capacity, indicating excessive workload.
What to do: Consider distributing tasks across multiple servers.
NodeSystemdServiceCrashlooping
Description: Systemd service keeps restarting, possibly crash looping
What this means: A particular service is experiencing repeated crashes.
What to do: Investigate and resolve the issue to ensure service stability.
NodeSystemdServiceFailed
Description: Systemd service has entered failed state
What this means: A specific service has failed and hasn’t restarted automatically.
What to do: Investigate and resolve the issue to restore service functionality.
NodeTextFileCollectorScrapeError
Description: Node Exporter text file collector failed to scrape
What this means: A log file or status indicator that is typically used to gather data for a particular metric is currently unavailable. This is preventing the system from collecting and reporting the necessary data.
What to do: Consult the Grafana Alloy and system logs to determine which specific file is inaccessible.
Node exporter filesystem alerts
NodeFilesystemAlmostOutOfSpace 5%
Description: Filesystem has less than 5% space left
What this means: The disk is almost full, indicating limited storage space.
What to do: Add storage capacity or remove unnecessary files to free up space.
NodeFilesystemAlmostOutOfSpace 3%
Description: Filesystem has less than 3% space left
What this means: The disk is almost full, indicating limited storage space.
What to do: Add storage capacity or remove unnecessary files to free up space.
NodeFilesystemFilesFillingUp 24 hrs
Description: Filesystem is predicted to run out of inodes within the next 24 hours
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.
NodeFilesystemFilesFillingUp 4 hrs
Description: Filesystem is predicted to run out of inodes within the next 4 hours
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.
NodeFilesystemAlmostOutOfFiles 5% inodes left
Description: Filesystem has less than 5% inodes left
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.
NodeFilesystemAlmostOutOfFiles 3% inodes left
Description: Filesystem has less than 3% inodes left
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.
