About Linux server integration pre-built alerts

The Linux server integration provides a variety of pre-built alerts that you can use right away to begin troubleshooting issues. In this step of the journey, you’ll become familiar with these pre-built alerts and learn how to use them to address various problems.

Did you know? If your machine is functioning properly, you won’t receive any alerts. No news is good news!

Node exporter alerts

NodeCPUHighUsage

Description: High CPU usage

What this means: This alert could mean that a process has failed or that a single node is overloaded.

What to do: Check that workloads are evenly distributed among all nodes and that all processes are operating as expected.

NodeClockNotSynchronising

Description: Clock not synchronising

What this means: The system is currently unable to synchronize its internal clock with an external time source. This could lead to time discrepancies if the system’s clock drifts.

What to do: Check the Network Time Protocol (NTP) configuration and confirm that the node can reach the specified time server.

NodeClockSkewDetected

Description: Clock skew detected

What this means: The system’s internal clock is inaccurate and hasn’t self-corrected.

What to do: Check the Network Time Protocol (NTP) service to ensure it’s working correctly and the node can communicate with the designated time server.

NodeDiskIOSaturation

Description: Disk IO queue is high

What this means: The system is currently experiencing a significant amount of disk input/output operations. This high level of disk activity can slow down the system’s performance.

What to do: Check that all processes are running as expected and consider spreading disk-intensive tasks across multiple nodes.

NodeFileDescriptorLimit

Description: Kernel is predicted to exhaust file descriptors limit soon

What this means: The kernel, the core component of the operating system, can only manage a limited number of open files simultaneously. The system is nearing this limit, which may cause problems with opening new files.

What to do: This is often caused by a process that’s opening many files and failing to close them properly.

NodeHighNumberConntrackEntriesUsed

Description: Number of conntrack are getting close to the limit

What this means: Conntrack is a component of the Linux firewall that keeps track of active network connections. The system is currently tracking a high number of connections, which could be a sign of a network problem or a potential security threat.

What to do: Analyze network traffic to determine the root cause.

NodeMemoryHighUtilization

Description: Host is running out of memory

What this means: A memory leak in a program could be causing high memory consumption.

What to do: Check that all processes are operating as expected. If applicable, fix the memory leak and restart the affected process, or distribute memory-intensive tasks across multiple nodes.

NodeMemoryMajorPagesFaults

Description: Memory major page faults are occurring at very high rate

What this means: The system is heavily relying on disk swapping, which means it’s using more memory than is physically available. This significantly degrades performance.

What to do: A potential cause is a memory leak, which you should investigate and resolve.

NodeNetworkReceiveErrs

Description: Network interface is reporting many receive errors

What this means: Network connectivity issues have been detected.

What to do: These could be caused by hardware malfunctions or malicious attacks.

NodeNetworkTransmitErrs

Description: Network interface is reporting many transmit errors

What this means: Network connectivity issues have been detected.

What to do: These could be caused by hardware malfunctions or incorrect network settings.

NodeRAIDDegraded

Description: RAID Array is degraded

What this means: The RAID array is in a critical state and there’s a high risk of data loss.

What to do: To prevent data loss, repair, or replace the failed disks and rebuild the RAID array as soon as possible.

NodeRAIDDiskFailure

Description: Failed device in RAID array

What this means: One of the disks in the RAID array has failed.

What to do: While the array is currently operational, replacing the faulty disk is crucial to prevent potential data loss.

NodeSystemSaturation

Description: System saturated, load per core is very high

What this means: All CPU cores are operating at maximum capacity, indicating excessive workload.

What to do: Consider distributing tasks across multiple servers.

NodeSystemdServiceCrashlooping

Description: Systemd service keeps restarting, possibly crash looping

What this means: A particular service is experiencing repeated crashes.

What to do: Investigate and resolve the issue to ensure service stability.

NodeSystemdServiceFailed

Description: Systemd service has entered failed state

What this means: A specific service has failed and hasn’t restarted automatically.

What to do: Investigate and resolve the issue to restore service functionality.

NodeTextFileCollectorScrapeError

Description: Node Exporter text file collector failed to scrape

What this means: A log file or status indicator that is typically used to gather data for a particular metric is currently unavailable. This is preventing the system from collecting and reporting the necessary data.

What to do: Consult the Grafana Alloy and system logs to determine which specific file is inaccessible.

Node exporter filesystem alerts

NodeFilesystemAlmostOutOfSpace 5%

Description: Filesystem has less than 5% space left

What this means: The disk is almost full, indicating limited storage space.

What to do: Add storage capacity or remove unnecessary files to free up space.

NodeFilesystemAlmostOutOfSpace 3%

Description: Filesystem has less than 3% space left

What this means: The disk is almost full, indicating limited storage space.

What to do: Add storage capacity or remove unnecessary files to free up space.

NodeFilesystemFilesFillingUp 24 hrs

Description: Filesystem is predicted to run out of inodes within the next 24 hours

What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.

What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.

NodeFilesystemFilesFillingUp 4 hrs

Description: Filesystem is predicted to run out of inodes within the next 4 hours

What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.

What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.

NodeFilesystemAlmostOutOfFiles 5% inodes left

Description: Filesystem has less than 5% inodes left

What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.

What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.

NodeFilesystemAlmostOutOfFiles 3% inodes left

Description: Filesystem has less than 3% inodes left

What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.

What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp directory.


page 9 of 10