Plan your alert rules

Before you create an alert rule, take a moment to plan. Good alerts are specific, actionable, and appropriately urgent. Poor alerts create noise that leads to alert fatigue—and eventually, ignored pages.

Effective alert planning answers three questions: What should I monitor? When should it fire? Who should be notified?

To plan your alert rules, consider the following:

  1. Choose what to monitor. Start with metrics or logs that directly indicate user impact or system health.

    Data typeWhat to monitorWhy it matters
    MetricsCPU, memory, disk, networkResource exhaustion affects all services
    LogsError patterns, exceptions, failed requestsApplication health and user impact
  2. Define meaningful thresholds. Base thresholds on what “normal” looks like in your environment, not arbitrary numbers.

    | Data type | Example threshold | Reasoning | |———–|——————-|———–|| | Metrics | CPU > 80% | Normal is 40-60%, gives time to respond | | Logs | Errors > 10/min | Normal is 1-2/min, catches real spikes |

  3. Set appropriate urgency. Not every alert needs to page someone at 3 AM.

    | Alert type | Metrics example | Logs example | Urgency | |————|—————–|————–|———|| | Critical | Disk 95% full | FATAL or panic logs | Page immediately | | Warning | CPU elevated 15 min | Error rate 5x normal | Slack notification | | Info | Memory trending up | Unusual log pattern | Email digest |

  4. Identify the responders. Who should receive this alert? The platform team? Database team? On-call engineer?

  5. Consider the “for” duration. How long should the condition persist before firing? Brief spikes during deployments shouldn’t page anyone.

In the next milestone, you’ll use Grafana’s exploration tools to find the specific metrics or logs you want to alert on.


More to explore (optional)


page 3 of 11