Alerts

Updated Nov 20, 2022 ·

Overview

Prometheus alerting allows you to define conditions for monitoring metrics and send notifications when thresholds are breached. Alerts are configured in rule files and can be sent to external systems like Slack, email, or PagerDuty using Alertmanager.

Each rule specifies the condition, duration, and labels for the alert.
Note that Prometheus only trigger alerts, NOT send alerts.
Alertmanager processes alerts and routes them to defined notification channels.
One Alertmanager can support multiple Prometheus servers

Alerting Rules

Alerting Rules in Prometheus are similar to Recording Rules but are specifically designed to trigger notifications when certain conditions are met. Both rules can be configured in the same YAML file under a single rule group or separate groups, depending on your configuration needs.

groups:
  - name: example-rules
    rules:
      # Recording Rule
      - record: job:cpu_usage:avg_rate
        expr: avg(rate(cpu_usage_seconds_total[5m])) by (job)

      # Alerting Rule
      - alert: HighCPUUsage
        expr: avg(rate(cpu_usage_seconds_total[5m])) by (instance) > 0.8
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage detected"
          description: "Instance {{ $labels.instance }} has exceeded 80% CPU usage for over 10 minutes."

Explanation:

Recording Rule:
- Computes the average CPU usage rate grouped by job over the last 5 minutes.
- Stores the result as job:cpu_usage:avg_rate.
Alerting Rule:
- Triggers an alert when average CPU usage exceeds 80% for a specific instance for more than 10 minutes.
- Includes labels and annotations for alert categorization and detailed information.

The `for` Clause

The for clause specifies the duration a condition must be true before an alert is triggered. This helps prevent false alarms by ensuring the issue persists over the defined period.

groups:
  - name: example-rules
    rules:
      # Recording Rule
      - record: job:cpu_usage:avg_rate
        expr: avg(rate(cpu_usage_seconds_total[5m])) by (job)

      # Alerting Rule
      - alert: HighCPUUsage
        expr: avg(rate(cpu_usage_seconds_total[5m])) by (instance) > 0.8
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage detected"
          description: "Instance {{ $labels.instance }} has exceeded 80% CPU usage for over 10 minutes."

In th example above, the HighCPUUsage alert will only fire if the average CPU usage remains above 80% for 10 minutes continuously. This duration ensures that temporary spikes in CPU usage do not trigger unnecessary alerts and provides a more reliable and meaningful notifications.

Prevent race conditions

Including the for clause in alerting rules prevents race conditions and mitigates the impact of transient issues, such as scrape timeouts or temporary network disruptions. These problems might cause individual scrapes to fail, leading to false-positive alerts.

Alert States

Alerts in Prometheus can exist in one of three states, depending on their evaluation results over time. These states help track the lifecycle of an alert.

Inactive: The alert condition is not met, and the alert remains dormant.
Pending: The alert condition is met, but the for clause duration has not yet elapsed.
Firing: The alert condition has been met continuously for the for clause duration, and the alert is now active.

Labels

Labels can be added to alerts to classify and provide context, making them easier to organize and filter. They are also used by Alertmanager to group, match, and route specific alerts for notification or silencing.

groups:
  - name: example-rules
    rules:
      - alert: HighCPUUsage
        expr: avg(rate(cpu_usage_seconds_total[5m])) by (instance) > 0.8
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High CPU usage detected"
          description: "Instance {{ .Labels.instance }} has exceeded 80% CPU usage for over 10 minutes."

Annotations

Annotations can be used to provide additional context for alerts, such as descriptions or helpful links. These are primarily used to include detailed information in alert notifications sent by Alertmanager. Unlike labels, annotations cannot be used to classify or match alerts for triggering notifications.

Annotations are templated using Go templating language:

To get alert label:
```
{{.Labels}}
```
To get instance label:
```
{{.Labels.instance}) 
```
To get the firing sample value:
```
{{.Value}) 
```

In the example below, thes rule triggers an alert called HighMemoryUsage when the average memory usage (memory_usage_bytes) over the last 5 minutes exceeds 80 bytes for an instance and persists for 10 minutes.

groups:
  - name: example-rules
    rules:
      - alert: HighMemoryUsage
        expr: avg(rate(memory_usage_bytes[5m])) by (instance) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High Memory Usage Alert"
          description: "Instance {{ .Labels.instance }} is using {{ .Value }} bytes of memory, which exceeds the threshold."
          details: "Check logs and performance metrics for {{ .Labels.instance }} to diagnose the issue."

Overview​

Alerting Rules​

The for Clause​

Alert States​

Labels​

Annotations​