Testing Alerts

Updated Nov 20, 2022 ·

Overview

In this lab, we will test an alert by setting up the rules file and Alertmanager configuration file. The lab environment includes the following nodes:

Node	Name	Role
Node1	project-web	Web Server
Node2	project-app	App Server
Node3	project-db	Database Server
Node4	prometheus	Prometheus Server

In Node4, the following components are installed

Prometheus
Alertmanager

Pre-requisites

Test Prometheus Components

Make sure that you have set up the Prometheus components on Node4.

Prometheus:

Alertmanager:

On Nodes 1,2, and 3, make sure that the Node Exporter is set up and running. To confirm this, open a browser and navigate to the URLs below. Do this for all three nodes.

http://node1-ip:9100/

Another way is to access the Prometheus console and go to Status > Targets

Create the Disk Space Rule

Create the rules directory.

mkdir /etc/prometheus/rules/ 
chown -R prometheus:prometheus /etc/prometheus/rules/

Next, create the /etc/prometheus/rules/rules-diskspace.yml. This rule will calculate when any filesystem has less than 50% free space. We set the threshold high to immeidately trigger it later

groups:
  - name: node
    rules:
      - alert: LowDiskSpace
        expr: 100 * node_filesystem_free_bytes{job="node_exporter"} / node_filesystem_size_bytes{job="node_exporter"} < 50
        labels:
          severity: warning
          environment: prod

info

The job name should match the job name that is specified as targets in the prometheus.yml file.

Reference the rules directory in the /etc/prometheus/prometheus.yml:

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - /etc/prometheus/rules/*.yml

Restart the Prometheus service.

sudo systemctl restart prometheus 
sudo systemctl status prometheus

Go back to the Prometheus console and go to Status > Rules.

Click Alerts. At the moment, there are still no alerts.

Trigger the Disk Space Rule

$ df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           383M     0  383M   0% /dev/shm
tmpfs           154M  4.4M  149M   3% /run
/dev/xvda4      8.8G  1.6G  7.2G  19% /
/dev/xvda3      960M  168M  793M  18% /boot
/dev/xvda2      200M  7.1M  193M   4% /boot/efi
tmpfs            77M     0   77M   0% /run/user/1000 

To trigger the alert, use dd to generate large dummy files. For example, to create a 5GB dummy file:

dd if=/dev/zero of=/tmp/dummyfile bs=1M count=5000

Now check the available disk space again.

$ df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           383M     0  383M   0% /dev/shm
tmpfs           154M  4.4M  149M   3% /run
/dev/xvda4      8.8G  6.5G  2.3G  74% /
/dev/xvda3      960M  168M  793M  18% /boot
/dev/xvda2      200M  7.1M  193M   4% /boot/efi
tmpfs            77M     0   77M   0% /run/user/1000 

Now go back to the Prometheus console and check the Alerts again. The alert should now change from Inactive to Firing.

Clear the Disk Space Alert

To clear the alert, we need to resolve the issue. The dd command from the previous section generated a 5GB file in /tmp directory.

$  ll /tmp/

total 5120000
-rw-r--r--. 1 root root 5242880000 Dec 14 13:30 dummyfile
drwx------. 3 root root         17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-chronyd.service-bQtk6d
drwx------. 3 root root         17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-dbus-broker.service-FBCMcE
drwx------. 3 root root         17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-kdump.service-NjWS7M
drwx------. 3 root root         17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-systemd-logind.service-BKZu0J 

To free up disk space, delete the large dummy file and recheck the disk usage:

rm -f /tmp/dummyfile

Once deleted, the disk space usage should decrease significantly.

$ df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        4.0M     0  4.0M   0% /dev
tmpfs           383M     0  383M   0% /dev/shm
tmpfs           154M  4.4M  149M   3% /run
/dev/xvda4      8.8G  1.6G  7.2G  19% /
/dev/xvda3      960M  168M  793M  18% /boot
/dev/xvda2      200M  7.1M  193M   4% /boot/efi
tmpfs            77M     0   77M   0% /run/user/1000 

Check the alerts again in the Prometheus console.

Verify Alerting Configuration

After you install Alertmanager in the Prometheus server, make sure you also set the alerting configuration in the /etc/prometheus/prometheus.yml:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093 

Restart Prometheus:

systemctl restart prometheus 
systemctl status prometheus

Create the Uptime Rules

Create another rule file: /etc/prometheus/rules/rules-uptime.yml.

groups:
  - name: node
    rules:
      - alert: NodeDown
        expr: up{job="node_exporter"} == 0
        for: 10s
        labels:
          severity: critical  
        annotations:
          message: "node {{.Labels.instance}} is down"

info

The job name should match the job name that is specified as targets in the prometheus.yml file.

Go to the Prometheus console > Status > Rules.

Check the Alert. So far so good.

Trigger the Uptime Rule

Turn off Node1 and Node3. Prometheus should not be able to reach the two targets.

Then check the alerts again in the Prometheus console.

Check the Alertmanager console. You might need to refresh it a few times.

Overview​

Pre-requisites​

Test Prometheus Components​

Create the Disk Space Rule​

Trigger the Disk Space Rule​

Clear the Disk Space Alert​

Verify Alerting Configuration​

Create the Uptime Rules​

Trigger the Uptime Rule​