Skip to main content

Testing Alerts

Updated Nov 20, 2022 ·

Overview

In this lab, we will test an alert by setting up the rules file and Alertmanager configuration file. The lab environment includes the following nodes:

NodeNameRole
Node1project-webWeb Server
Node2project-appApp Server
Node3project-dbDatabase Server
Node4prometheusPrometheus Server

In Node4, the following components are installed

  • Prometheus
  • Alertmanager

Pre-requisites

Test Prometheus Components

Make sure that you have set up the Prometheus components on Node4.

Prometheus:

Alertmanager:

On Nodes 1,2, and 3, make sure that the Node Exporter is set up and running. To confirm this, open a browser and navigate to the URLs below. Do this for all three nodes.

http://node1-ip:9100/

Another way is to access the Prometheus console and go to Status > Targets

Create the Disk Space Rule

Create the rules directory.

mkdir /etc/prometheus/rules/ 
chown -R prometheus:prometheus /etc/prometheus/rules/

Next, create the /etc/prometheus/rules/rules-diskspace.yml. This rule will calculate when any filesystem has less than 50% free space. We set the threshold high to immeidately trigger it later

groups:
- name: node
rules:
- alert: LowDiskSpace
expr: 100 * node_filesystem_free_bytes{job="node_exporter"} / node_filesystem_size_bytes{job="node_exporter"} < 50
labels:
severity: warning
environment: prod
info

The job name should match the job name that is specified as targets in the prometheus.yml file.

Reference the rules directory in the /etc/prometheus/prometheus.yml:

rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- /etc/prometheus/rules/*.yml

Restart the Prometheus service.

sudo systemctl restart prometheus 
sudo systemctl status prometheus

Go back to the Prometheus console and go to Status > Rules.

Click Alerts. At the moment, there are still no alerts.

Trigger the Disk Space Rule

Login to node1 and check the filesystem:

$ df -h

Filesystem Size Used Avail Use% Mounted on
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 383M 0 383M 0% /dev/shm
tmpfs 154M 4.4M 149M 3% /run
/dev/xvda4 8.8G 1.6G 7.2G 19% /
/dev/xvda3 960M 168M 793M 18% /boot
/dev/xvda2 200M 7.1M 193M 4% /boot/efi
tmpfs 77M 0 77M 0% /run/user/1000

To trigger the alert, use dd to generate large dummy files. For example, to create a 5GB dummy file:

dd if=/dev/zero of=/tmp/dummyfile bs=1M count=5000

Now check the available disk space again.

$ df -h

Filesystem Size Used Avail Use% Mounted on
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 383M 0 383M 0% /dev/shm
tmpfs 154M 4.4M 149M 3% /run
/dev/xvda4 8.8G 6.5G 2.3G 74% /
/dev/xvda3 960M 168M 793M 18% /boot
/dev/xvda2 200M 7.1M 193M 4% /boot/efi
tmpfs 77M 0 77M 0% /run/user/1000

Now go back to the Prometheus console and check the Alerts again. The alert should now change from Inactive to Firing.

Clear the Disk Space Alert

To clear the alert, we need to resolve the issue. The dd command from the previous section generated a 5GB file in /tmp directory.

$  ll /tmp/

total 5120000
-rw-r--r--. 1 root root 5242880000 Dec 14 13:30 dummyfile
drwx------. 3 root root 17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-chronyd.service-bQtk6d
drwx------. 3 root root 17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-dbus-broker.service-FBCMcE
drwx------. 3 root root 17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-kdump.service-NjWS7M
drwx------. 3 root root 17 Dec 14 11:33 systemd-private-b9a8aa9cc4e34627990ad1928bec10a3-systemd-logind.service-BKZu0J

To free up disk space, delete the large dummy file and recheck the disk usage:

rm -f /tmp/dummyfile

Once deleted, the disk space usage should decrease significantly.

$ df -h

Filesystem Size Used Avail Use% Mounted on
devtmpfs 4.0M 0 4.0M 0% /dev
tmpfs 383M 0 383M 0% /dev/shm
tmpfs 154M 4.4M 149M 3% /run
/dev/xvda4 8.8G 1.6G 7.2G 19% /
/dev/xvda3 960M 168M 793M 18% /boot
/dev/xvda2 200M 7.1M 193M 4% /boot/efi
tmpfs 77M 0 77M 0% /run/user/1000

Check the alerts again in the Prometheus console.

Verify Alerting Configuration

After you install Alertmanager in the Prometheus server, make sure you also set the alerting configuration in the /etc/prometheus/prometheus.yml:

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

Restart Prometheus:

systemctl restart prometheus 
systemctl status prometheus

Create the Uptime Rules

Create another rule file: /etc/prometheus/rules/rules-uptime.yml.

groups:
- name: node
rules:
- alert: NodeDown
expr: up{job="node_exporter"} == 0
for: 10s
labels:
severity: critical
annotations:
message: "node {{.Labels.instance}} is down"
info

The job name should match the job name that is specified as targets in the prometheus.yml file.

Go to the Prometheus console > Status > Rules.

Check the Alert. So far so good.

Trigger the Uptime Rule

Turn off Node1 and Node3. Prometheus should not be able to reach the two targets.

Then check the alerts again in the Prometheus console.

Check the Alertmanager console. You might need to refresh it a few times.


Feedback