Make Prometheus Alert Rules Work Better and Organize Them Better #19

Merged
june merged 5 commits from adjust_prometheus_alert_rules into main 2025-02-07 13:57:48 +01:00
Showing only changes of commit c4e35c1adf - Show all commits

grafana: pull out prom. net. rec. err. alerts for OPNs. to ex. wg int.
All checks were successful
/ Ansible Lint (push) Successful in 1m32s
/ Ansible Lint (pull_request) Successful in 1m30s

Pull out prometheus network receive error alerts for OPNsense to exclude
its WireGuard interfaces, which like to throw errors, but which aren't
of importance.
June 2025-02-06 01:34:45 +01:00
Signed by: june
SSH key fingerprint: SHA256:o9EAq4Y9N9K0pBQeBTqhSDrND5E7oB+60ZNx0U1yPe0

View file

@ -79,14 +79,26 @@ groups:
annotations:
summary: Host unusual network throughput out (instance {{ $labels.instance }})
description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n VALUE = {{ $value }}"
# General network receive error alerts.
# Excluding: OPNsense hosts
- alert: HostNetworkReceiveErrors
expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+", nodename!="OPNsense"}
for: 2m
labels:
severity: warning
annotations:
summary: Host Network Receive Errors (instance {{ $labels.instance }})
description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n VALUE = {{ $value }}"
# OPNsense network receive error alerts.
# This is the same as the regular network receive error alerts, but excluding the WireGuard interfaces as they like to throw errors, but which aren't of importance.
- alert: OPNsenseHostNetworkReceiveErrors
expr: (rate(node_network_receive_errs_total{device!~"wg.+"}[2m]) / rate(node_network_receive_packets_total{device!~"wg.+"}[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename="OPNsense"}
for: 2m
labels:
severity: warning
annotations:
summary: OPNsense host Network Receive Errors (instance {{ $labels.instance }})
description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n VALUE = {{ $value }}"
- alert: HostNetworkTransmitErrors
expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
for: 2m