grafana: account for long backup jobs in Prom. hyperv. disk rw rate al.

grafana: pull out prom. net. rec. err. alerts for OPNs. to ex. wg int.
Pull out prometheus network receive error alerts for OPNsense to exclude its WireGuard interfaces, which like to throw errors, but which aren't of importance.
2025-02-06 19:17:21 +01:00 · 2025-02-06 01:34:45 +01:00 · 2025-02-06 01:13:10 +01:00 · 2025-02-06 01:05:05 +01:00 · 2025-02-06 00:12:50 +01:00
1 changed files with 174 additions and 137 deletions
--- a/resources/chaosknoten/grafana/docker_compose/prometheus_alerts.rules.yaml
+++ b/resources/chaosknoten/grafana/docker_compose/prometheus_alerts.rules.yaml
@ -1,7 +1,7 @@
 # Links & Resources:
 # - https://samber.github.io/awesome-prometheus-alerts/rules
 groups:
-  - name: node-exporter
+  - name: node-exporter-memory
    rules:
      - alert: HostOutOfMemory
        expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
@ -28,6 +28,41 @@ groups:
        annotations:
          summary: Host Memory is underutilized (instance {{ $labels.instance }})
          description: "Node memory is < 10% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n  VALUE = {{ $value }}"
+      - alert: HostSwapIsFillingUp
+        expr: ((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host swap is filling up (instance {{ $labels.instance }})
+          description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}"
+      - alert: HostOomKillDetected
+        expr: (increase(node_vmstat_oom_kill[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 0m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host OOM kill detected (instance {{ $labels.instance }})
+          description: "OOM kill detected\n  VALUE = {{ $value }}"
+      - alert: HostEdacCorrectableErrorsDetected
+        expr: (increase(node_edac_correctable_errors_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 0m
+        labels:
+          severity: info
+        annotations:
+          summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
+          description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}"
+      - alert: HostEdacUncorrectableErrorsDetected
+        expr: (node_edac_uncorrectable_errors_total > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 0m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
+          description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}"
+
+  - name: node-exporter-network
+    rules:
      - alert: HostUnusualNetworkThroughputIn
        expr: (sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
        for: 5m
@ -44,56 +79,107 @@ groups:
        annotations:
          summary: Host unusual network throughput out (instance {{ $labels.instance }})
          description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}"
-      # Have different disk read and write rate alerts for VMs and physical machines.
-      - alert: VirtualHostUnusualDiskReadRate
-        expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{ype="virtual_machine", nodename=~".+", nodename!="forgejo-actions-runner", nodename!="woodpecker"}
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: Virtual host unusual disk read rate (instance {{ $labels.instance }})
-          description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
-      - alert: VirtualHostUnusualDiskWriteRate
-        expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{type="virtual_machine", nodename=~".+", nodename!="forgejo-actions-runner", nodename!="woodpecker"}
+      # General network receive error alerts.
+      # Excluding: OPNsense hosts
+      - alert: HostNetworkReceiveErrors
+        expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+", nodename!="OPNsense"}
        for: 2m
        labels:
          severity: warning
        annotations:
-          summary: Virtual host unusual disk write rate (instance {{ $labels.instance }})
+          summary: Host Network Receive Errors (instance {{ $labels.instance }})
+          description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}"
+      # OPNsense network receive error alerts.
+      # This is the same as the regular network receive error alerts, but excluding the WireGuard interfaces as they like to throw errors, but which aren't of importance.
+      - alert: OPNsenseHostNetworkReceiveErrors
+        expr: (rate(node_network_receive_errs_total{device!~"wg.+"}[2m]) / rate(node_network_receive_packets_total{device!~"wg.+"}[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename="OPNsense"}
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: OPNsense host Network Receive Errors (instance {{ $labels.instance }})
+          description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}"
+      - alert: HostNetworkTransmitErrors
+        expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host Network Transmit Errors (instance {{ $labels.instance }})
+          description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n  VALUE = {{ $value }}"
+      - alert: HostNetworkBondDegraded
+        expr: ((node_bonding_active - node_bonding_slaves) != 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host Network Bond Degraded (instance {{ $labels.instance }})
+          description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}"
+      - alert: HostConntrackLimit
+        expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host conntrack limit (instance {{ $labels.instance }})
+          description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}"
+
+  - name: node-exporter-disk
+    rules:
+      # General high disk read and write rate alerts.
+      # Excluding: hypervisor hosts, CI hosts
+      - alert: HostUnusualDiskReadRate
+        expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+", nodename!="forgejo-actions-runner", nodename!="woodpecker", nodename!="chaosknoten"}
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host unusual disk read rate (instance {{ $labels.instance }})
+          description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
+      - alert: HostUnusualDiskWriteRate
+        expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+", nodename!="forgejo-actions-runner", nodename!="woodpecker", nodename!="chaosknoten"}
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host unusual disk write rate (instance {{ $labels.instance }})
          description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
-      # Some VMs are expected to have high Read / Write rates z.B. CI servers
-      - alert: VirtualHostUnusualDiskReadRate
-        expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{ype="virtual_machine", nodename="forgejo-actions-runner", nodename="woodpecker"}
+      # CI hosts high disk read and write alerts.
+      # Longer intervals to account for disk intensive CI tasks.
+      - alert: CIHostUnusualDiskReadRate
+        expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename="forgejo-actions-runner", nodename="woodpecker"}
        for: 10m
        labels:
          severity: warning
        annotations:
-          summary: Virtual host unusual disk read rate for 10 min (instance {{ $labels.instance }})
+          summary: CI host unusual disk read rate for 10 min (instance {{ $labels.instance }})
          description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
      - alert: VirtualHostUnusualDiskWriteRate
-        expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{type="virtual_machine", nodename="forgejo-actions-runner", nodename="woodpecker"}
+        expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename="forgejo-actions-runner", nodename="woodpecker"}
        for: 4m
        labels:
          severity: warning
        annotations:
-          summary: Virtual host unusual disk write rate for 4 min (instance {{ $labels.instance }})
+          summary: CI host unusual disk write rate for 4 min (instance {{ $labels.instance }})
          description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
-      - alert: PhysicalHostUnusualDiskReadRate
-        expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{type="physical_machine", nodename=~".+"}
-        for: 20m
+      # Hypervisor host high disk read and write alerts.
+      # Longer intervals to account for disk intensive hypervisor tasks (backups, moving VMs, etc.).
+      - alert: HypervisorHostUnusualDiskReadRate
+        expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename="chaosknoten"}
+        for: 60m
        labels:
          severity: warning
        annotations:
-          summary: Physical host unusual disk read rate (instance {{ $labels.instance }})
-          description: "Disk is probably reading too much data (> 100 MB/s)\n  VALUE = {{ $value }}"
-      - alert: PhysicalHostUnusualDiskWriteRate
-        expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{type="physical_machine", nodename=~".+"}
-        for: 15m
+          summary: Hypervisor host unusual disk read rate (instance {{ $labels.instance }})
+          description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
+      - alert: HypervisorHostUnusualDiskWriteRate
+        expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename="chaosknoten"}
+        for: 60m
        labels:
          severity: warning
        annotations:
-          summary: Physical host unusual disk write rate (instance {{ $labels.instance }})
-          description: "Disk is probably writing too much data (> 100 MB/s)\n  VALUE = {{ $value }}"
+          summary: Hypervisor host unusual disk write rate (instance {{ $labels.instance }})
+          description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}"
      # Please add ignored mountpoints in node_exporter parameters like
      # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
      # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
@ -156,6 +242,55 @@ groups:
        annotations:
          summary: Host unusual disk write latency (instance {{ $labels.instance }})
          description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}"
+      # General unusual disk io alerts.
+      # Excluding: hypervisor hosts
+      - alert: HostUnusualDiskIo
+        expr: (rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{nodename=~".+", nodename!="chaosknoten"}
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host unusual disk IO (instance {{ $labels.instance }})
+          description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}"
+      # Hypervisor host unusual hard disk io alerts.
+      # Since hard disks on the hypervisor can easily have their IO saturated by hypervisor tasks (backups, moving VMs, etc.), alert when the IO is above the regular threshold for a very long time.
+      - alert: HypervisorHostUnusualHardDiskIo
+        expr: (rate(node_disk_io_time_seconds_total{device=~"s.+"}[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{nodename="chaosknoten"}
+        for: 50m
+        labels:
+          severity: warning
+        annotations:
+          summary: Hypervisor host unusual hard disk IO (instance {{ $labels.instance }})
+          description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}"
+      # Hypervisor host unusual other (non-hard) disk io alerts.
+      # This is the same as the regular unsual disk io alerts.
+      - alert: HypervisorHostUnusualOtherDiskIo
+        expr: (rate(node_disk_io_time_seconds_total{device!~"s.+"}[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{nodename="chaosknoten"}
+        for: 5m
+        labels:
+          severity: warning
+        annotations:
+          summary: Hypervisor host unusual other (non-hard) disk IO (instance {{ $labels.instance }})
+          description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}"
+      - alert: HostRaidArrayGotInactive
+        expr: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 0m
+        labels:
+          severity: critical
+        annotations:
+          summary: Host RAID array got inactive (instance {{ $labels.instance }})
+          description: "RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.\n  VALUE = {{ $value }}"
+      - alert: HostRaidDiskFailure
+        expr: (node_md_disks{state="failed"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 2m
+        labels:
+          severity: warning
+        annotations:
+          summary: Host RAID disk failure (instance {{ $labels.instance }})
+          description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}"
+
+  - name: node-exporter-cpu
+    rules:
      - alert: HostHighCpuLoad
        expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
        for: 10m
@ -190,31 +325,6 @@ groups:
        annotations:
          summary: Host CPU high iowait (instance {{ $labels.instance }})
          description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n  VALUE = {{ $value }}"
-      # Have different disk IO alerts for VMs and physical machines and for physical machines different ones for hard and other disks.
-      - alert: PhysicalHostUnusualHardDiskIo
-        expr: (rate(node_disk_io_time_seconds_total{device=~"s.+"}[1m]) > 0.75) * on(instance) group_left (nodename) node_uname_info{type="physical_machine", nodename=~".+"}
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: Physical host unusual hard disk IO (instance {{ $labels.instance }})
-          description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}"
-      - alert: PhysicalHostUnusualOtherDiskIo
-        expr: (rate(node_disk_io_time_seconds_total{device!~"s.+"}[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{type="physical_machine", nodename=~".+"}
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: Physical host unusual other (non-hard) disk IO (instance {{ $labels.instance }})
-          description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}"
-      - alert: VirtualHostUnusualDiskIo
-        expr: (rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{type="virtual_machine", nodename=~".+"}
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: Virtual host unusual disk IO (instance {{ $labels.instance }})
-          description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}"
      # # x2 context switches is an arbitrary number.
      # # The alert threshold depends on the nature of the application.
      # # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
@ -226,14 +336,28 @@ groups:
      #   annotations:
      #     summary: Host context switching high (instance {{ $labels.instance }})
      #     description: "Context switching is growing on the node (twice the daily average during the last 15m)\n  VALUE = {{ $value }}"
-      - alert: HostSwapIsFillingUp
-        expr: ((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 2m
+
+  - name: node-exporter-physical
+    rules:
+      - alert: HostNodeOvertemperatureAlarm
+        expr: ((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 0m
+        labels:
+          severity: critical
+        annotations:
+          summary: Host node overtemperature alarm (instance {{ $labels.instance }})
+          description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}"
+      - alert: HostKernelVersionDeviations
+        expr: (count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
+        for: 6h
        labels:
          severity: warning
        annotations:
-          summary: Host swap is filling up (instance {{ $labels.instance }})
-          description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}"
+          summary: Host kernel version deviations (instance {{ $labels.instance }})
+          description: "Different kernel versions are running\n  VALUE = {{ $value }}"
+
+  - name: node-exporter-misc
+    rules:
      - alert: HostSystemdServiceCrashed
        expr: (node_systemd_unit_state{state="failed"} == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
        for: 0m
@ -250,94 +374,6 @@ groups:
        annotations:
          summary: Host physical component too hot (instance {{ $labels.instance }})
          description: "Physical hardware component too hot\n  VALUE = {{ $value }}"
-      - alert: HostNodeOvertemperatureAlarm
-        expr: ((node_hwmon_temp_crit_alarm_celsius == 1) or (node_hwmon_temp_alarm == 1)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 0m
-        labels:
-          severity: critical
-        annotations:
-          summary: Host node overtemperature alarm (instance {{ $labels.instance }})
-          description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}"
-      - alert: HostRaidArrayGotInactive
-        expr: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 0m
-        labels:
-          severity: critical
-        annotations:
-          summary: Host RAID array got inactive (instance {{ $labels.instance }})
-          description: "RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.\n  VALUE = {{ $value }}"
-      - alert: HostRaidDiskFailure
-        expr: (node_md_disks{state="failed"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 2m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host RAID disk failure (instance {{ $labels.instance }})
-          description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}"
-      - alert: HostKernelVersionDeviations
-        expr: (count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 6h
-        labels:
-          severity: warning
-        annotations:
-          summary: Host kernel version deviations (instance {{ $labels.instance }})
-          description: "Different kernel versions are running\n  VALUE = {{ $value }}"
-      - alert: HostOomKillDetected
-        expr: (increase(node_vmstat_oom_kill[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 0m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host OOM kill detected (instance {{ $labels.instance }})
-          description: "OOM kill detected\n  VALUE = {{ $value }}"
-      - alert: HostEdacCorrectableErrorsDetected
-        expr: (increase(node_edac_correctable_errors_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 0m
-        labels:
-          severity: info
-        annotations:
-          summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
-          description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}"
-      - alert: HostEdacUncorrectableErrorsDetected
-        expr: (node_edac_uncorrectable_errors_total > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 0m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
-          description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}"
-      - alert: HostNetworkReceiveErrors
-        expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 2m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host Network Receive Errors (instance {{ $labels.instance }})
-          description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}"
-      - alert: HostNetworkTransmitErrors
-        expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 2m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host Network Transmit Errors (instance {{ $labels.instance }})
-          description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n  VALUE = {{ $value }}"
-      - alert: HostNetworkBondDegraded
-        expr: ((node_bonding_active - node_bonding_slaves) != 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 2m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host Network Bond Degraded (instance {{ $labels.instance }})
-          description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}"
-      - alert: HostConntrackLimit
-        expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
-        for: 5m
-        labels:
-          severity: warning
-        annotations:
-          summary: Host conntrack limit (instance {{ $labels.instance }})
-          description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}"
      - alert: HostClockSkew
        expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
        for: 10m
@ -362,6 +398,7 @@ groups:
        annotations:
          summary: Host requires reboot (instance {{ $labels.instance }})
          description: "{{ $labels.instance }} requires a reboot.\n  VALUE = {{ $value }}"
+
  - name: prometheus
    rules:
      - alert: PrometheusJobMissing
Author	SHA1	Message	Date
June	40cddb67b4	grafana: account for long backup jobs in Prom. hyperv. disk rw rate al. All checks were successful / Ansible Lint (pull_request) Successful in 1m35s Details / Ansible Lint (push) Successful in 1m34s Details	2025-02-06 19:17:21 +01:00
June	c4e35c1adf	grafana: pull out prom. net. rec. err. alerts for OPNs. to ex. wg int. All checks were successful / Ansible Lint (push) Successful in 1m32s Details / Ansible Lint (pull_request) Successful in 1m30s Details Pull out prometheus network receive error alerts for OPNsense to exclude its WireGuard interfaces, which like to throw errors, but which aren't of importance.	2025-02-06 01:34:45 +01:00
June	ee66631c2d	grafana: diff. prometheus disk io alerts by host task and disk type All checks were successful / Ansible Lint (push) Successful in 1m34s Details / Ansible Lint (pull_request) Successful in 1m32s Details Differentiate by host task (hypervisor or not) and disk (hard disk or not) type not by whether or not the host is physical and virtual and then by disk type. This is in line with the disk rate alerts changes and allows for fine-grained adjustments based on the host task type, which actually matters for these alerts.	2025-02-06 01:13:10 +01:00
June	9e77a41e3c	grafana: differentiate prometheus disk rate alerts by host task type All checks were successful / Ansible Lint (push) Successful in 1m38s Details / Ansible Lint (pull_request) Successful in 1m37s Details Not by a mix of host task type (CI server or not) and whether or not the host is virtual or physical. Also only differentiate on the duration not the rate, to not accidentally exclude slow hard disks.	2025-02-06 01:05:05 +01:00
June	5016407cef	grafana: group prometheus alert rules for better organization All checks were successful / Ansible Lint (push) Successful in 1m40s Details / Ansible Lint (pull_request) Successful in 1m37s Details	2025-02-06 00:12:50 +01:00