User Alerts¶
Dieses Dokument beschreibt die Konfiguration und Verwaltung von Alerts für die ayedo SDP (ayedo Software Delivery Platform). Als Platform Administrator konfigurieren Sie hier Alerting für Platform-Komponenten. Anwendungsentwickler können eigene Alerts für ihre Anwendungen konfigurieren (siehe User Guide: Alerts).
Überblick¶
Die ayedo SDP nutzt folgende Alert-Stack:
VictoriaMetrics (Metriken)
↓
VictoriaMetrics Alert (Alert Rules)
↓
Grafana Alerting (Routing & Notification)
↓
Notification Channels (Slack, E-Mail, PagerDuty, etc.)
Komponenten¶
| Komponente | Zweck | Namespace |
|---|---|---|
| VictoriaMetrics | Metriken-Sammlung | monitoring |
| VictoriaMetrics Alert (vmalert) | Alert-Rule-Evaluation | monitoring |
| Grafana Alerting | Alert-Routing und Notification | monitoring |
| Grafana | Visualization und Alert-Dashboards | monitoring |
Alert-Konfiguration¶
Alert-Rules definieren¶
Alert-Rules werden als PrometheusRule CRDs definiert:
# platform-alerts.yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: platform-alerts
namespace: monitoring
spec:
groups:
- name: kubernetes-nodes
interval: 30s
rules:
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
component: kubernetes
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been NotReady for more than 5 minutes."
runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#node-notready"
- alert: KubernetesNodeHighCPU
expr: (100 - (avg by (node) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 10m
labels:
severity: warning
component: kubernetes
annotations:
summary: "Node {{ $labels.node }} has high CPU usage"
description: "Node {{ $labels.node }} CPU usage is {{ $value | humanize }}%."
- alert: KubernetesNodeHighMemory
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
component: kubernetes
annotations:
summary: "Node {{ $labels.node }} has high memory usage"
description: "Node {{ $labels.node }} memory usage is {{ $value | humanize }}%."
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
component: kubernetes
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes."
runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#pod-crashloopbackoff"
Anwenden:
Alert-Labels¶
Labels sind essentiell für Alert-Routing:
| Label | Zweck | Werte |
|---|---|---|
severity | Alert-Schweregrad | critical, warning, info |
component | Betroffene Komponente | kubernetes, opensearch, harbor, etc. |
team | Verantwortliches Team | platform, backend, frontend, etc. |
environment | Environment | production, staging, development |
Alert-Annotations¶
Annotations liefern Kontext für Alerts:
| Annotation | Zweck |
|---|---|
summary | Kurze Beschreibung (1 Zeile) |
description | Detaillierte Beschreibung mit Variablen |
runbook_url | Link zum Runbook |
dashboard_url | Link zum Grafana-Dashboard |
Grafana Alerting Konfiguration¶
Grafana Alerting Config¶
# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-config
namespace: monitoring
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical Alerts → PagerDuty + Slack
- match:
severity: critical
receiver: pagerduty-critical
continue: true
- match:
severity: critical
receiver: slack-critical
# Warning Alerts → Slack
- match:
severity: warning
receiver: slack-warnings
# Info Alerts → Slack (nur tags)
- match:
severity: info
receiver: slack-info
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts-default'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
title: ':fire: Critical Alert'
text: |
*Alert:* {{ .GroupLabels.alertname }}
*Severity:* {{ .CommonLabels.severity }}
{{ range .Alerts }}
*Summary:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
{{ if .Annotations.runbook_url }}*Runbook:* <{{ .Annotations.runbook_url }}|View>{{ end }}
{{ end }}
send_resolved: true
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
title: ':warning: Warning'
text: |
*Alert:* {{ .GroupLabels.alertname }}
{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
send_resolved: true
- name: 'slack-info'
slack_configs:
- channel: '#alerts-info'
title: ':information_source: Info'
text: '{{ .GroupLabels.alertname }}'
inhibit_rules:
# Unterdrücke Warnungen wenn Critical Alert feuert
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
# Unterdrücke Node-Alerts wenn Node NotReady
- source_match:
alertname: 'KubernetesNodeNotReady'
target_match_re:
alertname: 'Kubernetes.*'
equal: ['node']
Anwenden:
kubectl apply -f alertmanager-config.yaml
kubectl rollout restart deployment -n monitoring alertmanager
Notification Channels¶
Slack¶
Setup:
- Slack-Incoming-Webhook erstellen: https://api.slack.com/messaging/webhooks
- Webhook-URL in Grafana Alerting Config eintragen
Beispiel:
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
E-Mail¶
Setup:
email_configs:
- to: 'alerts@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: '<smtp-password>'
headers:
Subject: 'Alert: {{ .GroupLabels.alertname }}'
html: |
<h2>Alert: {{ .GroupLabels.alertname }}</h2>
<ul>
{{ range .Alerts }}
<li>{{ .Annotations.summary }}</li>
{{ end }}
</ul>
PagerDuty¶
Setup:
- PagerDuty-Integration-Key erstellen: https://support.pagerduty.com/docs/services-and-integrations
- Integration-Key in Grafana Alerting Config eintragen
Beispiel:
pagerduty_configs:
- service_key: '<pagerduty-integration-key>'
description: '{{ .GroupLabels.alertname }}'
severity: '{{ .CommonLabels.severity }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
links:
- href: '{{ .Annotations.runbook_url }}'
text: 'Runbook'
Webhook (Generic)¶
Beispiel:
Silences¶
Silences unterdrücken Alerts temporär (z.B. während Maintenance).
Silence erstellen (via CLI)¶
# Silence für Alertname
amtool --alertmanager.url=http://localhost:9093 silence add \
alertname=KubernetesNodeNotReady \
--duration=2h \
--author="ops-team" \
--comment="Planned node maintenance"
# Silence für Label
amtool silence add \
severity=warning \
component=kubernetes \
--duration=1h \
--author="ops-team" \
--comment="Kubernetes upgrade"
Silence erstellen (via UI)¶
- Grafana Alerting UI öffnen: https://grafana.example.com/alerting
- "Silences" → "New Silence"
- Matcher hinzufügen (z.B.
alertname="KubernetesNodeNotReady") - Duration setzen (z.B. 2h)
- Comment hinzufügen
- "Create"
Silences auflisten¶
Silence löschen¶
Alert-Testing¶
Alert manuell triggern¶
# Fake-Alert an Grafana Alerting senden
curl -X POST http://grafana.example.com/alerting/api/v1/alerts \
-H 'Content-Type: application/json' \
-d '[
{
"labels": {
"alertname": "TestAlert",
"severity": "warning"
},
"annotations": {
"summary": "This is a test alert"
},
"startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}
]'
Alert-Rule testen¶
# vmalert-Logs prüfen
kubectl logs -n monitoring -l app.kubernetes.io/name=vmalert --tail=50
# Alert-Rule-Status prüfen
kubectl port-forward -n monitoring svc/vmalert 8880:8880
curl http://localhost:8880/api/v1/rules | jq
Notification-Channel testen¶
# Grafana Alerting Logs prüfen
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50
# Alerts in Grafana Alerting UI prüfen
# → https://grafana.example.com/alerting/#/alerts
Best Practices¶
1. Alert-Schweregrade richtig nutzen¶
| Severity | Wann verwenden | Aktion |
|---|---|---|
critical | Produktions-Ausfall, Datenverlust-Risiko | Sofortige Reaktion (PagerDuty) |
warning | Potenzielle Probleme, Kapazitätsgrenzen | Reaktion innerhalb 24h |
info | Informative Events | Keine Aktion erforderlich |
2. Alerts müssen actionable sein¶
Schlecht:
Gut:
alert: NodeHighLoad
expr: node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 0.8
annotations:
description: "Node {{ $labels.instance }} load is {{ $value }} (threshold: 0.8)"
runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#node-high-load"
3. Alert-Fatigue vermeiden¶
- for: verwenden, um transiente Spikes zu ignorieren
- Inhibit-Rules: verwenden, um redundante Alerts zu unterdrücken
- Grouping: verwenden, um ähnliche Alerts zu bündeln
Beispiel:
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m # Wartet 5 Min, bevor Alert feuert
4. Runbooks verlinken¶
Jeder Alert sollte ein Runbook haben:
annotations:
runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#pod-crashloopbackoff"
5. Alerts dokumentieren¶
annotations:
summary: "Kurze Beschreibung (1 Zeile)"
description: "Detaillierte Beschreibung mit Variablen: {{ $labels.pod }}"
impact: "User können Service X nicht nutzen"
action: "1. Prüfe Logs 2. Restart Pod 3. Eskaliere an Team Y"
Monitoring des Alert-Systems¶
Grafana Alerting healthy?¶
# Grafana Alerting Pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager
# Grafana Alerting HTTP-Test
curl http://grafana.example.com/alerting/-/healthy
curl http://grafana.example.com/alerting/-/ready
# Alerts anzeigen
curl http://grafana.example.com/alerting/api/v1/alerts | jq
vmalert healthy?¶
# vmalert-Pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=vmalert
# vmalert-HTTP-Test
kubectl port-forward -n monitoring svc/vmalert 8880:8880
curl http://localhost:8880/-/healthy
curl http://localhost:8880/api/v1/rules | jq
Notification-Delivery¶
# Grafana Alerting Logs nach Delivery-Errors durchsuchen
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager | grep -i error
# Grafana Alerting Status
curl http://grafana.example.com/alerting/api/v1/status | jq
Compliance-Mapping¶
Die beschriebenen Alert-Mechanismen erfüllen folgende Compliance-Anforderungen:
| Norm | Control | Erfüllung |
|---|---|---|
| ISO 27001 | Annex A 5.24 | ✅ Incident Management Planning |
| ISO 27001 | Annex A 5.25 | ✅ Assessment and Decision |
| BSI IT-Grundschutz | DER.1 | ✅ Detektion von Ereignissen |
| NIS2 | Requirement (b) | ✅ Incident Handling |
Weiterführende Dokumentation¶
- Runbooks - Alert-Response-Prozeduren
- Troubleshooting - Fehlerbehebung
- Kapazitätsplanung - Kapazitäts-Alerts
- Kubernetes Features - VictoriaMetrics
Support¶
Bei Fragen zu User Alerts:
- E-Mail: support@ayedo.de
- Website: ayedo.de
- Discord: ayedo Discord