Zum Inhalt

User Alerts

Dieses Dokument beschreibt die Konfiguration und Verwaltung von Alerts für die ayedo SDP (ayedo Software Delivery Platform). Als Platform Administrator konfigurieren Sie hier Alerting für Platform-Komponenten. Anwendungsentwickler können eigene Alerts für ihre Anwendungen konfigurieren (siehe User Guide: Alerts).

Überblick

Die ayedo SDP nutzt folgende Alert-Stack:

VictoriaMetrics (Metriken)
VictoriaMetrics Alert (Alert Rules)
Grafana Alerting (Routing & Notification)
Notification Channels (Slack, E-Mail, PagerDuty, etc.)

Komponenten

Komponente Zweck Namespace
VictoriaMetrics Metriken-Sammlung monitoring
VictoriaMetrics Alert (vmalert) Alert-Rule-Evaluation monitoring
Grafana Alerting Alert-Routing und Notification monitoring
Grafana Visualization und Alert-Dashboards monitoring

Alert-Konfiguration

Alert-Rules definieren

Alert-Rules werden als PrometheusRule CRDs definiert:

# platform-alerts.yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  name: platform-alerts
  namespace: monitoring
spec:
  groups:
    - name: kubernetes-nodes
      interval: 30s
      rules:
        - alert: KubernetesNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical
            component: kubernetes
          annotations:
            summary: "Node {{ $labels.node }} is not ready"
            description: "Node {{ $labels.node }} has been NotReady for more than 5 minutes."
            runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#node-notready"

        - alert: KubernetesNodeHighCPU
          expr: (100 - (avg by (node) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
          for: 10m
          labels:
            severity: warning
            component: kubernetes
          annotations:
            summary: "Node {{ $labels.node }} has high CPU usage"
            description: "Node {{ $labels.node }} CPU usage is {{ $value | humanize }}%."

        - alert: KubernetesNodeHighMemory
          expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
          for: 10m
          labels:
            severity: warning
            component: kubernetes
          annotations:
            summary: "Node {{ $labels.node }} has high memory usage"
            description: "Node {{ $labels.node }} memory usage is {{ $value | humanize }}%."

        - alert: KubernetesPodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
            component: kubernetes
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes."
            runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#pod-crashloopbackoff"

Anwenden:

kubectl apply -f platform-alerts.yaml

Alert-Labels

Labels sind essentiell für Alert-Routing:

Label Zweck Werte
severity Alert-Schweregrad critical, warning, info
component Betroffene Komponente kubernetes, opensearch, harbor, etc.
team Verantwortliches Team platform, backend, frontend, etc.
environment Environment production, staging, development

Alert-Annotations

Annotations liefern Kontext für Alerts:

Annotation Zweck
summary Kurze Beschreibung (1 Zeile)
description Detaillierte Beschreibung mit Variablen
runbook_url Link zum Runbook
dashboard_url Link zum Grafana-Dashboard

Grafana Alerting Konfiguration

Grafana Alerting Config

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

    route:
      receiver: 'default'
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        # Critical Alerts → PagerDuty + Slack
        - match:
            severity: critical
          receiver: pagerduty-critical
          continue: true
        - match:
            severity: critical
          receiver: slack-critical

        # Warning Alerts → Slack
        - match:
            severity: warning
          receiver: slack-warnings

        # Info Alerts → Slack (nur tags)
        - match:
            severity: info
          receiver: slack-info

    receivers:
      - name: 'default'
        slack_configs:
          - channel: '#alerts-default'
            title: 'Alert: {{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

      - name: 'pagerduty-critical'
        pagerduty_configs:
          - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
            description: '{{ .GroupLabels.alertname }}'
            details:
              firing: '{{ .Alerts.Firing | len }}'
              resolved: '{{ .Alerts.Resolved | len }}'

      - name: 'slack-critical'
        slack_configs:
          - channel: '#alerts-critical'
            title: ':fire: Critical Alert'
            text: |
              *Alert:* {{ .GroupLabels.alertname }}
              *Severity:* {{ .CommonLabels.severity }}
              {{ range .Alerts }}
              *Summary:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              {{ if .Annotations.runbook_url }}*Runbook:* <{{ .Annotations.runbook_url }}|View>{{ end }}
              {{ end }}
            send_resolved: true

      - name: 'slack-warnings'
        slack_configs:
          - channel: '#alerts-warnings'
            title: ':warning: Warning'
            text: |
              *Alert:* {{ .GroupLabels.alertname }}
              {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
            send_resolved: true

      - name: 'slack-info'
        slack_configs:
          - channel: '#alerts-info'
            title: ':information_source: Info'
            text: '{{ .GroupLabels.alertname }}'

    inhibit_rules:
      # Unterdrücke Warnungen wenn Critical Alert feuert
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'cluster', 'service']

      # Unterdrücke Node-Alerts wenn Node NotReady
      - source_match:
          alertname: 'KubernetesNodeNotReady'
        target_match_re:
          alertname: 'Kubernetes.*'
        equal: ['node']

Anwenden:

kubectl apply -f alertmanager-config.yaml
kubectl rollout restart deployment -n monitoring alertmanager

Notification Channels

Slack

Setup:

  1. Slack-Incoming-Webhook erstellen: https://api.slack.com/messaging/webhooks
  2. Webhook-URL in Grafana Alerting Config eintragen

Beispiel:

slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    send_resolved: true

E-Mail

Setup:

email_configs:
  - to: 'alerts@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_password: '<smtp-password>'
    headers:
      Subject: 'Alert: {{ .GroupLabels.alertname }}'
    html: |
      <h2>Alert: {{ .GroupLabels.alertname }}</h2>
      <ul>
      {{ range .Alerts }}
        <li>{{ .Annotations.summary }}</li>
      {{ end }}
      </ul>

PagerDuty

Setup:

  1. PagerDuty-Integration-Key erstellen: https://support.pagerduty.com/docs/services-and-integrations
  2. Integration-Key in Grafana Alerting Config eintragen

Beispiel:

pagerduty_configs:
  - service_key: '<pagerduty-integration-key>'
    description: '{{ .GroupLabels.alertname }}'
    severity: '{{ .CommonLabels.severity }}'
    details:
      firing: '{{ .Alerts.Firing | len }}'
      resolved: '{{ .Alerts.Resolved | len }}'
    links:
      - href: '{{ .Annotations.runbook_url }}'
        text: 'Runbook'

Webhook (Generic)

Beispiel:

webhook_configs:
  - url: 'https://monitoring.example.com/webhook'
    send_resolved: true

Silences

Silences unterdrücken Alerts temporär (z.B. während Maintenance).

Silence erstellen (via CLI)

# Silence für Alertname
amtool --alertmanager.url=http://localhost:9093 silence add \
  alertname=KubernetesNodeNotReady \
  --duration=2h \
  --author="ops-team" \
  --comment="Planned node maintenance"

# Silence für Label
amtool silence add \
  severity=warning \
  component=kubernetes \
  --duration=1h \
  --author="ops-team" \
  --comment="Kubernetes upgrade"

Silence erstellen (via UI)

  1. Grafana Alerting UI öffnen: https://grafana.example.com/alerting
  2. "Silences" → "New Silence"
  3. Matcher hinzufügen (z.B. alertname="KubernetesNodeNotReady")
  4. Duration setzen (z.B. 2h)
  5. Comment hinzufügen
  6. "Create"

Silences auflisten

amtool silence query

Silence löschen

amtool silence expire <silence-id>

Alert-Testing

Alert manuell triggern

# Fake-Alert an Grafana Alerting senden
curl -X POST http://grafana.example.com/alerting/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "severity": "warning"
      },
      "annotations": {
        "summary": "This is a test alert"
      },
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
    }
  ]'

Alert-Rule testen

# vmalert-Logs prüfen
kubectl logs -n monitoring -l app.kubernetes.io/name=vmalert --tail=50

# Alert-Rule-Status prüfen
kubectl port-forward -n monitoring svc/vmalert 8880:8880
curl http://localhost:8880/api/v1/rules | jq

Notification-Channel testen

# Grafana Alerting Logs prüfen
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50

# Alerts in Grafana Alerting UI prüfen
# → https://grafana.example.com/alerting/#/alerts

Best Practices

1. Alert-Schweregrade richtig nutzen

Severity Wann verwenden Aktion
critical Produktions-Ausfall, Datenverlust-Risiko Sofortige Reaktion (PagerDuty)
warning Potenzielle Probleme, Kapazitätsgrenzen Reaktion innerhalb 24h
info Informative Events Keine Aktion erforderlich

2. Alerts müssen actionable sein

Schlecht:

alert: HighLoad
expr: node_load1 > 5

Gut:

alert: NodeHighLoad
expr: node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 0.8
annotations:
  description: "Node {{ $labels.instance }} load is {{ $value }} (threshold: 0.8)"
  runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#node-high-load"

3. Alert-Fatigue vermeiden

  • for: verwenden, um transiente Spikes zu ignorieren
  • Inhibit-Rules: verwenden, um redundante Alerts zu unterdrücken
  • Grouping: verwenden, um ähnliche Alerts zu bündeln

Beispiel:

alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m  # Wartet 5 Min, bevor Alert feuert

4. Runbooks verlinken

Jeder Alert sollte ein Runbook haben:

annotations:
  runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#pod-crashloopbackoff"

5. Alerts dokumentieren

annotations:
  summary: "Kurze Beschreibung (1 Zeile)"
  description: "Detaillierte Beschreibung mit Variablen: {{ $labels.pod }}"
  impact: "User können Service X nicht nutzen"
  action: "1. Prüfe Logs 2. Restart Pod 3. Eskaliere an Team Y"

Monitoring des Alert-Systems

Grafana Alerting healthy?

# Grafana Alerting Pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager

# Grafana Alerting HTTP-Test
curl http://grafana.example.com/alerting/-/healthy
curl http://grafana.example.com/alerting/-/ready

# Alerts anzeigen
curl http://grafana.example.com/alerting/api/v1/alerts | jq

vmalert healthy?

# vmalert-Pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=vmalert

# vmalert-HTTP-Test
kubectl port-forward -n monitoring svc/vmalert 8880:8880
curl http://localhost:8880/-/healthy
curl http://localhost:8880/api/v1/rules | jq

Notification-Delivery

# Grafana Alerting Logs nach Delivery-Errors durchsuchen
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager | grep -i error

# Grafana Alerting Status
curl http://grafana.example.com/alerting/api/v1/status | jq

Compliance-Mapping

Die beschriebenen Alert-Mechanismen erfüllen folgende Compliance-Anforderungen:

Norm Control Erfüllung
ISO 27001 Annex A 5.24 ✅ Incident Management Planning
ISO 27001 Annex A 5.25 ✅ Assessment and Decision
BSI IT-Grundschutz DER.1 ✅ Detektion von Ereignissen
NIS2 Requirement (b) ✅ Incident Handling

Weiterführende Dokumentation

Support

Bei Fragen zu User Alerts: