User Alerts¶

Dieses Dokument beschreibt die Konfiguration und Verwaltung von Alerts für die ayedo SDP (ayedo Software Delivery Platform). Als Platform Administrator konfigurieren Sie hier Alerting für Platform-Komponenten. Anwendungsentwickler können eigene Alerts für ihre Anwendungen konfigurieren (siehe User Guide: Alerts).

Überblick¶

Die ayedo SDP nutzt folgende Alert-Stack:

VictoriaMetrics (Metriken)
    ↓
VictoriaMetrics Alert (Alert Rules)
    ↓
Grafana Alerting (Routing & Notification)
    ↓
Notification Channels (Slack, E-Mail, PagerDuty, etc.)

Komponenten¶

Komponente	Zweck	Namespace
VictoriaMetrics	Metriken-Sammlung	`monitoring`
VictoriaMetrics Alert (vmalert)	Alert-Rule-Evaluation	`monitoring`
Grafana Alerting	Alert-Routing und Notification	`monitoring`
Grafana	Visualization und Alert-Dashboards	`monitoring`

Alert-Konfiguration¶

Alert-Rules definieren¶

Alert-Rules werden als PrometheusRule CRDs definiert:

# platform-alerts.yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
  name: platform-alerts
  namespace: monitoring
spec:
  groups:
    - name: kubernetes-nodes
      interval: 30s
      rules:
        - alert: KubernetesNodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical
            component: kubernetes
          annotations:
            summary: "Node {{ $labels.node }} is not ready"
            description: "Node {{ $labels.node }} has been NotReady for more than 5 minutes."
            runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#node-notready"

        - alert: KubernetesNodeHighCPU
          expr: (100 - (avg by (node) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
          for: 10m
          labels:
            severity: warning
            component: kubernetes
          annotations:
            summary: "Node {{ $labels.node }} has high CPU usage"
            description: "Node {{ $labels.node }} CPU usage is {{ $value | humanize }}%."

        - alert: KubernetesNodeHighMemory
          expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
          for: 10m
          labels:
            severity: warning
            component: kubernetes
          annotations:
            summary: "Node {{ $labels.node }} has high memory usage"
            description: "Node {{ $labels.node }} memory usage is {{ $value | humanize }}%."

        - alert: KubernetesPodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: warning
            component: kubernetes
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
            description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{ $value }} times in the last 15 minutes."
            runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#pod-crashloopbackoff"

Anwenden:

kubectl apply -f platform-alerts.yaml

Alert-Labels¶

Labels sind essentiell für Alert-Routing:

Label	Zweck	Werte
`severity`	Alert-Schweregrad	`critical`, `warning`, `info`
`component`	Betroffene Komponente	`kubernetes`, `opensearch`, `harbor`, etc.
`team`	Verantwortliches Team	`platform`, `backend`, `frontend`, etc.
`environment`	Environment	`production`, `staging`, `development`

Alert-Annotations¶

Annotations liefern Kontext für Alerts:

Annotation	Zweck
`summary`	Kurze Beschreibung (1 Zeile)
`description`	Detaillierte Beschreibung mit Variablen
`runbook_url`	Link zum Runbook
`dashboard_url`	Link zum Grafana-Dashboard

Grafana Alerting Konfiguration¶

Grafana Alerting Config¶

# alertmanager-config.yaml
apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
      slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

    route:
      receiver: 'default'
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        # Critical Alerts → PagerDuty + Slack
        - match:
            severity: critical
          receiver: pagerduty-critical
          continue: true
        - match:
            severity: critical
          receiver: slack-critical

        # Warning Alerts → Slack
        - match:
            severity: warning
          receiver: slack-warnings

        # Info Alerts → Slack (nur tags)
        - match:
            severity: info
          receiver: slack-info

    receivers:
      - name: 'default'
        slack_configs:
          - channel: '#alerts-default'
            title: 'Alert: {{ .GroupLabels.alertname }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

      - name: 'pagerduty-critical'
        pagerduty_configs:
          - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
            description: '{{ .GroupLabels.alertname }}'
            details:
              firing: '{{ .Alerts.Firing | len }}'
              resolved: '{{ .Alerts.Resolved | len }}'

      - name: 'slack-critical'
        slack_configs:
          - channel: '#alerts-critical'
            title: ':fire: Critical Alert'
            text: |
              *Alert:* {{ .GroupLabels.alertname }}
              *Severity:* {{ .CommonLabels.severity }}
              {{ range .Alerts }}
              *Summary:* {{ .Annotations.summary }}
              *Description:* {{ .Annotations.description }}
              {{ if .Annotations.runbook_url }}*Runbook:* <{{ .Annotations.runbook_url }}|View>{{ end }}
              {{ end }}
            send_resolved: true

      - name: 'slack-warnings'
        slack_configs:
          - channel: '#alerts-warnings'
            title: ':warning: Warning'
            text: |
              *Alert:* {{ .GroupLabels.alertname }}
              {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}
            send_resolved: true

      - name: 'slack-info'
        slack_configs:
          - channel: '#alerts-info'
            title: ':information_source: Info'
            text: '{{ .GroupLabels.alertname }}'

    inhibit_rules:
      # Unterdrücke Warnungen wenn Critical Alert feuert
      - source_match:
          severity: 'critical'
        target_match:
          severity: 'warning'
        equal: ['alertname', 'cluster', 'service']

      # Unterdrücke Node-Alerts wenn Node NotReady
      - source_match:
          alertname: 'KubernetesNodeNotReady'
        target_match_re:
          alertname: 'Kubernetes.*'
        equal: ['node']

Anwenden:

kubectl apply -f alertmanager-config.yaml
kubectl rollout restart deployment -n monitoring alertmanager

Notification Channels¶

Slack¶

Setup:

Slack-Incoming-Webhook erstellen: https://api.slack.com/messaging/webhooks
Webhook-URL in Grafana Alerting Config eintragen

Beispiel:

slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXX'
    channel: '#alerts'
    title: '{{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    send_resolved: true

E-Mail¶

Setup:

email_configs:
  - to: 'alerts@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_password: '<smtp-password>'
    headers:
      Subject: 'Alert: {{ .GroupLabels.alertname }}'
    html: |
      <h2>Alert: {{ .GroupLabels.alertname }}</h2>
      <ul>
      {{ range .Alerts }}
        <li>{{ .Annotations.summary }}</li>
      {{ end }}
      </ul>

PagerDuty¶

Setup:

PagerDuty-Integration-Key erstellen: https://support.pagerduty.com/docs/services-and-integrations
Integration-Key in Grafana Alerting Config eintragen

Beispiel:

pagerduty_configs:
  - service_key: '<pagerduty-integration-key>'
    description: '{{ .GroupLabels.alertname }}'
    severity: '{{ .CommonLabels.severity }}'
    details:
      firing: '{{ .Alerts.Firing | len }}'
      resolved: '{{ .Alerts.Resolved | len }}'
    links:
      - href: '{{ .Annotations.runbook_url }}'
        text: 'Runbook'

Webhook (Generic)¶

Beispiel:

webhook_configs:
  - url: 'https://monitoring.example.com/webhook'
    send_resolved: true

Silences¶

Silences unterdrücken Alerts temporär (z.B. während Maintenance).

Silence erstellen (via CLI)¶

# Silence für Alertname
amtool --alertmanager.url=http://localhost:9093 silence add \
  alertname=KubernetesNodeNotReady \
  --duration=2h \
  --author="ops-team" \
  --comment="Planned node maintenance"

# Silence für Label
amtool silence add \
  severity=warning \
  component=kubernetes \
  --duration=1h \
  --author="ops-team" \
  --comment="Kubernetes upgrade"

Silence erstellen (via UI)¶

Grafana Alerting UI öffnen: https://grafana.example.com/alerting
"Silences" → "New Silence"
Matcher hinzufügen (z.B. alertname="KubernetesNodeNotReady")
Duration setzen (z.B. 2h)
Comment hinzufügen
"Create"

Silences auflisten¶

amtool silence query

Silence löschen¶

amtool silence expire <silence-id>

Alert-Testing¶

Alert manuell triggern¶

# Fake-Alert an Grafana Alerting senden
curl -X POST http://grafana.example.com/alerting/api/v1/alerts \
  -H 'Content-Type: application/json' \
  -d '[
    {
      "labels": {
        "alertname": "TestAlert",
        "severity": "warning"
      },
      "annotations": {
        "summary": "This is a test alert"
      },
      "startsAt": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
    }
  ]'

Alert-Rule testen¶

# vmalert-Logs prüfen
kubectl logs -n monitoring -l app.kubernetes.io/name=vmalert --tail=50

# Alert-Rule-Status prüfen
kubectl port-forward -n monitoring svc/vmalert 8880:8880
curl http://localhost:8880/api/v1/rules | jq

Notification-Channel testen¶

# Grafana Alerting Logs prüfen
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager --tail=50

# Alerts in Grafana Alerting UI prüfen
# → https://grafana.example.com/alerting/#/alerts

Best Practices¶

1. Alert-Schweregrade richtig nutzen¶

Severity	Wann verwenden	Aktion
`critical`	Produktions-Ausfall, Datenverlust-Risiko	Sofortige Reaktion (PagerDuty)
`warning`	Potenzielle Probleme, Kapazitätsgrenzen	Reaktion innerhalb 24h
`info`	Informative Events	Keine Aktion erforderlich

2. Alerts müssen actionable sein¶

Schlecht:

alert: HighLoad
expr: node_load1 > 5

Gut:

alert: NodeHighLoad
expr: node_load1 / count(node_cpu_seconds_total{mode="idle"}) by (instance) > 0.8
annotations:
  description: "Node {{ $labels.instance }} load is {{ $value }} (threshold: 0.8)"
  runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#node-high-load"

3. Alert-Fatigue vermeiden¶

for: verwenden, um transiente Spikes zu ignorieren
Inhibit-Rules: verwenden, um redundante Alerts zu unterdrücken
Grouping: verwenden, um ähnliche Alerts zu bündeln

Beispiel:

alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m  # Wartet 5 Min, bevor Alert feuert

4. Runbooks verlinken¶

Jeder Alert sollte ein Runbook haben:

annotations:
  runbook_url: "https://docs.ayedo.de/platform/operations/runbooks/#pod-crashloopbackoff"

5. Alerts dokumentieren¶

annotations:
  summary: "Kurze Beschreibung (1 Zeile)"
  description: "Detaillierte Beschreibung mit Variablen: {{ $labels.pod }}"
  impact: "User können Service X nicht nutzen"
  action: "1. Prüfe Logs 2. Restart Pod 3. Eskaliere an Team Y"

Monitoring des Alert-Systems¶

Grafana Alerting healthy?¶

# Grafana Alerting Pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=alertmanager

# Grafana Alerting HTTP-Test
curl http://grafana.example.com/alerting/-/healthy
curl http://grafana.example.com/alerting/-/ready

# Alerts anzeigen
curl http://grafana.example.com/alerting/api/v1/alerts | jq

vmalert healthy?¶

# vmalert-Pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=vmalert

# vmalert-HTTP-Test
kubectl port-forward -n monitoring svc/vmalert 8880:8880
curl http://localhost:8880/-/healthy
curl http://localhost:8880/api/v1/rules | jq

Notification-Delivery¶

# Grafana Alerting Logs nach Delivery-Errors durchsuchen
kubectl logs -n monitoring -l app.kubernetes.io/name=alertmanager | grep -i error

# Grafana Alerting Status
curl http://grafana.example.com/alerting/api/v1/status | jq

Compliance-Mapping¶

Die beschriebenen Alert-Mechanismen erfüllen folgende Compliance-Anforderungen:

Norm	Control	Erfüllung
ISO 27001	Annex A 5.24	✅ Incident Management Planning
ISO 27001	Annex A 5.25	✅ Assessment and Decision
BSI IT-Grundschutz	DER.1	✅ Detektion von Ereignissen
NIS2	Requirement (b)	✅ Incident Handling

Weiterführende Dokumentation¶

Runbooks - Alert-Response-Prozeduren
Troubleshooting - Fehlerbehebung
Kapazitätsplanung - Kapazitäts-Alerts
Kubernetes Features - VictoriaMetrics

Support¶

Bei Fragen zu User Alerts:

E-Mail: support@ayedo.de
Website: ayedo.de
Discord: ayedo Discord