Runbooks¶

Dieses Dokument beschreibt operationale Runbooks für Plattform Administratoren der ayedo Software Delivery Plattform. Runbooks dokumentieren Schritt-für-Schritt-Prozeduren, um Konsistenz, Effizienz und Nachvollziehbarkeit im Betrieb sicherzustellen.

Zielgruppe

Diese Runbooks richten sich an Plattform Administratoren. Anwendungsentwickler finden relevante Guides in der Delivery-Dokumentation.

Zweck von Runbooks¶

Runbooks erfüllen mehrere Zwecke:

Konsistenz: Gleiche Tasks werden immer gleich ausgeführt
Onboarding: Neue Team-Mitglieder können schnell produktiv werden
Incident Response: Schnelle Reaktion auf Alerts und Incidents
Compliance: Dokumentierte Prozeduren für Audits (ISO 27001, BSI IT-Grundschutz)
Knowledge Management: Reduzierung von Tribal Knowledge

Runbook-Kategorien¶

1. Alert-Response-Runbooks¶

Reaktion auf Monitoring-Alerts (Grafana, Grafana Alerting)

Beispiele:

Node NotReady
Pod CrashLoopBackOff
PVC Full
Certificate Expiration
High CPU/Memory

2. Change-Request-Runbooks¶

Bearbeitung von Änderungsanfragen (z.B. von Anwendungsentwickler)

Beispiele:

User Onboarding
Namespace Creation
Certificate Issuance
Ingress Configuration
Resource Quota Adjustment

3. Maintenance-Runbooks¶

Regelmäßige Wartungsaufgaben

Beispiele:

Node Patching
Kubernetes Upgrade
Certificate Renewal
Log Rotation
Backup Verification

Alert-Response-Runbooks¶

Node NotReady¶

Alert: KubernetesNodeNotReady

Severity: Critical

Impact:

Pods auf dem Node werden evakuiert
Workloads könnten ausfallen (falls keine Redundanz)
Cluster-Kapazität reduziert

Diagnosis:

Node-Status prüfen:

kubectl get nodes
kubectl describe node <node-name>

Node-Events prüfen:

kubectl get events --field-selector involvedObject.name=<node-name> --sort-by='.lastTimestamp'

Kubelet-Status prüfen:

ssh <node>
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 100 --no-pager

Mitigation:

Kubelet neu starten:

ssh <node>
sudo systemctl restart kubelet

Node wieder in den Cluster aufnehmen:
```
kubectl uncordon <node-name>
```

Pods-Status prüfen:

kubectl get pods -A --field-selector spec.nodeName=<node-name>

Escalation: Falls Node nicht zurückkommt → Disaster Recovery

Pod CrashLoopBackOff¶

Alert: KubePodCrashLooping

Severity: Warning

Impact:

Application funktioniert nicht
User können Service nicht nutzen
Potenzielle Datenverluste

Diagnosis:

Pod-Status prüfen:

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Pod-Logs prüfen:

# Aktuelle Logs
kubectl logs <pod-name> -n <namespace>

# Vorherige Logs (vor Restart)
kubectl logs <pod-name> -n <namespace> --previous

Events prüfen:

kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'

Mitigation:

ImagePullBackOff:

# Image existiert?
docker pull <image>

# Image-Pull-Secret korrekt?
kubectl get secret <image-pull-secret> -n <namespace> -o yaml

ConfigMap/Secret fehlt:

kubectl get configmap,secret -n <namespace>

Resource-Limits zu niedrig:

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits:"

Liveness/Readiness-Probe fehlschlägt:

kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Liveness:"

Escalation: Falls unklar → Application Developer kontaktieren

PVC Full¶

Alert: KubePersistentVolumeFillingUp

Severity: Warning

Impact:

Application kann keine Daten mehr schreiben
Potenzielle Datenverluste
Application könnte crashen

Diagnosis:

PVC-Usage prüfen:

kubectl get pvc -A
kubectl describe pvc <pvc-name> -n <namespace>

Disk-Usage in Pod prüfen:

kubectl exec -n <namespace> <pod-name> -- df -h

Top-Files identifizieren:

kubectl exec -n <namespace> <pod-name> -- du -sh /* | sort -rh | head -10

Mitigation:

Option 1: PVC erweitern (falls Storage-Class es unterstützt)

# PVC-Größe erhöhen
kubectl edit pvc <pvc-name> -n <namespace>

# spec.resources.requests.storage: 50Gi  # Von 20Gi auf 50Gi

# Warten bis Expansion abgeschlossen
kubectl get pvc <pvc-name> -n <namespace> --watch

Option 2: Alte Daten löschen

# Logs löschen (Beispiel)
kubectl exec -n <namespace> <pod-name> -- rm -rf /var/log/*.log

# Temporäre Dateien löschen
kubectl exec -n <namespace> <pod-name> -- rm -rf /tmp/*

Option 3: Log-Rotation aktivieren

# logrotate-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: logrotate-config
  namespace: <namespace>
data:
  logrotate.conf: |
    /var/log/*.log {
      daily
      rotate 7
      compress
      missingok
      notifempty
    }

Certificate Expiration¶

Alert: CertManagerCertificateExpiringSoon

Severity: Warning

Impact:

HTTPS-Zugriff funktioniert nicht mehr
User können Services nicht nutzen
Browser-Warnungen

Diagnosis:

Certificate-Status prüfen:

kubectl get certificates -A
kubectl describe certificate <cert-name> -n <namespace>

Certificate-Renewal-Status prüfen:

kubectl get certificaterequest -n <namespace>
kubectl logs -n cert-manager -l app=cert-manager

Let's Encrypt Rate-Limits prüfen:

# Logs nach "rate limit" durchsuchen
kubectl logs -n cert-manager -l app=cert-manager | grep -i "rate limit"

Mitigation:

Option 1: Manueller Renewal-Trigger

# Certificate neu anfordern
kubectl delete certificaterequest -n <namespace> <cert-request-name>

# cert-manager erzwingt Renewal
kubectl annotate certificate <cert-name> -n <namespace> \
  cert-manager.io/issue-temporary-certificate="true" --overwrite

Option 2: Let's Encrypt Issuer wechseln

# Von Production zu Staging (bei Rate-Limits)
kubectl edit certificate <cert-name> -n <namespace>

# spec.issuerRef.name: letsencrypt-staging

Option 3: DNS-Challenge verwenden (statt HTTP-Challenge)

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com
  namespace: production
spec:
  secretName: example-com-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - example.com
    - "*.example.com"
  solvers:
    - dns01:
        cloudflare:
          email: admin@example.com
          apiTokenSecretRef:
            name: cloudflare-api-token
            key: api-token

User Onboarding¶

Request: Neuer Application Developer benötigt Zugriff auf die Plattform

Prerequisites:

OIDC-Account existiert (Azure AD, Keycloak)
User hat MFA aktiviert
Manager-Approval liegt vor

Procedure:

RBAC-Zugriff gewähren:

# user-rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: developer-john-doe
  namespace: production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
  - kind: User
    name: john.doe@example.com
    apiGroup: rbac.authorization.k8s.io

kubectl apply -f user-rolebinding.yaml

# opensearch-role-mapping.yaml
opensearch:
  extraRoleMappings:
    - mapping_name: kubernetes_log_reader
      definition:
        users:
          - john.doe@example.com

# ayedo-config aktualisieren und anwenden
polycrate workflows run update-opensearch

Grafana-Zugriff gewähren:

Benutzer meldet sich mit OIDC an → Admin befördert zu Editor-Role:

# Grafana Admin-Login
# → https://grafana.example.com/admin/users
# → User "john.doe@example.com" finden
# → Role: "Editor" zuweisen

Harbor-Zugriff gewähren:

# Harbor Admin-Login
# → https://harbor.example.com/harbor/projects
# → Project "production" öffnen
# → Members → Add → john.doe@example.com → Role: Developer

Dokumentation senden:

E-Mail mit folgenden Infos: - Kubeconfig-Anleitung: Access Control - Grafana-URL: https://grafana.example.com - Harbor-URL: https://harbor.example.com

Namespace Creation¶

Request: Application Developer benötigt neuen Namespace

Prerequisites:

Namespace-Name: <team>-<env> (z.B. backend-prod)
Resource-Quotas definiert
Network-Policies definiert

Procedure:

Namespace erstellen:

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: backend-prod
  labels:
    team: backend
    environment: production
    istio-injection: enabled

kubectl apply -f namespace.yaml

Resource-Quotas erstellen:

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: backend-prod-quota
  namespace: backend-prod
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    persistentvolumeclaims: "10"

kubectl apply -f resource-quota.yaml

Network-Policies erstellen:

# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-from-other-namespaces
  namespace: backend-prod
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector: {}

kubectl apply -f network-policy.yaml

LimitRange erstellen:

# limit-range.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: backend-prod-limits
  namespace: backend-prod
spec:
  limits:
    - max:
        cpu: "4"
        memory: 8Gi
      min:
        cpu: 10m
        memory: 10Mi
      default:
        cpu: 500m
        memory: 512Mi
      defaultRequest:
        cpu: 100m
        memory: 128Mi
      type: Container

kubectl apply -f limit-range.yaml

RBAC-Zugriff gewähren:

# rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: backend-team-edit
  namespace: backend-prod
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
  - kind: Group
    name: backend-team
    apiGroup: rbac.authorization.k8s.io

kubectl apply -f rolebinding.yaml

Maintenance-Runbooks¶

Certificate Renewal¶

Frequency: Monatlich (automatisch via cert-manager)

Prerequisites:

cert-manager installiert
Let's Encrypt Issuer konfiguriert

Procedure:

Ablaufende Certificates identifizieren:

kubectl get certificates -A -o json | \
  jq -r '.items[] | select(.status.notAfter != null) |
  [.metadata.namespace, .metadata.name, .status.notAfter] | @tsv' | \
  awk -v date="$(date -d '+30 days' -u +%Y-%m-%dT%H:%M:%SZ)" '$3 < date'

Manuelles Renewal (falls automatisch fehlschlägt):

# Certificate löschen (wird neu erstellt)
kubectl delete certificaterequest -n <namespace> <cert-request-name>

# Warten auf Renewal
kubectl get certificate -n <namespace> <cert-name> --watch

Validierung:

# Certificate-Status prüfen
kubectl describe certificate <cert-name> -n <namespace>

# TLS-Handshake testen
openssl s_client -connect example.com:443 -servername example.com < /dev/null 2>/dev/null | \
  openssl x509 -noout -dates

Backup Verification¶

Frequency: Wöchentlich

Prerequisites:

Velero installiert
Backup-Schedule konfiguriert

Procedure:

Backup-Status prüfen:

velero backup get | grep Completed

Letztes Backup Details:

LAST_BACKUP=$(velero backup get -o json | jq -r '.items | sort_by(.status.startTimestamp) | last | .metadata.name')
velero backup describe $LAST_BACKUP --details

Test-Restore durchführen:

# Test-Namespace erstellen
kubectl create namespace backup-test

# Restore in Test-Namespace
velero restore create backup-test-restore \
  --from-backup $LAST_BACKUP \
  --namespace-mappings production:backup-test \
  --wait

# Validierung
kubectl get all -n backup-test

# Cleanup
kubectl delete namespace backup-test
velero restore delete backup-test-restore

Off-Site-Backup prüfen:

# Rclone-Job-Status
kubectl get jobs -n velero | grep rclone

# Rclone-Job-Logs
kubectl logs -n velero -l job-name=rclone-offsite-backup-<timestamp>

Runbook-Template¶

Für neue Runbooks nutzen Sie bitte folgendes Template:

### <Runbook-Titel>

**Alert/Request:** <Alert-Name oder Request-Typ>

**Severity:** <Critical/Warning/Info>

**Impact:**

- <Impact-Punkt 1>
- <Impact-Punkt 2>

**Diagnosis:**


1. **Schritt 1:**
   \```bash
   <command>
   \```

2. **Schritt 2:**
   \```bash
   <command>
   \```

**Mitigation:**


**Option 1: <Lösung 1>**

\```bash
<command>
\```

**Option 2: <Lösung 2>**

\```bash
<command>
\```

**Escalation:** <Wann eskalieren? An wen?>

Weiterführende Dokumentation¶

Troubleshooting - Detaillierte Fehlerbehebung
Maintenance - Wartungsprozeduren
Disaster Recovery - Backup und Wiederherstellung
User Alerts - Alert-Konfiguration

Support¶

Bei Fragen zu Runbooks:

E-Mail: support@ayedo.de
Website: ayedo.de
Discord: ayedo Discord