Disaster Recovery¶

Dieses Dokument beschreibt die Disaster-Recovery-Prozeduren für die ayedo Software Delivery Plattform. Diese Prozeduren müssen von Plattform Administratoren ausgeführt werden und setzen Kenntnisse über Kubernetes, Velero und die ayedo-Architektur voraus.

Compliance-Anforderungen¶

Disaster Recovery ist durch verschiedene Regulierungen und Standards mandatiert:

Norm	Control	Anforderung
ISO 27001	Annex A 5.30	ICT Readiness for Business Continuity
ISO 27001	Annex A 8.13	Information Backup
ISO 27001	Annex A 17.1.1	Planning Information Security Continuity
BSI IT-Grundschutz	OPS.1.1.5	Datensicherung
GDPR	Art. 32	Availability and Resilience
NIS2	Requirement ©	Disaster Recovery

Backup-Strategie¶

Die ayedo SDP nutzt eine Multi-Layer-Backup-Strategie:

1. Kubernetes-Ressourcen-Backups¶

Tool: Velero

Backup-Scope:

Kubernetes-Manifests (Deployments, Services, ConfigMaps, Secrets)
Persistent Volume Claims (PVCs)
Persistent Volumes (PVs) inkl. Daten

Backup-Häufigkeit:

Daily: Automatische Backups aller kritischen Namespaces
On-Demand: Vor Upgrades und Maintenance
Pre-Deployment: Bei kritischen Änderungen

2. Datenbank-Backups¶

Tools:

PostgreSQL: pg_dump via CronJob
Harbor: Database-Dump via CronJob

Backup-Ziel:

Object Storage (S3, Azure Blob, Minio)

3. Application-Data-Backups¶

Tool: Velero mit Restic/Kopia-Integration

Backup-Scope:

Application-spezifische Persistent Volumes
Stateful-Set-Daten
User-generierte Daten

4. Off-Site-Backups¶

Tool: Rclone via CronJob

Backup-Ziel:

Sekundäres Object Storage (geografisch getrennt)

Verschlüsselung:

Optional via Rclone Crypt

Velero-Backup und -Restore¶

Velero ist das primäre Tool für Kubernetes-Backups in der ayedo SDP.

Voraussetzungen¶

# Velero CLI installieren
wget https://github.com/vmware-tanzu/velero/releases/latest/download/velero-linux-amd64.tar.gz
tar -xvf velero-linux-amd64.tar.gz
sudo mv velero-linux-amd64/velero /usr/local/bin/

Backup erstellen¶

Daily-Backup (automatisch)¶

Velero ist so konfiguriert, dass täglich automatisch Backups erstellt werden:

# velero-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 02:00 UTC
  template:
    includedNamespaces:
      - production
      - staging
      - monitoring
    excludedResources:
      - events
      - events.events.k8s.io
    ttl: 720h  # 30 Tage Retention
    storageLocation: default
    volumeSnapshotLocations:
      - default

On-Demand-Backup¶

Vollständiges Cluster-Backup:

velero backup create full-backup-$(date +%Y%m%d) \
  --include-namespaces='*' \
  --exclude-namespaces=kube-system,kube-public,kube-node-lease \
  --wait

Namespace-spezifisches Backup:

velero backup create production-backup-$(date +%Y%m%d) \
  --include-namespaces=production \
  --wait

Backup mit Persistent Volumes:

velero backup create app-backup-$(date +%Y%m%d) \
  --include-namespaces=production \
  --default-volumes-to-fs-backup \
  --wait

Backup-Status prüfen¶

# Alle Backups auflisten
velero backup get

# Backup-Details anzeigen
velero backup describe <backup-name> --details

# Backup-Logs anzeigen
velero backup logs <backup-name>

# Backup-Status per kubectl
kubectl get backups -n velero

Restore durchführen¶

Wichtig

Velero überschreibt keine existierenden Ressourcen. Ressourcen müssen vor dem Restore gelöscht werden!

Vollständiger Cluster-Restore¶

# Schritt 1: Velero installieren (falls neuer Cluster)
velero install \
  --provider aws \
  --bucket ayedo-velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=eu-central-1 \
  --use-volume-snapshots=false \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --wait

# Schritt 2: Backup-Location verifizieren
velero backup-location get

# Schritt 3: Backups synchronisieren
velero backup get

# Schritt 4: Restore erstellen
velero restore create full-restore-$(date +%Y%m%d) \
  --from-backup full-backup-20250115 \
  --wait

Namespace-Restore¶

velero restore create production-restore-$(date +%Y%m%d) \
  --from-backup production-backup-20250115 \
  --include-namespaces production \
  --wait

Einzelne Ressource wiederherstellen¶

# Backup-Inhalt inspizieren
velero backup describe full-backup-20250115 --details

# Spezifische Ressourcen wiederherstellen
velero restore create selective-restore-$(date +%Y%m%d) \
  --from-backup full-backup-20250115 \
  --include-namespaces production \
  --include-resources deployment,service,configmap \
  --selector app=myapp \
  --wait

Persistent Volumes wiederherstellen¶

PVs mit WaitForFirstConsumer¶

PVs mit volumeBindingMode: WaitForFirstConsumer benötigen einen Pod, um gebunden zu werden:

Prüfen Sie pending PVCs:
```
kubectl get pvc -A | grep Pending
```
Erstellen Sie einen Helper-Pod:
```
api kindmet   spec                          
```
id=__span-10-1># pvc-binder-pod.yaml Version: v1 class=p>: Pod adata: name: pvc-binder namespace: production class=p>: containers: - name: sleeper image: busybox:stable command: ["sleep", "infinity"] volumeMounts: - mountPath: /data name: restore-volume resources: requests: cpu: 10m memory: 10Mi limits: cpu: 10m memory: 10Mi securityContext: allowPrivilegeEscalation: false runAsNonRoot: true runAsUser: 10000 capabilities: drop: ["ALL"] seccompProfile: type: RuntimeDefault volumes: - name: restore-volume persistentVolumeClaim: claimName: my-app-data # PVC-Name anpassen

Pod erstellen und Restore abwarten:

kubectl apply -f pvc-binder-pod.yaml

# Warten bis Velero das Volume restored
velero restore get

# Pod löschen
kubectl delete -f pvc-binder-pod.yaml

Harbor Disaster Recovery¶

Harbor speichert Container-Images in Object Storage und Metadaten in einer PostgreSQL-Datenbank.

Backup¶

Automatisches Datenbank-Backup¶

Harbor ist so konfiguriert, dass die Datenbank täglich via CronJob gesichert wird:

# CronJob-Status prüfen
kubectl get cronjob -n harbor harbor-backup-cronjob

# Letzten Backup-Job prüfen
kubectl get jobs -n harbor | grep harbor-backup

On-Demand-Backup¶

# Manuellen Backup-Job erstellen
kubectl create job -n harbor \
  --from=cronjob/harbor-backup-cronjob \
  harbor-manual-backup-$(date +%Y%m%d)

# Job-Status überwachen
kubectl logs -n harbor -f job/harbor-manual-backup-20250115

Restore¶

Harbor-Restore umfasst zwei Schritte:

Datenbank wiederherstellen
Image-Registry (Object Storage) wiederherstellen

# Schritt 1: Harbor-Pods herunterfahren
kubectl scale deployment -n harbor harbor-core --replicas=0
kubectl scale deployment -n harbor harbor-registry --replicas=0

# Schritt 2: Datenbank-Restore-Job erstellen
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: harbor-restore-$(date +%Y%m%d)
  namespace: harbor
spec:
  template:
    spec:
      containers:
        - name: restore
          image: postgres:14
          command:
            - bash
            - -c
            - |
              export PGPASSWORD=\$POSTGRES_PASSWORD
              psql -h harbor-database -U postgres -c "DROP DATABASE IF EXISTS registry;"
              psql -h harbor-database -U postgres -c "CREATE DATABASE registry;"
              pg_restore -h harbor-database -U postgres -d registry /backup/harbor-db-backup-20250115.dump
          env:
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: harbor-database
                  key: password
          volumeMounts:
            - name: backup
              mountPath: /backup
      volumes:
        - name: backup
          persistentVolumeClaim:
            claimName: harbor-backup-pvc
      restartPolicy: OnFailure
EOF

# Schritt 3: Harbor-Pods wieder hochfahren
kubectl scale deployment -n harbor harbor-core --replicas=1
kubectl scale deployment -n harbor harbor-registry --replicas=1

# Schritt 4: Harbor-Status prüfen
kubectl get pods -n harbor
curl -I https://harbor.example.com

Grafana Disaster Recovery¶

Grafana-Dashboards werden durch Velero gesichert (PVC-Backup).

Backup¶

# Automatisch via Velero Daily-Backup

# Manuelles Backup
velero backup create grafana-backup-$(date +%Y%m%d) \
  --include-namespaces monitoring \
  --include-resources deployment,pvc,pv,configmap,secret \
  --selector app.kubernetes.io/name=grafana \
  --default-volumes-to-fs-backup \
  --wait

Restore¶

# Grafana herunterfahren
kubectl scale deployment -n monitoring grafana --replicas=0

# PVC und PV löschen
kubectl delete pvc -n monitoring grafana
kubectl delete pv <grafana-pv-name>

# Velero-Restore
velero restore create grafana-restore-$(date +%Y%m%d) \
  --from-backup grafana-backup-20250115 \
  --include-namespaces monitoring \
  --wait

# Grafana hochfahren
kubectl scale deployment -n monitoring grafana --replicas=1

# Status prüfen
kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
curl -I https://grafana.example.com

GitLab Disaster Recovery¶

GitLab speichert Git-Repositories, CI/CD-Artefakte und Metadaten. Ein vollständiges Backup umfasst alle Datenbanken (PostgreSQL), Object Storage und GitLab-Secrets.

Backup¶

Automatisches Backup via GitLab Toolbox¶

GitLab bietet ein integriertes Backup-Tool über die Toolbox:

# Backup über gitlab-toolbox erstellen
kubectl exec -n gitlab gitlab-toolbox-<pod> -- \
  backup-utility --skip registry,uploads,builds,artifacts,lfs,packages

# Backup-Status prüfen
kubectl logs -n gitlab gitlab-toolbox-<pod> | grep -i backup

On-Demand-Backup¶

# Manuelles Backup erstellen
kubectl exec -n gitlab gitlab-toolbox-<pod> -- \
  gitlab-backup create SKIP=registry,uploads,builds,artifacts,lfs,packages

# Backup-Liste anzeigen
kubectl exec -n gitlab gitlab-toolbox-<pod> -- \
  gitlab-backup list

Object Storage

Bei Verwendung von Object Storage (S3, GCS, etc.) sollten Repository-Daten und Artifacts über Velero + Object Storage Snapshots gesichert werden.

Restore¶

GitLab-Restore erfordert Zugriff auf das Backup und GitLab-Secrets:

# Schritt 1: GitLab-Pods herunterfahren (außer Toolbox)
kubectl scale deployment -n gitlab gitlab-webservice --replicas=0
kubectl scale deployment -n gitlab gitlab-sidekiq --replicas=0

# Schritt 2: Datenbank-Restore via Toolbox
kubectl exec -n gitlab gitlab-toolbox-<pod> -- \
  gitlab-backup restore BACKUP=<timestamp> force=yes

# Schritt 3: GitLab-Pods wieder hochfahren
kubectl scale deployment -n gitlab gitlab-webservice --replicas=2
kubectl scale deployment -n gitlab gitlab-sidekiq --replicas=1

# Schritt 4: GitLab-Status prüfen
kubectl get pods -n gitlab
kubectl exec -n gitlab gitlab-toolbox-<pod> -- gitlab-rake gitlab:check SANITIZE=true
curl -I https://gitlab.example.com

ArgoCD Disaster Recovery¶

ArgoCD speichert Application-Definitionen, Cluster-Credentials und Sync-Status in einer Datenbank (standardmäßig etcd oder PostgreSQL).

Backup¶

ArgoCD-Backups werden automatisch durch Velero erfasst:

# Manuelles Backup
velero backup create argocd-backup-$(date +%Y%m%d) \
  --include-namespaces argocd \
  --include-resources deployment,statefulset,configmap,secret,pvc,pv \
  --default-volumes-to-fs-backup \
  --wait

# Backup-Status prüfen
velero backup describe argocd-backup-$(date +%Y%m%d)

Export von ArgoCD Applications (Deklarativ)¶

Alternativ können Applications als YAML exportiert werden:

# Alle Applications exportieren
kubectl get applications -n argocd -o yaml > argocd-applications-backup.yaml

# Cluster-Secrets exportieren
kubectl get secrets -n argocd -l argocd.argoproj.io/secret-type=cluster -o yaml > argocd-clusters-backup.yaml

Restore¶

Option 1: Velero-Restore (empfohlen)

# ArgoCD-Namespace und Ressourcen wiederherstellen
velero restore create argocd-restore-$(date +%Y%m%d) \
  --from-backup argocd-backup-20250115 \
  --include-namespaces argocd \
  --wait

# ArgoCD-Status prüfen
kubectl get pods -n argocd
kubectl get applications -n argocd
argocd app list

Option 2: Deklaratives Restore

# Applications wiederherstellen
kubectl apply -f argocd-applications-backup.yaml

# Cluster-Credentials wiederherstellen
kubectl apply -f argocd-clusters-backup.yaml

# ArgoCD-Server neu starten
kubectl rollout restart deployment -n argocd argocd-server

HashiCorp Vault Disaster Recovery¶

Vault speichert Secrets in einem verschlüsselten Storage-Backend. DR für Vault erfordert sowohl Storage-Backups als auch Unseal-Keys.

Backup¶

Raft Storage Snapshot (empfohlen)¶

Wenn Vault Raft als Storage-Backend nutzt:

# Snapshot erstellen
kubectl exec -n vault vault-0 -- vault operator raft snapshot save /tmp/vault-snapshot-$(date +%Y%m%d).snap

# Snapshot aus Pod kopieren
kubectl cp vault/vault-0:/tmp/vault-snapshot-$(date +%Y%m%d).snap ./vault-snapshot-$(date +%Y%m%d).snap

# Snapshot an sicheren Ort verschieben (S3, GCS, etc.)
aws s3 cp vault-snapshot-$(date +%Y%m%d).snap s3://backups/vault/

Automatisches Snapshot via CronJob¶

apiVersion: batch/v1
kind: CronJob
metadata:
  name: vault-snapshot
  namespace: vault
spec:
  schedule: "0 2 * * *"  # Täglich um 2 Uhr
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: vault
          containers:
            - name: snapshot
              image: hashicorp/vault:1.18.0
              command:
                - /bin/sh
                - -c
                - |
                  vault operator raft snapshot save /backup/vault-snapshot-$(date +%Y%m%d).snap
                  # Upload zu S3/GCS
              volumeMounts:
                - name: backup
                  mountPath: /backup
          volumes:
            - name: backup
              persistentVolumeClaim:
                claimName: vault-backup-pvc
          restartPolicy: OnFailure

Unseal-Keys sicher aufbewahren

Vault-Snapshots sind verschlüsselt. Ohne Unseal-Keys können Sie Vault nach einem Restore nicht entsperren. Bewahren Sie Unseal-Keys an einem sicheren, getrennten Ort auf (z.B. Hardware-Token, Passwort-Manager, Paper-Backup in Safe).

Restore¶

Raft Snapshot Restore¶

# Schritt 1: Vault-Pods herunterfahren (außer vault-0)
kubectl scale statefulset -n vault vault --replicas=1

# Schritt 2: Snapshot in Pod kopieren
kubectl cp ./vault-snapshot-20250115.snap vault/vault-0:/tmp/vault-snapshot.snap

# Schritt 3: Snapshot wiederherstellen
kubectl exec -n vault vault-0 -- vault operator raft snapshot restore -force /tmp/vault-snapshot.snap

# Schritt 4: Vault neu starten
kubectl rollout restart statefulset -n vault vault

# Schritt 5: Vault entsperren (Unseal)
kubectl exec -n vault vault-0 -- vault operator unseal <unseal-key-1>
kubectl exec -n vault vault-0 -- vault operator unseal <unseal-key-2>
kubectl exec -n vault vault-0 -- vault operator unseal <unseal-key-3>

# Schritt 6: Vault-Status prüfen
kubectl exec -n vault vault-0 -- vault status

Vault-Secrets exportieren (Alternative)¶

Für kleinere Deployments oder Migrations-Szenarien:

# Alle Secrets aus KV-Engine exportieren
vault kv list -format=json secret/ | jq -r '.[]' | while read path; do
  vault kv get -format=json "secret/$path" > "backup-$path.json"
done

Off-Site-Backup und -Restore¶

Off-Site-Backups werden via Rclone in ein geografisch getrenntes Object Storage repliziert.

Off-Site-Backup-Konfiguration¶

# rclone-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: rclone-offsite-backup
  namespace: velero
spec:
  schedule: "0 4 * * *"  # 04:00 UTC
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: rclone
              image: rclone/rclone:latest
              command:
                - /bin/sh
                - -c
                - |
                  rclone sync \
                    --config /config/rclone.conf \
                    --transfers 4 \
                    --checkers 8 \
                    --verbose \
                    primary:ayedo-velero-backups offsite:ayedo-velero-backups-offsite
              volumeMounts:
                - name: rclone-config
                  mountPath: /config
          volumes:
            - name: rclone-config
              secret:
                secretName: rclone-config
          restartPolicy: OnFailure

Restore von Off-Site-Backup¶

Schritt 1: Off-Site-Backup-Location konfigurieren

# S3-basiertes Off-Site-Backup
velero backup-location create offsite \
  --provider aws \
  --bucket ayedo-velero-backups-offsite \
  --config region=eu-west-1 \
  --access-mode ReadOnly \
  --credential=velero-offsite-credentials=cloud

# Backup-Location verifizieren
velero backup-location get

Schritt 2: Backups von Off-Site synchronisieren

# Backups von Off-Site-Location abrufen
velero backup get --backup-location offsite

Schritt 3: Restore durchführen

velero restore create offsite-restore-$(date +%Y%m%d) \
  --from-backup <backup-name> \
  --wait

Disaster Recovery Testing¶

Empfohlene Test-Häufigkeit¶

Test-Typ	Häufigkeit	Scope
Namespace-Restore	Monatlich	Einzelner Namespace (non-prod)
Full-Cluster-Restore	Quartalsweise	Staging-Cluster vollständig
Off-Site-Restore	Halbjährlich	Off-Site → Production-Cluster

DR-Test-Prozedur¶

Backup-Status validieren:
```
velero backup get | grep Completed
```
Test-Namespace erstellen:
```
kubectl create namespace dr-test
```

Restore in Test-Namespace:

velero restore create dr-test-restore \
  --from-backup production-backup-20250115 \
  --namespace-mappings production:dr-test \
  --wait

Validierung:
Alle Pods laufen: kubectl get pods -n dr-test
Services erreichbar: curl https://app.dr-test.example.com
Daten intakt: Datenbank-Queries ausführen

Cleanup:

kubectl delete namespace dr-test
velero restore delete dr-test-restore

RTO und RPO¶

Recovery Time Objective (RTO)¶

Komponente	Target RTO	Tatsächlicher RTO
Kubernetes-Ressourcen	< 30 Min	~15 Min
Persistent Volumes	< 1 Stunde	~30 Min
Harbor	< 1 Stunde	~30 Min
Gesamte Plattform	< 4 Stunden	~2 Stunden

Recovery Point Objective (RPO)¶

Backup-Typ	Target RPO	Backup-Häufigkeit
Velero Daily Backup	24 Stunden	Täglich 02:00 UTC
Harbor DB Backup	24 Stunden	Täglich 03:00 UTC
Off-Site Replication	24 Stunden	Täglich 04:00 UTC

Troubleshooting¶

Problem: Velero-Backup schlägt fehl¶

Symptom: velero backup get zeigt PartiallyFailed oder Failed

Lösung:

# Backup-Logs prüfen
velero backup logs <backup-name>

# Velero-Pod-Logs prüfen
kubectl logs -n velero -l app.kubernetes.io/name=velero

# Object-Storage-Zugriff testen
velero backup-location get

# Backup mit --allow-partially-failed wiederherstellen
velero restore create <restore-name> \
  --from-backup <backup-name> \
  --allow-partially-failed \
  --wait

Symptom: Snapshot bleibt in IN_PROGRESS Status

Lösung:

# Snapshot-Status prüfen
curl -u "${OS_USER}:${OS_PASS}" \
  "${OS_URL}/_snapshot/${SNAPSHOT_REPO}/${SNAPSHOT_NAME}/_status?pretty"

# Snapshot abbrechen
curl -u "${OS_USER}:${OS_PASS}" -X DELETE \
  "${OS_URL}/_snapshot/${SNAPSHOT_REPO}/${SNAPSHOT_NAME}"

curl -u "${OS_USER}:${OS_PASS}" "${OS_URL}/_cluster/health?pretty"

Problem: PVC bleibt nach Restore in Pending¶

Symptom: kubectl get pvc zeigt Pending Status

Lösung:

PVC-Events prüfen:

kubectl describe pvc <pvc-name> -n <namespace>

StorageClass-Mapping prüfen (siehe Velero-Doku)
Helper-Pod erstellen (siehe Abschnitt "PVs mit WaitForFirstConsumer")

Weiterführende Dokumentation¶

Backup & Recovery (Compliance) - Compliance-Anforderungen
Maintenance - Regelmäßige Wartung
Runbooks - Operationale Prozeduren
Troubleshooting - Fehlerbehebung

Support¶

Bei Fragen zu Disaster Recovery:

E-Mail: support@ayedo.de
Website: ayedo.de
Discord: ayedo Discord