Endpoint-Monitoring¶

Übersicht¶

Das Endpoint-Monitoring der Polycrate API überwacht die Erreichbarkeit und Performance Ihrer Services kontinuierlich. Der Polycrate Operator führt das Monitoring direkt im Kubernetes-Cluster durch.

Polycrate API vs. Polycrate Hub

Die Polycrate API und der Polycrate Hub sind zwei unabhängige Komponenten des Polycrate-Ökosystems:

Polycrate API: Web-Oberfläche und Backend für Infrastruktur-Management, Monitoring und Alerting
Polycrate Hub: Block-Registry und Marketplace für Polycrate Blocks

Das Endpoint-Monitoring ist eine Funktion der Polycrate API.

Architektur¶

Das Endpoint-Monitoring ist vollständig in den Polycrate Operator integriert. Der Operator fungiert als Monitoring-Agent und führt Checks direkt im Kubernetes-Cluster aus.

graph TB
    subgraph cluster[Kubernetes Cluster]
        INGRESS[Ingress-Objekte]

        subgraph operator[Polycrate Operator]
            DISCOVERY[Endpoint Discovery<br/>Controller]
            MONITORING[Endpoint Monitoring<br/>Controller]
            API_AGENT[API Agent<br/>Controller]
        end

        ENDPOINT_CRD[(Endpoint CRs)]
        PROM[Prometheus Metriken]
    end

    subgraph api[Polycrate API]
        CONFIG[Endpoint-Konfiguration]
        RESULTS[Check-Ergebnisse]
    end

    INGRESS -->|1. Discovery| DISCOVERY
    DISCOVERY -->|2. Erstellt| ENDPOINT_CRD
    API_AGENT <-->|3. Sync| CONFIG
    API_AGENT -->|4. Erstellt API-Endpoints| ENDPOINT_CRD
    MONITORING -->|5. Liest| ENDPOINT_CRD
    MONITORING -->|6. Führt Checks aus| INGRESS
    MONITORING -->|7. Status Update| ENDPOINT_CRD
    MONITORING -->|8. Exportiert| PROM

Operator-basiertes Monitoring¶

Der Polycrate Operator enthält drei Controller für das Endpoint-Monitoring:

Controller	Funktion
Endpoint Discovery	Erkennt Ingress-Objekte und erstellt Endpoint CRs
Endpoint Monitoring	Führt HTTP-Checks aus und aktualisiert Status
API Agent	Synchronisiert Endpoints mit der Polycrate API

Vorteile des Operator-basierten Ansatzes:

✅ Keine separate Agent-Installation nötig
✅ Automatische Endpoint-Erkennung via Ingress
✅ Kubernetes-native Persistierung in Custom Resources
✅ Prometheus-Metriken out-of-the-box
✅ Lokales Monitoring auch ohne API-Verbindung

Endpoint-Quellen¶

Endpoints können aus drei Quellen stammen:

Quelle	Label	Beschreibung
Discovery	`operator.polycrate.io/source=discovery`	Automatisch aus Ingress-Objekten erkannt
API	`operator.polycrate.io/source=api`	Von der Polycrate API zugewiesen
Manual	`operator.polycrate.io/source=manual`	Manuell erstellte Endpoint CRs

Automatische Endpoint-Erkennung (Discovery)¶

Der Operator erkennt Ingress-Objekte im Cluster und erstellt automatisch Endpoint Custom Resources:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    # Health-Check Pfad (statt Root-URL)
    endpoints.polycrate.io/path: "/health"
    # Check-Intervall in Sekunden
    endpoints.polycrate.io/interval: "30"
    # Timeout in Sekunden
    endpoints.polycrate.io/timeout: "10"
    # Erwarteter HTTP-Status
    endpoints.polycrate.io/expected-status: "200"
spec:
  rules:
    - host: app.example.com
      # ...

Annotation	Default	Beschreibung
`endpoints.polycrate.io/path`	`/`	HTTP-Pfad für Health-Check
`endpoints.polycrate.io/interval`	`60`	Check-Intervall in Sekunden
`endpoints.polycrate.io/timeout`	`30`	Timeout in Sekunden
`endpoints.polycrate.io/expected-status`	`200`	Erwarteter HTTP-Status
`endpoints.polycrate.io/ignore`	`false`	Ingress vom Monitoring ausschließen

API-Integration¶

Wenn die API-Integration aktiviert ist, synchronisiert der Operator mit der Polycrate API:

Registrierung: Operator meldet sich als Agent bei der API an
Endpoint-Sync: API weist zusätzliche Endpoints zu (z.B. für Multi-Region-Monitoring)
Health-Reporting: Operator sendet periodische Health-Reports

Konfiguration in OperatorConfig:

apiVersion: polycrate.io/v1alpha1
kind: OperatorConfig
metadata:
  name: default
  namespace: polycrate
spec:
  api_sync:
    enabled: true
    api_url: "https://api.polycrate.io"
    credentials_ref:
      secret_name: polycrate-api-creds
      token_key: token
    workspace_id: "my-workspace"

  endpoint_discovery:
    enabled: true
    watch_namespaces: []
    ignore_namespaces:
      - kube-system
    default_check_interval: 60

  agent:
    enabled: true
    max_concurrent_checks: 50
    check_timeout: "10s"
    metrics:
      enabled: true
      port: 9090

Endpoint Custom Resource¶

Der Operator speichert Endpoint-Konfiguration und Check-Ergebnisse in Custom Resources:

apiVersion: polycrate.io/v1alpha1
kind: Endpoint
metadata:
  name: myapp-example-com
  labels:
    operator.polycrate.io/source: "discovery"
spec:
  remote_address: "myapp.example.com"
  remote_port: 443
  protocol: "https"
  path: "/health"
  interval: 60
  timeout: 10
  expected_status_codes: [200, 201]

  check_config:
    method: "GET"
    follow_redirects: true
    ignore_tls_errors: false
    retry_enabled: true
    retry_count: 3

status:
  phase: "Synced"
  monitoring:
    last_check_at: "2025-01-18T10:35:00Z"
    last_check_success: true
    last_check_duration_ms: 150
    last_result:
      status_code: 200
      response_time_ms: 145
      tls_version: "TLS 1.3"
      certificate_expiry: "2025-06-15T00:00:00Z"
    uptime_percent: 99.65
    next_check_at: "2025-01-18T10:36:00Z"

Prometheus-Metriken¶

Der Operator exportiert Check-Ergebnisse als Prometheus-Metriken:

# Endpoint-Verfügbarkeit (1=up, 0=down)
polycrate_io_endpoint_up{
  endpoint_id="uuid",
  hostname="myapp.example.com",
  organization="my-org",
  workspace="my-workspace"
} 1

# Antwortzeit in Millisekunden
polycrate_io_endpoint_response_time_ms{...} 145

# HTTP-Statuscode
polycrate_io_endpoint_http_status_code{...} 200

# TLS-Zertifikat-Ablauf (Unix Timestamp)
polycrate_io_endpoint_certificate_expiry_timestamp{...} 1750000000

# Check-Dauer
polycrate_io_endpoint_check_duration_ms{...} 150

Endpoint-Typen¶

HTTP-Endpoints¶

Überwachung von Web-Services:

Feld	Beschreibung	Beispiel
URL	Vollständige URL	`https://api.example.com/health`
Method	HTTP-Methode	`GET`, `HEAD`
Expected Status	Erwarteter Status-Code	`200`, `2xx`
Timeout	Max. Wartezeit	`30s`
Interval	Check-Frequenz	`60s`

ICMP-Endpoints¶

Ping-basierte Erreichbarkeitsprüfung:

Feld	Beschreibung	Beispiel
Host	Hostname oder IP	`server.example.com`
Timeout	Max. Wartezeit	`10s`
Interval	Check-Frequenz	`30s`

ICMP im Operator

ICMP-Checks erfordern CAP_NET_RAW Capability im Container. HTTP-Checks werden bevorzugt.

Multi-Region-Monitoring¶

Für globales Monitoring können mehrere Operatoren als Monitoring-Agents fungieren:

graph TB
    subgraph eu[EU Cluster]
        OP_EU[Operator EU<br/>global_monitor=true]
    end

    subgraph us[US Cluster]
        OP_US[Operator US<br/>global_monitor=true]
    end

    subgraph apac[APAC Cluster]
        OP_APAC[Operator APAC<br/>global_monitor=true]
    end

    API[Polycrate API]

    OP_EU <--> API
    OP_US <--> API
    OP_APAC <--> API

Konfiguration für globales Monitoring:

In der Polycrate API kann ein Workspace als Global Endpoint Monitor konfiguriert werden:

Einstellung	Beschreibung
`operator_global_endpoint_monitor = false`	Operator überwacht nur eigene Endpoints (Default)
`operator_global_endpoint_monitor = true`	Operator erhält zusätzlich Endpoints von anderen Workspaces

Endpoint-Verteilungsmodi:

Modus	Beschreibung
`auto`	Endpoints können von allen globalen Monitoren überwacht werden
`organization_only`	Nur Operatoren der gleichen Organisation
`workspace_only`	Nur der eigene Operator (für airgapped Umgebungen)

SSL-Monitoring¶

HTTP-Endpoints mit HTTPS erhalten automatisch SSL-Monitoring:

Check	Alert bei
Gültigkeit	Zertifikat abgelaufen
Ablauf-Warnung	< 30 Tage bis Ablauf
Chain-Validierung	Ungültige Zertifikatskette
Hostname-Match	Hostname stimmt nicht überein

Standalone-Agent (Nicht-Kubernetes)¶

Für Umgebungen ohne Kubernetes steht weiterhin der CLI-Agent zur Verfügung:

polycrate api agent --agent-token "your-agent-token"

Der CLI-Agent bietet dieselbe Funktionalität, persistiert Check-Ergebnisse jedoch lokal statt in Kubernetes CRs.

Empfehlung

Für Kubernetes-Umgebungen wird der Operator-basierte Ansatz empfohlen, da er keine separate Installation erfordert und die automatische Ingress-Discovery bietet.

Best Practices¶

Health-Endpoints¶

Erstellen Sie dedizierte Health-Endpoints:

# Django-Beispiel
@api_view(['GET'])
def health_check(request):
    checks = {
        'database': check_database(),
        'cache': check_redis(),
        'external_api': check_external_api()
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return Response(checks, status=status_code)

Intervall-Empfehlungen¶

Endpoint-Typ	Empfohlenes Intervall
Kritische APIs	30s
Standard-Services	60s
Interne Services	120s
Batch-Jobs	300s

Multi-Region-Strategie¶

Mindestens 2 Operatoren als globale Monitore in verschiedenen Regionen
Alert erst bei mehreren Failures (vermeidet False Positives)
Regionale Latenz-Baselines definieren