Scaling GPU Clusters without melting down

Baseline

Scenario: High number of list calls with larger secrets
Problem: OOM apiserver b/c cache
Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
Result: Decreased number of oom crashes

Scanario: Scew between api server memory utilization across api-server pods
Problem: If a pod with high util get’s hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
Idea: Switch up the lb configuration -> Not quite the right angle
Fix: Goaway-chance param in apiserver - random COAWAY TCP message get’s sent -> Tearing down connection gracefully, recreate connection

Currently:
- CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
- Worker: Device Plugin, nfd topology updater
Future
- CP: APIServer, Controller manager, Scheduler
- Worker: Device Plugin

Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
Env: K8S 1.32 with scaling from 0 to 4000 Workloads
Metrics:
- Scheduling Latency: Topo aware was way more latency-affected
- Scheduler Memory util: 30% of memory saved with dra
- APi-Server Memory: Another 20& of memory saved
Result: They are confident that DRA will bew stable and even save memeory and cpu util