Scaling GPU Clusters without melting down
Slides
Baseline
- We need mroe and more gpus -> Control Plane needs to keep track of more objects
- Goal: Scale Workers without scaling control plane
Current Problems
Secret list calls go up and control plane goes down
- Scenario: High number of list calls with larger secrets
- Problem: OOM apiserver b/c cache
- Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
- Result: Decreased number of oom crashes
High memory usage until we restart the apiserver
- Scenario: API-Server frees up to 40% of it’s memory util when restarted
- Main suspect: Memory collection
- Idea: Tune GOGC (ENV Var
GOCC
) -> They set the default 100 to 50
- Result: Decrease in memory util and no more growing util over time
Large skew in memory utilization
- Scanario: Scew between api server memory utilization across api-server pods
- Problem: If a pod with high util get’s hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
- Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
- Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
- Idea: Switch up the lb configuration -> Not quite the right angle
- Fix: Goaway-chance param in apiserver - random
COAWAY TCP
message get’s sent -> Tearing down connection gracefully, recreate connection
Architectural mistakes
- Large number of secrets per workload -> List, Encode/Decode overhead
- No caching -> To many list calls
Preview
- There are a bunch of sig api-machinery improvements planned
The future
- The switch from NUMA GPU-Devices to DRA
- DRA is powerfull engough to get rid of custom numa stuff
The stack
- Currently:
- CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
- Worker: Device Plugin, nfd topology updater
- Future
- CP: APIServer, Controller manager, Scheduler
- Worker: Device Plugin
Testing scaling
- Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
- Env: K8S 1.32 with scaling from 0 to 4000 Workloads
- Metrics:
- Scheduling Latency: Topo aware was way more latency-affected
- Scheduler Memory util: 30% of memory saved with dra
- APi-Server Memory: Another 20& of memory saved
- Result: They are confident that DRA will bew stable and even save memeory and cpu util