THE GPUs on the bus go round and round

Background

  • They are the GForce Now folks
  • Large fleet of clusters all over the world (60.000+ GPUs)
  • They use kubevirt to pass through GPUs (vfio driver) or vGPUs
  • Devices fail from time to time
  • Sometimes failures needs restarts

Failure discovery

  • Goal: Maintain capacity
  • Failure reasons: Overheating, insufficient power, driver issues, hardware faults, …
  • Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
  • Fix: First detect failure, then remidiate
    • GPU Problem detector as part of their internal device plugin
    • Node Problem detector -> triggers remediation through maintainance

Remidiation approaches

  • Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
  • Discovery of remidiation loops -> Too many reboots indicate something being not quite right
  • Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
  • The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA

Prevention

Problems should not affect workload

  • Healthchecks with alerts
  • Firmware & Driver updates
  • Thermal & Powermanagement

Future Challenges

  • What if a high density with 8 GPUs has one failure?
  • What is an acceptable rate of working to broken GPUs per Node
  • If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?

Q&A

  • Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
  • Are the failure rates representative and what is counted as failure:
    • Failure is not being able to run a workload on a node (could be hardware or driver failure)
    • The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)