How to get visibility and isolation for AI in Kubernetes

The perennial problem with Kubernetes metrics is having so many data points available that it’s hard to filter out the noise.

Now, with AI, and especially agents, you also have to balance speed and security against observability – and the ability to be able to make more accurate utilisation and provisioning decisions.

GPUs and memory are expensive and frequently in short supply. They’re also often used inefficiently. Memory utilisation in typical Kubernetes clusters is as low as 20%, according to a recent report by Cast AI, and overprovisioning to avoid throttling and out of memory evictions is almost routine at 79%.

Provisioned for peak loads and inefficiently scheduled, enterprise AI clusters run at a shockingly low average of 5% active GPU utilization; the rest of the time they’re waiting for something.

Google’s Goodput metric counts tokens that arrive within an enterprise’s latency budget – which can get poor results on that measure even if a cluster is working away busily. There are techniques to improve the real AI throughput of a cluster, but having the right metrics dictates which to use.

Know about NUMA

AI runs on Kubernetes. According to the CNCF’s most recent annual report, 66% of orgs hosting generative AI models use Kubernetes to manage part or all of their inference workloads.

But as a platform, Kubernetes is still evolving features to support the scale of AI workloads alongside more granular control to run them efficiently and securely.

Associating zones of memory, and even PCIe I/O devices, with specific CPUs is a key part of how x86 systems have scaled performance since clock speeds stopped increasing. The downside of this Non-Uniform Memory Access (NUMA) design is that memory not local to a CPU is significantly slower.

That’s an extra hit when a GPU transfers data to or from RAM. If the system is not using memory on the same PCIe root as the GPU, then the transfer has higher latency or lower bandwidth because it has to cross the interconnect bus that joins all the memory zones rather than taking the direct local route.

“NUMA is a big deal for GPUs,” explains founder and CTO of Edera, a vendor-agnostic control plane for GPU infrastructure, Alex Zenla. A lot of the current interest in NUMA is for AI, where the design can help eke out every ounce of performance from GPUs.

“In some of our benchmarks we see 25 to 30% performance improvements when you have NUMA alignment, and it can be even better depending on the memory situation,” Zenla tells The Stack.

Get the full story: Subscribe for free

Join peers managing over $100 billion in annual IT spend and subscribe to unlock full access to The Stack’s analysis and events.

Subscribe now

Already a member? Sign in