

In today’s cloud-native world, enterprises need AI applications that are efficient, performant and built to scale. But meeting those expectations often comes down to one thing: infrastructure. AI inference may be a specialized workload, but at its core, it’s still a compute job — so what tools can help scale it reliably across an enterprise?
Google’s Poonam Lamba and Eddie Villalba discuss GKE for AI inference with theCUBE.
That’s the question Google Cloud’s Poonam Lamba and Eddie Villalba explored in a conversation with theCUBE, SiliconANGLE Media’s livestreaming studio, diving into how Google Kubernetes Engine is purpose-built to handle the demands of modern inference. With runtime flexibility, rich libraries and a configuration style familiar to web developers, Kubernetes is evolving from container orchestrator to AI operations backbone,
“[Kubernetes] solved a lot of the problems that organizations were faced with at the time,” said Eddie Villalba (pictured, middle), outbound product manager at Google Cloud. “Now, if you think about AI … AI is just another workload. It is a workload, but specialized. Then there’s a couple of different sides of AI, but we want to talk about serving inferencing, where the end users actually use the product.”
Villalba and Poonam Lamba (left), senior product manager of GKE AI inference and stateful workloads, Google Cloud spoke with theCUBE’s Savannah Peterson (right) for the “Google Cloud: Passport to Containers” interview series, during an exclusive broadcast on theCUBE. They discussed GKE as a powerful ally in the enterprise-grade deployment of gen AI. (* Disclosure below.)
While Kubernetes was initially seen as a general-purpose container orchestration tool, it’s now firmly entrenched as a foundational layer for AI inference at scale. In the same way that a student would absorb information during the semester and apply it in subsequent exams, inference tasks trained AI models to generate outputs based on new data — and Kubernetes’ unique toolset allows such operations.
“Let’s say you have trained a model, now you will take that model, the configuration that you need to run that model — the libraries, the runtime environment, like TensorFlow or PyTorch or JAX — you will package all of these things into a container, and now this becomes a portable unit that you will take from your testing to production,” Lamba said.
GKE stands out for its ability to handle complex and bursty workloads such as AI inference. It does all that with the versatility of a fine dining kitchen — capable of producing simple dishes or complex meals with ease, according to Villalba. Just as chefs need access to specialized tools, AI inference demands access to specialized accelerators such as GPUs and TPUs.
“If you think about what GKE is, it’s a very complicated, very organized kitchen that has all the equipment you need,” he said. “But when I need to create that Beef Wellington, I can. When I need to create just a bunch of salad, I can. When I need to just serve web services, it’s easy; GKE was already built for that. Now, with all those primitives in the APIs … the accelerator is just another resource, and it’s another API. Kubernetes was always good at assigning resources to your compute, memory and CPU. Now, this is just another resource that we optimize for that workload.”
Traditional load balancers weren’t designed for AI. That’s why Google created the GKE Inference Gateway, a model-aware, accelerator-aware load balancer tailored specifically for inference. Unlike conventional stateless routing, the Inference Gateway considers real-time data with model versioning, request priority, KV cache utilization and queuing depth, according to Lamba.
“What it does is when you are sending requests to Inference Gateway, you can specify the model name,” she said. “If you have different models or you have multiple versions of the same model, you can specify all of that in the request body. You can also specify if the incoming request is critical, standard or something that you can drop. So, depending on all that data, Inference Gateway can decide to route your request, but there’s more. It is also collecting real-time metrics from the KV-Cache utilization and the queuing that is happening at the model server level.”
To further address the unique needs of AI inference, GKE has introduced custom compute classes and the Dynamic Workload Scheduler. These features empower customers to define their desired performance and cost profiles, Villalba added.
“When I’m serving up something, I’m hitting an end user, and I need to make their experience happy,” he said. “I need to make sure that the resources needed are available at all times. Custom compute classes are a way for our customers to get the capacity they need when they need it in a priority order that they decide, but also sometimes in the most equitable fashion.”
Here’s the complete video interview, part of SiliconANGLE’s and theCUBE Research’s coverage of the “Google Cloud: Passport to Containers” interview series:
(* Disclosure: TheCUBE is a paid media partner for the “Google Cloud: Passport to Containers” series. Neither Google Cloud, the sponsor of theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Support our mission to keep content open and free by engaging with theCUBE community. Join theCUBE’s Alumni Trust Network, where technology leaders connect, share intelligence and create opportunities.
Founded by tech visionaries John Furrier and Dave Vellante, SiliconANGLE Media has built a dynamic ecosystem of industry-leading digital media brands that reach 15+ million elite tech professionals. Our new proprietary theCUBE AI Video Cloud is breaking ground in audience interaction, leveraging theCUBEai.com neural network to help technology companies make data-driven decisions and stay at the forefront of industry conversations.