Loading #LLMs into NVIDIA GPUs on Google Cloud just got simpler. NVIDIA Run:ai Model Streamer now supports GKE with Workload Identity for Google Cloud Storage—unlocking faster and more secure AI/ML inference. Learn more below. ⬇️
⚡ Native GKE Integration is Live: Load 100 GBs of model weights from GCS in seconds ⚡ Run:ai Model Streamer now has native Google Cloud Storage support on GKE with Workload Identity - something we've been working on with the Google team for months. Why does this matter? Cold starts are brutal for LLM inference. You spin up a GPU, and it just... sits there waiting for a 70B+ model to load. That's expensive idle time and very slow autoscaling. The model streamer changes that by streaming tensors directly from object storage to GPU memory concurrently. For multi-GPU setups with model parallelism, it coordinates loading across NVLink so each process fetches its share and shares with the others. The result: loading a 141GB Llama 3.3 70B model goes from minutes to seconds with a single flag: --load-format=runai_streamer Really proud of the results that we achieved here. Huge shoutout to everyone, who made it happen - Peter Schuurman, Noa Neria, Omer Dayan, Brian Kaufman Nishtha Jain, Jason Messer, Ronen Dar (Blog and Quickstart Guide link in the comments 👇)