The Art and Science of Deploying LLMs in Production

Jawad Md

AI | ML | Engineering | Startup Advisor

Published May 6, 2025

Abstract / TL;DR

Behind every smooth LLM API call or conversation lies an intricate ballet of Computational optimization. Large Language Models demand extraordinary resources due to their nature of sequential, autoregressive generation process- creating unique challenges when deploying them at scale.

This blog unveils the dual personality of Transformer inference: The parallelizable "prefill" phase which is compute intensive and the sequential "decode" phase where memory bandwidth becomes the limiting factor. At the heart of this process is the Key-Value cache, which is essential for both Performance, yet paradoxically the primary bottleneck as it grows linearly with the context length.

We explore the interesting solutions that have emerged to tame these challenges. Paged Attention's Virtual memory approach conquers fragmentation. Continuous batching's dynamic request management that keeps GPU humming at peak efficiency; Quantization techniques that shrinks models without sacrificing quality; and speculative Decoding's clever parallel token generation strategy.

The journey doesn't end with algorithmic optimization. Deploying these systems on Kubernetes introduces a new dimension of complexity: managing heterogeneous GPU resources, reconciling dynamic memory needs with static allocation limits, implementing sophisticated autoscaling based on queue metrics, designing cache-aware load balancing, ensuring safe state persistence, orchestrating zero-downtime updates, and optimizing costs through efficient resource utilization.

Success in production LLM deployment ultimately depends on understanding this intricate dance between model architecture, optimization algorithms, and infrastructure constraints.

1. The Anatomy of Large Language Model Inference

Large Language Models derive their power from the Transformer architecture and massive training datasets. Their autoregressive nature—generating text one token at a time where each new token depends on all previous tokens—creates unique performance challenges during inference.

The Dual Phase of Inference

Illustration of key-value cacheing mechanism(inspired by Nvidia)

Prefill Stage:

In this initial phase the model processes the user's input prompt by computing the Query, Key and Value vectors of all the prompt tokens simultaneously. The highly parallelizable process involves efficient matrix-matrix multiplications that are typically compute bound. The Stage produces two critical outputs: the initial Key-Value (KV) cache tensors and logits for the first first output token. For very long prompts, optimization techniques like "Chunked Prefill" break the computations into manageable portions.

Decode Stage:

Once the prompt processing completes, the model enters a strictly sequential phase, generating one token at a time. For each step:

The model processes the most recently generated token using the existing KV cache.
New K and V vectors are appended to the cache.
The Query vector performs attention over the entire growing KV cache.
Logits are computed and the next token is sampled.

The KV Cache: Essential yet problematic

The KV cache solves a very crucial problem and prevents the need to recompute calculations during the token generation. However, it creates primary performance bottlenecks in LLM Inference. This bottle neck manifests in two ways.

Memory Consumption: The KV cache grows linearly with sequence length and batch size and quickly becomes the largest consumer of the GPU memory. Its size approximates 'batch_size num_layes seq_len * hidden_dim'
Memory Fragmentation: Traditional memory allocation approaches lead to both internal fragmentation (unused portions of pre-allocated blocks) and external fragmentation (non-contiguous free memory). This fragmentation significantly wastes GPU memory, limiting batch sizes and sequence lengths.

The KV cache therefore represents the primary driver of memory pressure during inference, directly conflicting with finite GPU memory and static limits imposed by orchestration systems. This bottleneck has motivated innovations like PagedAttention that fundamentally reimagine KV cache memory management.

2. Optimization Techniques:

PagedAttention: Virtual Memory for KV Cache

PagedAttention revolutionizes memory management for LLM inference by applying operating system concepts to the Key-Value cache bottleneck. Rather than allocating contiguous memory blocks for entire sequences, it divides the KV cache into smaller, fixed-size logical blocks mapped to potentially non-contiguous physical memory blocks via a "block table." Memory is allocated dynamically, only when needed.

This approach effectively eliminates both external fragmentation (by removing the need for large contiguous spaces) and internal fragmentation (by allocating memory incrementally). The result: dramatically improved memory utilization enabling larger batch sizes and longer sequences within existing GPU constraints.

However, PagedAttention introduces complexity through modified attention kernels and user-space memory management. Its non-contiguous storage requires specialized kernels that use the block table for data access rather than simple index-based lookups, potentially creating barriers to adopting new attention implementations.

Continuous Batching:

Continuous Batching tackles GPU underutilization caused by traditional static batching approaches. Rather than waiting for all sequences in a batch to complete (which creates "bubbles" of GPU inactivity), it maintains a dynamic batch processed iteratively:

At each iteration, the model processes all active sequences
Completed sequences are immediately removed and their resources freed
New waiting requests fill the freed slots without delay

This technique dramatically increases hardware utilization, with frameworks like vLLM reporting up to 23x throughput improvements over static batching. It also reduces median latency since incoming requests spend less time waiting in the queue.

Quantization:

Quantization reduces numerical precision of model weights and sometimes activations, mapping 32-bit or 16-bit values to lower-precision formats like INT8, INT4, or FP8. This compression typically uses scaling factors and zero-points to minimize error.

Key approaches include:

Post-Training Quantization (PTQ): The most common approach for LLMs, applied after training without expensive retraining
Weight-Only Quantization: Reduces only model weights while keeping activations in higher precision (e.g., W4A16)
Activation-Aware Weight Quantization (AWQ): Identifies and protects "salient" weight channels by examining activation magnitudes rather than weight values themselves
GPTQ: Quantizes weights sequentially while using approximate second-order information (Hessian) to compensate for introduced errors
FP8 Quantization: Uses specialized 8-bit floating-point formats supported by newer NVIDIA architectures, often doubling throughput versus FP16/BF16

Quantization provides three key benefits: reduced memory footprint, decreased memory bandwidth requirements, and faster computation through specialized hardware units. However, aggressive quantization below 8 bits can degrade model accuracy, requiring careful validation.

Speculative Decoding:

Speculative Decoding accelerates LLM inference through a "draft-then-verify" approach. A smaller, faster model or mechanism generates candidate tokens that the large target model verifies in parallel:

Drafting: A faster mechanism generates a sequence of K candidate tokens
Verification: The target model processes the original sequence plus all drafted tokens in one parallel pass
Acceptance/Correction: Drafted tokens are sequentially compared against what the target model would have chosen:

This approach replaces multiple sequential, memory-bound steps with a single parallel verification step, significantly reducing generation time while guaranteeing identical output to standard autoregressive decoding. Effectiveness depends on draft model latency and token acceptance rate, with research showing minimizing draft latency is often more important than maximizing its accuracy.

Unlike memory-focused optimizations like PagedAttention, Speculative Decoding tackles the decode bottleneck by reducing sequential steps and shifting workload characteristics toward being compute-bound during parallel verification.

3. The Kubernetes Gauntlet: Infrastructure Challenges for LLM Inference

GPU Resource Management & Heterogeneity

Effectively managing GPU resources in Kubernetes requires specialized configuration beyond default settings. The NVIDIA device plugin must be installed to make GPUs schedulable resources, and heterogeneous GPU types (L40S, A100, H100) necessitate careful node labeling and pod scheduling rules.

The performance gap between GPU generations significantly impacts deployment strategy. H100s outperform A100s with higher memory bandwidth (3.35 TB/s vs 2 TB/s), specialized Transformer Engine with FP8 support (providing 2x throughput), greater compute capacity, and faster interconnects. These advantages yield 1.5-4.6x higher inference throughput, though at increased acquisition and power costs.

Dynamic Memory Needs vs Static Limits

A fundamental tension exists between Kubernetes' static resource management model and LLM inference's dynamic memory patterns. While model weights consume fixed memory, the KV cache grows with batch size and sequence length. Setting static GPU memory limits creates a difficult choice:

Too high: Significant resource waste during typical operation
Too low: Out-of-memory errors during bursts of long sequences

PagedAttention helps by reducing memory fragmentation, but cannot eliminate the underlying mismatch between static limits and dynamic workloads.

Autoscaling Complexity

Standard Kubernetes autoscaling metrics (CPU/memory utilization) fail for LLM inference workloads. More effective metrics include:

Request queue length: Directly reflects workload pressure
In-flight requests/batch size: Indicates parallelism capacity

KEDA (Kubernetes Event-driven Autoscaling) enables scaling based on these application-level metrics through various "Scalers," including Prometheus queries or HTTP traffic patterns. While scale-to-zero capabilities offer cost savings for expensive GPU resources, significant cold-start latency creates a direct tradeoff between cost and consistent performance.

Cache-Aware Load Balancing

Standard Kubernetes load balancing (random/round-robin) severely undermines LLM performance by destroying KV cache locality. Advanced strategies include:

Consistent hashing with prefix-based routing: Maps similar requests to the same replica
CHWBL (Consistent Hashing with Bounded Loads): Balances load while maintaining cache affinity
Gateway API Inference Extension: Enables AI-aware routing based on queue times or adapter availability

Implementation typically requires specialized ingress controllers, service meshes, or AI-aware proxies that understand the application's internal cache state.

State Management & Deployment Challenges

The massive size of LLM artifacts (5-150GB) creates significant challenges:

State management options: Shared network volumes (NFS/cloud equivalents), pre-baked images, host path mounts, or advanced model streaming techniques
Rolling updates: LLM pods require minutes to initialize, necessitating sophisticated readiness probes that verify complete model loading and canary deployment strategies

Cost Optimization

Managing costs for GPU-intensive workloads requires multiple strategies:

Maximizing hardware utilization through batching and kernel optimizations
Implementing intelligent autoscaling (including scale-to-zero when appropriate)
Applying model optimization techniques like quantization
Implementing robust monitoring and cost allocation mechanisms

Effectively deploying LLMs on Kubernetes requires understanding this complex interplay between model architecture, optimization techniques, and infrastructure constraints. Standard practices for stateless applications must be adapted to accommodate the unique resource patterns and performance characteristics of these sophisticated AI workloads.

4. Conclusion: The Road Ahead

The deployment of Large Language Models in production environments represents a fascinating convergence of cutting-edge AI research and practical engineering challenges. As we've explored throughout this article, successfully serving these complex models requires innovations at multiple levels of the stack—from model-level optimizations like quantization and speculative decoding to infrastructure adaptations for efficient resource utilization.

What makes LLM deployment particularly interesting is how it forces us to reconsider established patterns in cloud-native architecture. The autoregressive nature of these models, with their unique memory growth patterns and computational profiles, challenges conventional approaches to resource allocation, scaling, and traffic management. Standard Kubernetes practices designed for stateless microservices simply don't account for the dynamic memory requirements and cache locality needs of LLM inference.

Looking forward, we can expect continued evolution in both model architectures and deployment techniques. Newer models may incorporate architectural changes specifically designed to ease deployment constraints, while infrastructure solutions will become increasingly "AI-aware," with specialized tooling for managing the unique requirements of these workloads. The rapid innovation in areas like speculative decoding and quantization suggests there's still significant room for improvement in inference efficiency.

For practitioners entering this field, success requires developing expertise across traditionally separate domains—understanding the intricacies of Transformer architecture and attention mechanisms while simultaneously mastering the nuances of Kubernetes resource management and networking configurations. This cross-disciplinary approach, combining ML engineering with infrastructure expertise, will increasingly define the ML platforms of tomorrow.

The art and science of LLM deployment remain in their early stages, with best practices still emerging. By understanding the fundamental challenges and embracing the innovative solutions discussed here, organizations can build robust, efficient, and cost-effective LLM serving systems that unlock the full potential of these powerful AI models.

References

Kwon, W., Krishnan, K., Shen, H., Nwatu, S., Liang, Y., Cai, X., Wang, Y., Beck, E., Wang, D., Wang, Y., Wu, M., & Jin, Y. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP '23), 450-466.
Lin, J., Guo, Z., Wang, Y., Gong, Y., & Jiao, L. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. International Conference on Learning Representations (ICLR 2023).
Xia, Y., Ge, Y., Pang, Z., Cao, L., Wang, S., Zhou, Z. H., Wang, Z., & Sun, L. (2024). Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. arXiv:2401.07851.
NVIDIA. (2023). TensorRT-LLM: An Open Source Library for High Performance LLM Inference. NVIDIA Developer Documentation.
NVIDIA Developer Blog. (2024). Achieving High Mixtral 8x7B Performance with NVIDIA H100 Tensor Core GPUs and TensorRT-LLM. NVIDIA Technical Blog.
Google Cloud. (2024). Best practices for autoscaling LLM inference workloads with GPUs on GKE. Google Cloud Documentation.
KubeAI. (2024). LLM Load Balancing at Scale: CHWBL. KubeAI Blog.

Everense (Testing Solutions and Services)

3mo

Incredible breakdown of the real-world challenges and cutting-edge solutions for scaling LLM inference.

1 Reaction

Ronak Agrawal

Data Science | Gojek | ThoughtWorks | Nvidia

3mo

Such a fantastic read Jawad💥. Couldn't stop reading!

1 Reaction

Gavin Leeper

Data Scientist // MLE | Former Spotify & GOJEK | AI Experimenter

3mo

Congrats Jawad! Glad to see you’re still up to cool things :)

Syafri Bahar FRM®

3mo

Thanks for sharing Jawad. Great article! Wondering if you have a good recommendation for specialized ingress controller framework for k8 LLM deployment which doesn't destroy KV cache locality as mentioned in your article.

The Art and Science of Deploying LLMs in Production

Jawad Md

AI | ML | Engineering | Startup Advisor

Abstract / TL;DR

1. The Anatomy of Large Language Model Inference

The Dual Phase of Inference

Prefill Stage:

Decode Stage:

The KV Cache: Essential yet problematic

2. Optimization Techniques:

PagedAttention: Virtual Memory for KV Cache

Continuous Batching:

Quantization:

Speculative Decoding:

3. The Kubernetes Gauntlet: Infrastructure Challenges for LLM Inference

GPU Resource Management & Heterogeneity

Dynamic Memory Needs vs Static Limits

Autoscaling Complexity

Cache-Aware Load Balancing

State Management & Deployment Challenges

Cost Optimization

4. Conclusion: The Road Ahead

References

More articles by this author

Others also viewed

Five Uncomfortable Truths about LLMs in Production

Celebrating a crazy month of Open Multimodal LLM Releases

Revolutionizing Mathematical Problem-Solving: OpenAI’s o3 and Its Transformative Impact on Businesses and Enterprises

🥇Top ML Papers of the Week

Limitations of Large Reasoning Models

HIGGS Quantization: Enabling Efficient LLM Compression on Consumer Hardware

FOD#103: The Paper You Missed, and maybe the one we’ll be quoting a year from now

AI That Evolves Itself: Inside AlphaEvolve and What It Means for the Future

Architecting Solid Foundations for Scalable Knowledge Graphs

Optimizing Deployment and Inference for Large-Scale Transformer Models: A Practical Guide

Explore topics

Abstract / TL;DR

1. The Anatomy of Large Language Model Inference

The Dual Phase of Inference

Prefill Stage:

Decode Stage:

The KV Cache: Essential yet problematic

2. Optimization Techniques:

PagedAttention: Virtual Memory for KV Cache

Continuous Batching:

Quantization:

Speculative Decoding:

3. The Kubernetes Gauntlet: Infrastructure Challenges for LLM Inference

GPU Resource Management & Heterogeneity

Dynamic Memory Needs vs Static Limits

Autoscaling Complexity

Cache-Aware Load Balancing

State Management & Deployment Challenges

Cost Optimization

4. Conclusion: The Road Ahead

References

Beyond Prefix Caching: How LMCache Turns KV Cache into Composable LEGO Blocks

Aug 9, 2025

Distributed Large Language Model Inference: A ML Engineer's Guide

Jun 27, 2025

Designing Agentic Systems : Lessons learned from the Trenches

May 2, 2024

Others also viewed

Five Uncomfortable Truths about LLMs in Production

Celebrating a crazy month of Open Multimodal LLM Releases

Revolutionizing Mathematical Problem-Solving: OpenAI’s o3 and Its Transformative Impact on Businesses and Enterprises

🥇Top ML Papers of the Week

Limitations of Large Reasoning Models

HIGGS Quantization: Enabling Efficient LLM Compression on Consumer Hardware

FOD#103: The Paper You Missed, and maybe the one we’ll be quoting a year from now

AI That Evolves Itself: Inside AlphaEvolve and What It Means for the Future

Architecting Solid Foundations for Scalable Knowledge Graphs

Optimizing Deployment and Inference for Large-Scale Transformer Models: A Practical Guide

Explore topics