FusIOnX: Unlocking Cost-Efficient LLM Inference without High-End Hardware

Eshcar Hillel , Principal Research Scientist, June 2025

As large language models (LLMs) become essential to a wide range of AI applications, serving them efficiently at scale remains a major challenge --- particularly for organizations relying on mid-range hardware. Model quantization offers a way to reduce model size and memory usage, but it introduces runtime overhead that limits its practical impact. This blog presents how Pliops FusIOnX, a disaggregated KV-cache offloading solution, enables efficient inference of quantized models without sacrificing responsiveness or system throughput. When applied to a quantized DeepSeek-V3 model on 8x NVIDIA H20 GPUs, FusIOnX delivers up to 4.2x faster prompt processing, 3-10x higher prompt token throughput, and 2.5x end-to-end application-level efficiency compared to the baseline inference engine. With FusIOnX, it becomes feasible to run production-scale LLM inference using quantized models on affordable infrastructure, reducing total cost of ownership (TCO) without compromising user experience.

Recent Pliops FusIOnX technical reports

1. FusIOnx Leveraging Nvidia Dynamo, May 27 2025

2. FusIOnX Optimizes LLM Inference with NVIDIA Dynamo (StorageReview), May 20 2025

3. FusIOnX: Accelerated Key-Value Store for Cost-Effective LLM Inference, May 8 2025

The Memory Wall: When Model Size Outpaces Hardware

Deploying large LLMs like DeepSeek-V3 on mid-range infrastructure presents a serious memory challenge. Even on a setup with 8x NVIDIA H20 GPUs, total high-bandwidth memory (HBM) is quickly consumed by model weights and activation buffers, leaving limited capacity for KV-cache—the memory structure used to store intermediate attention outputs between tokens.

To illustrate this, consider a common configuration running DeepSeek-V3-0324 (FP8) with continuous batching. The system starts with 800 GB total HBM, but after accounting for framework overhead (both PyTorch and system-level memory), less than 25 GB remain for KV-cache. With a 6K-token prompt and 150-token response per request, each KV-cache instance consumes roughly 3.5 GB. This restricts the system to a maximum of 5-6 concurrent users, beyond which responsiveness and throughput begin to degrade as the system struggles to allocate memory for additional KV-cache blocks.

Table 1 shows system behaviour across varying number of users. As user count increases, Time-to-First-Token (TTFT), Time-per-Output-Token (TPOT) and Requests-per-Second (RPS) rise, while token throughput per user decline. TTFT service level agreement (SLA) is set to 1sec, TPOT SLA is set to 90ms. It demonstrates a clear bottleneck: while GPU compute may remain underutilized, the Memory Wall prevents serving more users efficiently. At batch size 6, the new prefill is not starting because there is not enough HBM for decoding.

Quantized Models: Memory Relief with a Compute Tradeoff

Quantization has emerged as a practical approach to overcoming the memory wall. By reducing the precision of model weights—e.g., from FP16 to FP8 or FP4—quantization significantly shrinks the memory footprint of LLMs. This not only frees up HBM for more concurrent users but also alleviates memory bandwidth pressure during the decode phase, accelerating token generation.

However, quantization introduces a new compute bottleneck, particularly during the prefill phase (when the model processes the prompt and generates the initial KV-cache). On mid-range GPUs like the NVIDIA H20, which do not natively support FP4 arithmetic, quantized weights must be dequantized to a higher-precision format (e.g., FP16) at runtime. This dequantization step is computationally expensive and disproportionately impacts prefill latency.

To illustrate the impact, we ran DeepSeek-V3-0324-AWQ (FP4) on the same 8x H20 GPU setup described earlier. While memory is no longer a limiting factor—enabling up to 16 concurrent users—the computational overhead of dequantization increases TTFT and TPOT, pushing response times beyond acceptable service levels (SLA) as defined by the customer.

Compared to its FP8 counterpart, the quantized model suffers from up to 40% lower responsiveness (TPS/user) and offers no throughput gain, despite freeing up memory. The very technique used to unlock memory headroom ends up trading it for compute inefficiency --- a trade off that makes production deployment of quantized models on non-high-end GPUs difficult without further architectural optimization.

This is where Pliops FusIOnX makes the difference.

Pliops FusIOnX to the Rescue: Turning Quantization Tradeoffs into Gains

FusIOnX is a disaggregated KV-cache offloading solution that enables the reuse of KV-cache across LLM sessions, dramatically reducing the need to recompute the prompt during inference. This is especially powerful when used with quantized models, where prefill phase becomes a compute bottleneck due to dequantization overhead on non-FP4-enabled GPUs.

By offloading and reusing previously computed KV-cache entries, FusIOnX minimizes redundant computation in the prefill stage. This not only restores responsiveness lost to low-bit quantization but also unlocks significant improvements in throughput and overall efficiency—even on mid-range hardware.

Recognizing this need, several customers asked us to evaluate how quantized models perform when combined with our KV-cache offloading solution. In this study, we benchmark DeepSeek-V3-0324-AWQ (FP4) on 8x H20 GPUs, with and without FusIOnX. The results are impressive: 4.2x faster prefill latency (TTFT), 3x–10x higher prompt token throughput, 2.5x improvement in application-level efficiency for multi-turn chat scenarios. These gains effectively neutralize the drawbacks of dequantization during prefill, allowing quantized models to perform like their higher-precision counterparts --- but with far lower memory usage and compute cost.

Under the Hood: The XDP FusIOnX Stack

To support high-throughput, low-latency inference with disaggregated KV-cache offloading, we deployed FusIOnX using a two-tier architecture:

Initiator Node – This is the compute node running the LLM inference engine (vLLM), equipped with GPUs. It orchestrates the inference workload and initiates KV-cache offloading.
Target Node – This node hosts the LightningAI-powered smart storage system, acting as the remote memory backend for storing and retrieving KV-cache entries.

System Configuration

Initiator: 8x NVIDIA H20 GPUs each with 96GB HBM, Dual-socket Intel(R) Xeon(R) Platinum 8468V, 1TB DRAM, 1x NVIDIA ConnectX-7 (400Gb)

Target: Dual-socket Intel(R) Xeon(R) Platinum 8458P, 1TB DRAM, 1 XDP Pro1 card connected to 6x7.68TB Intel P5520 SSDs, 1x NVIDIA ConnectX-7 (400Gb)

This setup ensures high-throughput, low-latency communication between initiator and target nodes, enabling rapid offload and retrieval of KV-cache with minimal overhead.

Test Plan and Evaluation Metrics

To evaluate FusIOnX, we compared it against a vanilla vLLM baseline across multiple production-relevant scenarios. Our goal was to assess how KV-cache offloading impacts user experience and system efficiency, especially when paired with quantized models.

Benchmarking Methodology

We ran two types of tests to reflect real-world inference scenarios:

Static Batching Benchmark

Simulates multi-turn conversation sessions where 90% of the KV-cache has already been computed in earlier turns but no longer fits in GPU memory. Retrieving this KV-cache from FusIOnX significantly improves performance without the need to recompute.

Continuous Batching Benchmark

Reflects production workloads with dynamic request arrival. New requests are dynamically batched for efficiency, and each response is returned as soon as that request completes, without waiting for the rest of the batch. This maximizes system utilization and highlights latency-responsiveness trade-offs. We use prompt distributions derived from public datasets like ShareGPT.

Metrics That Matter

We focused on four key metrics that reflect both system-level efficiency and end-user experience:

Time-to-First-Token (TTFT): Measures how quickly the model returns the first output token after receiving a prompt. For FusIOnX, this includes the time to load past KV-cache entries from storage (I/O) and compute the new entries (delta-prefill).
Time-per-Output-Token (TPOT): Measures the interval between generated tokens during decoding. Lower TPOT translates directly to smoother, faster user interaction.
Requests-per-Second (RPS): Indicates system throughput --- the number of full prompt-response pairs the system can serve per second. We also present RPS/GPU to reflect per-GPU efficiency.
Tokens-per-Second (TPS): Indicates how many output tokens are generated each second. We normalize TPS per user to assess quality of experience at scale.

Results

We begin by running static batch tests to showcase the raw gain of KV-cache offloading using LightningAI. Results demonstrate that FusIOnX decreases TTFT by up to 4.2x, improves the prompt prefilling throughput by up to 10x, and increases the end-to-end inference efficiency by up to 2.5x.

Prefill Acceleration

Figure 2 demonstrates up to 4.2x speedup in prefill of FusIOnX vs the baseline running DS-V3-AWQ on 8xH20 GPUs with different prompt (context) lengths, for batch size 1. Figure 3 shows that prefill latency with LightningAI-based KV-cache offloading closely matches the upper bound of retaining the full cache in HBM using automatic-prefix-caching (APC).

Throughput Gain

Figure 4 shows the prompt token throughput gain of FusIOnX vs. the baseline when running static benchmark for different prompt lengths. TTFT SLA is set to 500ms and 600ms. With the de-quantization overhead vanilla is struggling to meet SLA. Red markers indicate runs where vLLm vanilla is unable to meet SLA even at batch size 1. Since KV-cache processing is significantly shorter with FusIOnX, the batch size can be increased without violating TTFT SLA. This results in higher TPS. At 500ms SLA the gain is 2x-8x, and as SLA increases to 600ms, FusIOnX is able to further provide up to 10x token throughput.

Application-Level Gain

We next describe the result of the continuous batching tests, simulating a more realistic setup of production environments. We ran the benchmark with different numbers of clients. Figure 5 explores the tradeoff between system efficiency captured by RPS/GPU and user experience measured by TPS/user. At the typical 90-100ms TPOT SLA (representing approximately 10 TPS/user), FusIOnX demonstrates 2.5x higher efficiency than vanilla vLLM.

Summary

This evaluation demonstrates that disaggregated KV-cache offloading, as implemented by Pliops FusIOnX, effectively addresses the computational overhead introduced by quantized LLMs during prompt processing. By enabling reuse of intermediate computations across sessions, FusIOnX substantially reduces latency and improves throughput, unlocking the practical benefits of low-bit quantization in production settings.

Looking forward, KV-cache offloading presents a promising approach to overcoming memory and compute constraints that limit scalable deployment of large, quantized models. It enables serving more users per GPU while maintaining responsiveness and reducing energy consumption and total cost of ownership. As models grow larger and context windows expand, architectural solutions like FusIOnX will be critical to sustaining cost-effective LLM inference at scale.

FusIOnX: Unlocking Cost-Efficient LLM Inference without High-End Hardware

Pliops

Pliops accelerates and amplifies the performance and scalability of global GenAI infrastructure.

The Memory Wall: When Model Size Outpaces Hardware

Quantized Models: Memory Relief with a Compute Tradeoff

Pliops FusIOnX to the Rescue: Turning Quantization Tradeoffs into Gains

Under the Hood: The XDP FusIOnX Stack

System Configuration

Test Plan and Evaluation Metrics

Benchmarking Methodology

Static Batching Benchmark

Continuous Batching Benchmark

Metrics That Matter

Results

Prefill Acceleration

Throughput Gain

Application-Level Gain

Summary

More articles by this author

Others also viewed

NVIDIA AI Dev Weekly: New Open Models, Training an LLM in a Weekend, and More

Accelerating OpenAI’s gpt-oss Models, New CUDA 13.0, and More

Pushing Past L(LM)its: The AI Evolution of 2024 and What Lies Ahead in 2025

DeciDiffusion 1.0: 3x the Speed of Stable Diffusion with the Same Quality

Observations on the first order outputs of LLM’s wrt NVIDIA DGX Reference Architecture employing ChatGPT and Claude – an outside in perspective

Deepseek's FlashMLA: Unlocking Next-Generation AI Inference Efficiency on Hopper GPUs

AI Accelerators- The importance of the right processors.

NVIDIA's cuOpt news, custom visualizations, a preview of challenge mode & more

Catalysts of Change: How the AI Industrial Revolution is Rewiring the Future of Computing

NewMind AI Journal #59

Explore topics

The Memory Wall: When Model Size Outpaces Hardware

Quantized Models: Memory Relief with a Compute Tradeoff

Pliops FusIOnX to the Rescue: Turning Quantization Tradeoffs into Gains

Under the Hood: The XDP FusIOnX Stack

System Configuration

Test Plan and Evaluation Metrics

Benchmarking Methodology

Static Batching Benchmark

Continuous Batching Benchmark

Metrics That Matter

Results

Prefill Acceleration

Throughput Gain

Application-Level Gain

Summary

Accelerating LLM Inference: Pliops FusIOnX leveraging NVIDIA Dynamo

May 27, 2025

Pliops FusIOnX: Accelerated Key-Value Store for Cost-Effective LLM Inference

May 8, 2025

Others also viewed

NVIDIA AI Dev Weekly: New Open Models, Training an LLM in a Weekend, and More

Accelerating OpenAI’s gpt-oss Models, New CUDA 13.0, and More

Pushing Past L(LM)its: The AI Evolution of 2024 and What Lies Ahead in 2025

DeciDiffusion 1.0: 3x the Speed of Stable Diffusion with the Same Quality

Observations on the first order outputs of LLM’s wrt NVIDIA DGX Reference Architecture employing ChatGPT and Claude – an outside in perspective

Deepseek's FlashMLA: Unlocking Next-Generation AI Inference Efficiency on Hopper GPUs

AI Accelerators- The importance of the right processors.

NVIDIA's cuOpt news, custom visualizations, a preview of challenge mode & more

Catalysts of Change: How the AI Industrial Revolution is Rewiring the Future of Computing

NewMind AI Journal #59

Explore topics