FusIOnX: Unlocking Cost-Efficient LLM Inference without High-End Hardware
Eshcar Hillel , Principal Research Scientist, June 2025
As large language models (LLMs) become essential to a wide range of AI applications, serving them efficiently at scale remains a major challenge --- particularly for organizations relying on mid-range hardware. Model quantization offers a way to reduce model size and memory usage, but it introduces runtime overhead that limits its practical impact. This blog presents how Pliops FusIOnX, a disaggregated KV-cache offloading solution, enables efficient inference of quantized models without sacrificing responsiveness or system throughput. When applied to a quantized DeepSeek-V3 model on 8x NVIDIA H20 GPUs, FusIOnX delivers up to 4.2x faster prompt processing, 3-10x higher prompt token throughput, and 2.5x end-to-end application-level efficiency compared to the baseline inference engine. With FusIOnX, it becomes feasible to run production-scale LLM inference using quantized models on affordable infrastructure, reducing total cost of ownership (TCO) without compromising user experience.
Recent Pliops FusIOnX technical reports
1. FusIOnx Leveraging Nvidia Dynamo, May 27 2025
2. FusIOnX Optimizes LLM Inference with NVIDIA Dynamo (StorageReview), May 20 2025
The Memory Wall: When Model Size Outpaces Hardware
Deploying large LLMs like DeepSeek-V3 on mid-range infrastructure presents a serious memory challenge. Even on a setup with 8x NVIDIA H20 GPUs, total high-bandwidth memory (HBM) is quickly consumed by model weights and activation buffers, leaving limited capacity for KV-cache—the memory structure used to store intermediate attention outputs between tokens.
To illustrate this, consider a common configuration running DeepSeek-V3-0324 (FP8) with continuous batching. The system starts with 800 GB total HBM, but after accounting for framework overhead (both PyTorch and system-level memory), less than 25 GB remain for KV-cache. With a 6K-token prompt and 150-token response per request, each KV-cache instance consumes roughly 3.5 GB. This restricts the system to a maximum of 5-6 concurrent users, beyond which responsiveness and throughput begin to degrade as the system struggles to allocate memory for additional KV-cache blocks.
Table 1 shows system behaviour across varying number of users. As user count increases, Time-to-First-Token (TTFT), Time-per-Output-Token (TPOT) and Requests-per-Second (RPS) rise, while token throughput per user decline. TTFT service level agreement (SLA) is set to 1sec, TPOT SLA is set to 90ms. It demonstrates a clear bottleneck: while GPU compute may remain underutilized, the Memory Wall prevents serving more users efficiently. At batch size 6, the new prefill is not starting because there is not enough HBM for decoding.
Quantized Models: Memory Relief with a Compute Tradeoff
Quantization has emerged as a practical approach to overcoming the memory wall. By reducing the precision of model weights—e.g., from FP16 to FP8 or FP4—quantization significantly shrinks the memory footprint of LLMs. This not only frees up HBM for more concurrent users but also alleviates memory bandwidth pressure during the decode phase, accelerating token generation.
However, quantization introduces a new compute bottleneck, particularly during the prefill phase (when the model processes the prompt and generates the initial KV-cache). On mid-range GPUs like the NVIDIA H20, which do not natively support FP4 arithmetic, quantized weights must be dequantized to a higher-precision format (e.g., FP16) at runtime. This dequantization step is computationally expensive and disproportionately impacts prefill latency.
To illustrate the impact, we ran DeepSeek-V3-0324-AWQ (FP4) on the same 8x H20 GPU setup described earlier. While memory is no longer a limiting factor—enabling up to 16 concurrent users—the computational overhead of dequantization increases TTFT and TPOT, pushing response times beyond acceptable service levels (SLA) as defined by the customer.
Compared to its FP8 counterpart, the quantized model suffers from up to 40% lower responsiveness (TPS/user) and offers no throughput gain, despite freeing up memory. The very technique used to unlock memory headroom ends up trading it for compute inefficiency --- a trade off that makes production deployment of quantized models on non-high-end GPUs difficult without further architectural optimization.
This is where Pliops FusIOnX makes the difference.
Pliops FusIOnX to the Rescue: Turning Quantization Tradeoffs into Gains
FusIOnX is a disaggregated KV-cache offloading solution that enables the reuse of KV-cache across LLM sessions, dramatically reducing the need to recompute the prompt during inference. This is especially powerful when used with quantized models, where prefill phase becomes a compute bottleneck due to dequantization overhead on non-FP4-enabled GPUs.
By offloading and reusing previously computed KV-cache entries, FusIOnX minimizes redundant computation in the prefill stage. This not only restores responsiveness lost to low-bit quantization but also unlocks significant improvements in throughput and overall efficiency—even on mid-range hardware.
Recognizing this need, several customers asked us to evaluate how quantized models perform when combined with our KV-cache offloading solution. In this study, we benchmark DeepSeek-V3-0324-AWQ (FP4) on 8x H20 GPUs, with and without FusIOnX. The results are impressive: 4.2x faster prefill latency (TTFT), 3x–10x higher prompt token throughput, 2.5x improvement in application-level efficiency for multi-turn chat scenarios. These gains effectively neutralize the drawbacks of dequantization during prefill, allowing quantized models to perform like their higher-precision counterparts --- but with far lower memory usage and compute cost.
Under the Hood: The XDP FusIOnX Stack
To support high-throughput, low-latency inference with disaggregated KV-cache offloading, we deployed FusIOnX using a two-tier architecture:
System Configuration
Initiator: 8x NVIDIA H20 GPUs each with 96GB HBM, Dual-socket Intel(R) Xeon(R) Platinum 8468V, 1TB DRAM, 1x NVIDIA ConnectX-7 (400Gb)
Target: Dual-socket Intel(R) Xeon(R) Platinum 8458P, 1TB DRAM, 1 XDP Pro1 card connected to 6x7.68TB Intel P5520 SSDs, 1x NVIDIA ConnectX-7 (400Gb)
This setup ensures high-throughput, low-latency communication between initiator and target nodes, enabling rapid offload and retrieval of KV-cache with minimal overhead.
Test Plan and Evaluation Metrics
To evaluate FusIOnX, we compared it against a vanilla vLLM baseline across multiple production-relevant scenarios. Our goal was to assess how KV-cache offloading impacts user experience and system efficiency, especially when paired with quantized models.
Benchmarking Methodology
We ran two types of tests to reflect real-world inference scenarios:
Static Batching Benchmark
Simulates multi-turn conversation sessions where 90% of the KV-cache has already been computed in earlier turns but no longer fits in GPU memory. Retrieving this KV-cache from FusIOnX significantly improves performance without the need to recompute.
Continuous Batching Benchmark
Reflects production workloads with dynamic request arrival. New requests are dynamically batched for efficiency, and each response is returned as soon as that request completes, without waiting for the rest of the batch. This maximizes system utilization and highlights latency-responsiveness trade-offs. We use prompt distributions derived from public datasets like ShareGPT.
Metrics That Matter
We focused on four key metrics that reflect both system-level efficiency and end-user experience:
Results
We begin by running static batch tests to showcase the raw gain of KV-cache offloading using LightningAI. Results demonstrate that FusIOnX decreases TTFT by up to 4.2x, improves the prompt prefilling throughput by up to 10x, and increases the end-to-end inference efficiency by up to 2.5x.
Prefill Acceleration
Figure 2 demonstrates up to 4.2x speedup in prefill of FusIOnX vs the baseline running DS-V3-AWQ on 8xH20 GPUs with different prompt (context) lengths, for batch size 1. Figure 3 shows that prefill latency with LightningAI-based KV-cache offloading closely matches the upper bound of retaining the full cache in HBM using automatic-prefix-caching (APC).
Throughput Gain
Figure 4 shows the prompt token throughput gain of FusIOnX vs. the baseline when running static benchmark for different prompt lengths. TTFT SLA is set to 500ms and 600ms. With the de-quantization overhead vanilla is struggling to meet SLA. Red markers indicate runs where vLLm vanilla is unable to meet SLA even at batch size 1. Since KV-cache processing is significantly shorter with FusIOnX, the batch size can be increased without violating TTFT SLA. This results in higher TPS. At 500ms SLA the gain is 2x-8x, and as SLA increases to 600ms, FusIOnX is able to further provide up to 10x token throughput.
Application-Level Gain
We next describe the result of the continuous batching tests, simulating a more realistic setup of production environments. We ran the benchmark with different numbers of clients. Figure 5 explores the tradeoff between system efficiency captured by RPS/GPU and user experience measured by TPS/user. At the typical 90-100ms TPOT SLA (representing approximately 10 TPS/user), FusIOnX demonstrates 2.5x higher efficiency than vanilla vLLM.
Summary
This evaluation demonstrates that disaggregated KV-cache offloading, as implemented by Pliops FusIOnX, effectively addresses the computational overhead introduced by quantized LLMs during prompt processing. By enabling reuse of intermediate computations across sessions, FusIOnX substantially reduces latency and improves throughput, unlocking the practical benefits of low-bit quantization in production settings.
Looking forward, KV-cache offloading presents a promising approach to overcoming memory and compute constraints that limit scalable deployment of large, quantized models. It enables serving more users per GPU while maintaining responsiveness and reducing energy consumption and total cost of ownership. As models grow larger and context windows expand, architectural solutions like FusIOnX will be critical to sustaining cost-effective LLM inference at scale.
Software Engineer at Taboola
2moAviv Sinai Ariel Pisetzky Great article Eshcar Hillel