What is the next big leap in #mathematicaloptimization: quantum approaches, AI/ML/RL, GPU...? So many trends to follow and to evaluate. So, which to bet on? Since early this year, it crystallized that for some large (especially gigantically large) linear optimization problems, GPU acceleration can help tremendously -- based on a proper mathematical foundation (first order methods) combined with the strength of GPUs (in contrast to CPUs) when it comes to memory exchange and parallelism. With the October 2025 release, #Xpress 9.8 now incorporates GPU acceleration on PDHG that makes your large-scale linear programming solutions fly! 🔥 What's Got Us Excited: • 30x speedups in single precision and 25x in double precision! • Full algorithm GPU implementation - not just matrix operations • Great for problems with over 100,000 non zeroes; even better for problems with over 10,000,000 non zeroes Thanks to our partners who submitted instances for testing and evaluating. 💡 Read more here: https://lnkd.in/eUq69x8G Don't get addicted purely to GPUs, yet! There are still many instances for which a Barrier or dual Simplex outperform current GPU implementations, and are less dependent on the numerical tolerances. Let's research more, and start enjoying!
Jens Schulz’s Post
More Relevant Posts
-
This is a great example of how a Hybrid compute architecture can drive efficiency for Inference which is key to meet the best TCO and less Energy usage. This is built on a hybrid architecture that combines GPU systems like NVIDIA DGX with CPU built by Apple. Here’s a technical breakdown: --- 1. Workload Specialization • GPU (DGX Spark): Ideal for parallel-heavy tasks like prefill (matrix multiplications, attention blocks). • CPU (M3 Ultra): Excels at sequential, low-latency tasks like token-by-token decode, especially with unified memory and fast single-core performance. By splitting inference stages, you reduce contention and maximize hardware utilization. --- TCO (Total Cost of Ownership) Benefits 1. Lower Hardware Costs • DGX systems are expensive and power-hungry. Offloading decode to a cheaper, efficient CPU system reduces the need for multiple DGX units. • M3 Ultra offers high performance per watt at a fraction of the cost. 2. Reduced GPU Overprovisioning • Decode is often seen bottlenecked by latency, not throughput. Running it on GPU wastes parallelism. • Offloading decode frees up GPU, improving throughput per dollar. 3. Scalable Deployment • You can scale CPU nodes independently for decode-heavy workloads. • This modularity allows elastic scaling based on workload profiles. --- Power Efficiency Gains 1. Energy-Optimized Decode • CPUs like the M3 Ultra consume far less power than GPUs for sequential tasks. 2. Thermal and Cooling Savings • DGX systems require active cooling and high-density power delivery. • Offloading decode reduces GPU duty cycle, lowering thermal output and cooling costs. 3. Idle Power Reduction • Decode often involves waiting for token generation. CPUs can idle efficiently, while GPUs consume power even when underutilized. --- With the announcement of Nvidia NVLink integrations with Intel CPU architecture, maybe we will witness more of such deployments which is a win-win for AI optimizations and CPU wins across AI Inference!!!!
To view or add a comment, sign in
-
Accelerate large-scale vector search with NVIDIA cuVS integration in Faiss, boosting up to 12x faster index builds and 8x lower search latency on GPU. Effortlessly scale and deploy across CPU and GPU to power real-time AI and retrieval applications. https://lnkd.in/g6QHaziT
To view or add a comment, sign in
-
🎥 New video: Performance-portable GPUs with SYCL (oneAPI & AdaptiveCpp) vs CUDA/HIP We've just published a short explainer on our JPDC paper where we stress-test SYCL on real HPC workloads (single- and multi-GPU, even mixed NVIDIA+AMD) from one codebase. What’s inside: • Why SYCL (modern C++) is a pragmatic path to “write once, run fast (almost) everywhere.” • A real application (UVaFTLE for flow analysis), with both memory-bound and compute-light kernels. • Results: SYCL ≈ HIP on AMD; competitive on NVIDIA (sometimes wins on lighter kernels); mixed-vendor multi-GPU works in practice. • Practical guidance: prefer device-resident data (USM-device or buffers); avoid shared/managed memory for performance-critical paths; oneAPI tends to shine on NVIDIA, AdaptiveCpp on AMD. Why it matters: Teams juggling heterogeneous clusters and evolving hardware roadmaps can maintain a single well-performing codebase without vendor lock-in, trading a sliver of peak performance for portability and longevity. ▶ Watch this short video! 📄 Preprint (JPDC, 2025): https://lnkd.in/dDPRiPeE 💻 Code (UVaFTLE: CUDA/HIP/SYCL): https://lnkd.in/dxjqC_wK If you’re working on GPU portability, we’d love your feedback!
To view or add a comment, sign in
-
Can your CUDA code run anywhere? For nearly two decades, NVIDIA’s CUDA has set the standard for GPU programming — a powerful framework that turned GPUs into the engine of the AI revolution. Now, as demand for compute soars, others are following in its footsteps. From AMD’s HIP to Spectral’s SCALE and even new Chinese GPU stacks, developers are reimagining what “CUDA compatibility” means in a more open, multi-vendor world. Our latest deep dive retraces the path from the first GeForce 256 to today’s emerging efforts to make CUDA code run agnostically, and what this shift could mean for developers everywhere. 👉 Read the full story written by Antoine Radet & Cédric Courtaud, Ph.D : https://lnkd.in/egKSbBFM
To view or add a comment, sign in
-
Very interesting NVIDIA blog to read related to Floating Point Emulation in cuBLAS. The latest cuBLAS update in NVIDIA CUDA Toolkit 13.0 Update 2 introduces new APIs and implementations that significantly boost the performance of double-precision (FP64) matrix multiplications through floating-point emulation on Tensor Cores found in GPU architectures such as NVIDIA GB200 NVL72. The cuBLAS library includes an automatic dynamic precision (ADP) framework that analyzes inputs to determine if emulation can be safely leveraged for increased performance, and automatically configures emulation parameters to enable accuracy equal to or better than native FP64 matrix multiplication. Applications such as ecTrans, BerkeleyGW, and Quantum Espresso have seen significant performance improvements using FP emulation, with speedups ranging from 1.5x to 3x, while maintaining accuracy within acceptable ranges. https://lnkd.in/dvFbfuq7
To view or add a comment, sign in
-
Learn when to use CPUs vs. GPUs for AI inference. Compare performance, cost, and energy efficiency to choose the right hardware for your AI workloads. Read more. #CloudComputing https://ow.ly/4leu50X84pz
AI Inference Hardware Decisions: When to Choose CPUs vs. GPUs
To view or add a comment, sign in
-
Learn when to use CPUs vs. GPUs for AI inference. Compare performance, cost, and energy efficiency to choose the right hardware for your AI workloads. Read more. #CloudComputing https://ow.ly/4leu50X84pz
AI Inference Hardware Decisions: When to Choose CPUs vs. GPUs
To view or add a comment, sign in
-
Learn when to use CPUs vs. GPUs for AI inference. Compare performance, cost, and energy efficiency to choose the right hardware for your AI workloads. Read more. #CloudComputing https://ow.ly/4leu50X84pz
AI Inference Hardware Decisions: When to Choose CPUs vs. GPUs
To view or add a comment, sign in
-
Learn when to use CPUs vs. GPUs for AI inference. Compare performance, cost, and energy efficiency to choose the right hardware for your AI workloads. Read more. #CloudComputing https://ow.ly/4leu50X84pz
AI Inference Hardware Decisions: When to Choose CPUs vs. GPUs
To view or add a comment, sign in
-
🔥 GPU health monitoring just got native in Kubernetes 1.34! No more “Running” Pods with dead GPUs. Kubernetes now tracks per-resource health, surfacing GPU or accelerator failures directly in Pod status. ✅ Detect GPU faults in real time ✅ Automate recovery with controllers ✅ Stop wasting compute on broken devices Read the full blog 👇 👉 https://zurl.co/mA50F #Kubernetes #AI #GPU #ML #CloudNative
To view or add a comment, sign in
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
Long awaited!