"Exploring Extreme Quantization for LLMs: A New Framework"

Principal Investigator of the iCAS Lab, Assistant Professor of Computer Science and Engineering at the University of South Carolina

One of the challenges in a field full of excitement and hype is staying grounded and pursuing fundamental insights. This is what Jinendra Malekar and I tried to do in our recent paper, now available in 𝘛𝘳𝘢𝘯𝘴𝘢𝘤𝘵𝘪𝘰𝘯𝘴 𝘰𝘯 𝘔𝘢𝘤𝘩𝘪𝘯𝘦 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘙𝘦𝘴𝘦𝘢𝘳𝘤𝘩 (𝘛𝘔𝘓𝘙): “𝗔𝗺𝗱𝗮𝗵𝗹’𝘀 𝗟𝗮𝘄 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀: 𝗔 𝗧𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁-𝗖𝗲𝗻𝘁𝗿𝗶𝗰 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗼𝗳 𝗘𝘅𝘁𝗿𝗲𝗺𝗲 𝗟𝗟𝗠 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻.” Link to the Paper: https://lnkd.in/eTPCewfQ 🔍𝗧𝗵𝗲 𝗸𝗲𝘆 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: • A 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁-𝗰𝗲𝗻𝘁𝗿𝗶𝗰 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 of mixed-precision LLMs, where projection layers are aggressively quantized (<𝟰-𝗯𝗶𝘁) while attention heads remain higher precision (𝗜𝗡𝗧𝟴/𝗙𝗣𝟭𝟲) to preserve accuracy. • An adaptation of 𝗔𝗺𝗱𝗮𝗵𝗹’𝘀 𝗟𝗮𝘄 for LLMs, providing a quantitative framework to reason about throughput ceilings under extreme quantization. • Extensive experiments across diverse LLM architectures (𝗚𝗣𝗧, 𝗢𝗣𝗧, 𝗟𝗟𝗮𝗠𝗔) and hardware backends ( 𝗘𝗱𝗴𝗲𝗧𝗣𝗨, 𝗖𝗹𝗼𝘂𝗱𝗧𝗣𝗨, and 𝗚𝗣𝗨) 💡 Our finding show that while extreme quantization can significantly boost LLM throughput, the gains are ultimately limited by the most constrained parts of the model, which depend heavily on both 𝗺𝗼𝗱𝗲𝗹 𝗵𝘆𝗽𝗲𝗿𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀 (e.g. context length, embedding dimensions) and 𝗵𝗮𝗿𝗱𝘄𝗮𝗿𝗲 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲. In other words, 𝙏𝙝𝙚𝙧𝙚’𝙨 𝙣𝙤 𝙤𝙣𝙚-𝙨𝙞𝙯𝙚-𝙛𝙞𝙩𝙨-𝙖𝙡𝙡 𝙨𝙤𝙡𝙪𝙩𝙞𝙤𝙣! This can provide a roadmap for designing more holistic quantization strategies that push LLM performance further. We’d love to hear your thoughts on where extreme LLM quantization should head next: 𝘐𝘴 𝘵𝘩𝘦 𝘣𝘪𝘨𝘨𝘦𝘳 𝘱𝘳𝘪𝘰𝘳𝘪𝘵𝘺 𝘥𝘦𝘷𝘦𝘭𝘰𝘱𝘪𝘯𝘨 𝙘𝙪𝙨𝙩𝙤𝙢 𝙝𝙖𝙧𝙙𝙬𝙖𝙧𝙚 𝘵𝘰 𝘧𝘶𝘭𝘭𝘺 𝘤𝘢𝘱𝘪𝘵𝘢𝘭𝘪𝘻𝘦 𝘰𝘯 𝘪𝘵, 𝘰𝘳 𝘢𝘥𝘷𝘢𝘯𝘤𝘪𝘯𝘨 𝙖𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢𝙞𝙘 𝙞𝙣𝙣𝙤𝙫𝙖𝙩𝙞𝙤𝙣𝙨 𝘵𝘰 𝘮𝘢𝘬𝘦 𝘢𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯 𝘮𝘰𝘳𝘦 𝘦𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵? #LLM #Quantization #EfficientAI

2 Comments

Khushal Pandala

Student at University of South Carolina

Congratulations Dr. Zand and Jinendra Malekar.

1 Reaction

Prashant Malekar

MBA Finance Grad

Congratulations Ramtin Zand and Jinendra Malekar

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Bill Genovese CISSP ITIL

CIO Advisory Partner | Kyndryl Global Quantum Services & Consulting Leader | CTO | Technology Strategy | Corporate Strategy Innovation Selection Committee Member |AI & ML
3w
Report this post
For nearly seven decades, Dijkstra’s algorithm has reigned supreme as the gold standard for finding shortest paths in graphs. Born from a 20-minute mental exercise at an Amsterdam café in 1956, Edsger Dijkstra’s creation has been the backbone of everything from GPS navigation to network routing protocols. But that reign just ended. A research team led by Ran Duan at Tsinghua University has achieved what many considered impossible: they’ve broken the fundamental “sorting barrier” that has limited shortest-path algorithms for 40 years. Their new deterministic O(m log^(2/3) n)-time algorithm for single-source shortest paths represents a breakthrough that challenges textbook assumptions about algorithmic limits.

Move Over Dijkstra: The New Algorithm That Just Rewrote 70 Years of Computer Science medium.com
Like Comment
To view or add a comment, sign in
Theepana Govintharajah

PhD candidate
1mo
Report this post
Here, I am posting about the A* algorithm, which is built on top of Dijkstra’s Algorithm. It works better when a heuristic value is available. These are the basic building blocks for developing more advanced pathfinding algorithms with modern improvements.
Like Comment
To view or add a comment, sign in
Abdulrahman Alhendi

Machine Learning Engineer
1mo
Report this post
When improving LLM Accuracy, it’s common to assume the process is linear: Prompt Engineering → RAG → Fine-Tuning. In reality, it’s more nuanced. A better approach is to start with prompt engineering. This gives you a baseline to evaluate performance. From there, your next step depends on what you find: 1. Context Optimization (RAG) This is useful when the model lacks domain knowledge because it wasn’t part of its training set, or when its knowledge is outdated. Adding retrieval helps the model provide more accurate and relevant responses. 2. Model Optimization (Fine-Tuning) This is the right choice when the model is inconsistent, produces incorrect formatting, or doesn’t match the tone and style you need. Fine-tuning helps lock in consistent behavior. The important point is that this isn’t a strict sequence—it’s a decision based on what’s missing in the model’s performance.
Like Comment
To view or add a comment, sign in
Richard Waldinger
3w
Report this post
Revised paper on deductive program synthesis. Program derived from a logical specification using theorem-proving technology. Correctness established and rationale explained as a byproduct of the proof. In a case study, dozens of unification algorithms were generated automatically.

2 Comments
Like Comment
To view or add a comment, sign in
Manceps

343 followers
3w
Report this post
How to Train Your LLM? Memory Decoder Crushes RAG For years, LLM domain adaptation has been stuck in a compromise: the immense costs and "catastrophic forgetting" of DAPT, or the frustrating latency and clunky overhead of RAG. But a new approach is here, and it feels like a generational leap. Discover the Memory Decoder, a brilliant, plug-and-play memory component that bypasses the limitations of its predecessors. By learning to imitate a retriever, this compact module supercharges your LLM, delivering both superior performance and unparalleled efficiency. Can a small, dedicated "memory chip" truly make a 0.5B model outperform a 72B-parameter behemoth? The research says yes. Read on to find out how this paradigm shift could make the old methods obsolete. https://lnkd.in/gmBHXTcb
Like Comment
To view or add a comment, sign in
Al Kari

Technology Catalyst
3w
Report this post
Goodbye RAG, Hello Memory Decoder. The Memory Decoder fundamentally reimagines how we infuse specialized knowledge into large language models. It takes the key advantages of its predecessors—the deep domain knowledge of DAPT and the plug-and-play nature of RAG—and wraps them in a single, efficient, and versatile architecture that avoids their biggest pitfalls. https://lnkd.in/gDzYuW3x
Manceps

343 followers
3w

How to Train Your LLM? Memory Decoder Crushes RAG For years, LLM domain adaptation has been stuck in a compromise: the immense costs and "catastrophic forgetting" of DAPT, or the frustrating latency and clunky overhead of RAG. But a new approach is here, and it feels like a generational leap. Discover the Memory Decoder, a brilliant, plug-and-play memory component that bypasses the limitations of its predecessors. By learning to imitate a retriever, this compact module supercharges your LLM, delivering both superior performance and unparalleled efficiency. Can a small, dedicated "memory chip" truly make a 0.5B model outperform a 72B-parameter behemoth? The research says yes. Read on to find out how this paradigm shift could make the old methods obsolete. https://lnkd.in/gmBHXTcb
Like Comment
To view or add a comment, sign in
Nikolai Varankine

AGI Researcher – VARANKIN
3w
Report this post
📊 If you wonder why #LLM benchmarks indicate that rivals go almost head-to-head, the article from Sebastian Raschka, PhD provides a clear answer: every good idea migrates from model to model, thanks to #open #weights and #source initiative. As often mentioned in the article linked below, architectures become very common, having only second-order differences. 🔎 However, for multibillion scale projects even a small percent of advantage brings significant benefits in absolute measures. This article is devoted to such details. I hope you will like it. The subject isn't simple and describes nine models in full details, so the text appears to be a long read naturally. ❓ A Q-K crossing as an argument to function of mixing over V is a key component of #Transformer technology. What keeps my curiosity alive is a wide presence of FC layers and normalizers across all architectures. These additions serve as correctors to the main math. They require significant part of total number of parameters. They take significant time to compute. They can't be reduced or completely dropped out from the model. So, the question is what is missed or wrong in the method that we need to adjust base computations across all the model repeatedly? 📰 "The Big LLM Architecture Comparison" https://lnkd.in/dkjQXysD 📷 The picture below was taken from this article
2 Comments
Like Comment
To view or add a comment, sign in
Sathish Kumar Karunamoorthy

Senior Project Manager at CEI
1w
Report this post
The research system is built on an orchestrator-worker pattern, a common design in computing where one central unit directs the process and supporting units carry out specific tasks. https://lnkd.in/gPJk2SBF

How Anthropic Built a Multi-Agent Research System blog.bytebytego.com
Like Comment
To view or add a comment, sign in

2,782 followers

173 Posts

View Profile Connect

LinkedIn respects your privacy

"Exploring Extreme Quantization for LLMs: A New Framework"

Explore content categories