"Exploring Extreme Quantization for LLMs: A New Framework"

View profile for Ramtin Zand

Principal Investigator of the iCAS Lab, Assistant Professor of Computer Science and Engineering at the University of South Carolina

One of the challenges in a field full of excitement and hype is staying grounded and pursuing fundamental insights. This is what Jinendra Malekar and I tried to do in our recent paper, now available in 𝘛𝘳𝘢𝘯𝘴𝘢𝘤𝘵𝘪𝘰𝘯𝘴 𝘰𝘯 𝘔𝘢𝘤𝘩𝘪𝘯𝘦 𝘓𝘦𝘢𝘳𝘯𝘪𝘯𝘨 𝘙𝘦𝘴𝘦𝘢𝘳𝘤𝘩 (𝘛𝘔𝘓𝘙): “𝗔𝗺𝗱𝗮𝗵𝗹’𝘀 𝗟𝗮𝘄 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀: 𝗔 𝗧𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁-𝗖𝗲𝗻𝘁𝗿𝗶𝗰 𝗔𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗼𝗳 𝗘𝘅𝘁𝗿𝗲𝗺𝗲 𝗟𝗟𝗠 𝗤𝘂𝗮𝗻𝘁𝗶𝘇𝗮𝘁𝗶𝗼𝗻.” Link to the Paper: https://lnkd.in/eTPCewfQ 🔍𝗧𝗵𝗲 𝗸𝗲𝘆 𝗰𝗼𝗻𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝘀: • A 𝘁𝗵𝗿𝗼𝘂𝗴𝗵𝗽𝘂𝘁-𝗰𝗲𝗻𝘁𝗿𝗶𝗰 𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 of mixed-precision LLMs, where projection layers are aggressively quantized (<𝟰-𝗯𝗶𝘁) while attention heads remain higher precision (𝗜𝗡𝗧𝟴/𝗙𝗣𝟭𝟲) to preserve accuracy. • An adaptation of 𝗔𝗺𝗱𝗮𝗵𝗹’𝘀 𝗟𝗮𝘄 for LLMs, providing a quantitative framework to reason about throughput ceilings under extreme quantization. • Extensive experiments across diverse LLM architectures (𝗚𝗣𝗧, 𝗢𝗣𝗧, 𝗟𝗟𝗮𝗠𝗔)  and hardware backends ( 𝗘𝗱𝗴𝗲𝗧𝗣𝗨, 𝗖𝗹𝗼𝘂𝗱𝗧𝗣𝗨, and 𝗚𝗣𝗨) 💡 Our finding show that while extreme quantization can significantly boost LLM throughput, the gains are ultimately limited by the most constrained parts of the model, which depend heavily on both 𝗺𝗼𝗱𝗲𝗹 𝗵𝘆𝗽𝗲𝗿𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝘀 (e.g. context length, embedding dimensions) and 𝗵𝗮𝗿𝗱𝘄𝗮𝗿𝗲 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲. In other words, 𝙏𝙝𝙚𝙧𝙚’𝙨 𝙣𝙤 𝙤𝙣𝙚-𝙨𝙞𝙯𝙚-𝙛𝙞𝙩𝙨-𝙖𝙡𝙡 𝙨𝙤𝙡𝙪𝙩𝙞𝙤𝙣! This can provide a roadmap for designing more holistic quantization strategies that push LLM performance further. We’d love to hear your thoughts on where extreme LLM quantization should head next: 𝘐𝘴 𝘵𝘩𝘦 𝘣𝘪𝘨𝘨𝘦𝘳 𝘱𝘳𝘪𝘰𝘳𝘪𝘵𝘺 𝘥𝘦𝘷𝘦𝘭𝘰𝘱𝘪𝘯𝘨 𝙘𝙪𝙨𝙩𝙤𝙢 𝙝𝙖𝙧𝙙𝙬𝙖𝙧𝙚 𝘵𝘰 𝘧𝘶𝘭𝘭𝘺 𝘤𝘢𝘱𝘪𝘵𝘢𝘭𝘪𝘻𝘦 𝘰𝘯 𝘪𝘵, 𝘰𝘳 𝘢𝘥𝘷𝘢𝘯𝘤𝘪𝘯𝘨 𝙖𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢𝙞𝙘 𝙞𝙣𝙣𝙤𝙫𝙖𝙩𝙞𝙤𝙣𝙨 𝘵𝘰 𝘮𝘢𝘬𝘦 𝘢𝘵𝘵𝘦𝘯𝘵𝘪𝘰𝘯 𝘤𝘰𝘮𝘱𝘶𝘵𝘢𝘵𝘪𝘰𝘯 𝘮𝘰𝘳𝘦 𝘦𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵? #LLM  #Quantization  #EfficientAI  

Khushal Pandala

Student at University of South Carolina

6d

Congratulations Dr. Zand and Jinendra Malekar.

See more comments

To view or add a comment, sign in

Explore content categories