Intel and Weizmann Institute Join Forces to Revolutionize LLM Performance for Everyone

Guy Grimland

Published Jul 28, 2025

About a week ago, the International Conference on Machine Learning (ICML) took place it’s one of the most important research events in the AI world. And a team of researchers from Intel Labs, in collaboration with the Weizmann Institute of Science, published a fascinating study showing how to significantly speed up the performance of Large Language Models (LLMs).

What's really beautiful is that the research earned the honor of being selected for an Oral Presentation - a category reserved for the top 1% of the roughly 15,000 papers submitted to the conference.

Okay, a little background:

Large Language Models (LLMs) like ChatGPT are incredible tools, but they have a major Achilles' heel: they're slow and consume a lot of resources. Every answer they generate requires immense computing power, which causes delays (latency) for the user and dramatically drives up operating costs for companies. The need to accelerate these models is one of the biggest challenges in the AI industry today.

To tackle this, the industry uses a method most of you have probably heard of called "Speculative Decoding". The idea is that a small, nimble "assistant" model quickly guesses the next part of the text, and the large, powerful model only has to verify the guess instead of generating everything from scratch. For anyone not familiar with it, let me explain with the following example.

Let's say we want the model to complete the sentence: "The capital of France is..."

Without Speculative Decoding (the slow way):

The large model works step-by-step.

* Step 1: It generates the word "Paris" (requiring a lot of computational power).

* Step 2: It reads "The capital of France is Paris" and generates the next word: "the" (again, a lot of computational power).

* Step 3: It reads the entire new sentence and generates the word "City" (again, a full computational effort).

The result: To generate 3 words, the large model needed 3 separate and expensive "generation cycles."

With Speculative Decoding (the fast way):

The models work together.

* Step 1 (Quick Guess): The small, nimble "assistant" model reads the sentence and instantly guesses a three-word draft: "Paris," "the," "City" (it does this very quickly).

* Step 2 (One-Shot Verification): The large, powerful model gets this entire draft and checks it all at once. It asks itself: "Is this the right guess?" In this case, the answer is "Yes."

The result: To generate the same 3 words, the large model only needed one efficient "thinking cycle" to approve the whole draft.

Now, I'm about to say something that probably every data scientist knows, but bear with me. AI models don't understand words like we do. Each model learns and builds its own unique "digital language" - a private dictionary of tokens that only it understands. The word "apple" in one model might be represented by token #123, while in another, it’s token #987.

Until now, speculative decoding only worked if both models—the large and the small one—spoke the exact same digital language. They had to be graduates of the "same language academy" (usually it means, developed by the same company using the same dictionary).

You couldn’t just take a model that spoke one company’s “tech‑speak” and pair it with another that used a completely different style of language.

This created a "lock in" that severely limited the industry. Developers couldn't just pick the fastest small model on the market; they had to find one that was "language-compatible" with their large model, making the whole process cumbersome, expensive, and often, simply impossible.

The research we presented shatters that lock-in. Using three novel algorithms, the researchers (Nadav Timor, the paper's first author and a PhD student with Prof. David Harel from the Weizmann Institute, along with Intel Labs' Jonathan Mamou, Moshe Berchansky, Daniel Korat, Oren Pereg, and Moshe Wasserblat) developed a method that, for the first time, allows for complete freedom.

Now, developers can "mix and match" any small model with any large model, even if they were developed by different companies, are based on different architectures, and use entirely different vocabularies.

The implication is a revolution in accessibility and efficiency. Any developer can now choose the fastest small model and the most accurate large model for their task, combining them to get maximum performance at a minimum cost.

Basically, what they accomplished (and I want to be clear, they did the work, not me) is solving a fundamental problem that hampered the flexibility and efficiency of generative AI systems. Our research shows how to turn speculative decoding into a universal tool.

The algorithms they developed make acceleration methods, which were previously only available to organizations that could train their own custom small models, accessible to everyone.

Check out the paper for more details. The researchers are shown in the photos. They deserve all the credit.

Intel and Weizmann Institute Join Forces to Revolutionize LLM Performance for Everyone

Guy Grimland

More articles by this author

Others also viewed

TAI 134: The US Reveals Its New Regulations for the Diffusion of Advanced AI

This AI newsletter is all you need #91

📏 Does “size matter”❓

Digital Economy Weekly - March 30, 2025

China’s Rapid AI Advancements: Near-Parity with U.S. Models

The io Moment: When AI Becomes Truly Personal

Introducing Atom-of-Thoughts: The Latest in Prompt Engineering

🌐 The AI Pulse: July 11, 2025 - Chronicles from the Frontline of Intelligence — Where Reality Outpaces Fiction

🧠 The AI Pulse — May 5th, 2025 Edition “A New Code Awakens: 10 Dispatches from the Synthetic Frontier”

Newsletter 275: Unlocking the Potential of Inclusive AI

Explore topics

Adolescence: Modern Manhood on the Brink

Apr 25, 2025

הסוד שבפנים: על גבריות ישראלית והדחקת רגשות

Jan 4, 2025

Why You Need to "Leave" Your Workplace (While Still Working There)

Nov 9, 2024

The Woke West Has Already Condemned Us

Nov 8, 2023

"Bad companies are destroyed by crisis. Good companies survive them. Great companies are improved by them"

Feb 4, 2023

Exploring the Future of Public Relations: How Chat GPT Technology is Set to Shake Things Up

Dec 23, 2022

"Epic Vacation Fail: Or Was it Actually a Success in Disguise?"

Dec 21, 2022

איך לארח 90 עיתונאים מכל העולם ולשרוד - עצות ותובנות מעשיות

Sep 21, 2022

"How I rescued my family from Ukraine"

Mar 28, 2022

Intel 2020 - my personal perspective

Nov 22, 2020