Intel and Weizmann Institute Join Forces to Revolutionize LLM Performance for Everyone

Intel and Weizmann Institute Join Forces to Revolutionize LLM Performance for Everyone

About a week ago, the International Conference on Machine Learning (ICML) took place it’s one of the most important research events in the AI world. And a team of researchers from Intel Labs, in collaboration with the Weizmann Institute of Science, published a fascinating study showing how to significantly speed up the performance of Large Language Models (LLMs).

What's really beautiful is that the research earned the honor of being selected for an Oral Presentation - a category reserved for the top 1% of the roughly 15,000 papers submitted to the conference.

Okay, a little background:

Large Language Models (LLMs) like ChatGPT are incredible tools, but they have a major Achilles' heel: they're slow and consume a lot of resources. Every answer they generate requires immense computing power, which causes delays (latency) for the user and dramatically drives up operating costs for companies. The need to accelerate these models is one of the biggest challenges in the AI industry today.

To tackle this, the industry uses a method most of you have probably heard of called "Speculative Decoding". The idea is that a small, nimble "assistant" model quickly guesses the next part of the text, and the large, powerful model only has to verify the guess instead of generating everything from scratch. For anyone not familiar with it, let me explain with the following example.

Let's say we want the model to complete the sentence: "The capital of France is..."

Without Speculative Decoding (the slow way):

The large model works step-by-step.

* Step 1: It generates the word "Paris" (requiring a lot of computational power).

* Step 2: It reads "The capital of France is Paris" and generates the next word: "the" (again, a lot of computational power).

* Step 3: It reads the entire new sentence and generates the word "City" (again, a full computational effort).

The result: To generate 3 words, the large model needed 3 separate and expensive "generation cycles."

With Speculative Decoding (the fast way):

The models work together.

* Step 1 (Quick Guess): The small, nimble "assistant" model reads the sentence and instantly guesses a three-word draft: "Paris," "the," "City" (it does this very quickly).

* Step 2 (One-Shot Verification): The large, powerful model gets this entire draft and checks it all at once. It asks itself: "Is this the right guess?" In this case, the answer is "Yes."

The result: To generate the same 3 words, the large model only needed one efficient "thinking cycle" to approve the whole draft.

Now, I'm about to say something that probably every data scientist knows, but bear with me. AI models don't understand words like we do. Each model learns and builds its own unique "digital language" - a private dictionary of tokens that only it understands. The word "apple" in one model might be represented by token #123, while in another, it’s token #987.

Until now, speculative decoding only worked if both models—the large and the small one—spoke the exact same digital language. They had to be graduates of the "same language academy" (usually it means, developed by the same company using the same dictionary).

You couldn’t just take a model that spoke one company’s “tech‑speak” and pair it with another that used a completely different style of language.

This created a "lock in" that severely limited the industry. Developers couldn't just pick the fastest small model on the market; they had to find one that was "language-compatible" with their large model, making the whole process cumbersome, expensive, and often, simply impossible.

The research we presented shatters that lock-in. Using three novel algorithms, the researchers (Nadav Timor, the paper's first author and a PhD student with Prof. David Harel from the Weizmann Institute, along with Intel Labs' Jonathan Mamou, Moshe Berchansky, Daniel Korat, Oren Pereg, and Moshe Wasserblat) developed a method that, for the first time, allows for complete freedom.

Now, developers can "mix and match" any small model with any large model, even if they were developed by different companies, are based on different architectures, and use entirely different vocabularies.

The implication is a revolution in accessibility and efficiency. Any developer can now choose the fastest small model and the most accurate large model for their task, combining them to get maximum performance at a minimum cost.

Basically, what they accomplished (and I want to be clear, they did the work, not me) is solving a fundamental problem that hampered the flexibility and efficiency of generative AI systems. Our research shows how to turn speculative decoding into a universal tool.

The algorithms they developed make acceleration methods, which were previously only available to organizations that could train their own custom small models, accessible to everyone.

Check out the paper for more details. The researchers are shown in the photos. They deserve all the credit.


Ziv Shapira

Software Quality Assurance Architect at Intel Corporation (AI Solutions Group)

1w

💡 Great insight, well done

Yossi Bitton

Senior IT Technical Program Manager at Intel Corporation

2w

To view or add a comment, sign in

Others also viewed

Explore topics