Unlocking the Power of Small Language Models with Agent Distillation

Unlocking the Power of Small Language Models with Agent Distillation

As large language models (LLMs) continue to push the boundaries of what’s possible—excelling at complex reasoning, multi-step problem solving, and real-world applications—their sheer size and computational demands present significant barriers to widespread adoption. Running inference on a 32-billion-parameter model requires extensive hardware resources and incurs high latency, making it impractical for many real-world deployments. But what if we could capture the reasoning prowess of these giant models in far smaller, more efficient ones?

That’s precisely the challenge addressed by the recent paper “Distilling LLM Agent into Small Models with Retrieval and Code Tools” (Kang et al., May 2025). In this work, the authors introduce Agent Distillation, a novel framework for teaching small language models (sLMs) to not only reason but also take actions—using external tools like retrieval systems and code interpreters—just as their large LLM “teachers” do. Below, I’ll break down the core ideas, key innovations, and practical implications of this approach.

Why Chain-of-Thought Distillation Falls Short

Traditionally, the go-to method for transferring reasoning ability from a large LLM to a smaller one has been Chain-of-Thought (CoT) Distillation. In CoT distillation, a powerful teacher model (e.g., a 32B-parameter LLM) is prompted to produce a step-by-step rationale (a “chain of thought”) for solving a problem—say, a multi-hop question or a math puzzle. The small model is then fine-tuned to mimic these rationales via next-token prediction.

While CoT distillation can boost the performance of sLMs on certain benchmarks, it has two key limitations:

  1. Hallucination & Calculation Errors: Small models often hallucinate facts or make arithmetic mistakes when faced with questions requiring up-to-date knowledge or precise computation. Simply replaying a teacher’s written reasoning does not guarantee that the sLM can independently handle unseen facts or complex numerical steps.
  2. Static Traces, No Interaction: CoT distillation treats teacher reasoning as static text. There’s no way for the student to “look things up” or “run code” at inference time. In real life, many reasoning tasks require dynamic information retrieval (e.g., fetching a fact or browsing Wikipedia) or executing code for heavy calculations.

Agent Distillation addresses both issues by teaching small models how to interact with tools—just like a savvy human might switch to a calculator for arithmetic or search a knowledge base for missing details.

Introducing Agent Distillation: Reason ⇄ Act ⇄ Observe

At its heart, Agent Distillation revolves around reason-act-observe trajectories. Instead of collecting only step-by-step reasoning text, the teacher LLM runs as an “agent” that alternates between:

  1. Thought: Generating intermediate reasoning steps in natural language.
  2. Action: Calling an external tool (e.g., a code snippet to perform arithmetic, or a retrieval query to fetch facts).
  3. Observation: Incorporating results returned by the tool (execution outputs or search results) back into the reasoning process.

By logging complete trajectories—Thought → Action → Observation → Thought → Action, and so on—the student model sees not just “what to think” but “what to do” at each step. During distillation, the student is fine-tuned to reproduce the teacher’s sequence of thought/action pairs, learning to:

  • Formulate retrieval queries
  • Invoke code execution (e.g., using a Python interpreter to calculate large or tricky numerical expressions)
  • Handle and correct for execution errors (e.g., syntax mistakes or runtime exceptions)
  • Integrate retrieved knowledge into its reasoning

This interactive framework enables sLMs to defer heavy computation or rare fact retrieval to specialized tools, sidestepping hallucination and arithmetic slip-ups.

Two Key Innovations: First-Thought Prefix & Self-Consistent Actions

Agent Distillation alone is a powerful idea, but pulling it off in practice—especially when distilling from a 32B model into a mere 0.5–3 billionparameter sLM—requires careful attention to trajectory quality and student robustness. Kang et al. introduce two complementary techniques:

  1. First-Thought Prefix (FTP)
  2. Self-Consistent Action Generation (SAG)

Putting It to the Test: Factual & Mathematical Benchmarks

The authors evaluate Agent Distillation across eight standard reasoning tasks:

  • Factual (Multi-Hop QA)
  • Mathematical (Written Math Problems)

Metrics focus on exact match accuracy for math problems and an LLM-as-judge for factual QA.

Baseline Comparison

  • A 32B teacher LLM (Qwen 2.5-32B-Instruct) demonstrates strong performance when prompted with CoT or run as an agent.
  • Student models range in size from 0.5B to 7B (all variants of Qwen 2.5-Instruct, finetuned beforehand).
  • Baselines include classic CoT distillation (with or without retrieval-augmented generation) and zero-shot agent prompting (no distillation).

Key Findings

  1. Substantial Accuracy Gains:
  2. Out-of-Domain Robustness:
  3. Better Generalization with Tools:
  4. FTP Matters for Harder Problems:
  5. SAG Reduces Code Errors:

How Small Agents Think & Act

Putting it all together, what does a distilled agent actually do at inference time? Imagine the student model encountering a problem like:

“What would $300 invested in NVIDIA stock at the start of 2015 be worth by the end of 2020?”

A CoT-only sLM might try to recall historical prices (and likely hallucinate) or attempt mental arithmetic (prone to mistakes). The distilled agent, by contrast, proceeds as follows:

  1. Thought (Step 1) “First, I need the price of NVIDIA stock in January 2015 and December 2020, along with any stock splits in between.”
  2. Action (Step 1) Calls the retrieval tool:

search("NVIDIA stock price history 2015 to 2020 including splits")        

  1. Thought (Step 2) “With $300, I can buy 300/20 = 15 shares in Jan 2015. No splits occurred, so I still have 15 shares by Dec 2020. At $130 each, total = 15 × 130 = $1,950.”
  2. Action (Step 2) Executes a short Python snippet:

initial, p2015, p2020 = 300, 20, 130
shares = initial / p2015
final_value = shares * p2020
print(final_value)        

  1. Thought (Final) “Therefore, $300 invested in NVIDIA at the start of 2015 would be worth $1,950 by the end of 2020.”

This back-and-forth process allows a small model to handle both factual lookup and precise numeric computation, without memorizing entire stock histories or risking arithmetic mistakes.

Practical Takeaways and Impact

  1. Efficiency Without Sacrificing Accuracy
  2. Better Out-of-Domain Generalization
  3. Guidance on Distillation Strategy
  4. Future Directions

Conclusion

Agent Distillation represents a significant leap toward democratizing intelligent language agents. By combining structured reasoning (via CoT prefacing) and dynamic tool use (retrieval + code execution), this framework allows a tiny 0.5 billion-parameter model to punch well above its weight—solving multi-hop QA and Olympiad-style math at levels previously reserved for much larger LLMs.

For practitioners and researchers alike, Agent Distillation offers a compelling blueprint:

  1. Teach small models not just “what to think,” but “what to do” and when to defer to specialized tools.
  2. Invest in high-quality teacher trajectories (FTP) to guide student reasoning.
  3. Ensure robust inference (SAG) by filtering out invalid code or retrieval steps.

If you’re exploring how to bring advanced reasoning capabilities into resource-constrained environments—whether inside a mobile app, a real-time analytics dashboard, or a corporate chatbot—Agent Distillation is a methodology worth examining. It paves the way for truly interactive, cost-effective language agents that can “think + act + learn” on the fly, all within the footprint of a fraction-size model.

References (for those who want to dive deeper)

  • Kang, M., Jeong, J., Lee, S., Cho, J., & Hwang, S. J. (2025). Distilling LLM Agent into Small Models with Retrieval and Code Tools. arXiv:2505.17612.
  • ReAct: Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.
  • CodeAct: Wang, X. et al. (2024). Executable Code Actions Elicit Better LLM Agents. ICML.

RISHABH TRIPATHI

Kode Vortex | NPTEL STAR & Facilitated by IIT Kanpur | GLA University | Data Science Enthusiast | Defi Protocals Learner | ML Expert & Developer | GenAI | GCP | PowerBI

2mo

Helpful insight, Sarvex sir

Like
Reply
Linda Restrepo

EDITOR | PUBLISHER Inner Sanctum Vector N360™

2mo

Brilliantly articulated, Sarvex. For those less familiar with the concept, here’s why this matters: Sarvex is explaining a breakthrough called Agent Distillation, where small language models (sLMs) are trained to behave like their much larger counterparts (LLMs)—not by just copying answers, but by learning to think, act, and observe like intelligent agents. Instead of simply mimicking step-by-step responses (as in traditional Chain-of-Thought distillation), these small models are taught to use tools—like search engines to retrieve facts or code interpreters to perform calculations and solve problems. This shift allows them to become: Smarter and more accurate Cheaper to deploy in real-world apps (e.g., mobile, finance, chatbots) More reliable in complex or unfamiliar domains As someone working closely with AI deployments, I see this as a critical evolution. These are no longer passive models—they're interactive systems that reason, verify, and adapt. Sarvex, thank you for breaking this down so clearly. This methodology isn’t just powerful—it’s necessary for a future where AI is accessible and sustainable. Linda Restrepo Editor #AgentDistillation #AIInnovation #SmallLanguageModels #EdgeAI #RealWorldAI #InnerSanctumVectorN360

To view or add a comment, sign in

Others also viewed

Explore topics