Unlocking the Power of Small Language Models with Agent Distillation

Sarvex Jatasra

Ex-Amazon, Ex-Motorola, Ex-Microsoft | Shaping Tomorrow's World Since 1991: Trailblazing FinSecOps, Deep Learning, Quantum Computing, Generative AI, and Extended Reality—Revolutionizing FinTech, BFSI, and Trading.

Published Jun 3, 2025

As large language models (LLMs) continue to push the boundaries of what’s possible—excelling at complex reasoning, multi-step problem solving, and real-world applications—their sheer size and computational demands present significant barriers to widespread adoption. Running inference on a 32-billion-parameter model requires extensive hardware resources and incurs high latency, making it impractical for many real-world deployments. But what if we could capture the reasoning prowess of these giant models in far smaller, more efficient ones?

That’s precisely the challenge addressed by the recent paper “Distilling LLM Agent into Small Models with Retrieval and Code Tools” (Kang et al., May 2025). In this work, the authors introduce Agent Distillation, a novel framework for teaching small language models (sLMs) to not only reason but also take actions—using external tools like retrieval systems and code interpreters—just as their large LLM “teachers” do. Below, I’ll break down the core ideas, key innovations, and practical implications of this approach.

Why Chain-of-Thought Distillation Falls Short

Traditionally, the go-to method for transferring reasoning ability from a large LLM to a smaller one has been Chain-of-Thought (CoT) Distillation. In CoT distillation, a powerful teacher model (e.g., a 32B-parameter LLM) is prompted to produce a step-by-step rationale (a “chain of thought”) for solving a problem—say, a multi-hop question or a math puzzle. The small model is then fine-tuned to mimic these rationales via next-token prediction.

While CoT distillation can boost the performance of sLMs on certain benchmarks, it has two key limitations:

Hallucination & Calculation Errors: Small models often hallucinate facts or make arithmetic mistakes when faced with questions requiring up-to-date knowledge or precise computation. Simply replaying a teacher’s written reasoning does not guarantee that the sLM can independently handle unseen facts or complex numerical steps.
Static Traces, No Interaction: CoT distillation treats teacher reasoning as static text. There’s no way for the student to “look things up” or “run code” at inference time. In real life, many reasoning tasks require dynamic information retrieval (e.g., fetching a fact or browsing Wikipedia) or executing code for heavy calculations.

Agent Distillation addresses both issues by teaching small models how to interact with tools—just like a savvy human might switch to a calculator for arithmetic or search a knowledge base for missing details.

Introducing Agent Distillation: Reason ⇄ Act ⇄ Observe

At its heart, Agent Distillation revolves around reason-act-observe trajectories. Instead of collecting only step-by-step reasoning text, the teacher LLM runs as an “agent” that alternates between:

Thought: Generating intermediate reasoning steps in natural language.
Action: Calling an external tool (e.g., a code snippet to perform arithmetic, or a retrieval query to fetch facts).
Observation: Incorporating results returned by the tool (execution outputs or search results) back into the reasoning process.

By logging complete trajectories—Thought → Action → Observation → Thought → Action, and so on—the student model sees not just “what to think” but “what to do” at each step. During distillation, the student is fine-tuned to reproduce the teacher’s sequence of thought/action pairs, learning to:

Formulate retrieval queries
Invoke code execution (e.g., using a Python interpreter to calculate large or tricky numerical expressions)
Handle and correct for execution errors (e.g., syntax mistakes or runtime exceptions)
Integrate retrieved knowledge into its reasoning

This interactive framework enables sLMs to defer heavy computation or rare fact retrieval to specialized tools, sidestepping hallucination and arithmetic slip-ups.

Two Key Innovations: First-Thought Prefix & Self-Consistent Actions

Agent Distillation alone is a powerful idea, but pulling it off in practice—especially when distilling from a 32B model into a mere 0.5–3 billionparameter sLM—requires careful attention to trajectory quality and student robustness. Kang et al. introduce two complementary techniques:

First-Thought Prefix (FTP)
Self-Consistent Action Generation (SAG)

Putting It to the Test: Factual & Mathematical Benchmarks

The authors evaluate Agent Distillation across eight standard reasoning tasks:

Factual (Multi-Hop QA)
Mathematical (Written Math Problems)

Metrics focus on exact match accuracy for math problems and an LLM-as-judge for factual QA.

Baseline Comparison

A 32B teacher LLM (Qwen 2.5-32B-Instruct) demonstrates strong performance when prompted with CoT or run as an agent.
Student models range in size from 0.5B to 7B (all variants of Qwen 2.5-Instruct, finetuned beforehand).
Baselines include classic CoT distillation (with or without retrieval-augmented generation) and zero-shot agent prompting (no distillation).

Key Findings

Substantial Accuracy Gains:
Out-of-Domain Robustness:
Better Generalization with Tools:
FTP Matters for Harder Problems:
SAG Reduces Code Errors:

How Small Agents Think & Act

Putting it all together, what does a distilled agent actually do at inference time? Imagine the student model encountering a problem like:

“What would $300 invested in NVIDIA stock at the start of 2015 be worth by the end of 2020?”

A CoT-only sLM might try to recall historical prices (and likely hallucinate) or attempt mental arithmetic (prone to mistakes). The distilled agent, by contrast, proceeds as follows:

Thought (Step 1) “First, I need the price of NVIDIA stock in January 2015 and December 2020, along with any stock splits in between.”
Action (Step 1) Calls the retrieval tool:

search("NVIDIA stock price history 2015 to 2020 including splits")

Thought (Step 2) “With $300, I can buy 300/20 = 15 shares in Jan 2015. No splits occurred, so I still have 15 shares by Dec 2020. At $130 each, total = 15 × 130 = $1,950.”
Action (Step 2) Executes a short Python snippet:

initial, p2015, p2020 = 300, 20, 130
shares = initial / p2015
final_value = shares * p2020
print(final_value)

Thought (Final) “Therefore, $300 invested in NVIDIA at the start of 2015 would be worth $1,950 by the end of 2020.”

This back-and-forth process allows a small model to handle both factual lookup and precise numeric computation, without memorizing entire stock histories or risking arithmetic mistakes.

Practical Takeaways and Impact

Efficiency Without Sacrificing Accuracy
Better Out-of-Domain Generalization
Guidance on Distillation Strategy
Future Directions

Conclusion

Agent Distillation represents a significant leap toward democratizing intelligent language agents. By combining structured reasoning (via CoT prefacing) and dynamic tool use (retrieval + code execution), this framework allows a tiny 0.5 billion-parameter model to punch well above its weight—solving multi-hop QA and Olympiad-style math at levels previously reserved for much larger LLMs.

For practitioners and researchers alike, Agent Distillation offers a compelling blueprint:

Teach small models not just “what to think,” but “what to do” and when to defer to specialized tools.
Invest in high-quality teacher trajectories (FTP) to guide student reasoning.
Ensure robust inference (SAG) by filtering out invalid code or retrieval steps.

If you’re exploring how to bring advanced reasoning capabilities into resource-constrained environments—whether inside a mobile app, a real-time analytics dashboard, or a corporate chatbot—Agent Distillation is a methodology worth examining. It paves the way for truly interactive, cost-effective language agents that can “think + act + learn” on the fly, all within the footprint of a fraction-size model.

References (for those who want to dive deeper)

Kang, M., Jeong, J., Lee, S., Cho, J., & Hwang, S. J. (2025). Distilling LLM Agent into Small Models with Retrieval and Code Tools. arXiv:2505.17612.
ReAct: Yao, S. et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR.
CodeAct: Wang, X. et al. (2024). Executable Code Actions Elicit Better LLM Agents. ICML.

Technological Musings

873 followers

+ Subscribe

RISHABH TRIPATHI

2mo

Helpful insight, Sarvex sir

Linda Restrepo

EDITOR | PUBLISHER Inner Sanctum Vector N360™

2mo

Brilliantly articulated, Sarvex. For those less familiar with the concept, here’s why this matters: Sarvex is explaining a breakthrough called Agent Distillation, where small language models (sLMs) are trained to behave like their much larger counterparts (LLMs)—not by just copying answers, but by learning to think, act, and observe like intelligent agents. Instead of simply mimicking step-by-step responses (as in traditional Chain-of-Thought distillation), these small models are taught to use tools—like search engines to retrieve facts or code interpreters to perform calculations and solve problems. This shift allows them to become: Smarter and more accurate Cheaper to deploy in real-world apps (e.g., mobile, finance, chatbots) More reliable in complex or unfamiliar domains As someone working closely with AI deployments, I see this as a critical evolution. These are no longer passive models—they're interactive systems that reason, verify, and adapt. Sarvex, thank you for breaking this down so clearly. This methodology isn’t just powerful—it’s necessary for a future where AI is accessible and sustainable. Linda Restrepo Editor #AgentDistillation #AIInnovation #SmallLanguageModels #EdgeAI #RealWorldAI #InnerSanctumVectorN360

Unlocking the Power of Small Language Models with Agent Distillation

Sarvex Jatasra

Ex-Amazon, Ex-Motorola, Ex-Microsoft | Shaping Tomorrow's World Since 1991: Trailblazing FinSecOps, Deep Learning, Quantum Computing, Generative AI, and Extended Reality—Revolutionizing FinTech, BFSI, and Trading.

Why Chain-of-Thought Distillation Falls Short

Introducing Agent Distillation: Reason ⇄ Act ⇄ Observe

Two Key Innovations: First-Thought Prefix & Self-Consistent Actions

Putting It to the Test: Factual & Mathematical Benchmarks

How Small Agents Think & Act

Practical Takeaways and Impact

Conclusion

Technological Musings

873 followers

More articles by this author

Others also viewed

Does Fine-Tuning cause more Hallucinations, and how does cross-layer Attention reduce Key-Value Cache size?

Large Language Model Settings: Temperature, Top P and Max Tokens

Rethinking Hallucination Detection in Language Models: Are We Measuring It Correctly?

How exactly LLM generates text?

Day 11/50: Building a small language from scratch: Introduction to the Attention Mechanism in Large Language Models (LLMs)

Advancing Reasoning Strategies in Large Language Models

Emergent Reasoning and Deliberative Thought in Large Language Models

NewMind AI Journal #110

A Hybrid Large Language Model (LLM) Approach: Combining RAG, CoT, and Multi-Method Tokenization for Enhanced AI Responses

Mimicking Large Language Models Without Access to Weights: Techniques and Implications

Explore topics

Why Chain-of-Thought Distillation Falls Short

Introducing Agent Distillation: Reason ⇄ Act ⇄ Observe

Two Key Innovations: First-Thought Prefix & Self-Consistent Actions

Putting It to the Test: Factual & Mathematical Benchmarks

How Small Agents Think & Act

Practical Takeaways and Impact

Conclusion

Technological Musings

873 followers

🧠 BYOKG-RAG: A Smarter Way to Use Knowledge Graphs in LLM-Powered Question Answering

Jul 18, 2025

🚘 Driving into the Future: Safe Autonomous Vehicles with CIMRL – Combining Imitation and Reinforcement Learning

Jul 14, 2025

🧠⚙️ Neuro-Symbolic Reinforcement Learning: Building Trustworthy and Generalizable AI

Jul 13, 2025

From Rewards to Preferences: Direct Preference Optimization (DPO) with Verifiable Preferences

Jul 13, 2025

🧠 Reinforcement Learning with Verifiable Reward (RLVR): A New Paradigm for Teaching LLMs to Reason

Jul 13, 2025

How a Single Example Can Spark Intelligence: The Power of 1-Shot RLVR in Large Language Models

Jul 13, 2025

Rethinking Code Evaluation: Introducing CodeBLEU for Smarter AI Code Synthesis

Jul 13, 2025

🧠 Post-Training Large Language Models (LLMs): The Hidden Engine Behind Smart Reasoning

Jul 13, 2025

Trust Region Policy Optimization (TRPO): A Reliable Foundation for Deep Reinforcement Learning

Jul 13, 2025

Reinventing Reinforcement Learning: The Simplicity and Power of Proximal Policy Optimization (PPO)

Jul 13, 2025

Others also viewed

Does Fine-Tuning cause more Hallucinations, and how does cross-layer Attention reduce Key-Value Cache size?

Large Language Model Settings: Temperature, Top P and Max Tokens

Rethinking Hallucination Detection in Language Models: Are We Measuring It Correctly?

How exactly LLM generates text?

Day 11/50: Building a small language from scratch: Introduction to the Attention Mechanism in Large Language Models (LLMs)

Advancing Reasoning Strategies in Large Language Models

Emergent Reasoning and Deliberative Thought in Large Language Models

NewMind AI Journal #110

A Hybrid Large Language Model (LLM) Approach: Combining RAG, CoT, and Multi-Method Tokenization for Enhanced AI Responses

Mimicking Large Language Models Without Access to Weights: Techniques and Implications

Explore topics