SWiRL: Advancing Multi-Step Reasoning and Tool Use in Large Language Models

Nick Gupta

ML Engineer | Artificial General Intelligence (AGI) | Amazon | Columbia University Computer Science

Published Aug 14, 2025

Why SWiRL Matters

Large Language Models (LLMs) have evolved into powerful engines for natural language processing, but when it comes to complex, multi-step reasoning—think multi-hop question answering, math problem solving, or orchestrating external tools—they often stumble.

The culprit? Most fine-tuning approaches like RLHF (Reinforcement Learning from Human Feedback) and RLAIF (RL from AI Feedback) optimize for single-step outputs, giving the model feedback only after the final answer. This leaves a massive gap: errors made early in a reasoning chain are left uncorrected until it’s too late.

Enter SWiRL (Step-Wise Reinforcement Learning) — a method that shifts the focus from “Was the final answer correct?” to “Was each step along the way sound and useful?”

What is SWiRL?

SWiRL is a two-stage, offline RL framework designed to improve reasoning across multiple steps of thought and tool usage. Instead of waiting until the end of a task to score the model, SWiRL evaluates each intermediate action in the reasoning chain, reinforcing good reasoning habits throughout the process.

Stage 1 — Multi-Step Synthetic Data Generation

An open-source LLM (e.g., Gemma 2) is augmented with tools like a search engine or calculator.
It generates multi-step reasoning trajectories—each a sequence of intermediate thoughts, tool calls, and results.
These trajectories are split into sub-trajectories (one per action), then filtered:
Surprisingly, SWiRL performs best with process-only filtering—even incorrect final answers can contain valuable reasoning steps.

Stage 2 — Step-Wise RL Optimization

A reward model scores each action in context, without using golden labels.
Policy gradient methods optimize the base LLM to maximize per-step rewards.
This granular reinforcement encourages better local decision-making and overall plan quality.

How SWiRL Differs from Traditional RLHF

Feedback timing – RLHF gives feedback only after the final answer, while SWiRL provides it after every step.
Labels needed – RLHF relies on human or golden labels; SWiRL uses model-based judgment and requires none.
Tool use – RLHF doesn’t explicitly train tool usage; SWiRL includes built-in multi-step tool orchestration.
Error handling – In RLHF, early mistakes often derail the final output; SWiRL can recover mid-trajectory.

Key Results

In rigorous tests across multi-hop QA and math reasoning benchmarks, SWiRL delivered double-digit gains over baselines:

+21.5% on GSM8K (math)
+15.3% on BeerQA
+14.8% on CofCA
+11.1% on MuSiQue
+12.3% on HotPotQA

Even more exciting: cross-domain generalization works.

Training only on HotPotQA improved GSM8K math reasoning by +16.9%.
Training only on GSM8K improved HotPotQA QA performance by +9.2%.

Why Process Filtering Wins

One surprising finding: Outcome filtering (keeping only correct answers) often hurt performance. Models trained solely on “perfect” examples became less robust, likely due to overfitting.

By contrast, process filtering—judging each reasoning step—yielded higher accuracy across tasks. Incorrect answers often still contained valuable reasoning segments that taught the model how to recover from mistakes.

Scaling & Model Size Impact

Dataset Size: Performance improves significantly with more trajectories. Gains appear even with just 1,000 trajectories; 10,000+ continues to help.
Model Size: Larger models (e.g., Gemma-2-27b) generalize better across domains, while smaller models benefit mostly in-domain.

Beyond the Numbers: Why SWiRL Is a Big Deal

No Human Labels Needed — Cuts cost and speeds iteration.
Error-Tolerant Learning — Learns from imperfect data.
Tool-Aware Reasoning — Works seamlessly with calculators, retrievers, APIs.
Cross-Task Transfer — Improves reasoning even in unseen domains.
Robust Process Rewards — Optimizes reasoning quality, not just outcomes.

Implications for Industry

For enterprise AI deployments—whether it’s financial analysis, scientific research, customer service automation, or agentic task orchestration—SWiRL offers:

Higher accuracy on complex workflows
Better reliability in decision-making chains
Faster adaptation to new problem domains

Final Thoughts

Multi-step reasoning is the next frontier for LLMs, and SWiRL is a leap forward. By reinforcing good reasoning at every step, it closes the gap between current LLMs and the agents we want them to be—capable, reliable, and adaptable across diverse, tool-rich environments.

In a world where AI is expected to “think before it speaks,” SWiRL ensures it also thinks before every step.

Curious to learn more? The full research is available on arXiv: Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use (SWiRL).

SWiRL: Advancing Multi-Step Reasoning and Tool Use in Large Language Models

Nick Gupta

ML Engineer | Artificial General Intelligence (AGI) | Amazon | Columbia University Computer Science

Why SWiRL Matters

What is SWiRL?

Stage 1 — Multi-Step Synthetic Data Generation

Stage 2 — Step-Wise RL Optimization

How SWiRL Differs from Traditional RLHF

Key Results

Why Process Filtering Wins

Scaling & Model Size Impact

Beyond the Numbers: Why SWiRL Is a Big Deal

Implications for Industry

Final Thoughts

More articles by this author

Explore topics

Why SWiRL Matters

What is SWiRL?

Stage 1 — Multi-Step Synthetic Data Generation

Stage 2 — Step-Wise RL Optimization

How SWiRL Differs from Traditional RLHF

Key Results

Why Process Filtering Wins

Scaling & Model Size Impact

Beyond the Numbers: Why SWiRL Is a Big Deal

Implications for Industry

Final Thoughts

Training Smarter LLMs with GRPO: A Deep Dive into Group Relative Policy Optimization

Aug 4, 2025

Demystifying Mixture of Experts (MoE): A Scalable Solution for Large-Scale Deep Learning

Nov 1, 2024

Unveiling LangSmith: Revolutionizing LLM Monitoring with Security in Mind

Oct 20, 2024

"Where are you 'from'?"

Sep 4, 2024

What is Retrieval-Augmented Generation (RAG) and How to Secure RAG Solutions: A Technical Deep Dive

Aug 19, 2024

Top Emerging Trends in Machine Learning for 2024

Jul 12, 2024

Latest Development in AI: The Revolutionary Leap from Large Language Models to General World Models

Feb 24, 2024

Using NLP with AWS SageMaker

May 27, 2023

Mastering XGBoost: From Basics to Advanced Techniques with a Complete Use Case

May 10, 2023

K-Means Clustering: An Introduction to Grouping Data for Improved Insights

Mar 21, 2023

Explore topics