One of the biggest barriers to deploying LLM-based agents in real workflows is their poor performance on long-horizon reasoning. Agents often generate coherent short responses but struggle when a task requires planning, tool use, or multi-step decision-making. The issue is not just accuracy at the end, but the inability to reason through the middle. Without knowing which intermediate steps helped or hurt, agents cannot learn to improve. This makes long-horizon reasoning one of the hardest and most unsolved problems for LLM generalization. It is relatively easy for a model to retrieve a document, answer a factual question, or summarize a short email. It is much harder to solve a billing dispute that requires searching, interpreting policy rules, applying edge cases, and adjusting the recommendation based on prior steps. Today’s agents can generate answers, but they often fail to reflect, backtrack, or reconsider earlier assumptions. A new paper from Google DeepMind and Stanford addresses this gap with a method called SWiRL: Step-Wise Reinforcement Learning. Rather than training a model to get the final answer right, SWiRL trains the model to improve each step in a reasoning chain. It does this by generating synthetic multi-step problem-solving traces, scoring every individual step using a reward model (Gemini 1.5 Pro), and fine-tuning the base model to favor higher-quality intermediate steps. This approach fundamentally changes the way we train reasoning agents. Instead of optimizing for final outcomes, the model is updated based on how good each reasoning step was in context. For example, if the model generates a search query or a math step that is useful, even if the final answer is wrong, that step is rewarded and reinforced. Over time, the agent learns not just to answer, but to reason more reliably. This is a major departure from standard RLHF, which only gives feedback at the end. SWiRL improves performance by 9.2 percent on HotPotQA, 16.9 percent on GSM8K when trained on HotPotQA, and 11 to 15 percent on other multi-hop and math datasets like MuSiQue, BeerQA, and CofCA. It generalizes across domains, works without golden labels, and outperforms both supervised fine-tuning and single-step RL methods. The implications are substantial: we can now train models to reason better by scoring and optimizing their intermediate steps. Better reward models, iterative reflection, tool-assisted reasoning, and trajectory-level training will lead to more robust performance in multi-step tasks. This is not about mere performance improvement. It shows how we can begin to train agents not to mimic outputs, but to improve the quality of their thought process. That’s essential if we want to build agents that work through problems, adapt to new tasks, and operate autonomously in open-ended environments.
Stanford Method for Improving Open LLM Performance
Explore top LinkedIn content from expert professionals.
Summary
The “stanford-method-for-improving-open-llm-performance” refers to several innovative techniques developed at Stanford to help open-source large language models (LLMs) improve their reasoning, accuracy, and adaptability in complex tasks. These methods focus on changing how models learn and interact, such as refining each step of problem solving, using feedback in natural language, and evolving the context around the models instead of just adjusting their internal settings.
- Try step-wise feedback: Focus on helping models learn by giving feedback for each stage in a problem, so they can build better reasoning skills over time, not just aim for the correct final answer.
- Use textual critique: Encourage models to update their solutions based on constructive text-based feedback, rather than relying solely on traditional numeric error signals.
- Evolve context continually: Allow models to rewrite and reflect on their own prompts and instructions, building a richer memory and context for smarter decision-making without the need for retraining.
-
-
Stanford researchers just introduced a new way to optimize AI models using text-based feedback instead of traditional backpropagation! Deep learning has long relied on numerical gradients to fine-tune neural networks. But, optimizing generative AI systems has been much harder because they interact using natural language, not numbers. 𝗧𝗲𝘅𝘁𝗚𝗿𝗮𝗱 𝗶𝘀 𝘁𝗵𝗲 𝗳𝗶𝗿𝘀𝘁 𝗳𝗿𝗮𝗺𝗲𝘄𝗼𝗿𝗸 𝘁𝗼 𝗯𝗮𝗰𝗸𝗽𝗿𝗼𝗽𝗮𝗴𝗮𝘁𝗲 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸, 𝗲𝗻𝗮𝗯𝗹𝗶𝗻𝗴 𝗔𝗜 𝘁𝗼 𝗶𝘁𝗲𝗿𝗮𝘁𝗶𝘃𝗲𝗹𝘆 𝗿𝗲𝗳𝗶𝗻𝗲 𝗶𝘁𝘀 𝗼𝘂𝘁𝗽𝘂𝘁𝘀 𝗮𝗰𝗿𝗼𝘀𝘀 𝗱𝗶𝘃𝗲𝗿𝘀𝗲 𝘁𝗮𝘀𝗸𝘀. 1. Improved AI performance in PhD-level science Q&A, raising accuracy from 51.0% to 55.0% on GPQA and from 91.2% to 95.1% on MMLU physics. 2. Optimized medical treatment plans, outperforming human-designed radiotherapy plans by better balancing tumor targeting and organ protection. 3. Enhanced AI-driven drug discovery by iteratively refining molecular structures, generating high-affinity compounds faster than traditional methods. 4. Boosted complex AI agents like Chameleon, increasing multimodal reasoning accuracy by 7.7% through iterative feedback refinement. 𝗧𝗵𝗲 𝘂𝘀𝗲 𝗼𝗳 "𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁𝘀" 𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝗻𝘂𝗺𝗲𝗿𝗶𝗰𝗮𝗹 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁𝘀 𝗶𝘀 𝗽𝗿𝗲𝘁𝘁𝘆 𝗱𝗮𝗿𝗻 𝗰𝗼𝗼𝗹. It treats LLM feedback as “textual gradients” which are collected from every use of a variable in the system. By aggregating critiques from different contexts and iteratively updating variables (using a process analogous to numerical gradient descent), the method smooths out individual inconsistencies. 𝗜'𝗺 𝗰𝘂𝗿𝗶𝗼𝘂𝘀 𝗮𝗯𝗼𝘂𝘁 𝗵𝗼𝘄 𝗳𝗼𝗿𝗺𝗮𝗹𝗶𝘇𝗶𝗻𝗴 𝗺𝗲𝘁𝗵𝗼𝗱𝘀 𝘁𝗼 𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗲 𝗮𝗻𝗱 𝗰𝗼𝗻𝘀𝘁𝗿𝗮𝗶𝗻 𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗴𝗿𝗮𝗱𝗶𝗲𝗻𝘁𝘀 beyond formalization of the propagation and update process via the equations could be developed to enhance robustness. Perhaps training secondary models to evaluate the quality and consistency of textual gradients or an ensemble approach of generating multiple textual gradients using different LLMs or multiple prompts? Just throwing some ideas out there; this stuff is pretty cool. Here's the awesome work: https://lnkd.in/gX8ABsdM Congrats to Mert Yuksekgonul, Federico Bianchi, Joseph Boen, James Zou, and co! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW
-
Did Stanford just kill LLM fine-tuning? . . This new paper from Stanford, called Agentic Context Engineering (ACE), proves something wild: you can make models smarter without changing a single weight. Here's how it works: Instead of retraining the model, ACE evolves the context itself. The model writes its own prompt, reflects on what worked and what didn't, then rewrites it. Over and over. It becomes a self-improving system. Think of it like the model keeping a living notebook where every failure becomes a lesson and every success becomes a rule. The results are impressive: - 10.6% better than GPT-4-powered agents on AppWorld - 8.6% improvement on financial reasoning tasks - 86.9% lower cost and latency No labeled data required. Just feedback loops. Here's the counterintuitive part: Everyone's chasing short, clean prompts. ACE does the opposite. It builds dense, evolving playbooks that compound over time. Turns out LLMs don't need simplicity. They need context density. The question here is how to manage all this information and experience. This is where building a real-time memory layer for Agents like Zep AI (YC W24) can be a great solution and active area of research going forward. What are your thoughts? I have linked the paper in the next tweet! ____ If you found it insightful, reshare with your network. Find me → Akshay Pachaar ✔️ For more insights and tutorials on LLMs, AI Agents, and Machine Learning!
Explore categories
- Hospitality & Tourism
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development