The AI Advice Everyone's Using Might Be Getting Less Useful
That "think step by step" trick with ChatGPT? New research suggests it's not the magic bullet it once was.
If you've used ChatGPT, Claude, or any AI chatbot in the past year, you've probably seen this advice everywhere: When you want better answers, tell the AI to "think step by step." It's called Chain-of-Thought prompting, and it's been the go-to trick for getting AI to show its work and reason through problems more carefully.
But here's the thing: new research from the University of Pennsylvania suggests this widely-shared advice is becoming less valuable—and in some cases, it might actually make things worse.
What the Researchers Found
A team led by researchers at Wharton tested this "think step by step" approach across eight different AI models, from ChatGPT to Claude to Google's Gemini. They used 198 PhD-level questions in biology, physics, and chemistry—the kind of questions so difficult that actual PhDs only get them right 65% of the time, and they're "Google-proof" (meaning even 30 minutes of web searching doesn't help much).
Here's what they discovered:
For newer AI models, Chain-of-Thought prompting often provides only tiny improvements. In some cases, it made performance worse. The reasoning models (like OpenAI's o3-mini and o4-mini) saw average improvements of just 2.9-3.1%—statistically significant but practically small. GPT-4o-mini showed the smallest gains (4.4%) and these weren't even statistically significant.
It significantly slows things down and costs more. Chain-of-Thought requests took 35-600% longer than direct answers. That 2-second ChatGPT response? It might now take 7-14 seconds. For reasoning models, the slowdown was 20-80%. Token usage can double or triple, meaning your API bills jump accordingly.
Modern AI already "thinks" without being asked. When researchers let models respond naturally (without forcing direct answers), they found many already performed some form of step-by-step reasoning by default. Explicitly asking for it provided much smaller benefits.
The researchers tested this by comparing three different approaches: asking for direct answers, asking models to "think step by step," and letting them respond naturally in chat mode. Many modern models already show their work somewhat when given conversational freedom, making explicit Chain-of-Thought prompts less valuable than earlier research suggested.
The Catch-22 of Perfect Accuracy
Perhaps most surprisingly, Chain-of-Thought prompting created a strange trade-off. While it improved average performance, it actually hurt what researchers called "perfect accuracy"—getting questions right every single time across multiple attempts.
For three out of five non-reasoning models tested, Chain-of-Thought introduced more variability in answers. Models would sometimes get "easy" questions wrong that they would have answered correctly with direct prompting, even as they improved on harder questions overall.
Gemini Flash 2.0, for example, saw its perfect accuracy drop by 13.1 percentage points when using Chain-of-Thought, despite improvements in average performance. It's like having a student who suddenly starts making careless errors on simple problems while getting better at complex ones.
Why This Matters for Real-World Use
This research challenges one of the most common pieces of AI advice circulating online. If you're using AI for tasks where you need consistent, reliable answers—rather than just generally better average performance—Chain-of-Thought prompting might not be your best bet.
The researchers tested this rigorously: 25 trials per question across 198 questions per model (that's 4,950 tests per condition). They used multiple accuracy thresholds: 100% correct (all 25 attempts right), 90% correct (23 out of 25), and simple majority (13 out of 25). The pattern held across different measurement approaches.
The Evolution of AI Reasoning
What's happening reflects how quickly AI is evolving. Chain-of-Thought prompting emerged from influential 2022 research that showed dramatic improvements on reasoning tasks. But today's models are fundamentally different from those early systems.
"We found that many non-reasoning models perform a version of Chain-of-Thought even if unprompted," the researchers note. When you remove formatting constraints and let modern AI respond naturally, it often shows its work automatically.
This creates diminishing returns for explicit prompting. It's like reminding someone to breathe—if they're already doing it naturally, the reminder doesn't help much.
So What Should You Do?
The research doesn't suggest abandoning Chain-of-Thought entirely, but it does recommend being more strategic:
For older or smaller AI models that don't naturally show their reasoning, "think step by step" can still provide meaningful improvements.
For cutting-edge models, consider whether you really need the step-by-step breakdown. If you're just looking for quick, accurate answers, direct prompting might be faster and just as good.
Weigh the trade-offs. If you're willing to wait longer and pay more (through token usage) for potentially more detailed reasoning, Chain-of-Thought still has value. But don't assume it's always better.
Consider your consistency needs. If you need the same answer every time you ask the same question, Chain-of-Thought's increased variability might work against you.
The Bigger Picture
This research highlights a broader truth about working with AI: what works today might not work tomorrow. As models become more sophisticated, our prompting strategies need to evolve too.
The researchers acknowledge limitations in their study—they tested one benchmark with relatively simple Chain-of-Thought prompts—but their findings align with what many heavy AI users have started noticing: the dramatic improvements from basic prompting techniques are becoming less dramatic.
It's not that Chain-of-Thought prompting is bad; it's that AI has gotten good enough that it often does this kind of thinking automatically. In a way, that's progress. The technique worked so well that it's now baked into how these systems operate.
The challenge for users is staying current with what actually helps versus what we think should help based on advice that was true six months ago. In the rapidly evolving world of AI, even the best practices have expiration dates.
This article is based on "Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting" by Meincke, Mollick, Mollick, and Shapiro from the University of Pennsylvania's Wharton School.
Listen to a podcast version of this article.
#ai #machinelearning #promptengineering #chainofthought #llms #mlops #airesearch #aiexplained #aitrends #aicosts #wharton ##mollickpaper #ethanmollick #airesearch #deeplearning #deeplearningwiththewolf
Sometimes I just stop mid-debug, take the latest output, and start a new thread with a fresh prompt, 'Here’s what I’ve got, but something’s off , can we spot the fix or maybe optimize it?' 90% of the time, it works. Clean context, clear results. Voila 😁