The AI Advice Everyone's Using Might Be Getting Less Useful

Diana Wolf T.

Writer | Editor of Deep Learning with the Wolf | Silicon Valley-Based

Published Jun 14, 2025

That "think step by step" trick with ChatGPT? New research suggests it's not the magic bullet it once was.

If you've used ChatGPT, Claude, or any AI chatbot in the past year, you've probably seen this advice everywhere: When you want better answers, tell the AI to "think step by step." It's called Chain-of-Thought prompting, and it's been the go-to trick for getting AI to show its work and reason through problems more carefully.

But here's the thing: new research from the University of Pennsylvania suggests this widely-shared advice is becoming less valuable—and in some cases, it might actually make things worse.

What the Researchers Found

A team led by researchers at Wharton tested this "think step by step" approach across eight different AI models, from ChatGPT to Claude to Google's Gemini. They used 198 PhD-level questions in biology, physics, and chemistry—the kind of questions so difficult that actual PhDs only get them right 65% of the time, and they're "Google-proof" (meaning even 30 minutes of web searching doesn't help much).

Here's what they discovered:

For newer AI models, Chain-of-Thought prompting often provides only tiny improvements. In some cases, it made performance worse. The reasoning models (like OpenAI's o3-mini and o4-mini) saw average improvements of just 2.9-3.1%—statistically significant but practically small. GPT-4o-mini showed the smallest gains (4.4%) and these weren't even statistically significant.

It significantly slows things down and costs more. Chain-of-Thought requests took 35-600% longer than direct answers. That 2-second ChatGPT response? It might now take 7-14 seconds. For reasoning models, the slowdown was 20-80%. Token usage can double or triple, meaning your API bills jump accordingly.

Modern AI already "thinks" without being asked. When researchers let models respond naturally (without forcing direct answers), they found many already performed some form of step-by-step reasoning by default. Explicitly asking for it provided much smaller benefits.

The researchers tested this by comparing three different approaches: asking for direct answers, asking models to "think step by step," and letting them respond naturally in chat mode. Many modern models already show their work somewhat when given conversational freedom, making explicit Chain-of-Thought prompts less valuable than earlier research suggested.

The Catch-22 of Perfect Accuracy

Perhaps most surprisingly, Chain-of-Thought prompting created a strange trade-off. While it improved average performance, it actually hurt what researchers called "perfect accuracy"—getting questions right every single time across multiple attempts.

For three out of five non-reasoning models tested, Chain-of-Thought introduced more variability in answers. Models would sometimes get "easy" questions wrong that they would have answered correctly with direct prompting, even as they improved on harder questions overall.

Gemini Flash 2.0, for example, saw its perfect accuracy drop by 13.1 percentage points when using Chain-of-Thought, despite improvements in average performance. It's like having a student who suddenly starts making careless errors on simple problems while getting better at complex ones.

Why This Matters for Real-World Use

This research challenges one of the most common pieces of AI advice circulating online. If you're using AI for tasks where you need consistent, reliable answers—rather than just generally better average performance—Chain-of-Thought prompting might not be your best bet.

The researchers tested this rigorously: 25 trials per question across 198 questions per model (that's 4,950 tests per condition). They used multiple accuracy thresholds: 100% correct (all 25 attempts right), 90% correct (23 out of 25), and simple majority (13 out of 25). The pattern held across different measurement approaches.

The Evolution of AI Reasoning

What's happening reflects how quickly AI is evolving. Chain-of-Thought prompting emerged from influential 2022 research that showed dramatic improvements on reasoning tasks. But today's models are fundamentally different from those early systems.

"We found that many non-reasoning models perform a version of Chain-of-Thought even if unprompted," the researchers note. When you remove formatting constraints and let modern AI respond naturally, it often shows its work automatically.

This creates diminishing returns for explicit prompting. It's like reminding someone to breathe—if they're already doing it naturally, the reminder doesn't help much.

So What Should You Do?

The research doesn't suggest abandoning Chain-of-Thought entirely, but it does recommend being more strategic:

For older or smaller AI models that don't naturally show their reasoning, "think step by step" can still provide meaningful improvements.

For cutting-edge models, consider whether you really need the step-by-step breakdown. If you're just looking for quick, accurate answers, direct prompting might be faster and just as good.

Weigh the trade-offs. If you're willing to wait longer and pay more (through token usage) for potentially more detailed reasoning, Chain-of-Thought still has value. But don't assume it's always better.

Consider your consistency needs. If you need the same answer every time you ask the same question, Chain-of-Thought's increased variability might work against you.

The Bigger Picture

This research highlights a broader truth about working with AI: what works today might not work tomorrow. As models become more sophisticated, our prompting strategies need to evolve too.

The researchers acknowledge limitations in their study—they tested one benchmark with relatively simple Chain-of-Thought prompts—but their findings align with what many heavy AI users have started noticing: the dramatic improvements from basic prompting techniques are becoming less dramatic.

It's not that Chain-of-Thought prompting is bad; it's that AI has gotten good enough that it often does this kind of thinking automatically. In a way, that's progress. The technique worked so well that it's now baked into how these systems operate.

The challenge for users is staying current with what actually helps versus what we think should help based on advice that was true six months ago. In the rapidly evolving world of AI, even the best practices have expiration dates.

This article is based on "Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting" by Meincke, Mollick, Mollick, and Shapiro from the University of Pennsylvania's Wharton School.

Listen to a podcast version of this article.

#ai #machinelearning #promptengineering #chainofthought #llms #mlops #airesearch #aiexplained #aitrends #aicosts #wharton ##mollickpaper #ethanmollick #airesearch #deeplearning #deeplearningwiththewolf

The AI Advice Everyone's Using Might Be Getting Less Useful

Diana Wolf T.

Writer | Editor of Deep Learning with the Wolf | Silicon Valley-Based

That "think step by step" trick with ChatGPT? New research suggests it's not the magic bullet it once was.

What the Researchers Found

The Catch-22 of Perfect Accuracy

Why This Matters for Real-World Use

The Evolution of AI Reasoning

So What Should You Do?

The Bigger Picture

Deep Learning with the Wolf

2,157 followers

More articles by this author

Others also viewed

The Battle for AI Supremacy: DeepSeek, ChatGPT, and India’s AI Future

When is AI easy and hard to apply?

DeepSeek AI vs. ChatGPT: The Emergence of a Powerful New Contender in the Global AI Race

Is this the hot summer of AI? Will the Release of ChatGPT 5 be the next big leap in AI?

Generative AI revolution and its Implications for Utilities and at Large

When GPT connects with humans emotions

AI is More Than ChatGPT

ALL Voices Need to Feed the Neural Network for AI to Work

Strategic Business and Societal Implications of ChatGPT / GPT-3

How to embarrass yourself with Generative AI

Explore topics

That "think step by step" trick with ChatGPT? New research suggests it's not the magic bullet it once was.

What the Researchers Found

The Catch-22 of Perfect Accuracy

Why This Matters for Real-World Use

The Evolution of AI Reasoning

So What Should You Do?

The Bigger Picture

Deep Learning with the Wolf

2,157 followers

When the Chat Starts Sharp and Ends… Weird

Aug 16, 2025

The Ice Sculpture Principle: How AI Engineers Build What Must Not Break

Aug 13, 2025

Tesla’s Dojo Shutdown: A Cautionary Tale for the AI Chip Wars

Aug 9, 2025

Attachment and Erosion

Jul 25, 2025

Growing Up with a Bot

Jul 24, 2025

The Day of the Agents

Jul 18, 2025

Grok 4, Humanity’s Last Exam, and the Meltdown: When AI Got Too Smart Too Fast

Jul 11, 2025

Even the CEO of OpenAI Needed Help—So He Asked ChatGPT

Jul 8, 2025

Cognitive Debt: The Brain Drain Behind the Magic of AI

Jun 22, 2025

The Accidental Strategist: How Jensen Huang Transformed Operational Excellence Into Market Disruption

Jun 17, 2025

Others also viewed

The Battle for AI Supremacy: DeepSeek, ChatGPT, and India’s AI Future

When is AI easy and hard to apply?

DeepSeek AI vs. ChatGPT: The Emergence of a Powerful New Contender in the Global AI Race

Is this the hot summer of AI? Will the Release of ChatGPT 5 be the next big leap in AI?

Generative AI revolution and its Implications for Utilities and at Large

When GPT connects with humans emotions

AI is More Than ChatGPT

ALL Voices Need to Feed the Neural Network for AI to Work

Strategic Business and Societal Implications of ChatGPT / GPT-3

How to embarrass yourself with Generative AI

Explore topics