Fallibility

John Sumser

Long term industry analyst makes good

Published Jun 10, 2025

From time to time, I’m going to point out important research. So much of the conversation is driven by hype. The hype is driven by people who have been driven by hype. The echo chamber is so intense that it makes LLMs look like objective deliverers of fact. While I am astonished by the things I can do with the latest tools, the potential for error is high. The likelihood that an LLM will do something really dumb grows with each inflated claim about the future, with each unreviewed publication of LLM output.

Take the claim that Artificial General Intelligence is right around the corner and a logical extension of the machine learning/LLM model. There are very smart voices (like Gary Marcus) who try hard to keep the hype in check.

In the years that I built ethics boards for AI companies, my teams looked hard at the downsides – bias, discrimination, unreviewed use of the output, security, role-based permissions, data cleaning, management, and disposal. These are fundamental safety issues. Our view of the tools we use has to be grounded in something other than VC-driven hype.

Thinking soberly about AI usage is incredibly challenging. The tools are often designed to get the user to overlook the flaws. (Even if that’s an unintentional consequence.) Coupled with a well-trained human tendency to believe the output of a machine, it’s easy to (1) low- complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse overestimate utility.

In early June, Apple published a remarkable paper in its machine learning section (link in first comment)- “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”. The paper is a study of what happens when the latest ‘reasoning models’ try to tackle complex questions. Large Reasoning Models (LRMs) are LLMs that show their work. They provide an ongoing description of the decisions they are making. Most of the major chat providers offer this capability. The underlying idea is that the user can watch (oversee) the LLM as it works.

The paper is the result of a detailed study of that ‘reasoning’ process. The authors looked for both clear ‘thinking' and accuracy in the answer. (Note: That means they were dealing with extreme complexity that has a verifiably accurate answer; like solving a puzzle.) Theuy found three different scenarios (1) Low- complexity tasks where standard models surprisingly outperform LRMs. It’s fascinating to think that the introduction of reasoning makes LLMs under-perform on simple tasks. That’s the opposite of what is claimed

(2) Medium-complexity tasks where additional thinking in LRMs demonstrates advantage. This is the core claim vendors are making

(3) High-complexity tasks where both models experience complete collapse. This is likely caused by failure at the edges of word predictability. What is troubling about this finding is that the LRMs can’t tell you when their model failed. The user is left to wonder how to tell when the model has failed.

Pretty quickly, this flaw is big enough to create massive uncertainty about answer quality. It is one of those gray area ethical questions. You might ask, ‘Should LLMs be trotted out to an unsuspecting public when they haven’t been thoroughly tested?’ You might also ask, ‘Who has the liability when the LLM fails at sophisticated problem solving.’

As I inferred, I am very optimistic about the ultimate utility of our new AI tools. At the very least, they suffer from this sort of quality problem and are dangerous as a result.

We are all a part of a vast beta test. The paper is dense but worth a scan. The link points to a summary (the paper’s conclusions).

Gabriel Gheorghiu

AI skeptic, tech critic, ERP expert, manufacturing enthusiast, and strong believer in sustainability and inclusion

3mo

Why are companies spending billions to be part of a beta test?

1 Reaction

Tim McAllister

Senior Director, Digital Trust at DigiCert | IoT & PKI Security Leader | Enabling Secure, Scalable Cybersecurity for Connected Devices

3mo

Great analysis, John!🙌 The Apple #LLM research paper cuts through a lot of the noise: so-called “reasoning” models perform well on medium-complexity problems, but when tasks get deep or require outside information, they simply collapse—and don’t warn the user. That’s a huge blind spot, especially as most users can’t tell when things go sideways. Nate B. Jones (#Substack) made a solid point on this recently: LLMs hit a wall when they need to “phone a friend”—meaning, when they need external data or tools to get the job done. If a model can’t reach beyond its training, it fakes it. Until we build real-time retrieval and tool use directly into these systems, all the scale and “reasoning” in the world won’t solve these edge cases. Bottom line: we need hybrid solutions, strong validation, and real guardrails. Otherwise, we’re just beta-testing AI’s limitations in public. Thanks for keeping the discussion grounded. #MCP #RAG #apple #AI #ReasoningModel

Laurie Ruettimann

Board Chair @ Income To Support All Foundation | LinkedIn Learning Instructor

3mo

I did not consent!

1 Reaction

Bonnie Duncan Tinder

Founder & CEO of Raven Intelligence. Amplifying the Voice of the Customer in Enterprise Software. Named Top 100 Influencer by HR Executive Magazine (2025, 2024, 2023, 2022)

3mo

John Sumser - 100%. The reason I suspect that we haven't heard more about the fails is the usage of AI hasn't been fully tested at scale for the 'complex problems' that you mention. I think about FSD (full self driving) 5 years ago--lots of tragedies / lawsuits from drivers who were beta testing the complexity. It's vastly improved today, but not without a lot of heartburn along the way.

2 Reactions

John Sumser

Long term industry analyst makes good

3mo

Link to the paper: https://machinelearning.apple.com/research/illusion-of-thinking?utm_source=hrexaminer

Fallibility

John Sumser

Long term industry analyst makes good

More articles by this author

Others also viewed

AI’s Achilles’ Heel: Breaking down Bias for trust in industrial AI

"Everyone's thinking it's the next person's job to deal with the impact of AI."

Governing Artificial Intelligence (3rd part): Regulatory and standards frameworks

The Grumpy Thinker’s Guide to AI

India’s Steps Towards Regulating use of Artificial Intelligence Tools in Securities Markets

DeepSeek and Compliance: Understanding Model Distillation and Lighter Models

Adapting to the EU AI Act: Essential Insights for the financial sector

4 Steps to Good Narrow AI

The Real Mission Impossible

Navigating the Labyrinth of AI risk: Classifying Threats and Fine-Tuning Regulation

Explore content categories

Promptly Transparent

Sep 9, 2025

It's More Complicated Than It Looks

Sep 8, 2025

Deja Vu: It's The End of Work As We Know It

Aug 7, 2025

Anti Anthropormorphism Prompt

Aug 2, 2025

How Safety Happens

Jul 27, 2025

The Coming Agentic Traffic Jam

Jul 22, 2025

Noise is not Intelligence

Jul 9, 2025

RIP Crossing the Chasm

Jun 26, 2025

Hunting the New Metaphor #3: Compensation is a Conversation

Jun 25, 2025

Metaphor Hunt #2: What is Compensation?

Jun 17, 2025

Others also viewed

AI’s Achilles’ Heel: Breaking down Bias for trust in industrial AI

"Everyone's thinking it's the next person's job to deal with the impact of AI."

Governing Artificial Intelligence (3rd part): Regulatory and standards frameworks

The Grumpy Thinker’s Guide to AI

India’s Steps Towards Regulating use of Artificial Intelligence Tools in Securities Markets

DeepSeek and Compliance: Understanding Model Distillation and Lighter Models

Adapting to the EU AI Act: Essential Insights for the financial sector

4 Steps to Good Narrow AI

The Real Mission Impossible

Navigating the Labyrinth of AI risk: Classifying Threats and Fine-Tuning Regulation

Explore content categories