4. Energie-, Wasser- und Landverbrauch
Hype-Machine Big Tech
Ressourcen bei der Herstellung
Postkoloniale Ausbeutung
Überwachung, Manipulation
Nichtkonsensuelle Verwendung der Daten
Missbrauchspotentiale
Qualifikationsverlust, Ironies of Automation
Realitätsverlust & Fake Everything
All möglichen Arten von Bias
5. Energie-, Wasser- und Landverbrauch
Hype-Machine Big Tech
Ressourcen bei der Herstellung
Postkoloniale Ausbeutung
Überwachung, Manipulation
Nichtkonsensuelle Verwendung der Daten
Missbrauchspotentiale
Qualifikationsverlust, Ironies of Automation
Realitätsverlust & Fake Everything
All möglichen Arten von Bias
6. Mensch: Löse ein Rätsel für mich.
ChatGPT: Klar, ich helfe gerne! Was für ein
Rätsel soll ich lösen?
Mensch: Ein Mann und eine Ziege sind auf einer
Seite eines Flusses. Sie haben ein Boot. Wie
können sie beide hinüberkommen?
ChatGPT: Der Mann bringt zuerst die Ziege über
den Fluss und lässt das Boot auf der
ursprünglichen Seite zurück. Dann kehrt der
Mann allein mit dem Boot zurück. Er lässt die
Ziege auf der anderen Seite zurück und fährt
mit dem Boot zurück zur ursprünglichen Seite.
Schließlich bringt er den Kohl über den Fluss.
Quelle: Douglas
Hofstaedter, Autor von
Gödel Escher Bach,
2024
The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models
via the Lens of Problem Complexity
Parshin Shojaee∗†
Iman Mirzadeh→
Keivan Alizadeh
Maxwell Horton Samy Bengio Mehrdad Farajtabar
Apple
Abstract
Recent generations of frontier language models have introduced Large Reasoning Models
(LRMs) that generate detailed thinking processes before providing answers. While these models
demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scal-
ing properties, and limitations remain insu!ciently understood. Current evaluations primarily fo-
cus on established mathematical and coding benchmarks, emphasizing final answer accuracy. How-
ever, this evaluation paradigm often su"ers from data contamination and does not provide insights
into the reasoning traces’ structure and quality. In this work, we systematically investigate these
gaps with the help of controllable puzzle environments that allow precise manipulation of composi-
tional complexity while maintaining consistent logical structures. This setup enables the analysis
of not only final answers but also the internal reasoning traces, o"ering insights into how LRMs
“think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs
face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-
intuitive scaling limit: their reasoning e"ort increases with problem complexity up to a point, then
declines despite having an adequate token budget. By comparing LRMs with their standard LLM
counterparts under equivalent inference compute, we identify three performance regimes: (1) low-
complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity
tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks
where both models experience complete collapse. We found that LRMs have limitations in exact
computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We
also investigate the reasoning traces in more depth, studying the patterns of explored solutions
and analyzing the models’ computational behavior, shedding light on their strengths, limitations,
and ultimately raising crucial questions about their true reasoning capabilities.
1 Introduction
Large Language Models (LLMs) have recently evolved to include specialized variants explicitly
designed for reasoning tasks—Large Reasoning Models (LRMs) such as OpenAI’s o1/o3 [1, 2],
DeepSeek-R1 [3], Claude 3.7 Sonnet Thinking [4], and Gemini Thinking [5]. These models are new
artifacts, characterized by their “thinking” mechanisms such as long Chain-of-Thought (CoT) with
self-reflection, and have demonstrated promising results across various reasoning benchmarks. Their
∗
Equal contribution.
[…] fail to develop
generalizable reasoning
capabilities beyond certain
complexity thresholds […]
These insights challenge
prevailing assumptions
about LRM capabilities and
suggest that current
approaches may be
encountering fundamental
barriers to generalizable
reasoning.
9. »Es gibt keinen innertechnischen, keinen
funktionalen und keinen operationalen
Unterschied zwischen Halluzinationen und
Nichthalluzinationen«
»Das Überraschende ist nicht, dass Chatbots
manchmal Unsinn produzieren, sondern
dass sie so oft richtig antworten. Wir
schlagen vor, dass man das eher als
Nebeneffekt der Tatsache verstehen sollte,
dass wahre Aussagen plausibler sind als
falsche.«
10. That’s not too hard. I asked ChatGPT for some citations of papers that “indicate
the importance of silicon for the gut microbiome”, for example, and it happily
gave me results.
Liao, Y., Luo, Z., Liu, J., Luo, H., Wu, Q., Wang, T., & Cheng, W. (2024). Silicon-
enriched functional meat enhances colonic barrier function by inhibiting inflammatory
response and oxidative stress in diabetic rats. Journal of Functional Foods, 112,
105241. https://doi.org/10.1016/j.jff.2024.105241
Kolodziejczak, M., Sienkiewicz, K., Kulik, D., & Szybinski, P. (2020). Postoperative
supplementation of propionate and silicon dioxide in patients after total gastrectomy
—Preliminary report. Przegląd Gastroenterologiczny, 15(3), 218–223. https://doi.org/
10.5114/pg.2020.100058
Ravaioli, S., Tentor, F., Caputo, F., Tomasi, A., Merolle, L., Faldini, C., & Papi, M.
(2024). Amorphous silica nanoparticles and the human gut microbiota: A relationship
with multiple implications. Journal of Nanobiotechnology, 22(1), 1–20. https://doi.org/
10.1186/s12951-024-02305-x […]
https://westreich.substack.com/p/why-chatgpt-creates-scientific-citations
12. »In the Make America Healthy Again Commission’s report,
they had a number of false (hallucinated) studies.
[…]
Some of them simply don’t exist.
And yet, as described by media outlet NOTUS, these
studies are claimed to be ›broadly illustrative‹ of how
America’s children are getting too many ADHD
prescriptions.«
https://westreich.substack.com/p/why-chatgpt-creates-scientific-citations
22. … concerns about potential decrease
in programmers’ agency and
productivity with Generative AI are
justified …irreführend und unverantwortlich
…reliance on ChatGPT is linked
to procrastination, memory
loss, and a decline in academic
performance.
…higher confidence in GenAI is
associated with less critical
thinking, while higher self-
confidence is associated with
more critical thinking.
These results highlight the potential
cognitive costs of AI tool reliance…
23. … concerns about potential decrease
in programmers’ agency and
productivity with Generative AI are
justified …irreführend und unverantwortlich
…reliance on ChatGPT is linked
to procrastination, memory
loss, and a decline in academic
performance.
…higher confidence in GenAI is
associated with less critical
thinking, while higher self-
confidence is associated with
more critical thinking.
These results highlight the potential
cognitive costs of AI tool reliance…
brain
Use chatGPT!
chatGPT
24. Meine persönliche Richtlinie
› denk dran: it is not science!
› immer zuerst selbst nachdenken
› niemals als person ansprechen
› allen ergebnissen misstrauen
› so wenig als möglich nutzen
25. Human: decode L PZ AOL TVZA MYLXBLUA SLAALY
PU AOPZ ZLUALUJL
[…]
ChatGPT: Using a Caesar cipher with a forward shift
of 7 on the given text "L PZ AOL TVZA MYLXBLUA
SLAALY PU AOPZ ZLUALUJL" results in the decoded
message:
"I AM THE KING OF WISHFUL THINKING IN THE
LAND OF DREAMS"
I am the
king of wishful
thinking in the land
of dreams
»E IS THE MOST FREQUENT LETTER
IN THIS SENTENCE«
Quelle: selbst erlebt