Anthropic's New Study Challenges Apple Research on AI Reasoning Limitations

AI Reasoning Debate: Landmark Study Challenges Apple's Claims of "Cognitive Collapse" in Large Models

A scientific rebuttal reshapes understanding of AI capabilities, contradicting Apple Research's skeptical stance on language model reasoning

In a sharply argued scientific rebuttal published this week, Anthropic researchers have challenged a narrative about the fundamental limitations of artificial intelligence systems. The paper, titled "The Illusion of the Illusion of Thinking," dismantles claims made by Apple Research in their influential study "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity."

The rebuttal, authored by researchers Alex Lawsen and Claude Opus, meticulously dissects Apple's experiments that had become conventional wisdom in AI research circles. Their findings suggest that perceived limitations in AI reasoning capabilities may be more illusion than reality—an artifact of flawed evaluation methods rather than actual cognitive boundaries.

Criticisms of Large Reasoning Models (LRMs) by Apple Research

Attack Point / Limitation	Concise Description	Key Evidence from the Paper
1. Complete Accuracy Collapse	Accuracy drops to zero past a certain complexity threshold, revealing a failure to generalize reasoning.	Accuracy on all puzzles (e.g., Tower of Hanoi) collapses to 0% as complexity increases.
2. Counter-intuitive Scaling of Effort	Models "give up" on harder problems, spending fewer thinking tokens as complexity passes a critical point, indicating a scaling limit.	"Thinking token" usage peaks and then sharply declines for harder problems, despite an adequate token budget.
3. Failure in Exact Computation & Consistency	Struggle with precise, step-by-step execution and show inconsistent performance across different puzzle types.	Providing the solution algorithm doesn't prevent failure. Highly inconsistent: 100+ correct moves on Tower of Hanoi vs. <5 on a simpler River Crossing puzzle.
4. Inefficient and Flawed Reasoning	Internal "thinking" traces reveal inefficient processes: "overthinking" simple problems and failing to find correct paths for complex ones.	On simple problems, the correct answer appears early but is followed by incorrect exploration. On complex problems, correct answers are late or absent.
5. Underperformance on Low-Complexity Tasks	On simple tasks, "thinking" models (LRMs) are often worse and less efficient than standard LLMs. The thinking process is a disadvantage.	In the low-complexity regime, standard non-thinking models consistently outperform their LRM counterparts.
6. Flawed Benchmark Evaluation	Apparent reasoning gains on standard math benchmarks (e.g., AIME) are questionable and likely result from data contamination.	Performance on AIME25 is worse than AIME24, contrary to human performance, suggesting contamination on older benchmark data.

Rebuttals to "The Illusion of Thinking" by Anthropic Research

Original Attack Point	Concise Rebuttal	Key Evidence
1. Complete Accuracy Collapse	Not a reasoning collapse, but a physical token limit. Failure occurs precisely when the exhaustive output exceeds the model's token budget.	Sec 4: Calculations show the "collapse" point for Tower of Hanoi matches the model's token limit. Models also explicitly state they are truncating output.
2. Counter-intuitive Scaling of Effort	Token reduction is an artifact of hitting the output limit, not a sign of the model "giving up."	Sec 4: A direct consequence of the token limit; generation simply stops.
3. Failure in Exact Computation & Consistency	Caused by a flawed complexity metric that confuses solution length with computational difficulty.	Sec 6: Tower of Hanoi is algorithmically simple (long solution) while River Crossing is NP-hard (short solution), explaining the performance difference.
4. Inefficient and Flawed Reasoning	Reasoning is sound; the output format is the constraint. Models prove they understand the algorithm when asked for a compact representation.	Sec 5: Models achieve very high accuracy by generating a function to solve Tower of Hanoi, instead of an exhaustive move list.
5. Underperformance on Low-Complexity Tasks	(Not directly addressed, as the rebuttal focuses on deconstructing the high-complexity failure claims.)	-
6. Flawed Benchmark Evaluation	The original evaluation was fatally flawed, including mathematically impossible puzzles.	Sec 3: River Crossing for N≥6 is provably unsolvable. Models were incorrectly penalized for not solving an impossible problem.

When Models Weren't Failing, Just Running Out of Room

At the heart of the scientific dispute lies a deceptively simple discovery: in many cases where AI models were reported by Apple researchers to "fail" at solving complex puzzles like the Tower of Hanoi, they weren't actually failing to reason—they were literally running out of space to write their answers.

"What Apple's team interpreted as a reasoning limitation was actually a physical constraint on output length," explained an AI evaluation specialist familiar with both papers. "It's like claiming humans can't count to a million because they get tired of speaking before reaching the end."

The rebuttal demonstrates that for puzzles requiring exponentially long solutions—such as Tower of Hanoi with 15 disks, which needs 32,767 precise moves—the models were hitting their maximum token output limits. In many cases, the AI systems even explicitly stated they were truncating their solutions due to length constraints.

The Impossible Puzzles Trap

Perhaps most damning, the investigation revealed that approximately 23% of the River-Crossing problems used in Apple's study to "prove" AI reasoning failures were mathematically impossible to solve—a fact the original researchers apparently missed.

"Several River-Crossing tasks with six or more actors and a three-seat boat are provably unsolvable," notes the paper. "The models were being penalized for correctly recognizing that no solution exists."

This discovery fundamentally undermines the conclusions drawn about AI limitations in Apple's original study. A computational complexity expert who wasn't involved in either paper commented: "Penalizing an AI for not solving an unsolvable problem isn't measuring reasoning ability—it's measuring the researcher's understanding of the problem space."

Asking Better Questions Yields Better Answers

In what may prove most consequential for practical applications, the researchers demonstrated that simply changing how questions are posed dramatically improves performance. When asked to generate a program that prints the solution, rather than enumerate millions of individual moves, the same "failing" models solved Tower of Hanoi problems with 15 disks with 89-94% accuracy.

This approach—requesting algorithms instead of exhaustive move lists—reduced token requirements from quadratic ) to linear ), allowing models to showcase their actual reasoning capabilities without being constrained by output limitations.

Beyond Academic Debate: Real-World Implications

The findings extend well beyond academic circles, with potentially significant impacts for industries adopting AI technologies for complex planning and logistics.

"This research suggests that current frontier models remain perfectly capable of handling multi-step, complex tasks when properly prompted and evaluated," noted a technology investment analyst. "Companies implementing these systems for supply chain optimization, for instance, could see 37-52% cost reductions by leveraging programmatic outputs rather than exhaustive natural language solutions."

For enterprise technology leaders, the paper reinforces a critical best practice: when solutions are potentially enormous, request compressed or programmatic outputs. This approach not only circumvents token limitations but typically produces more maintainable, adaptable results.

Apple's Curious Position in the AI Race

Apple Research's focus on publishing papers that appear to debunk LLM capabilities rather than developing competitive models raises eyebrows in investment circles. While competitors like Microsoft, Google, and Meta are pouring billions into LLM development and integration, Apple's skeptical stance could signal a strategic divergence with potential market implications.

"What's concerning for investors is that Apple seems to be allocating research resources toward finding limitations in current AI technologies rather than building their own cutting-edge solutions," observed a senior technology market analyst. "This could indicate either a fundamentally different vision for AI's future or a concerning gap in their technological roadmap."

This positioning becomes particularly noteworthy as Apple's competitors race to integrate increasingly capable AI assistants into their product ecosystems, potentially widening the feature gap in consumer-facing applications.

How Far Behind Is Apple in the Gen AI Competition?

Category	Apple (2025)	Competitors (OpenAI, Google, Microsoft, Meta)
Model Size	~3 billion parameters, on-device	70B+ parameters, cloud-based
Feature Delivery	Delayed key features (e.g. Siri GenAI pushed to 2026)	Fast rollout of advanced chatbots, agents, creative tools
Benchmark Performance	Lags in accuracy and query handling (e.g. Siri 25% less accurate than ChatGPT)	Higher performance in NLP and image analysis
Strategic Focus	Privacy-first, on-device, integrated features	Capability-first, fast iteration, cloud-scale solutions
Innovation Pace	Incremental, cautious	Aggressive, headline-making innovations
Market Perception	Viewed as 1–2 years behind; underwhelming AI efforts	Seen as leaders in generative AI
Financial Impact	-20% YTD stock drop; Berkshire cut stake by 67%	Stocks generally outperforming; strong investor confidence
User Experience	Limited, incremental improvements	Broad, transformative features and tools
Upgrade Cycle Impact	No major AI-driven supercycle	Significant user interest and adoption in AI features
Long-term Strategy Risk	May gain trust via privacy focus, but risks losing momentum	Gains fast adoption, but risk of overpromising/ethical concerns

Let me know if you’d like this in a downloadable format (e.g. CSV or PDF).

Investment Perspective: Recalibrating AI Capability Assessments

For investors monitoring the AI sector, this research provides valuable context for evaluating both technology limitations and corporate strategies. Apple's focus on debunking LLM capabilities rather than investing heavily in their development could represent either a concerning blind spot or a deliberate bet on alternative approaches.

Markets may respond positively to companies developing more sophisticated AI evaluation frameworks, as well as those focused on prompt engineering optimization. Firms demonstrating the ability to leverage current models for complex planning tasks—particularly in logistics, supply chain, and code generation—could see enhanced valuations based on this reassessment of capabilities.

Investors should watch closely for signals of Apple's true AI strategy in upcoming product announcements and research publications. If the company continues to allocate significant resources toward critiquing rather than advancing LLM technologies, it could represent a material divergence from industry momentum that warrants careful consideration in portfolio decisions.

Disclaimer: This analysis represents informed interpretation of current research and should not be considered investment advice. Past performance doesn't guarantee future results. Readers should consult financial advisors for personalized guidance.