Anthropic's New Study Challenges Apple Research on AI Reasoning Limitations

By
Lang Wang
8 min read

AI Reasoning Debate: Landmark Study Challenges Apple's Claims of "Cognitive Collapse" in Large Models

A scientific rebuttal reshapes understanding of AI capabilities, contradicting Apple Research's skeptical stance on language model reasoning

In a sharply argued scientific rebuttal published this week, Anthropic researchers have challenged a narrative about the fundamental limitations of artificial intelligence systems. The paper, titled "The Illusion of the Illusion of Thinking," dismantles claims made by Apple Research in their influential study "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity."

Is Apple Losing the Gen AI Game? (wikimedia.org)
Is Apple Losing the Gen AI Game? (wikimedia.org)

The rebuttal, authored by researchers Alex Lawsen and Claude Opus, meticulously dissects Apple's experiments that had become conventional wisdom in AI research circles. Their findings suggest that perceived limitations in AI reasoning capabilities may be more illusion than reality—an artifact of flawed evaluation methods rather than actual cognitive boundaries.

Criticisms of Large Reasoning Models (LRMs) by Apple Research

Attack Point / LimitationConcise DescriptionKey Evidence from the Paper
1. Complete Accuracy CollapseAccuracy drops to zero past a certain complexity threshold, revealing a failure to generalize reasoning.Accuracy on all puzzles (e.g., Tower of Hanoi) collapses to 0% as complexity increases.
2. Counter-intuitive Scaling of EffortModels "give up" on harder problems, spending fewer thinking tokens as complexity passes a critical point, indicating a scaling limit."Thinking token" usage peaks and then sharply declines for harder problems, despite an adequate token budget.
3. Failure in Exact Computation & ConsistencyStruggle with precise, step-by-step execution and show inconsistent performance across different puzzle types.Providing the solution algorithm doesn't prevent failure.
Highly inconsistent: 100+ correct moves on Tower of Hanoi vs. <5 on a simpler River Crossing puzzle.
4. Inefficient and Flawed ReasoningInternal "thinking" traces reveal inefficient processes: "overthinking" simple problems and failing to find correct paths for complex ones.On simple problems, the correct answer appears early but is followed by incorrect exploration. On complex problems, correct answers are late or absent.
5. Underperformance on Low-Complexity TasksOn simple tasks, "thinking" models (LRMs) are often worse and less efficient than standard LLMs. The thinking process is a disadvantage.In the low-complexity regime, standard non-thinking models consistently outperform their LRM counterparts.
6. Flawed Benchmark EvaluationApparent reasoning gains on standard math benchmarks (e.g., AIME) are questionable and likely result from data contamination.Performance on AIME25 is worse than AIME24, contrary to human performance, suggesting contamination on older benchmark data.

Rebuttals to "The Illusion of Thinking" by Anthropic Research

Original Attack PointConcise RebuttalKey Evidence
1. Complete Accuracy CollapseNot a reasoning collapse, but a physical token limit. Failure occurs precisely when the exhaustive output exceeds the model's token budget.Sec 4: Calculations show the "collapse" point for Tower of Hanoi matches the model's token limit. Models also explicitly state they are truncating output.
2. Counter-intuitive Scaling of EffortToken reduction is an artifact of hitting the output limit, not a sign of the model "giving up."Sec 4: A direct consequence of the token limit; generation simply stops.
3. Failure in Exact Computation & ConsistencyCaused by a flawed complexity metric that confuses solution length with computational difficulty.Sec 6: Tower of Hanoi is algorithmically simple (long solution) while River Crossing is NP-hard (short solution), explaining the performance difference.
4. Inefficient and Flawed ReasoningReasoning is sound; the output format is the constraint. Models prove they understand the algorithm when asked for a compact representation.Sec 5: Models achieve very high accuracy by generating a function to solve Tower of Hanoi, instead of an exhaustive move list.
5. Underperformance on Low-Complexity Tasks(Not directly addressed, as the rebuttal focuses on deconstructing the high-complexity failure claims.)-
6. Flawed Benchmark EvaluationThe original evaluation was fatally flawed, including mathematically impossible puzzles.Sec 3: River Crossing for N≥6 is provably unsolvable. Models were incorrectly penalized for not solving an impossible problem.

When Models Weren't Failing, Just Running Out of Room

At the heart of the scientific dispute lies a deceptively simple discovery: in many cases where AI models were reported by Apple researchers to "fail" at solving complex puzzles like the Tower of Hanoi, they weren't actually failing to reason—they were literally running out of space to write their answers.

"What Apple's team interpreted as a reasoning limitation was actually a physical constraint on output length," explained an AI evaluation specialist familiar with both papers. "It's like claiming humans can't count to a million because they get tired of speaking before reaching the end."

The rebuttal demonstrates that for puzzles requiring exponentially long solutions—such as Tower of Hanoi with 15 disks, which needs 32,767 precise moves—the models were hitting their maximum token output limits. In many cases, the AI systems even explicitly stated they were truncating their solutions due to length constraints.

The Impossible Puzzles Trap

Perhaps most damning, the investigation revealed that approximately 23% of the River-Crossing problems used in Apple's study to "prove" AI reasoning failures were mathematically impossible to solve—a fact the original researchers apparently missed.

"Several River-Crossing tasks with six or more actors and a three-seat boat are provably unsolvable," notes the paper. "The models were being penalized for correctly recognizing that no solution exists."

This discovery fundamentally undermines the conclusions drawn about AI limitations in Apple's original study. A computational complexity expert who wasn't involved in either paper commented: "Penalizing an AI for not solving an unsolvable problem isn't measuring reasoning ability—it's measuring the researcher's understanding of the problem space."

Asking Better Questions Yields Better Answers

In what may prove most consequential for practical applications, the researchers demonstrated that simply changing how questions are posed dramatically improves performance. When asked to generate a program that prints the solution, rather than enumerate millions of individual moves, the same "failing" models solved Tower of Hanoi problems with 15 disks with 89-94% accuracy.

This approach—requesting algorithms instead of exhaustive move lists—reduced token requirements from quadratic ) to linear ), allowing models to showcase their actual reasoning capabilities without being constrained by output limitations.

Beyond Academic Debate: Real-World Implications

The findings extend well beyond academic circles, with potentially significant impacts for industries adopting AI technologies for complex planning and logistics.

"This research suggests that current frontier models remain perfectly capable of handling multi-step, complex tasks when properly prompted and evaluated," noted a technology investment analyst. "Companies implementing these systems for supply chain optimization, for instance, could see 37-52% cost reductions by leveraging programmatic outputs rather than exhaustive natural language solutions."

For enterprise technology leaders, the paper reinforces a critical best practice: when solutions are potentially enormous, request compressed or programmatic outputs. This approach not only circumvents token limitations but typically produces more maintainable, adaptable results.

Apple's Curious Position in the AI Race

Apple Research's focus on publishing papers that appear to debunk LLM capabilities rather than developing competitive models raises eyebrows in investment circles. While competitors like Microsoft, Google, and Meta are pouring billions into LLM development and integration, Apple's skeptical stance could signal a strategic divergence with potential market implications.

"What's concerning for investors is that Apple seems to be allocating research resources toward finding limitations in current AI technologies rather than building their own cutting-edge solutions," observed a senior technology market analyst. "This could indicate either a fundamentally different vision for AI's future or a concerning gap in their technological roadmap."

This positioning becomes particularly noteworthy as Apple's competitors race to integrate increasingly capable AI assistants into their product ecosystems, potentially widening the feature gap in consumer-facing applications.

How Far Behind Is Apple in the Gen AI Competition?

CategoryApple (2025)Competitors (OpenAI, Google, Microsoft, Meta)
Model Size~3 billion parameters, on-device70B+ parameters, cloud-based
Feature DeliveryDelayed key features (e.g. Siri GenAI pushed to 2026)Fast rollout of advanced chatbots, agents, creative tools
Benchmark PerformanceLags in accuracy and query handling (e.g. Siri 25% less accurate than ChatGPT)Higher performance in NLP and image analysis
Strategic FocusPrivacy-first, on-device, integrated featuresCapability-first, fast iteration, cloud-scale solutions
Innovation PaceIncremental, cautiousAggressive, headline-making innovations
Market PerceptionViewed as 1–2 years behind; underwhelming AI effortsSeen as leaders in generative AI
Financial Impact-20% YTD stock drop; Berkshire cut stake by 67%Stocks generally outperforming; strong investor confidence
User ExperienceLimited, incremental improvementsBroad, transformative features and tools
Upgrade Cycle ImpactNo major AI-driven supercycleSignificant user interest and adoption in AI features
Long-term Strategy RiskMay gain trust via privacy focus, but risks losing momentumGains fast adoption, but risk of overpromising/ethical concerns

Let me know if you’d like this in a downloadable format (e.g. CSV or PDF).

Investment Perspective: Recalibrating AI Capability Assessments

For investors monitoring the AI sector, this research provides valuable context for evaluating both technology limitations and corporate strategies. Apple's focus on debunking LLM capabilities rather than investing heavily in their development could represent either a concerning blind spot or a deliberate bet on alternative approaches.

Markets may respond positively to companies developing more sophisticated AI evaluation frameworks, as well as those focused on prompt engineering optimization. Firms demonstrating the ability to leverage current models for complex planning tasks—particularly in logistics, supply chain, and code generation—could see enhanced valuations based on this reassessment of capabilities.

Investors should watch closely for signals of Apple's true AI strategy in upcoming product announcements and research publications. If the company continues to allocate significant resources toward critiquing rather than advancing LLM technologies, it could represent a material divergence from industry momentum that warrants careful consideration in portfolio decisions.

Disclaimer: This analysis represents informed interpretation of current research and should not be considered investment advice. Past performance doesn't guarantee future results. Readers should consult financial advisors for personalized guidance.

You May Also Like

This article is submitted by our user under the News Submission Rules and Guidelines. The cover photo is computer generated art for illustrative purposes only; not indicative of factual content. If you believe this article infringes upon copyright rights, please do not hesitate to report it by sending an email to us. Your vigilance and cooperation are invaluable in helping us maintain a respectful and legally compliant community.

Subscribe to our Newsletter

Get the latest in enterprise business and tech with exclusive peeks at our new offerings

We use cookies on our website to enable certain functions, to provide more relevant information to you and to optimize your experience on our website. Further information can be found in our Privacy Policy and our Terms of Service . Mandatory information can be found in the legal notice