AI: Apple's new viral research paper on AI Reasoning. RTZ #752
...the debate on the roadmap to AGI continues
Even as the AI companies like OpenAI, Anthropic, Google, DeepSeek/Manus/Qwen in China all accelerate from Level 1 chatbots to Level 2 AI Reasoning on OpenAI’s ‘roadmap to AGI’, there are recent new AI research papers that highlight the speed bumps ahead.
The latest one is by Apple researchers, aptly titled “The Illusion of Thinking”.
The WSJ outlines it all well in “Why Superintelligent AI Isn’t Taking Over Anytime Soon”:
"The title of a fresh paper from Apple AAPL says it all: “The Illusion of Thinking.” In it, a half-dozen top researchers probed reasoning models—large language models that “think” about problems longer, across many steps—from the leading AI labs, including OpenAI, DeepSeek and Anthropic. They found little evidence that these are capable of reasoning anywhere close to the level their makers claim.”
“Generative AI can be quite useful in specific applications, and a boon to worker productivity. OpenAI claims 500 million monthly active ChatGPT users—astonishingly far reach and fast growth for a service released just 2½ years ago. But these critics argue there is a significant hazard in overestimating what it can do, and making business plans, policy decisions and investments based on pronouncements that seem increasingly disconnected from the products themselves.”
Indeed, if anything, OpenAI founder/CEO Sam Altman doubled down on AI with a long essay on the proximity to AGI (aka artificial general intelligence), with “The Gentle Singularity” setting the AI Tech Wave stage in the near term:
“We are past the event horizon; the takeoff has started. Humanity is close to building digital superintelligence, and at least so far it’s much less weird than it seems like it should be.”
“Robots are not yet walking the streets, nor are most of us talking to AI all day. People still die of disease, we still can’t easily go to space, and there is a lot about the universe we don’t understand.”
“And yet, we have recently built systems that are smarter than people in many ways, and are able to significantly amplify the output of people using them. The least-likely part of the work is behind us; the scientific insights that got us to systems like GPT-4 and o3 were hard-won, but will take us very far.”
The Apple paper, worth reading in full, points out the immediate challenges in the AI Reasoning field:
“Apple’s paper builds on previous work from many of the same engineers, as well as notable research from both academia and other big tech companies, including Salesforce. These experiments show that today’s “reasoning” AIs—hailed as the next step toward autonomous AI agents and, ultimately, superhuman intelligence—are in some cases worse at solving problems than the plain-vanilla AI chatbots that preceded them. This work also shows that whether you’re using an AI chatbot or a reasoning model, all systems fail utterly at more complex tasks.”
“Apple’s researchers found “fundamental limitations” in the models. When taking on tasks beyond a certain level of complexity, these AIs suffered “complete accuracy collapse.” Similarly, engineers at Salesforce AI Research concluded that their results “underscore a significant gap between current LLM capabilities and real-world enterprise demands.”
“Importantly, the problems these state-of-the-art AIs couldn’t handle are logic puzzles that even a precocious child could solve, with a little instruction. What’s more, when you give these AIs that same kind of instruction, they can’t follow it.”
Ignacio de Gregoroio sums up the Apple paper thus in “Apple’s Viral AI Paper, Reality or Fraud”:
“This is not Apple’s first critique of modern AI models, to the point where one might believe Apple is notoriously skeptical of modern AI progress.”
“That’s understandable, but that doesn’t mean it can’t evolve with it.”
“Take Meta, for example. Few AI incumbents on planet Earth are harsher critics of modern frontier models than Yann LeCun, Chief AI Scientist at Meta.”
I’ve written both about Yann Le Cun’s views on LLM AIs, and Meta’s recent $15 billion investment in Scale.ai. Ignacio continues.
“However, that has not prevented Meta from investing heavily in this vision and actually trying to build powerful AI models. Even if you disagree with the portrayal they receive from profit-incentivized incumbents, you can’t deny progress.”
The overall result of course is a vigorous debate in the industry:
“Apple’s paper has set off a debate in tech’s halls of power—Signal chats, Substack posts and X threads—pitting AI maximalists against skeptics.”
“People could say it’s sour grapes, that Apple is just complaining because they don’t have a cutting-edge model,” says Josh Wolfe, co-founder of venture firm Lux Capital. “But I don’t think it’s a criticism so much as an empirical observation.”
OpenAI itself had an impressive AI Reasoning upgrade in o3 Pro a few days ago which I highlighted.
“The reasoning methods in OpenAI’s models are “already laying the foundation for agents that can use tools, make decisions, and solve harder problems,” says an OpenAI spokesman. “We’re continuing to push those capabilities forward.”
“The debate over this research begins with the implication that today’s AIs aren’t thinking, but instead are creating a kind of spaghetti of simple rules to follow in every situation covered by their training data.”
“Gary Marcus, a cognitive scientist who sold an AI startup to Uber in 2016, argued in an essay that Apple’s paper, along with related work, exposes flaws in today’s reasoning models, suggesting they’re not the dawn of human-level ability but rather a dead end. “Part of the reason the Apple study landed so strongly is that Apple did it,” he says. “And I think they did it at a moment in time when people have finally started to understand this for themselves.”
Gary goes onto make deeper arguments against AI Reasoning ahead in “Seven replies to the viral Apple reasoning paper – and why they fall short”:
“Bottom line? None of the rejoinders are compelling. If people like Sam Altman are sweating, it’s because they should. The Apple paper is yet another clear sign that scaling is not the answer; for once, people are paying attention.”
“The kicker? A Salesforce paper also just posted, that many people missed:”
“In the “multi-turn” condition, which presumably would require reasoning and algorithmic precision, performance was only 35%.”
“Talk about convergence evidence. Taking the SalesForce report together with the Apple paper, it’s clear the current tech is not to be trusted.”
AI Researchers will continue to research and debate this issue in depth. But one thing is clear thus far in this AI Tech Wave.
No one is throwing in the towel on either side of this perennial debate. At this point, it’s ‘AI or Bust’, even if it takes a bit longer than expected. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)