AI: A Tale of three 'Deep AI Reasonings'. RTZ #622
...evaluating the 'state of the art' (SOTA) from OpenAI, DeepSeek and Google
I’ve discussed the flurry of AI Reasoning products from DeepSeek (R1), Google (Flash Thinking Experimental), to OpenAI’s double punching ‘Operator’ and ‘Deep Research’ in recent days.
All the possibilities of these reasoning products on their way from level 2 AI Reasoing, to level 3 agent functionality, and then on to AGI some day.
Now OpenAI’s Sam Altman apparently has put down a timeline of two to three years with his new investment partner Softbank Masayoshi Son, together on stage in Tokyo this week.
Regardless of when we get to AGI, the one thing we can do at this point in the AI Tech Wave, is evaluate the LLM AI Reasoning products themselves.
And for that I thought it’d be helpful to showcase the spectrum of reviews, from worst, kinda good, to pretty good. These products involve a lot of work to evaluate, both in setting up the robust, deailed queries, and then painstakingly evaluate the results, which can be voluminous depending on the task at hand.
So they’re a snapshot in time of their capabilities at this early point, and as one of the authors points out below, ‘they’re the worst they’ll ever be.’
Let’s start with Gary Marcus, who has been the most skeptical critic for some time, of this AI Tech Wave. His posts are useful to keep in mind to understand the cons and pros of AI overall as it attempts to Scale to amazing heights. He is the most critical of the current crop of AI Reasoning products, with the following reasoning (pun intended):
“Time will tell, but I still feel that one of my most prescient essays was the one posted almost exactly two years ago to the day in which I warned that Generative AI might undermine Google’s business model by polluting the internet with garbage, entitled What Google Should Really be Worried About:”
“The basic conjecture then — which has already been confirmed to some degree – was that LLMs would be used to write garbage content undermining the value of the internet itself.”
”Deep Research, because it works faster and expands the reach of what can be automatically generated, is going to make that problem worse. Scammers will use it, for example, to write “medical reports” that are half-baked (in order to sell ads on the webistes that report such “research”).”
“But it is not just s scammers; the larger problem may be naive people who believe that the outputs are legit, or who use it as a shortcut for writing journal articles, missing the errors it produces.”
His whole piece is worth reading to understand the downside case he makes with examples.
More positive, is this evaluation by Nate Jones in “A Deep Dive into OpenAI Deep Research: Can you Feel the AGI?”:
“Below is a comparative review and grading of the three reports. Each write-up addresses the same prompt—AI and automation trends—but they differ notably in length, level of detail, structure, and the breadth of research/data they include. I’ve evaluated them along several dimensions:”
“He goes on to give comprehensive prompts to OpenAI o3 with Deep Research, DeepSeek R1 Deep Thinking, and Deep Research by Google.”
Again, the methodology and results are notable for the current state of the art in AI Reasoning, which again, will only get better. And his overall takeaways are thus:
“The common ground among all three is that AI-driven automation is accelerating, certain core industries will see the largest transformations, there’s an urgent need for upskilling and policy intervention, and governments must craft frameworks balancing growth and ethics.”
• “The key distinctions lie primarily in how in-depth each response is, how region-specific they get, and how many data points they provide.”
• “In sum, they largely agree on the big picture” of his core prompts and queries.
His core takeaway is memorable in its pithiness:
“I think the biggest thing I walk away with is the visceral sense that the world changed yet again. This thing is very good, very smart, and the thought that this is the dumbest AI is every going to be is kind of blowing my mind.”
Last, and the one I found most interesting, is a this experiment by Ben Thompson of Stratechery:
“Instead of trying to describe OpenAI’s Deep Research, I thought it would be better to show you. I did two prompts: the first one asked Deep Research to give me a report about Apple’s earnings in my style and voice.”
He goes onto to describe his detailed prompts and the results in detail. And they are fascinating for one particular reason. The test involves asking OpenAI’s Deep Research to write ‘a research report about Apple’s latest earnings in the style and foice of Stratechery that is inline with my previous analysis.”
In other words, learn from his own work and write a draft with that analytical rigor and voice.
It’s a detailed experiment, and the results are notable for the specific reason, that they can be evaluated by the author for any hallucinations and incorrect facts, BECAUSE he is very familiar with his own work and analysis.
Again, the whole piece is worth reading, and his conclusions are worth noting:
“Pretty impressive, right?
“Again — and this may be motivated thinking — I don’t think there is anything novel for me specifically. It’s also the case that the reason why the second answer is so good and insightful is precisely because I gave the model specific talking points, instead of asking it to come up with insight on its own. In other words, the model does a much better job of filling out a thesis than coming up with one. And, I would add, I think it did a pretty terrible job of speaking in my voice or style (and fell down badly when it tried to do so explicitly).”
“This is exactly what that second report in particular is demonstrating. The ideas are mine, but the substantiation is AI, and not only is it pretty damn good, it’s also the worst it is ever going to be.”
And that’s why I wanted to outline this tale of three AI Reasonings at this stage of the AI Tech Wave.
There is a LOT of room for improvement by the products by OpenAI, Google and DeepSeek. And others beyond them.
But they’re likely going to get better, perhaps not as fast as we’d all like, in the AI Tech Wave ahead. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)