A few days ago I did a fairly detailed piece on how even the scientists who are building today’s Foundation LLM AI models don’t know how the models do what they do, when they do it. And make them repeat what they do, after they do it.
And even how to explain how they do it (aka ‘Explainability’). Or even interpret their results (aka ‘Interpretability’). And the need for better ‘steerability’ of the models. All critical elements for users, regulators, (looking at you, Senator Schumer on ‘Explainability’), and society to have better confidence in what AI can ultimately do for us all.
So it doesn’t help that recent days have seen a spate of anecdotal reports, and some new AI research, that indicates that the top LLM AI model of them all, OpenAI’s GPT4, that powers ChatGPT, may be losing its mojo in its current incarnation. Tom’s Hardware explains:
“In recent months there has been a groundswell of anecdotal evidence and general murmurings concerning a decline in the quality of ChatGPT responses.”
“A team of researchers from Stanford and UC Berkeley decided to determine whether there was indeed degradation and come up with metrics to quantity the scale of detrimental change. To cut a long story short, the dive in performance certainly wasn't imagined.”
Specifically, the researchers ran and measured a number of tests, and
“startlingly highlighted that "GPT -4's success rate on 'is this number prime? think step by step' fell from 97.6% to 2.4% from March to June."
There is a lot more detail on the results in the piece above, and researchers there and elsewhere are digging into these results a lot more. And it may be too early to raise alarms. As this other piece by industry analyst Arvind Narayan explains:
“A new paper making the rounds is being interpreted as saying that GPT-4 has gotten worse since its release. Unfortunately, this is a vast oversimplification of what the paper found.”
“One important concept to understand about chatbots is that there is a big difference between capability and behavior. A model that has a capability may or may not display that capability in response to a particular prompt.”
“For the last couple of months, many AI enthusiasts have been convinced, based on their own usage, that GPT-4’s performance has degraded.”
“When GPT-4’s architecture was (allegedly) leaked, there was a widely viewed claim that OpenAI degraded performance to save computation time and cost.”
“OpenAI, for its part, issued a clear denial that they degraded performance, which was interpreted by this community as gaslighting. So when the paper came out, it seemed to confirm these longstanding suspicions.”
The piece goes into the details of how the models do their thing (worth understanding):
“Chatbots acquire their capabilities through pre-training. It is an expensive process that takes months for the largest models, so it is never repeated. On the other hand, their behavior is heavily affected by fine tuning, which happens after pre-training. Fine tuning is much cheaper and is done regularly. Note that the base model, after pre-training, is just a fancy autocomplete: It doesn’t chat with the user.”
“The chatting behavior arises through fine tuning. Another important goal of fine tuning is to prevent undesirable outputs. In other words, fine tuning can both elicit and suppress capabilities.”
Note that the author, when discussing ‘chatting behavior’, is referring to reinforcement learning loops, which I’ve outlined in an earlier post is critical to improving the reliability and results of LLM AI models. He goes on:
“Knowing all this, we should expect a model’s capabilities to stay largely the same over time, while its behavior can vary substantially.”
And then he goes on to make a key point (Bolding mine):
“Behavior drift makes it hard to build reliable products on top of LLM AI APIs”.
“The user impact of behavior change and capability degradation can be very similar. Users tend to have specific workflows and prompting strategies that work well for their use cases. Given the nondeterministic nature of LLMs, it takes a lot of work to discover these strategies and arrive at a workflow that is well suited for a particular application. So when there is a behavior drift, those workflows might stop working.”
“It is little comfort to a frustrated ChatGPT user to be told that the capabilities they need still exist, but now require new prompting strategies to elicit. This is especially true for applications built on top of the GPT API. Code that is deployed to users might simply break if the model underneath changes its behavior.”
All this is is to say “your actual results may vary”, given the way LLM AI models currently do their thing. Frustrating to say the least.
The piece goes onto the punchline:
“In short, the new paper doesn’t show that GPT-4 capabilities have degraded. But it is a valuable reminder that the kind of fine tuning that LLMs regularly undergo can have unintended effects, including drastic behavior changes on some tasks. Finally, the pitfalls we uncovered are a reminder of how hard it is to quantitatively evaluate language models.”
And that makes ‘Explainability’ and ‘Interpretability’ much harder to do.
So the discussions around the variability of results in GPT 4 and other LLM AI models will continue. But this current kerfuffle is a reminder that we are in the early, early days of these technologies.
And as a result, users are going to have to live through the teething pains as AI technology gets sorted out. And the behavior of these models and chatbots be made to drift less, and become far more dependable. Stay tuned.
=====
UPDATE: Fast response by OpenAI on the ‘behavior drift’ issues raised in this post:
“We’re introducing custom instructions so that you can tailor ChatGPT to better meet your needs.This feature will be available in beta starting with the Plus plan today, expanding to all users in the coming weeks. Custom instructions allow you to add preferences or requirements that you’d like ChatGPT to consider when generating its responses.
We’ve heard your feedback about the friction of starting each ChatGPT conversation afresh. Through our conversations with users across 22 countries, we’ve deepened our understanding of the essential role steerability plays in enabling our models to effectively reflect the diverse contexts and unique needs of each person.
ChatGPT will consider your custom instructions for every conversation going forward. The model will consider the instructions every time it responds, so you won’t have to repeat your preferences or information in every conversation.”
This indicates they’re focused on user concerns and are proactive. The response begins to address some of the issues, not the least one being the ‘Goldfish Memory’ of LLM AI interactions today, that I discussed a while ago. Lot more for OpenAI and the industry to do on this front. And get better at ‘Explainability’, ‘Interpretability’, and ‘Steerability’. Stay tuned indeed.