In this Sunday’s ‘AI: In My View’, I want to tackle an increasing fear voiced by the leading Foundation LLM AI companies, and their user/supporters/developers: that we may run out of Data to feed the ever-growing models. As we’ve discussed before, the industry needs ’Extractive Data’ to feed the ever-growing models, and of course data engagement for the ‘reinforced learning’ feedback loops that are so important to make LLM AI results more relevant and accurate.
That even though the obvious sources are increasingly looking as they’ve all been tapped, and many of the owners are increasingly asking for payment before further use or going to court, I’ve maintained that we’ve barely scratched the tip of the tip of the data iceberg. One source of my optimism lies in our ‘Data Exhaust’, which is the data generated as information trails and byproducts from our online activities. Let me explain.
First, let’s address the fears on running out of Data for the growing Foundation LLM AI models. As this recent piece in Venturebeat postulates: “What happens when we run out of data for AI models?”:
“However, according to recent research done by Epoch, we might soon need more data for training AI models. The team has investigated the amount of high-quality data available on the internet. (“High quality” indicated resources like Wikipedia, as opposed to low-quality data, such as social media posts.) “
“The analysis shows that high-quality data will be exhausted soon, likely before 2026. While the sources for low-quality data will be exhausted only decades later, it’s clear that the current trend of endlessly scaling models to improve results might slow down soon.”
There is a global scramble for Data to plug into the ever larger LLM AI models, as this Economist piece highlights:
“Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026. The latest AI models from Google and Meta, two tech giants, are likely trained on over 1 trillion words. By comparison, the sum total of English words on Wikipedia, an online encyclopedia, is about 4 billion.”
Of course, there are also big pockets of Data within the firewalls of companies:
“There is, however, one source of data that remains largely untapped: the information that exists within the walls of the tech firms’ corporate customers. Many businesses possess, often unwittingly, vast amounts of useful data, from call-centre transcripts to customer spending records. Such information is especially valuable because it can be used to fine-tune models for specific business purposes, such as helping call-centre workers answer queries or analysts spot ways to boost sales.”
“Yet making use of that rich resource is not always straightforward. Roy Singh of Bain, a consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove most useful for training AI tools. Often these are spread across various systems, buried in company servers rather than in the cloud.”
In fact, in a recent survey of business executives trying to harness LLM AI technologies on their own pools of data, they’re finding the task is so daunting that it may actually take years to use all the data they have available for AI leveraged products and services. And of course, a massive amount of forward investment.
Another possible source ‘Synthetic Data”, which has its potential uses and pitfalls:
“Microsoft, OpenAI and Cohere are among the groups testing the use of so-called synthetic data — computer-generated information to train their AI systems known as large language models (LLMs) — as they reach the limits of human-made data that can further improve the cutting-edge technology.”
The experts aren’t sure we’re anywhere close to the end of our Data rope, as outlined in this piece I penned recently. The co-founder/CEO of Anthropic, a leading Foundation LLM AI company, Dario Amodei, when asked about Data being a long-term constraint replied:
“My guess is that this will not be a blocker. Maybe it would be better if it was, but it won’t be.”
The thing is that our online activity, by over five billion of us daily, generates more data than we know what to do with. As this G2 piece outlines:
“When you think about data, follow the zeros.”
“As in 18 zeroes, as in more than 2,500,000,000,000,000,000 bytes (or 2.5 quintillions) of data are created each day. As in, holy guacamole.”
“This exorbitant amount of data relates to all things: our preferences, interactions, and everything else we do on the internet and connected devices. Companies take heaping servings of this data – millions of scraps at a time – and turn them into actionable insights using big data analytics software.“
Much of this ‘data exhaust’ is just not used. Until of course LLM AI came along with a possible use.
To make the numbers more comprehensible, just zoom into a much smaller amount of data that we generate of our phones in our camera rolls via photos, selfies, and screenshots every day, as this Wired piece tries to capture poignantly (links mine):
“Our camera roll is full of digital footprints, this may simply be evidence that life online is moving faster than your offline existence—that the need to shape chaos into a coherent narrative feels more urgent in the realm of infinite scrolls than it does in the clearly marked hours you experience IRL (in real life).”
This type of Data ‘at the Edge’ of course is one of the prime opportunities for companies like Apple, with over two billion devices in the hands of consumers, collecting ‘data exhaust’ every day. Apple remains one of the best positioned companies on the AI front in my view, even before the upcoming Vision Pro platform.
So overall, Data exhaust is of course much larger, and almost unimaginably so. Both in our personal and professional lives. And before we even begin to leverage video and voice for AI applications going forward, where Google remains one of first to benefit at scale.
The key is going to be figuring out creative, out-of-the-box uses for all the Data exhaust generated daily in our online activities for both personal and business use. Just plugging that stuff into LLM AI models doesn’t do the trick.
It’s a start, for the early current generation of models, but to take things further is going to require a LOT more work, and time. And of course mounds of investment. Let’s adjust our expectation clocks. But there will likely be light at the end of inexhaustible Data tunnels. And no, we won’t run out of Data to use. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here).