The last few posts laid out how the emerging AI Tech Value Stack building new value in similar, but different ways vs the PC and Internet waves that created trillions in value for public and private investors over forty years.
We saw how this current AI iteration builds upon the work done by Jeff Bezos’ Amazon Flywheel. And how one could envision the poster-child of LLM AI today, OpenAI, leveraging the same.
And then explained the ‘Flywheel Loops’ this time also uniquely leverage what are called the Reinforcement Learning and Feedback Loops, based on human and AI interactions (RLHF and RLAIF).
And yesterday, we talked about the one other ingredient that makes the AI stack very different from the PC and Internet stacks.
That being of course, Data. It’s Box 4 in the AI Tech Value stack below.
Data is the foundation upon which Foundation LLM AI models are built by OpenAI, Google, Meta, and everyone else going into the LLM AI business worldwide.
Today, almost everyone uses pretty much the same pile of digital data available online, as the initial raw material for their LLM AI models. Chart 2 below shows the Data box at a process level, while Chart 3 shows some representative sources.
Everyone uses Internet sources at hand like Wikipedia, Reddit (more on that soon), internet common crawls and others. The Data, once ‘extracted’, is then processed and optimized for use by users and businesses. This area is particularly seeing massive innovation and will be a driver for the ultimate utility and reliability of AI.
A lot of observers are increasingly concerned that all of the logical Data sources online may have been tapped already.
And I stated yesterday, today’s online data troves are but ‘the tip of the tip of the iceberg’. And far from being a ‘Data Moat’ for the current crop of LLM AI leaders.
This week we see Google potentially tapping into a uniquely rich and massive trove of new data to train their LLM AI models on. As this report highlights:
“YouTube, which Google owns, is the single biggest and richest source of imagery, audio and text transcripts on the internet. And Google's researchers have been using YouTube to develop its next large-language model, Gemini.”
This is not about AI to generate new video, which ‘AI Native’ companies like Runway and others are doing. But rather using YouTube videos to train the next phase of LLM AI models.
YouTube is a huge part of what we do on the Internet, as this report highlighted in numbers and history. The punchline:
“Almost 5 billion videos are watched on Youtube every single day.”
Remember there are 8 billion people on the planet, and less than half are on the internet via mobile or PC.
Meta and TikTok, which have their own large troves of video data, are also similarly focused given their video libraries.
Indeed, this week’s report that TikTok’s parent Bytedance directly placed orders for a billion dollars of Nvidia’s much sought after AI GPU chips, is a sign of these efforts and investments. Note that Meta and Microsoft are also among the largest customers of Nvidia GPU chips.
Google is also cleverly using other AI Models and tools like its Visual Language Model (VLM) Flamingo, to give text descriptions to the huge number of YouTube Shorts, which typically have sparse labeling, to further feed their Bard, Palm2 and other LLM AI models.
The opportunity to use YouTube as training data has not been lost on competitors. As this report highlights, OpenAI may have been secretly scraping YouTube data to train its GPT models already.
Separately, others are fast at work on this approach as well beyond the big tech companies. As this AI development explains another novel, out of the box way that YouTube data can be used for video training of LLM AI models: “Robots learn to perform chores by watching YouTube”. So the robots can learn from TV, just as kids learn from Nickelodeon.
And others are also working along these lines, as this piece explains:
“Imagine a self-learning robot that can enrich its knowledge about fine-grained manipulation actions (such as preparing food) simply by “watching” demo videos.”
All this highlights again how we are at the ‘tip of the tip’ for new sources of Data to train LLM AI models. I’ll have more on that subject in future posts.
For now, just recognize that it’s not just billions of people watching and learning from YouTube, TikTok and Meta videos. The AI machines are as well. For our benefit of course. Stay tuned.