AI: Unified Multimodal AIs up next

...as OpenAI & Google race

Sep 19, 2023

It’s hard to believe, but barely ten months have flown by since OpenAI’s ‘ChatGPT moment’ flicked the light bulb on for the world on the possibilities of LLM AI (large language model AI). At least a dozen companies large and small are staking tens of billions on the next bigger versions of Foundation LLM AI models, trained on the next bigger pools of data, especially video, to give us what we can call ‘Unified Multimodal’ AI ‘chatbots’ and AI-powered Search boxes.

The front-runners in the race are of course Google and OpenAI/Microsoft, (with Meta not far behind). As the Information outlines in “OpenAI Hustles to Beat Google to Launch ‘Multimodal’ LLM” featuring Open AI and Google AI head honchos Greg Brockman and Demis Hassabis:

“As fall approaches, Google and OpenAI are locked in a good ol’ fashioned software race, aiming to launch the next generation of large-language models: multimodal. These models can work with images and text alike, producing code for a website just by seeing a sketch of what a user wants the site to look like, for instance, or spitting out a text analysis of visual charts so you don’t have to ask your engineer friend what these ones mean.”
“Google’s getting close. It has shared its upcoming Gemini multimodal LLM with a small group of outside companies (as I scooped last week), but OpenAI wants to beat Google to the punch. The Microsoft-backed startup is racing to integrate GPT-4, its most advanced LLM, with multimodal features akin to what Gemini will offer, according to a person with knowledge of the situation. OpenAI previewed those features when it launched GPT-4 in March but didn’t make them available except to one company, Be My Eyes, that created technology for people who were blind or had low vision. Six months later, the company is preparing to roll out the features, known as GPT-Vision, more broadly.”

We’ve already been able to use text to create reasoned answers to prompts and queries via OpenAI’s ChatGPT, based on its GPT 3, 3.5 and 4 LLM AI models. And do similar things with Google’s competing Bard and Palm 2 based AI enhanced Search (aka Search Generative Experience, SGE). And then go to separate sites and apps to do the same text queries to generate, code, photos, and videos. Separate products and companies like CoPilot from Microsoft, Dall-E from Open AI, Midjourney, Runway, and many more are already generating hundreds of millions in revenues and are valued in the billions. Multi-modal text driven AI queries.

But each time millions of users had to go to separate places to do those tasks in those specific capabilities. ‘Unified Multimodal’ means doing it all from the same place. One search/chatbot box to do it all.

We got a tantalizing glimpse of this back in March by OpenAI’s President Greg Brockman, when he used a smartphone to take a photo of some notes describing a website that told jokes, and then showed how GPT 4 converted that picture of the notes into a working website that told jokes. It was breathtaking to watch. Here is a link to the video here that’s under a minute to watch.

The next race for all the big AI companies is to give us all this capability as fast and at scale as possible. The hold ups have been additional work to make sure this technology works safely and reliably, and that it can scale to tens of millions plus users.

That last bit has been tough due to the relative scarcity of the AI GPUs from companies like Nvidia and their manufacturers at TSMC in Taiwan and elsewhere. Google has its own chips called Tensorflow TPUs also made by the same folks at TSMC. Tens of billions in investment by them all in this AI Tech Wave gold rush is going to make sure that problem will be solved in the next couple of years. The world will have millions of the GPUs that cost tens of thousands each. They’ll be available to most customers, except if they’re from China. As I’ve said before, GPU chips are like potato chips. They can always make more. In this case, with a bit more time and a lot more money.'

As the examples above indicate, the underlying capability of multimodal LLM AI is visual. Not only as input for instructions, but increasingly as input for new visual data to train the much larger Foundation LLM AI models training on ever large clusters of faster GPU chips, both in the cloud, and locally on our devices.

Here, Google has a bit of an advantage with its vast pools of video data in YouTube. Something I’ve talked about before. Again, as this other piece by the Information titled “Why YouTube Could Give Google an Edge in AI”:

“Google last month upgraded its Bard chatbot with a new machine-learning model that can better understand conversational language and compete with OpenAI’s ChatGPT. As Google develops a sequel to that model, it may hold a trump card: YouTube.”
“The video site, which Google owns, is the single biggest and richest source of imagery, audio and text transcripts on the internet. And Google’s researchers have been using YouTube to develop its next large-language model, Gemini, according to a person with knowledge of the situation. The value of YouTube hasn’t been lost on OpenAI, either: The startup has secretly used data from the site to train some of its artificial intelligence models, said one person with direct knowledge of the effort.”

Of course, other companies like TikTok by Bytedance, Meta with its billions of ‘Reels’ video users across Meta, Facebook, and other apps, Elon’s Twitter/X and xAI are also contenders to have their next generation LLM AI models go to school on video as well. It’s going to take a huge amount of compute infrastructure to train and scale the LLM AI models, not to mention billions more in capital expenditures (Capex).

It’ll get even more mainstream when we can add voice and translations to that multi-modality and add the ability to query these models with our voices. Again, Amazon with Alexa and Echo, Google with Nest and Google Assistant, Apple with Siri, and many others are also investing billions to upgrade their ‘Voice Assistants’ with LLM AI capabilities. That will also be a concurrent addition to ‘Multimodal’ AI in the coming months.

Hacker developers aren’t waiting for the official products and releases. For a glimpse into this near future, check out this video and report on how one demo built by developer Justin Alvey, using off-the shelf Google Nest Mini hardware. It’s truly peeking into the future through a keyhole.

All these capabilities in multimodal LLM AI models, as well as coming ‘Smart Agents’ driven AI, are likely to be the next big things to get more mainstream users to try this ‘AI thing’ for real in the next few years.

But up first in the coming weeks, are Google with Gemini, and OpenAI with ChatGPT with visual multimodal capabilities, all without going to multiple sites and apps. It’s just the beginning race in a long series of races to come.

Remember, OpenAI has its one-day Developer conference coming up November 6. But they may not wait even that long given the rumblings on Google Gemini. Stay tuned.

(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)

AI: Reset to Zero

Discussion about this post