AI: OpenAI's text to video LLM Sora out and about. RTZ #565

...text to video and interactive world foundation models accelerating, far beyond video games & Hollywood

Dec 10, 2024

OpenAI released its long-awaited text to video app Sora to wide availability over the coming weeks, leaving the ‘test phase’ with selected parties and partners. It’s part of its ‘12 Days of Shipmas’ I discussed this Saturday. And the reviews are coming in on the positive side.

Most notable is this YouTube review by MKBHD, who got early access to the full Sora service over the past few days. The 17 minute video is worth a watch on the ‘good, the bad, and the ugly’ of the product thus far. Including the fact that for now it has no sound generation.

As Techcrunch describes the review of the service:

“Videos on the Sora homepage can be bookmarked for later viewing to a “Saved” tab, organized into folders, and clicked on to see which text prompts were used to create them. Sora can generate videos from uploaded images as well as prompts, according to Brownlee, and can edit existing Sora-originated videos.”

“Using the “Re-mix” feature, users can describe changes they want to see in a video and Sora will attempt to incorporate these in a newly generated clip. Re-mix has a “strength” setting that lets users specify how drastically they want Sora to change the target video, with higher values yielding videos that take more artistic liberties.”

“Sora can generate up to 1080p footage, Brownlee said — but the higher the resolution, the longer videos take to generate. 1080p footage takes 8x longer than 480p, the fastest option, while 720p takes 4x longer.”

The conclusion is that it’s impressive, though early in its development:

“So, what’s Sora good for? Brownlee found it to be useful for things like title slides in a certain style, animations, abstracts, and stop-motion footage. But he stopped short of endorsing it for anything photorealistic.”

“It’s impressive that it’s AI-generated video, but you can tell pretty quickly that it’s AI-generated video,” he said of the majority of Sora’s clips. “Things just get really wonky.”

Mashable adds its own take on the review:

“Sora is very much an ongoing work, as OpenAI shared during the launch. While it may offer a step up from other AI video generators, it's clear that there are just some areas where all AI video models are going to find challenging.”

And there are technical problems to fix going forward like ‘object permanence’ and lack of physics context in the models:

“The first thing Brownlee mentions is object permanence. Sora has issues with displaying, say, a specific object in an individual's hand throughout the runtime of the video. Sometimes the object will move or just suddenly disappear. Just like with AI text, Sora's AI video suffers from hallucinations.”

“Which brings Brownlee to Sora's biggest problem: Physics in general. Photorealistic video seems to be quite challenging for Sora because it can't just seem to get movement down right. A person simply walking will start slowing down or speeding up in unnatural ways. Body parts or objects will suddenly warp into something completely different at times as well.”

These are issues that can be solved with technical approaches and ‘hybrid’ computing frameworks working with the models.

Competitors are of course also active in the LLM text to video space, including companies in China as I’ve discussed.

China Unveils Vidu: A Powerful Text-to-Video Generator

In an adjacent space, Google announced Genie 2, a large scale foundation world mode, designed to go from text and/or images to ‘interactive 3D worlds’:

“We introduce Genie 2, a foundation world model capable of generating an endless variety of action-controllable, playable 3D environments for training and evaluating embodied agents. Based on a single prompt image, it can be played by a human or AI agent using keyboard and mouse inputs.”

There are some very impressive examples of the capabilities of this new tech, expanding language models into creating interactive worlds.

Techcrunch describes it further in “DeepMind’s Genie 2 can generate interactive worlds that look like video games”:

“DeepMind, Google’s AI research org, has unveiled a model that can generate an “endless” variety of playable 3D worlds.”

“Called Genie 2, the model — the successor to DeepMind’s Genie, which was released earlier this year — can generate an interactive, real-time scene from a single image and text description (e.g. “A cute humanoid robot in the woods”). In this way, it’s similar to models under development by Fei-Fei Li’s company, World Labs, and Israeli startup Decart.”

Other AI startups like World Labs are introducing their version of similar technologies, as Axios describes in “Fei-Fei Li's startup turns photos into 3D worlds”:

“Stanford professor Fei-Fei Li, dubbed the "godmother" of AI, on Monday released an early peek at what her AI startup has been working on: an engine that can turn still images into realistic three-dimensional worlds.”

Three men and one woman pose in front of large window

“Fei-Fei Li co-founded World Labs with Ben Mildenhall, Justin Johnson and Christoph Lassner.”

“Why it matters: Li is focused on the intersection of generative AI and the physical world, a challenging but potentially lucrative area.”

“Driving the news: In an interactive blog post, Li's World Labs demonstrated its approach, sharing examples of realistic photos and fantasy images being turned into 3D worlds, including one based on a Van Gogh painting.”

The Observer describes how “Fei-Fei Li’s Startup Allows You to Walk in the 3D World of Edward Hopper Paintings”, also worth a read.

An Edward Hopper painting inside a Museum

All these developments point to the accelerating pace of text to video AI models with a variety of approaches.

Potential applications abound. Everything from taking images of a loved one in the past into family movies in the future, to taking images of favorite locations and turn them into interactive worlds to explore. And far beyond of course. We’ll likely be positively surprised by applications here far beyond Hollywood and video gaming.

It’s early days for them all in this AI Tech Wave, but the near-term potential of these technologies, with all their current shortcomings, is truly fascinating for the possible consumer and enterprise applications and services to come. At Scale.

As MKBHD puts it in his review, these services at this moment, ‘are the worst they’ll ever be’. Stay tuned.

(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)

AI: Reset to Zero

Discussion about this post