How Sora generates realistic videos from text

Sora, the OpenAI video generator that has filled the headlines of all the media in the last few hours, has many utilities that could be applied in sectors of all types.

Today I want to talk about its application in the world of video games, because Sora is not only capable of generating hyperrealistic videos, but also shows an unprecedented ability to create and simulate video game worlds, something that until now seemed reserved for humans and their complex programming algorithms.

According to OpenAI, Sora can “simulate artificial processes,” which includes controlling a player in Minecraft and rendering the world and its dynamics in great detail, all autonomously. This opens a universe of possibilities for the design and development of video games.

What sets Sora apart from other video generators is its focus on simulating real-world physics, acting more like a “data-driven physics engine” than a mere image generator. This allows Sora to perform thousands of calculations to predict how objects interact with their environment, creating what is known as a “world model.” This ability makes Sora a perfect tool for the generation of video games, something that Nvidia senior researcher Dr. Jim Fan highlighted. Watch this video:

Does Sora learn physics?

The objection that “Sora isn’t learning physics, he’s just manipulating 2D pixels” is one that has been heard and one that I respectfully disagree with, because it underestimates the complexity and potential of the technology we are discussing. This criticism strikes me as reductionist and does not capture the essence of how emerging technologies like Sora work, nor does it recognize the true advancement it represents.

To put it in context, as Jim Fan did, let’s consider the evolution and capabilities of GPT-4, a language model that has been shown to be able to generate executable Python code. GPT-4 does not store Python syntax trees explicitly; instead, it learns implicit forms of syntax, semantics, and data structures to generate code. This learning process is not trivial: it is the result of manipulating sequences of integers (token identifiers) on a massive scale, allowing the model to capture and replicate complex patterns of language and logic.

Similarly, Sora must learn implicit 3D text shapes, 3D transformations, ray-traced rendering, and physics rules to model video pixels as accurately as possible. This means that Sora is learning concepts of a game engine to satisfy its goal, not through explicit programming of these rules, but through observation and analysis of enormous amounts of data. This learning capability is an emergent property of massively scaling text-to-video training.

It is important to recognize that Sora is not intended to replace game engine developers. Its emerging understanding of physics is fragile and far from perfect, as it still produces hallucinations and errors that do not match our common sense of physics. This indicates that although Sora is a powerful tool, it still has significant limitations in its ability to consistently simulate complex physical interactions.

However, Sora’s potential, like the GPT-3 moment in 2020, is an indicator of what could be possible in the future. GPT-3, despite its imperfections, proved to be a compelling demonstration of in-context learning as an emergent property. Instead of focusing on the current imperfections of these models, we should consider what these emerging technologies could achieve as they continue to evolve. Extrapolating to what future versions like GPT-4 and beyond might be capable of, offers an exciting glimpse into the possibilities that artificial intelligence has to transform fields like video game development and beyond.

Sora’s limits

However, as with any emerging technology, Sora is not without limitations. Although its early tests have shown great potential, OpenAI admits that the model still does not accurately model the physics of many basic interactions, which has resulted in some strange and sometimes hilarious videos. However, Sora has overcome challenges that other video generators have not, such as “object permanence” and better camera movement dynamics.

Speculation about what Sora was trained on is high, with rumors suggesting the use of video game engines such as Unreal Engine 5 to aid in his training. Although OpenAI has not confirmed these speculations, the idea that Sora may have learned from existing digital worlds is fascinating and opens questions about intellectual property and proper attribution, especially considering previous lawsuits against OpenAI for training previous models without compensation.

What is clear is that Sora has the potential to be a game-changer in video game development, significantly lowering the barrier to entry for developers and allowing for faster, more efficient content creation. However, it also raises significant challenges, from respect for intellectual property to the impact on employment within the video game industry, which has already suffered numerous layoffs in the last year.

Leave a Reply