Advisor

AI’s Next Frontier: Advancing Large World Models for Robotics & AVs

Posted February 19, 2025 | Technology |
Large World Models

This Advisor series examines the concept of large world models (LWMs) and their importance in building the next generation of robots and autonomous vehicles (AVs). In Part I, we introduced LWM technology and examined its potential for use. Here in Part II, we discuss leading companies’ efforts to develop LWMs for various applications, including robotics and AVs.

LWM Recap

LWMs are a type of generative AI (GenAI) designed to model and understand the dynamics of the real world. The goal is to build a virtual, 3D representation of the actual world. AI developers use these virtual worlds to train, test, and implement physical AI systems. LWMs are trained on huge multimodal datasets (text, photos, video, and audio). This enables them to build accurate internal representations of the complex physical interactions and spatial properties of the world and how they work. This is key because it allows LWMs to represent and predict dynamics like motion, force, gravity, and friction (as well as spatial relationships). Moreover, LWMs can reason about the consequences of such actions. These qualities enable LWMs to generate highly detailed and accurate simulations of real-world scenarios, including virtual environments for robots and AVs to learn to perform key tasks like manipulation, navigation, and obstacle avoidance.

AI developers see LWMs as a way to build advanced robots and AVs with the ability to: (1) better understand their surroundings, (2) accurately predict future states and actions, and (3) make more informed decisions, ultimately improving their performance and safety. For more on the benefits that LWMs offer robots and AVs, see my previous Advisor.

Companies Pioneering LWM Development

A number of companies are developing LWMs for various applications and industries, including, to varying degrees, to support robotics and AV development. They range from start-ups like Decart, Odyssey, and World Labs to Big Tech players like Google/DeepMind and Nvidia. Let’s explore their efforts, keeping in mind that LWMs are really still very cutting-edge and that developments are moving very fast.

Decart

Decart’s LWM efforts focus on creating interactive GenAI experiences and optimizing AI infrastructure for enterprise applications. Their Oasis platform features advanced AI infrastructure that is capable of training generative interactive models and making them accessible in real time, including for real-time video generation and adaptive, virtual worlds that users can create and explore. Oasis is frequently compared to Minecraft (a hugely popular video game where players can create and explore their own virtual environments). Basically, Oasis is a consumer offering that enables dynamic audio-visual interactions that evolve based on user input. Oasis has received a lot of attention and engagement, with millions of users interacting with it.

Decart also offers a graphics processing unit (GPU) optimization tool for enterprise AI development. This is a systems-level solution designed to maximize GPU efficiency during the training and inferencing phases of AI model deployment. This technology supports advanced AI model development by accelerating training processes and optimizing deployment for enterprise clients. It’s designed to significantly reduce operational expenses, making AI application development more affordable and scalable. A number of companies reportedly currently use it to reduce the operational costs associated with building and running AI models.

Support for Robotics & AVs

Decart is not currently directly focused on supporting robotics and AVs; however, Decart’s ability to train highly interactive models and real-time video generation could be used for supporting model development in these domains. For example, its AI infrastructure platform and generative modeling tools could be applied to enhance the capabilities of AVs, providing real-time adaptive responses and improving the overall efficiency of autonomous systems.

Google/DeepMind

Google’s DeepMind AI group has developed Genie 2, a foundation world model (FWM)1 that can generate a variety of playable 3D environments — including for training and evaluating AI models and systems. Genie 2 generates complex 3D environments with support for features like physics simulation, character animation, lighting effects, and complex object interactions, all from a single prompt image. DeepMind claims that Genie 2 can support unlimited diverse training environments for AI systems, enabling researchers to test and develop more general embodied AI systems. This includes virtual worlds that more accurately model physical elements and interactions occurring in the real world (e.g., the movement of grass or the flow of water).

Support for Robotics & AVs

Google/DeepMind targets its technology at various application and industries, including gaming and entertainment, healthcare and life sciences, and enterprise AIWhen it comes to supporting robotics and AV development efforts, it is necessary to consider the company’s various work in these areas, including other platforms and generative models that can be applied to enhance the capabilities of AVs and robotics. Two that stand out in particular are: AutoRT (a system that harnesses large foundation models and large visual models to better train robots, enabling them to understand practical human goals and perform diverse tasks in real-world environments) and ALOHA Unleashed (an AI system to help robots learn to perform complex tasks requiring dexterous movement, such as tying shoelaces or hanging shirts).

Nvidia

Of all the companies developing LWMs, Nvidia is the furthest along when it comes to releasing a product designed to support robotics and AV development. Last month, the company introduced its Cosmos world foundation models (WFMs). These models have been trained on millions of hours of driving and robotics video data, enabling the creation of high-quality, physics-aware videos from multimodal inputs. These physics-aware videos have the ability to predict future world states, offering a useful tool for training and evaluating AI systems developed to operate in complex environments.

Support for Robotics & AVs

Cosmos WFMs are designed to work with Nvidia’s Omniverse 3D development platform for integrating physical AI systems into existing software tools and simulation workflows, including for industrial and robotic use cases. Supplemental tools include Nvidia’s Isaac robot development platform, consisting of software libraries, application frameworks, and AI models. Isacc facilitates the development of AI robots, such as autonomous mobile robots, arms and manipulators, and humanoids. This includes simulation tools (for Omniverse) that allow developers to design, simulate, test, and train AI-based robots and autonomous machines in a physically based virtual environment, reducing the need for extensive hardware testing.

Odyssey

Odyssey introduced its Explorer generative LWM last December. It bills Explorer as an “image-to-world model” that can convert “any” 2D images into highly detailed 3D virtual worlds. Explorer’s ability to generate highly photorealistic virtual worlds, complete with live-action motion, stems from Odyssey’s implementation of Gaussian splatting, which is a technique used in the innovative technology of radiance field rendering. Radiance field representations allow for highly detailed scene reconstruction, enabling generative modeling to more closely approximate photorealism. Odyssey is focused on enhancing Explorer’s capabilities for supporting 3D content creation and film production (live-action film), hyperrealistic gaming development, architectural design, and new forms of entertainment that could become possible with its technology.

Support for Robotics & AVs

To the best of my knowledge, Odyssey has not made any formal announcements regarding the use of its technology for robotics and AVs.

OpenAI

Although OpenAI is not at the forefront of companies developing LWMs (at least from what we currently know), due to its huge influence on AI development we must consider the potential for possible future efforts. OpenAI is famous for its cutting-edge large language models. However, specific details about its work on LWMs are not readily available. However, with its ongoing research in GenAI and machine learning (ML), it’s certainly possible that the company might be exploring LWMs. Additionally, OpenAI has extensive experience in multimodal GenAI models, specifically its Sora text-to-video generator tool, which can create virtual environments.

Sora can accurately simulate some physical aspects of the real world. For example, it is designed to understand and simulate the physical world in motion, which allows it to generate realistic and imaginative scenes from a user’s text instructions. But while Sora’s text-to-video model is impressive, it does have limitations when it comes to simulating physical aspects. For example, it can’t accurately model the physics of many basic interactions, like glass shattering.

Support for Robotics & AVs

Regarding robotics, OpenAI is actively pursuing development efforts. This is apparent from recent job postings seeking a senior research engineer (and other positions) to head a robotics group. The group’s mission, as stated, is to “expand the capabilities of foundational models to support general-purpose robotics in dynamic, [real-world environments], ensuring reliable and safe operation. These capabilities include … [action generation], [motion planning], [world modeling], and real-time communication through voice and emotions.” This certainly suggests that OpenAI seeks to develop models that can represent and understand the dynamics of the real world and which are necessary to support advanced robotics systems.

World Labs

World Labs was founded by visionary AI pioneer Fei-Fei Li, a world-renowned expert in AI, computer vision, ML, and graphics. Li is famous for creating ImageNet, a massive image database (of over 14 million images annotated and categorized into more than 21,000 object categories) that has been instrumental in advancing computer vision and deep learning research.

World Labs is initially focusing on developing spatially intelligent LWMs that can understand and reason about the 3D world from images and other modalities. Like Decart’s Oasis, Google/Deepmind’s Genie 2, and Odyssey’s Explorer tools, World Lab’s technology generates highly interactive virtual worlds from a single input photograph. Where the company’s technology appears to stand out is in its support for highly functional end-user interactions via the use of controllable cameras and adjustable depth of field effects from a Web browser. Some even speculate that this functionality could lead to fundamental changes in user interactions in general.

World Labs only officially launched last September. Yet it quickly reached a valuation of US $1 billion, after raising $230 million from venture capital firms Andreessen Horowitz, NEA (New Enterprise Associates), and Radical Ventures. World Labs’s technology is particularly attractive for automating content creation in applications and industries like gaming and movie making.

Support for Robotics & AVs

To the best of my knowledge, World Labs has not made any formal announcements specifically regarding robotics and AVs. The company plans to formally release its product sometime in 2025.

Conclusion

The companies discussed in this Advisor are at the forefront of developing LWMs. Some, like World Labs, are barely six months old. Others are well-established Big Tech players like Google/DeepMind and Nvidia. All are seeking to advance AI modeling capabilities that will radically transform applications and industries like entertainment, video, gaming, movies, healthcare, virtual reality, and engineering and facilitate the development of the next generation of sophisticated robots and advanced AVs. In short, these companies are striving to capitalize on the potential for advancements in LWMs to revolutionize how AI applications are trained and developed. That said, LWMs are still really a (very) emerging technology, and we can expect developments to move very fast.

Finally, I’d like to get your opinion on the development of LWMs. What impact do you think they will have on AI application development, especially for robotics and AVs? As always, your comments will be held in strict confidence. You can email me at experts@cutter.com or call +1 510 356 7299 with your comments.

Note

1FWMs and LWMs are related, but they emphasize different aspects of AI model development. FWMs are designed as comprehensive, pretrained models that serve as the basis for developing more specialized or specific applications. They are intended to provide a solid foundation for understanding and generating interactive environments and can be fine-tuned for various tasks and applications. LWMs focus on generating detailed and expansive simulations of complex, interactive environments. They emphasize the scale and complexity of the environments they can create. These models are tailored to simulate vast and intricate scenarios that autonomous agents can interact with and are used for training and evaluating agents in diverse, realistic conditions. In short, while both FWMs and LWMs are geared toward generating interactive environments, the former emphasize their role as a base for further development and adaptation, whereas LWMs focus on the scale and complexity of the environments they generate.

About The Author
Curt Hall
Curt Hall is a Cutter Expert and a member of Arthur D. Little’s AMP open consulting network. He has extensive experience as an IT analyst covering technology and application development trends, markets, software, and services. Mr. Hall's expertise includes artificial intelligence (AI), machine learning (ML), intelligent process automation (IPA), natural language processing (NLP) and conversational computing, blockchain for business, and customer… Read More