Open World Models, Physical AI, and the Road to AGI

I recently had the privilege of working alongside the research team at Genmo (fun fact: their CTO, Ajay Jain, is a co-author on the foundational denoising diffusion probabilistic models paper that underpins nearly every modern image and video generative model in production today). What follows is my attempt to consolidate what I learned, correct some of my own earlier misconceptions, and share a point of view on where this frontier is heading. The views expressed here are my own.

The Remaining Frontier

The AI "Cambrian explosion" is a phrase that has been used so often it risks losing its force, but it remains the most apt metaphor for what has happened since 2022. It started with language: GPT-3, then GPT-4, then a cascade of frontier LLMs from Anthropic, Google, and a growing roster of open-source challengers. Then came image generation (DALL-E, Midjourney, Stable Diffusion), video (Sora, Runway, Kling), speech and music (ElevenLabs, Suno). Each modality followed a roughly similar arc: an initial breakthrough, rapid scaling, and then commoditisation.

The remaining frontier is the physical modality, and it is the hardest. Not because any single aspect of it is more technically complex than language or vision in isolation, but because it combines all of the other modalities and adds something fundamentally new: interaction with the real, physical world. It requires pairing software with physical embodiment, which makes it simultaneously a hardware challenge, a software challenge, and a systems-integration challenge.

This is the domain of open world models and physical AI. These terms are often used interchangeably, but they are meaningfully different, and understanding the distinction is essential to understanding where the field is going.

What Is an Open World Model?

An open world model (OWM) takes in a current state and an action, and predicts the next state. In the video-generation context, states are frames. The model answers the question: what will the world look like if I do X? The "open" qualifier means it generalises across diverse physical environments rather than being constrained to a single narrow domain.

Think of an OWM as an AI's internal simulation of how the world works. It learns cause and effect. It learns that dropped objects fall, that rigid objects don't deform on contact, that a cup of water tips when pushed past its centre of gravity. It attempts to achieve spatial and temporal consistency within its simulated environment, much as a human being develops an intuitive physics engine through years of interacting with the real world.

OWMs are often framed as a precursor to artificial general intelligence (AGI). The reasoning is straightforward: a system that can accurately simulate the physical world, predicting cause and effect and maintaining coherence over time, possesses something that looks like understanding, not just pattern-matching. Nearly every frontier AI lab lists some version of "build a model that understands the world" as its north-star ambition.

What can OWMs actually do today?

Beyond their role as the "brain" of a future robot, open world models have immediate, practical applications. They can create digital twins: synthetic replicas of real-world environments used to run simulations. A manufacturing plant trying to optimise its production line could simulate thousands of plant reconfigurations without shutting down a single assembly station. They can serve as training environments for robotic policies, testing how a robot would behave in scenarios that are expensive, dangerous, or simply impossible to recreate in real life (deep-sea navigation, space robotics, bomb disposal). They are finding early traction in video gaming, architecture, engineering, and construction. In general, they are most valuable wherever obtaining real-life training data or recreating a real-life scenario is either very hard or impossible.

The space is attracting serious capital. World Labs, co-founded by Fei-Fei Li, raised $1 billion in February 2026 to advance spatial intelligence. NVIDIA's DreamDojo, released in February 2026, is an open-source robot world model trained on 44,711 hours of real-world human video, with simulated success rates showing a Pearson correlation of 0.995 with real-world results. And AMI Labs, Yann LeCun's new venture, raised $1.03 billion to build world models grounded in his Joint Embedding Predictive Architecture (JEPA), a fundamentally different architectural bet that I'll return to later.

What Is Physical AI?

Physical AI is some kind of physical embodiment (usually a robot) powered by an AI model. The AI acts as the brain; the robot is the body. The AI model does not have to be an open world model. In fact, most current physical AI systems are powered by Vision-Language-Action models (VLAs), not OWMs. This is an important distinction that is frequently missed.

The two key building blocks of Physical AI are often conflated but answer fundamentally different questions. World models and VLAs sit in a dependency chain; one generates the training data the other consumes.

A VLA takes in observations (images, joint states, language instructions) and outputs actions: motor torques, joint angles, gripper commands. The "foundation" part means it is pretrained across diverse robots and tasks so it can generalise. It answers a different question from the OWM: not "what will the world look like?" but "what should I do next?"

The relationship between the two is a training data dependency: world models generate the action-conditioned, physically plausible video data that makes robot foundation models more generalisable. Companies like Physical Intelligence (whose π0 model was trained on 7 robotic platforms and 68 unique tasks), Sunday (building household humanoid robots), Google DeepMind (Gemini Robotics), and Figure AI are all pushing the VLA frontier.

Why Are They So Often Conflated?

Both OWMs and physical AI are associated with AGI. And AGI, in the popular imagination, looks like The Terminator or Ex Machina or Steven Spielberg's A.I. (a physically embodied intelligence). This is the "robot apocalypse" version of AGI, and it naturally fuses the software (world model) and the hardware (robot) into a single concept.

But AGI does not have to be physical. Think of Rehoboam in Westworld (a vast predictive model that simulates human civilisation without having a body) or "The Entity" in Mission: Impossible: Dead Reckoning (an intelligence that exists entirely in digital infrastructure). These are depictions of AGI that are purely cognitive, with no physical embodiment at all.

I don't think there is a settled definition of AGI yet. But most working definitions would converge on an AI system that is: (a) capable of continuous learning, updating itself from new experience rather than being frozen after training; (b) able to perceive and model the world, with an understanding of cause and effect and an awareness of space and time; and (c) able to generalise to genuinely novel situations it was never trained on.

Open world models are seen as the "brain" of an AGI, or at least a precursor to it, because they attempt to satisfy condition (b): predicting cause and effect and maintaining spatial and temporal consistency. They should, in theory, also contribute to (c), because a sufficiently rich world model should be able to simulate novel scenarios that go beyond its training data.

A distinction worth making

Continuous learning and generalisation are related but separable problems. Generalisation is the ability to perform well on novel scenarios at inference time; LLMs already demonstrate this impressively. Continuous learning (also called lifelong or online learning) is the ability to update the model's weights from new experience without catastrophically forgetting prior knowledge. Existing OWMs attempt the former; the latter remains largely unsolved. The recent discussion around recurrent LLM architectures points toward models that can improve over time, not just generate output to novel input, and that's a different, harder problem.

Why Can't We Just Use the LLM Playbook?

An astute observer might ask: current LLMs already generalise to novel inputs. That's precisely why they're so widely used. Hasn't this problem already been solved? Why can't we simply adopt the same underlying architecture for world models?

The answer lies in the data primitive. LLMs are built on the primitive of language, specifically word embeddings derived from tokenised text. There is an emergent property that arises from using language as a data primitive: text inherently encodes compositional semantic structure (syntax, grammar, hierarchical meaning) in a way that other data primitives (pixels, joint angles, force vectors) do not.

When an LLM responds coherently to a prompt it has never seen before, it appears to "understand" because the training objective (next-token prediction) and the data primitive (language tokens) were co-designed: each makes sense only because of the other. The transformer architecture leverages this structure to build representations that are hierarchically organised, and this is what enables generalisation.

This does not map straightforwardly to the physical world. Pixels don't inherently encode cause and effect. A sequence of image frames showing a ball rolling off a table and falling contains the visual evidence of gravity, but the pixel representation alone doesn't distinguish between "the ball fell because it was pushed" and "the ball fell because the table was tilted" unless the model has learned a causal representation, and that's much harder to learn from pixels alone than syntactic structure is to learn from text.

The trillion-dollar question: What is the data primitive that encodes physical semantics most accurately, in the same way that text tokens encode grammatical semantics? Whoever answers this will have done for robotics what the transformer and next-token prediction did for language.

The State of Physical AI: VLAs and Their Limits

Most current physical AI systems are powered by VLAs, not by open world models. Understanding how VLAs work and where they break is essential context for understanding why the field believes OWMs are necessary for the next leap.

VLAs are trained primarily via imitation learning. There are two main sources of training data. The first is human task mimicry: video recordings of humans carrying out tasks, which the model learns to reproduce. The second is teleoperation: human operators physically control robotic embodiments (via exoskeletons, haptic gloves, or remote joysticks), carrying out specific tasks, and the recordings become training data.

The training data typically involves three primitives: the video recording itself (pixels), language annotations (annotators labelling each video segment with the task or action being performed), and action data (coordinates of end-effectors, joint angles, gripper states). Different VLA architectures train on different combinations: at minimum pixels and actions; at best, all three.

From Gemini Robotics 1.5 (Google DeepMind, 2025): The model is trained on "thousands of hours of real-world expert robot demonstrations containing thousands of diverse tasks, covering scenarios with varied manipulation skills, objects, task difficulties, episode horizons, and dexterity requirements" across ALOHA, Bi-arm Franka, and Apollo humanoid platforms. Its Motion Transfer mechanism enables zero-shot cross-embodiment skill transfer: tasks trained on one robot transfer directly to a different robot without retraining. Over 90% of evaluation episodes during development were conducted in simulation. → Full paper

VLAs have made real progress on generalisation. Gemini Robotics 1.5 achieves a progress score of 0.83 on in-distribution tasks and demonstrates meaningful instruction, action, and visual generalisation. Its "Thinking VLA" architecture enables the robot to reason in natural language before acting, dramatically improving multi-step task performance.

But even with these advances, VLAs face fundamental constraints that stem from the nature of their training data.

The brittleness of teleoperated data

Teleoperated datasets are controlled by embodiment type, task, objects, and environment. Consider a seemingly simple task: pick up an egg from a blue cup and place it on a green plate, on a flat tabletop in bright lighting. A policy trained on this specific scenario might fail if you change any single variable: switch to warm yellow lighting, replace the green plate with a red bowl, move the cup to a different position on the table. That's how brittle current policies can be.

This brittleness creates a logistical impossibility: you cannot create teleoperated data for every single scenario, action, object, and environment combination in the world. Even with a million operators working around the clock, the combinatorial space of the real world dwarfs any dataset you could collect. The data also needs to be created separately for each embodiment type (quadrupeds, bipeds/humanoids, dual-arm and single-arm platforms), each with their own kinematic configurations, grippers, and actuators.

Training on all three data primitives (pixels, language, actions) helps. When the model can semantically link pixel observations to action sequences to language commands, it gains an embodiment-agnostic anchor: the word "grasp" maps to a class of contact-and-lift action sequences across different objects, robots, and visual contexts. This enables better cross-embodiment transfer. But even with all three primitives and high-quality annotations, the model's generalisation remains bounded by the coverage of the training corpus.

This is why even companies that have worked on robotics for years (Amazon being the canonical example) have only deployed robots in extremely controlled settings: windowless warehouses with uniform lighting, guided floor beams, and rigidly constrained workflows. And it's why I remain sceptical of consumer robotics companies making bold claims about what their robots can do in unstructured home environments.

Why OWMs Are the Key to Physical AI's Next Frontier

Given these constraints, open world models offer two critical capabilities that the current VLA paradigm lacks.

1. Simulated testing environments

A robot policy is a set of actions a robot takes given a particular set of observations. A policy rollout is when that policy is tested in an environment. These rollouts can be carried out in a physical environment (say, a manufacturing site), but this is often costly, difficult to reproduce, or simply impossible. Consider policy rollouts for robots doing deep-sea navigation, space exploration, or bomb disposal. You cannot test-fail-iterate in these environments the way you can in a warehouse.

This is where OWMs become essential: they provide a simulated digital environment in which to test robotic policies before deploying them in the real world. The Gemini Robotics 1.5 team reports that "over 90% of the evaluation episodes during development were conducted in simulation," using the open-source MuJoCo physics simulator. This dramatically reduces the volume of tests on real hardware, allowing much faster iteration.

2. Simulated training environments

The bigger prize is using world models not just for testing but for training. If your world model can accurately simulate real-world cause and effect, a robot policy can be trained inside this simulated environment by generating arbitrary situations, far more diverse than any teleoperation dataset could provide.

Important caveat

This is a big "if." No world model in production today is accurate enough to fully substitute for real-world training. If you train a policy via reinforcement learning inside an imperfectly simulated environment, the policy will exploit the model's errors, learning behaviours that look optimal in simulation but fail catastrophically in reality. In the real world, physical laws govern the reward mechanism properly in a way that an imperfect simulation does not. This failure mode, called model exploitation, is the central challenge of model-based reinforcement learning.

World Models vs. Physical Simulators vs. Video Generators

This brings me to an important three-way distinction that is frequently collapsed in popular discussion.

Three distinct approaches to simulating physical environments, each with different trade-offs. Current video generators excel at visual realism but fail at physical plausibility in ways that matter for robot training.

Physical simulators (MuJoCo, NVIDIA Isaac Sim, various structural engineering and FEM tools) have been around for decades. They hard-code physical rules, which makes them reliable and accurate within their domain. Their weakness is the flip side of their strength: they can only simulate scenarios for which rules have been explicitly authored, making them brittle and constrained to a small subset of the physical world.

Video generation models (Sora, Runway, Kling, ByteDance's Seedance) can produce hyper-realistic video, but video realism does not equate to physical plausibility. Generated video content can look indistinguishable from reality and yet depict physically impossible events, including objects that float without support, items that pass through solid surfaces, or manipulations that occur without visible contact between the robot and the object.

From RBench (ByteDance Seed & Peking University, 2026): The benchmark evaluated 25 representative video generation models on physical robot video generation. Even the highest-scoring model (Wan2.6) achieved only a 0.607 average score across all metrics. Specific failure modes include floating/penetration (robot parts not grounded or interpenetrating solid objects), spontaneous emergence (objects appearing or disappearing without causal motion), and non-contact attachment (objects moving with the robot without visible contact). The benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations. → Full paper

These failure modes matter enormously for robotics. A human viewer tolerates minor physics violations if the video is visually smooth. A robot policy trained on that video will internalise the wrong physical priors and fail in deployment.

A useful distinction: physical plausibility is a local property (frame t+1 follows correctly from frame t), while long-horizon temporal consistency is a global property (frame t+100 is coherent with all prior state). These require separate architectural solutions, and current models struggle with both, albeit to different degrees.

The Next Frontiers

I believe frontier research in this space will be tackling two interconnected bottlenecks.

1. Scaling physically plausible data

The RBench paper's contribution extends beyond evaluation. Its companion dataset, RoVid-X, is the largest open-source robotic video dataset for video generation: 4 million annotated video clips covering 1,300+ skills, enriched with optical flow, depth maps, and physical property annotations. This is the kind of infrastructure investment that preceded the LLM scaling era. First you need the data, then you need the benchmarks to measure whether your models are learning the right things from it.

Companies are attacking this from multiple angles. NVIDIA with DreamDojo trains world models on massive corpora of real-world human video (44,000+ hours) and uses those models as "data factories," generating action-conditioned synthetic trajectories that downstream robot policies train on. A homegrown company worth watching is Bifrost, which generates synthetic training datasets with pixel-perfect semantic labels: each pixel in a scene is annotated with what it represents (a specific object class, surface type, or entity). This approach addresses the perception layer of physical AI, giving models richer structured signal about what is in the world, not just what it looks like. Worth noting that Bifrost's strongest production use cases currently skew toward geospatial and aerial imagery, so how far this pixel-semantic approach transfers to contact-rich robotic manipulation is an open empirical question.

2. Finding the right data primitive for physical semantics

This is the deeper, harder question. The token-prediction paradigm worked for language not just because of the transformer architecture, but because the training objective (next-token prediction) and the data primitive (language tokens) were co-designed. What is the equivalent for the physical world?

I believe the most promising current answer is the pixel-text-action triplet. Each training example contains three views of the same physical interaction at three levels of abstraction:

The pixel-text-action triplet is the strongest current answer to encoding physical semantics. Each primitive provides a different layer of constraint, forcing the model to learn representations that are coherent across all three simultaneously.

Text is the critical addition. It provides the compositional semantic structure that pixel prediction and action sequences alone cannot supply. The word "grasp" labels not just what happened but the intent, the object, the relational structure, and the task context. It plays for physical interaction the same role that grammatical structure plays in language: it provides the scaffold that makes generalisation possible.

LeCun's JEPA bet

This is where Yann LeCun's AMI Labs makes its contrarian wager. LeCun has argued that pixel-level video prediction is the wrong inductive bias for physical understanding. His reasoning: reconstruction-based objectives (including most video world models) still have a pixel-level loss somewhere in the training loop. This forces representations to encode visually detailed but causally irrelevant features (exact textures, lighting gradients, surface reflections) because the model needs to reconstruct them to minimise loss. This is wasted representational capacity that could be devoted to learning causal structure.

JEPA (Joint Embedding Predictive Architecture) removes the decoder entirely. Instead of predicting pixels, it predicts representations, forcing the model to learn features organised purely around what is predictively useful, which should be the causally relevant features of the physical world. AMI Labs raised $1.03 billion at a $3.5 billion valuation in March 2026, betting that this architectural choice will prove decisive.

The counter-bet, held by much of the industry including DreamDojo (NVIDIA), Genmo, and DeepMind, is the connectionist position: that with sufficient scale and the right conditioning signals (action-conditioning, inverse dynamics models, physics-conditioned generation), physical understanding will emerge in the representations, the same way grammatical structure emerged in LLM embeddings without being explicitly designed.

Where I'm Placing My Attention

If I had to identify the developments most worth watching over the next 12 to 18 months, here is where I'd focus.

First, benchmarking that measures physical plausibility, not just visual realism. The RBench framework is exactly the kind of infrastructure the field needs. The next step is running the research that connects the dots: training robotic policies on video from models with different RBench scores, and measuring whether higher physical plausibility scores actually translate into better real-world policy performance. That study, when it comes, will be one of the most consequential papers in the field.

Second, latent-space training. Most current pipelines follow a sequential workflow: train a world model, use it to generate synthetic video, train a policy on that video via behavioural cloning. Dreamer 4 argues this is the wrong order of operations. The agent should train directly inside the world model's latent representation space, never generating pixels during the training loop. This is more compute-efficient, produces denser learning signals, and avoids information loss from pixel generation. It is worth noting that this is related to but distinct from LeCun's JEPA argument. Both advocate for learning in abstract representation space rather than pixel space. But they differ in mechanism: JEPA removes the reconstruction objective from the training signal entirely and applies to self-supervised learning broadly, while Dreamer 4 retains a reconstruction component in the world model training but moves the policy learning loop into latent space via RL. They are complementary ideas pointing in the same direction, not the same proposal. Dreamer 4 has so far been demonstrated at small scale (Minecraft and early robotics experiments), but if it scales to real-world manipulation, it collapses the current pipeline significantly.

Third, cross-embodiment transfer. Gemini Robotics 1.5's zero-shot skill transfer across ALOHA, Bi-arm Franka, and Apollo humanoid robots is the strongest benchmark-validated evidence that a single model can learn representations useful across very different physical embodiments. NVIDIA's DreamDojo also demonstrates multi-robot capability, producing action-conditioned rollouts across four distinct platforms (GR-1, G1, AgiBot, YAM), and explicitly designed its latent action representation for cross-embodiment generalisation. The honest caveat is that DreamDojo's reported headline result (0.995 Pearson correlation with real-world outcomes) speaks to sim-to-real fidelity rather than cross-embodiment transfer benchmarks specifically, so I would not draw a direct equivalence with Gemini Robotics 1.5's zero-shot transfer numbers. But the directional signal from both labs is the same: the field is converging on shared representations that span embodiments, and if that capability continues to mature, it fundamentally changes the economics of robotics. You no longer need to collect separate training data for every robot form factor.

The frontier of AI is no longer about making models that can talk, or even models that can see. It is about making models that understand how the physical world works: the ability to predict what happens when you push, pull, lift, pour, assemble, and navigate. That understanding is what bridges the gap between a chatbot and a system that can actually do things in the real world.

We are not there yet. The best video models still score 0.6 on physical plausibility. The best VLAs still fail when the lighting changes. The field's foundational debate (learn from pixels or learn from abstract representations?) remains unresolved. But the pace of progress is extraordinary, and the capital and talent flowing into this space suggest that the next 18 months will be more consequential than the last 18.

The Cambrian explosion isn't over. It's just reaching the hardest part.