From Digital Twins to World Models

Zeeshan Zia

Agentic AI and Multimodal LLMs

Published Jun 25, 2025

The concept for “digital twins” came from NASA’s simulations for the Apollo missions in the 1960s, where they used physical replicas of the spacecraft to study and test solutions, effectively mirroring the physical world in a virtual environment. In the intervening sixty years, while the improvement in physics and graphics engines has increased the promise of the concept as a tool for future prediction and counterfactual reasoning, its use has remains niche due to two reasons:

1. Difficulty in digitizing the initial state: Its incredibly complex to mirror a real environment in the virtual world. One needs to digitize 3D geometry and physics models of equipment, tools, and humans; and then explicitly program plans and behaviors into the simulation, taking months and years.

2. Low-fidelity bridge from the real-world: Sensors that connect the real-world to the simulation have remained low resolution, e.g. capturing the behaviors on a manufacturing assembly station with sensors on torque drivers, light curtains, RFID. These sensors cannot digitize real-world activities in their complete details including all the motions involving humans, tools, parts, and automation, and if you can’t digitize it, you can’t apply computations to it. That’s why even after significant effort is spent digitizing a complex system such as a factory, as its being commissioned, the simulation is abandoned once the real-world system comes into being - because its easier to test alternative design choices in the real-world than adapting the simulation.

At the same time, even ordinary commercial LLMs and Video Generation models are able to answer detailed questions about the physical world in text form and video, because they have learned a latent model of the World. Researchers are working on approaches to utilize these latent World Models [1] as simulators so that an agent can use them to predict future states, simulate outcomes and make decisions, without interacting with the physical world.

I am seeing three ways of accessing these latent simulation capabilities for predictive and counterfactual modeling.

Vision-Language-Action (VLA) Models

Article content — π0 - A state-of-the-art VLA motion from Physical Intelligence

VLA models train jointly on visual, textual, and action data to learn a joint representation that can reason across these modalities. One may ask such a model to perform an activity at a high-level, e.g. tell the robot or a human avatar to “Make a sandwich” or “Clean a room” either in simulation or in the real-world. While the idea sounds neat, in practice, it's not trivial to train VLAs or acquire the right kind of data, and they’re 5+ years behind commercial LLMs. To make them useful today, we need two things:

1. Hierarchical models and clever training: With limited training datasets, VLA models can only learn to perform simple actions, corresponding to individual steps in a complex activity. Approaches such as Hi Robot [2] use two VLMs, one to break down a high-level request into a step-by-step plan, and another to actually execute the individual steps. Also, Naive VLAs tend to ignore language instructions and focus on the current visual state to decide next action. Tricks such as dataset augmentation [3] via renaming step names in training sets with standard LLMs, and pre-training [4] with next token prediction while post-training with diffusion help retain VLM knowledge.

2. Acquire large amounts of training data: Well-funded research labs are beginning to capture data for specific domains e.g. household tasks such as cleaning, making a bed, loading dishes, setting tables recorded from tele-operating a variety of different robots in hundreds of homes. Within these limited familiar environments and tasks, VLAs are beginning to show impressive generalization ability across robots in new unseen homes [5]. However, they don’t extend to more diverse environments such as factories where the range of objects and motions can be significantly larger.

Conditional Video Generation / Reasoning Models

Perhaps you’ve heard of, or even played with video generation models such as Google Veo3, OpenAI Sora, or Runway Gen-3 Alpha. These models create a plausible video from a text caption. Now some projects such as Nvidia’s Cosmos [4] are using “conditional” video generation and textual reasoning as an alternative to physical simulation. There are three models included in the Nvidia Cosmos ecosystem:

(i) Cosmos Predict: Here one can provide an image and a text caption to generate the next 30 seconds of the video. One can see this as a simulation of the future or a counterfactual, e.g. show it the image of a robotic gripper approaching a cup from an awkward grasping angle, and the generated video acts as an alternative to physics based simulation which shows whether the liquid in the cup will be spilled or not!

(ii) Cosmos Transfer: Steerable video generation, where one can use different levers to finely control the generated video, e.g. provide vehicle and road bounding boxes to generate video with the same road layout, or provide some constraints on a robot arm activity to generate video of that arm doing a specific activity.

(iii) Cosmos Reason: Takes a video and a textual question as input and performs physical reasoning to answer future prediction and counterfactual questions in text form. It's interesting that Nvidia is working on a very powerful Digital Twin platform named Omniverse, and also simultaneously building the Cosmos World Models. So far they use Omniverse quite extensively to generate training data for Cosmos, but I haven’t seen any AI-backed simulation enhancements on the Omniverse side, which can be really powerful.

Explicit Physical Grounding

The above two approaches perform their “reasoning” inside neural network activations and the individual steps of the reasoning are opaque to their user, with the exception of generating natural language reasoning traces.

There are a number of approaches in computer vision and AI that take a real scene and reconstruct it in a 3D representation grounded in physics, so traditional physics engines can be applied to the scene to simulate the future and counterfactuals. While traditional approaches like camera-geometry or depth camera based reconstruction, as well as Gaussian splats or NerFs are focused on mimicking the visual appearance of the scene, they lack affordances, which is now being supplied with the appropriate application of AI. Researchers are going beyond 3D reconstruction and into understanding of object articulations [6], parts [7], and material properties [8], benefiting from semantic computer vision models as well as LLMs. Some examples representing this fascinating line of work include:

DRAWER [6] uses a dual geometric representation combining Gaussian Splatting (for their appearance) and Neural SDFs (for their geometry), identifies interactable and actionable object using object detectors, and asks an LLM to learn about revolute vs. prismatic joints.
PartCrafter [7] jointly reconstructs individual objects / parts in a scene while maintaining their compositional relationship from a single image, while allow the user to prompt for different number of parts.
In “Birth and Death of a Rose” [8], researchers implements a physical simulator in the loop, essentially “de-rendering” a video to extract its shape, material, form, light, camera, and motion and then “rendering” it to use the difference as a self-supervision signal for the neural network.

Together these lines of work point at a fusion layer where symbols and neural activations cooperate. Digital twins will no longer be brittle CAD scenes frozen at commissioning. They will be self-healing, data-driven world models that stay in sync with reality and answer what-if questions on demand. When that happens the simulation will not be an offline exercise.

[1] A Path Towards Autonomous Machine Intelligence, Yann LeCun, 2022

[2] Hi Robot: Open-ended instruction following with hierarchical vision-language-action models, Shi et al., 2025

[3] Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success, Kim et al., 2025 [4] π0: A Vision-Language-Action Flow Model for General Robot Control, Black et al., 2024

[5] Physical AI with World Foundation Models, Nvidia Cosmos

[6] DRAWER: Digital Reconstruction and Articulation With Environment Realism, Xia et al., 2025

[7] PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers, Lin et al., 2025

[8] Birth and Death of a Rose, Geng et al., 2025

Prashant Sawane

Manager - Industrial and Manufacturing Engineering

2mo

Thanks for sharing, Zeeshan

1 Reaction

Manikantan N S

Senior Vice President & Global Head - Manufacturing Vertical @ Tech Mahindra | Digital Transformation, Consulting, Strategy

3mo

Vigneshwaran Senthamilarasu M.Eng, PMP®, LSSBB

Incredible innovation if it comes out means

See more comments

LinkedIn respects your privacy

From Digital Twins to World Models

Zeeshan Zia

Agentic AI and Multimodal LLMs

Vision-Language-Action (VLA) Models

Conditional Video Generation / Reasoning Models

Explicit Physical Grounding

Others also viewed

Simulation of Nafion membrane using Marini3 force field

aiSim™ – 5.4.0 release notes

Digital twins. Engineers' dream come true

Melvine's AI Analysis # 37 "How Rocket Lab is Using AI to Revolutionize Space Exploration: The Future of Generative AI in Aerospace"

Model or Experiment?

Genesis prompt-to-simulation Engine

Exploring Generative AI with NVIDIA at SIGGRAPH 2024

Investigating the Fidelity of Models for Developing Digital Twins

The Future of Factory Optimization with Cross-Platform AI Solutions

Machine Vision and 3D Metrology Trends 2018

Explore content categories