
I tried running NVIDIA Cosmos 3 on DGX Spark
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
I've been strongly feeling lately that NVIDIA is at the forefront of Physical AI (AI with embodiment, Physical AI). As the foundation supporting robot control and factory simulation, World Foundation Models have become increasingly important.
I had the opportunity to run "Cosmos 3," the next-generation version of the NVIDIA Cosmos series, on DGX Spark™, so I'd like to share what kind of model it is. Cosmos 3 has a structure that handles everything from robot observation, predictive video generation, and control command generation in a single model, and areas that previously required connecting 2 to 3 pipeline stages are being absorbed into the world foundation model side all at once.
Cosmos 3 Architecture — MoT 2-Tower
In previous Cosmos series, models were provided individually for each use case, such as Cosmos Predict 2.5 for video generation and Cosmos Reason 2 as a VLM for video understanding. Cosmos 3 significantly changes this configuration. The core of Cosmos 3 is the Omni model, with a structure where two towers — the Reasoner Tower and Generator Tower — run in parallel within the same MoT (Mixture-of-Transformers) architecture.
The Reasoner Tower is the understanding-focused VLM responsible for "reading and judging" video and text, while the Generator Tower is the generation-focused diffusion expert responsible for "creating and operating" images, video, audio, and actions. The key point is that rather than placing these two as separate models, they are connected through shared latent representations, allowing the generation tower to directly receive the intermediate representations derived by the understanding tower as conditions. Text is decoded autoregressively by predicting the next token in sequence, while images, video, audio, and actions are generated through iterative denoising — a design that allows the most suitable generation method for each modality to be used within a single framework.
Cosmos 3 provides this MoT-configured omnimodel in two sizes: Nano (15.17B) and Super (63.99B). During inference, you can also extract just the Reasoner Tower to run it as a VLM, so you can use the same model either as a Reasoner for understanding-oriented tasks or as the full omnimodel when generation is also needed. This article focuses primarily on Nano for validation.
Validation Environment on DGX Spark
Validation was performed on NVIDIA DGX Spark (GB10 / ARM64 / 128 GB unified memory, CUDA 13.0, Ubuntu 24.04). The model used is Cosmos3-Nano (full Omni configuration, BF16 approximately 30 GB).
The official inference code is set up so that the environment is ready with a single uv sync command, installing torch 2.10.0+cu130, natten (Blackwell wheel), lerobot, and other necessary packages. The Wan 2.2 VAE from Alibaba has been adopted as Cosmos 3's visual tokenizer, and it is automatically fetched from Hugging Face on the first inference.
Running 4 Use Cases
This is where the main content begins. I'll run four modes — text-to-image, text-to-video, image-to-video, and Policy Model — with Cosmos 3's Omni model on DGX Spark. All executions are simply configured by specifying official sample JSON files.
Generating Commercial-Quality Robotics Scenes from Text
Text-to-image generates images of robotics scenes from long prompts. When given content such as "a modern laboratory with white walls and gray floors, a metal-finished robot arm mounted on a white workbench," the actual measurements on DGX Spark showed 960×960 / 35 steps, 22 seconds after model loading, approximately 30 GB GPU memory, producing an image with most of the elements described in the prompt. At 22 seconds per image, it's quite moving to think that an open-source world foundation model can run on a single DGX Spark. Since it's built on training data from the Physical AI domain, it seems well-suited for VSS and manufacturing synthetic scene materials, as well as data augmentation for PPE training.
Generating Video of Grasping Movements from Text
For text-to-video, I tested with the prompt "a gripper grabs a red cube and slowly lifts it." Using a light setting of 256p / 24 frames / 12 fps, the inference time was 22 seconds. In the generated video, the robot arm structure remained consistent over time, and the action sequence of "descend → contact → grasp → lift" was arranged in a physically valid order. The fact that the structure doesn't break down even at low-resolution settings is characteristic behavior of a model trained on the Physical AI domain.
Generating Physically Conservative Video from Existing Images
Image-to-video generates video starting from a conditional image. When the official sample's conditional image (robot arms side by side with a wooden board) was given the prompt "the right arm slowly reaches over the board in the center and returns to its original position," the right arm moved exactly as instructed while both arms were preserved. The inference time was 17 seconds, shorter than text-to-video, suggesting that image conditioning stabilizes diffusion convergence.
What I personally found interesting is the strong respect for the physical state of the conditional image. Since it clearly takes the stance of "preserve what's in the image, don't introduce what's not in the image," this faithfulness of not generating non-existent objects seems reliable for applications like surveillance footage where you want to predict "what would happen if this state were left unattended."
Simultaneously Generating Video and Control Commands with the Policy Model
The centerpiece of this article is the Policy Model. This is the flagship feature mode of Cosmos 3, which simultaneously outputs predicted video and robot action sequences from observation video and natural language task instructions. What previously required connecting separate pipelines for "observation," "planning," "generation," and "control" is now completed in a single inference.
For validation, I used the official sample as-is. The observation video is a Bridge dataset in LeRobot v3 format (WidowX kitchen robot), and the prompt is "Put the pot to the left of the purple item." in English. Running on DGX Spark, it output a 640×480 × 17 frame predicted video and 16 steps × 10-dimensional actions in 21 seconds after model loading.

The prompt instructions are properly reproduced in the video: "pot" is interpreted as the stainless bowl corresponding to the "portable container" in the scene, resulting in a flow of grasping and moving it to the left.
Here's the highlight. The Policy Model outputs numerical values for "how to move the robot arm" together with the video. A numerical sequence for 16 steps of arm movement (combining values for end-effector position, orientation, gripper open/close, etc.) is output at a precision level that can be directly passed to the arm for operation. The sample includes "reference movements" to compare against your own output. The official pass/fail threshold (0.05) is also defined, making it clear whether the result passes or fails.

The overall error was 0.013194, which is less than a quarter of the passing threshold of 0.05. Out of 16 steps, 14 steps almost perfectly matched the reference, with slight deviation only at the moment when the gripper opens and closes. The fact that it could simultaneously generate predicted video and passing-level movement commands in 21 seconds using only observation video and natural language instructions gives a hands-on sense that it could be practically viable as a foundation for "giving verbal instructions to an arm" with small robots like Reachy Mini or SO-ARM101.
Differences from Before and Summary
Finally, let me organize what has changed compared to the previous Cosmos series.
Previously, when building Physical AI applications, it was necessary to construct the observation part and generation part with separate models. The pipeline required connecting two models: Cosmos Reason 2 as a VLM for observation and Cosmos Predict 2.5 as a diffusion model for generation. Since each consumed approximately 17 to 40 GB in BF16, having two models coexist required careful resource management. With Cosmos 3, this observation and generation is completed in a single inference. The Policy Model in this validation output predicted video and 16 steps × 10-dimensional action sequences together in 21 seconds.
The scope for on-site applications also seems to be expanding. For small robots like Reachy Mini or SO-ARM101, the Policy Model's ability to generate "language → video + action" end-to-end gives the impression that it can handle in a single model what previously required separately training GR00T or ACT. For factory footage, use cases like visualizing as video "what would happen if this state were left unattended" starting from anomaly events extracted with VSS are also coming into view.
To summarize, the flow looks like this:
- Roles that were previously individual, such as video generation and video understanding, are consolidated in Cosmos 3 into a 2-tower MoT of Reasoner + Generator
- text-to-image / text-to-video / image-to-video produced practical-quality output in approximately 35 steps and 22 seconds
- image-to-video behaves conservatively by prioritizing the physical state of the conditional image, making it reliable for safety-critical applications
- The Policy Model simultaneously generates predicted video and action sequences from observation video and task instructions, clearing the official golden standard with MSE 0.013
The biggest change I felt this time is that the concept of completing "observation → planning → generation → control" with a single world foundation model has become achievable at a size that runs on a single DGX Spark.
Cosmos 3 was officially released (GA) at Computex. I plan to continue diving deeper into topics like Cosmos 3 Reasoner's Physical AI inference, detailed environment setup, and further in-depth validation of the Policy Model in separate articles.
Reference Links
- NVIDIA Cosmos Platform Overview
- Wan 2.2 VAE(Wan-AI/Wan2.2-TI2V-5B)
- Qwen3-VL technical report (arXiv:2511.21631)
- LeRobot v3 dataset format
- Related Cosmos series by the author
- Related VSS series
