
I tried running NVIDIA Cosmos 3 on DGX Spark
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
I've been strongly feeling lately that NVIDIA is at the forefront of Physical AI (embodied AI). As the foundation supporting robot control and factory simulation, the importance of World Foundation Models has risen significantly.
I had the opportunity to run the next-generation "Cosmos 3" from the NVIDIA Cosmos series on a DGX Spark™, so I'd like to share what kind of model it is. Cosmos 3 has a structure that handles everything from robot observation, predictive video generation, and control command generation all in one model, and the parts that previously required connecting 2 to 3 pipeline stages are now being absorbed into the world foundation model side.
Cosmos 3 Architecture — MoT 2 Tower
In previous Cosmos series models, models were provided individually by use case, such as Cosmos Predict 2.5 for video generation and Cosmos Reason 2 as a VLM for video understanding. This configuration changed significantly with Cosmos 3. The core of Cosmos 3 is the Omni model, which has a structure where two towers — the Reasoner Tower and the Generator Tower — run in parallel within the same MoT (Mixture-of-Transformers) architecture.
The Reasoner Tower is a VLM responsible for understanding — "reading and judging" video and text — while the Generator Tower is a diffusion expert responsible for generation — "creating and moving" images, video, audio, and actions. The key point is that rather than placing these two as separate models side by side, they are connected via shared latent representations, allowing the generation-side tower to directly receive as conditions the intermediate representations derived by the understanding-side tower. Both towers are initialized from Qwen3-VL (8B for Nano, 32B for Super), designed to add generation capability while retaining language and visual understanding.
Cosmos 3 offers the full MoT Omni model in two sizes — Nano (15.17B) and Super (63.99B) — as well as lightweight versions that extract only the Reasoner Tower (Nano-Reasoner 8.77B / Super-Reasoner 33.36B). The Reasoner-only version is for understanding-oriented use cases, while Omni is for when generation is also needed. In this article, we focus on Omni for our validation.
Validation Environment on DGX Spark
Validation was performed on an NVIDIA DGX Spark (GB10 / ARM64 / 128 GB unified memory, CUDA 13.0, Ubuntu 24.04). The model used is Cosmos3-Nano (full Omni configuration, BF16 approximately 30 GB).
The official inference code sets up the environment with a single uv sync command, installing torch 2.10.0+cu130, natten (Blackwell wheel), lerobot, and more. The visual tokenizer for Cosmos 3 uses Alibaba's Wan 2.2 VAE, which is automatically fetched from Hugging Face on the first inference.
Running 4 Use Cases
Now for the main topic. I'll run the 4 modes of Cosmos 3's Omni model — text-to-image, text-to-video, image-to-video, and Policy Model — on DGX Spark. Execution in all cases is a simple setup that just specifies an official sample JSON.
Generating Commercial-Quality Robotics Scenes from Text
Text-to-image generates images of robotics scenes from long prompts. When given content such as "a modern laboratory with white walls and gray floor, a metal-finished robot arm mounted on a white workbench," the DGX Spark produced an image containing most of the elements described in the prompt, with measured results of 960×960 / 35 steps, 22 seconds after model loading, approximately 30 GB GPU memory. At 22 seconds per image, it's quite moving to think that an open-source world foundation model can run on a single DGX Spark. Since it's built on training data from the Physical AI domain, it seems well-suited for VSS, synthetic scene material for manufacturing, and data augmentation for PPE training.
Generating Grasping Motion Videos from Text
For text-to-video, I verified operation with the prompt "a gripper grabs a red cube and slowly lifts it." With the lightweight setting of 256p / 24 frames / 12 fps, the inference time was 22 seconds. In the generated video, the robot arm's structure was consistent over time, and the motion sequence of "descent → contact → grasp → lift" was arranged in a physically plausible order. The fact that it doesn't break down structurally even at low resolution is a behavior characteristic of a model trained in the Physical AI domain.
Generating Physically Conservative Videos from Existing Images
Image-to-video generates video starting from a condition image. When given the prompt "the right arm slowly reaches over the central board and returns to its original position" with the official sample condition image (robot arms on both sides and a wooden board), the resulting video cleanly showed only the right arm moving while both arms were preserved. The inference time was 17 seconds, shorter than text-to-video, suggesting that image conditioning stabilizes diffusion convergence.
What I found personally interesting was the strong respect for the physical state of the condition image. The stance of "preserve what's in the image, don't generate what isn't" is clear, so for use cases predicting "what would happen if this state were left unattended" — like with surveillance footage — the faithfulness of not spontaneously generating non-existent objects seems reliable.
Simultaneously Generating Video and Control Commands with Policy Model
The centerpiece of this article is the Policy Model. This is the flagship mode of Cosmos 3, which simultaneously outputs predicted video and robot action sequences from observation video and natural language task instructions. The parts that previously connected "observation," "planning," "generation," and "control" as separate pipelines are now completed in a single inference.
I used the official sample as-is for validation. The observation video is from the Bridge dataset in LeRobot v3 format (WidowX kitchen robot), and the prompt in English is "Put the pot to the left of the purple item." Running on DGX Spark, it output a 640×480 × 17 frame predicted video and 16 steps × 10 dimensions of actions in 21 seconds after model loading.

The prompt's instructions are properly reproduced in the video, with the pot being interpreted as the stainless bowl — the "portable container" in the scene — following a flow of grasping and moving it to the left.
Here's where it gets interesting. The Policy Model outputs numbers describing "how to move the robot arm" together with the video. A numerical sequence covering the arm's movement over 16 steps (including hand position, orientation, gripper open/close, etc.) is produced at a precision level that could be directly passed to the arm for execution. The sample includes a "reference movement" for comparison with your own output. An acceptance threshold (0.05) defined by the official team as "pass if error is smaller than this" is also defined, making pass/fail clearly identifiable.

The overall error was 0.013194, staying below one-quarter of the passing threshold of 0.05. 14 out of 16 steps nearly matched the reference exactly, with slight deviation only at the moment of gripper open/close. Being able to simultaneously generate predicted video and passing-level movements in 21 seconds from just observation video and natural language instructions gives a tangible sense that it could serve as a practical foundation for "instructing arms with words" on small robots like Reachy Mini or SO-ARM101.
Differences from Previous Versions and Summary
Finally, let me summarize what has changed compared to the previous Cosmos series.
Previously, when building Physical AI applications, it was necessary to construct the observation and generation parts with separate models. This meant connecting two pipelines: Cosmos Reason 2 as a VLM for observation, and Cosmos Predict 2.5 diffusion for generation. Each consumed approximately 17–40 GB in BF16, so resource management required attention when co-hosting two models. With Cosmos 3, this observation and generation is completed in a single inference. This time, the Policy Model output a predicted video and 16 steps × 10 dimensions of action sequence together in 21 seconds.
The range of practical applications also looks promising. For small robots like Reachy Mini or SO-ARM101, there's a tangible sense that the Policy Model's ability to generate "language → video + action" end-to-end could handle in a single model what previously required separately training GR00T or ACT. For factory footage, use cases become visible such as using anomalous events extracted by VSS as a starting point to visualize "what would happen if this state were left unattended" as video.
To summarize the key points:
- Roles that were previously separate — such as video generation and video understanding — are consolidated in Cosmos 3 into a 2-tower MoT of Reasoner + Generator
- Text-to-image / text-to-video / image-to-video produced practical-quality output in approximately 35 steps and 22 seconds
- Image-to-video behaved conservatively, prioritizing the physical state of the condition image, making it potentially reliable for safety-critical applications
- The Policy Model simultaneously generated predicted video and action sequences from observation video and task instructions, clearing the official golden standard with MSE 0.013
The biggest change I felt this time was that the concept of completing "observation → planning → generation → control" within a single world foundation model is now achievable at a size that fits on a single DGX Spark.
Cosmos 3 was officially released (GA) at Computex. I plan to continue exploring in separate articles topics such as Physical AI reasoning with Cosmos 3 Reasoner, detailed environment setup, and further in-depth validation of the Policy Model.
Reference Links
- NVIDIA Cosmos Platform Overview
- Wan 2.2 VAE(Wan-AI/Wan2.2-TI2V-5B)
- Qwen3-VL technical report (arXiv:2511.21631)
- LeRobot v3 dataset format
- Related Cosmos series articles by the author
- Related VSS series articles
