I tried running NVIDIA Cosmos 3 on DGX Spark

I tried running NVIDIA Cosmos 3 on DGX Spark

2026.06.01

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

I've been strongly feeling lately that NVIDIA is at the forefront of Physical AI (embodied AI). As the foundation supporting robot control and factory simulation, the importance of World Foundation Models has risen significantly.

I had the opportunity to run the next-generation "Cosmos 3" from the NVIDIA Cosmos series on a DGX Spark™, so I'd like to share what kind of model it is. Cosmos 3 has a structure that handles everything from robot observation, predictive video generation, and control command generation all in one model, and the parts that previously required connecting 2 to 3 pipeline stages are now being absorbed into the world foundation model side.

https://blogs.nvidia.com/blog/cosmos-3-physical-ai-open-world-foundation-model/

Cosmos 3 Architecture — MoT 2 Tower

In previous Cosmos series models, models were provided individually by use case, such as Cosmos Predict 2.5 for video generation and Cosmos Reason 2 as a VLM for video understanding. This configuration changed significantly with Cosmos 3. The core of Cosmos 3 is the Omni model, which has a structure where two towers — the Reasoner Tower and the Generator Tower — run in parallel within the same MoT (Mixture-of-Transformers) architecture.

The Reasoner Tower is a VLM responsible for understanding — "reading and judging" video and text — while the Generator Tower is a diffusion expert responsible for generation — "creating and moving" images, video, audio, and actions. The key point is that rather than placing these two as separate models side by side, they are connected via shared latent representations, allowing the generation-side tower to directly receive as conditions the intermediate representations derived by the understanding-side tower. Both towers are initialized from Qwen3-VL (8B for Nano, 32B for Super), designed to add generation capability while retaining language and visual understanding.

Cosmos 3 offers the full MoT Omni model in two sizes — Nano (15.17B) and Super (63.99B) — as well as lightweight versions that extract only the Reasoner Tower (Nano-Reasoner 8.77B / Super-Reasoner 33.36B). The Reasoner-only version is for understanding-oriented use cases, while Omni is for when generation is also needed. In this article, we focus on Omni for our validation.

Validation Environment on DGX Spark

Validation was performed on an NVIDIA DGX Spark (GB10 / ARM64 / 128 GB unified memory, CUDA 13.0, Ubuntu 24.04). The model used is Cosmos3-Nano (full Omni configuration, BF16 approximately 30 GB).

The official inference code sets up the environment with a single uv sync command, installing torch 2.10.0+cu130, natten (Blackwell wheel), lerobot, and more. The visual tokenizer for Cosmos 3 uses Alibaba's Wan 2.2 VAE, which is automatically fetched from Hugging Face on the first inference.

Running 4 Use Cases

Now for the main topic. I'll run the 4 modes of Cosmos 3's Omni model — text-to-image, text-to-video, image-to-video, and Policy Model — on DGX Spark. Execution in all cases is a simple setup that just specifies an official sample JSON.

Generating Commercial-Quality Robotics Scenes from Text

Text-to-image generates images of robotics scenes from long prompts. When given content such as "a modern laboratory with white walls and gray floor, a metal-finished robot arm mounted on a white workbench," the DGX Spark produced an image containing most of the elements described in the prompt, with measured results of 960×960 / 35 steps, 22 seconds after model loading, approximately 30 GB GPU memory. At 22 seconds per image, it's quite moving to think that an open-source world foundation model can run on a single DGX Spark. Since it's built on training data from the Physical AI domain, it seems well-suited for VSS, synthetic scene material for manufacturing, and data augmentation for PPE training.

Generating Grasping Motion Videos from Text

For text-to-video, I verified operation with the prompt "a gripper grabs a red cube and slowly lifts it." With the lightweight setting of 256p / 24 frames / 12 fps, the inference time was 22 seconds. In the generated video, the robot arm's structure was consistent over time, and the motion sequence of "descent → contact → grasp → lift" was arranged in a physically plausible order. The fact that it doesn't break down structurally even at low resolution is a behavior characteristic of a model trained in the Physical AI domain.

Generating Physically Conservative Videos from Existing Images

Image-to-video generates video starting from a condition image. When given the prompt "the right arm slowly reaches over the central board and returns to its original position" with the official sample condition image (robot arms on both sides and a wooden board), the resulting video cleanly showed only the right arm moving while both arms were preserved. The inference time was 17 seconds, shorter than text-to-video, suggesting that image conditioning stabilizes diffusion convergence.

What I found personally interesting was the strong respect for the physical state of the condition image. The stance of "preserve what's in the image, don't generate what isn't" is clear, so for use cases predicting "what would happen if this state were left unattended" — like with surveillance footage — the faithfulness of not spontaneously generating non-existent objects seems reliable.

Simultaneously Generating Video and Control Commands with Policy Model

The centerpiece of this article is the Policy Model. This is the flagship mode of Cosmos 3, which simultaneously outputs predicted video and robot action sequences from observation video and natural language task instructions. The parts that previously connected "observation," "planning," "generation," and "control" as separate pipelines are now completed in a single inference.

I used the official sample as-is for validation. The observation video is from the Bridge dataset in LeRobot v3 format (WidowX kitchen robot), and the prompt in English is "Put the pot to the left of the purple item." Running on DGX Spark, it output a 640×480 × 17 frame predicted video and 16 steps × 10 dimensions of actions in 21 seconds after model loading.

Policy Model output. Frame 0 of condition video (top left) shows stainless bowl on the left, purple object in the center. Generated Frame 0 (top right) matches the condition. Frame 8 (bottom left) shows robot arm approaching the bowl with gripper grasping. Frame 16 (bottom right) shows bowl being lifted and moving toward the left of the purple object

The prompt's instructions are properly reproduced in the video, with the pot being interpreted as the stainless bowl — the "portable container" in the scene — following a flow of grasping and moving it to the left.

Here's where it gets interesting. The Policy Model outputs numbers describing "how to move the robot arm" together with the video. A numerical sequence covering the arm's movement over 16 steps (including hand position, orientation, gripper open/close, etc.) is produced at a precision level that could be directly passed to the arm for execution. The sample includes a "reference movement" for comparison with your own output. An acceptance threshold (0.05) defined by the official team as "pass if error is smaller than this" is also defined, making pass/fail clearly identifiable.

Bar chart of per-step MSE for Cosmos 3 Policy Model over 16 steps. Steps 0-5 and 8-15 are blue with MSE below 0.01, only steps 6-7 are red with MSE just under 0.1. A red dotted line shows the official threshold of 0.05, and a green dotted line shows the overall average MSE of 0.0132

The overall error was 0.013194, staying below one-quarter of the passing threshold of 0.05. 14 out of 16 steps nearly matched the reference exactly, with slight deviation only at the moment of gripper open/close. Being able to simultaneously generate predicted video and passing-level movements in 21 seconds from just observation video and natural language instructions gives a tangible sense that it could serve as a practical foundation for "instructing arms with words" on small robots like Reachy Mini or SO-ARM101.

Differences from Previous Versions and Summary

Finally, let me summarize what has changed compared to the previous Cosmos series.

Previously, when building Physical AI applications, it was necessary to construct the observation and generation parts with separate models. This meant connecting two pipelines: Cosmos Reason 2 as a VLM for observation, and Cosmos Predict 2.5 diffusion for generation. Each consumed approximately 17–40 GB in BF16, so resource management required attention when co-hosting two models. With Cosmos 3, this observation and generation is completed in a single inference. This time, the Policy Model output a predicted video and 16 steps × 10 dimensions of action sequence together in 21 seconds.

The range of practical applications also looks promising. For small robots like Reachy Mini or SO-ARM101, there's a tangible sense that the Policy Model's ability to generate "language → video + action" end-to-end could handle in a single model what previously required separately training GR00T or ACT. For factory footage, use cases become visible such as using anomalous events extracted by VSS as a starting point to visualize "what would happen if this state were left unattended" as video.

To summarize the key points:

  • Roles that were previously separate — such as video generation and video understanding — are consolidated in Cosmos 3 into a 2-tower MoT of Reasoner + Generator
  • Text-to-image / text-to-video / image-to-video produced practical-quality output in approximately 35 steps and 22 seconds
  • Image-to-video behaved conservatively, prioritizing the physical state of the condition image, making it potentially reliable for safety-critical applications
  • The Policy Model simultaneously generated predicted video and action sequences from observation video and task instructions, clearing the official golden standard with MSE 0.013

The biggest change I felt this time was that the concept of completing "observation → planning → generation → control" within a single world foundation model is now achievable at a size that fits on a single DGX Spark.

Cosmos 3 was officially released (GA) at Computex. I plan to continue exploring in separate articles topics such as Physical AI reasoning with Cosmos 3 Reasoner, detailed environment setup, and further in-depth validation of the Policy Model.


生成AI活用はクラスメソッドにお任せ

過去に支援してきた生成AIの支援実績100+を元にホワイトペーパーを作成しました。御社が抱えている課題のうち、どれが解決できて、どのようなサービスが受けられるのか?4つのフェーズに分けてまとめています。どうぞお気軽にご覧ください。

生成AI資料イメージ

無料でダウンロードする

Share this article