I tried running NVIDIA Cosmos 3 on DGX Spark

I tried running NVIDIA Cosmos 3 on DGX Spark

2026.06.01

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

I've been strongly feeling lately that NVIDIA is at the forefront of Physical AI (AI with embodiment, Physical AI). As the foundation supporting robot control and factory simulation, World Foundation Models have become increasingly important.

I had the opportunity to run "Cosmos 3," the next-generation version of the NVIDIA Cosmos series, on DGX Spark™, so I'd like to share what kind of model it is. Cosmos 3 has a structure that handles everything from robot observation, predictive video generation, and control command generation in a single model, and areas that previously required connecting 2 to 3 pipeline stages are being absorbed into the world foundation model side all at once.

https://blogs.nvidia.com/blog/cosmos-3-physical-ai-open-world-foundation-model/

https://dev.classmethod.jp/articles/dgx-spark-cosmos3-family-usecase-map/

Cosmos 3 Architecture — MoT 2-Tower

In previous Cosmos series, models were provided individually for each use case, such as Cosmos Predict 2.5 for video generation and Cosmos Reason 2 as a VLM for video understanding. Cosmos 3 significantly changes this configuration. The core of Cosmos 3 is the Omni model, with a structure where two towers — the Reasoner Tower and Generator Tower — run in parallel within the same MoT (Mixture-of-Transformers) architecture.

The Reasoner Tower is the understanding-focused VLM responsible for "reading and judging" video and text, while the Generator Tower is the generation-focused diffusion expert responsible for "creating and operating" images, video, audio, and actions. The key point is that rather than placing these two as separate models, they are connected through shared latent representations, allowing the generation tower to directly receive the intermediate representations derived by the understanding tower as conditions. Text is decoded autoregressively by predicting the next token in sequence, while images, video, audio, and actions are generated through iterative denoising — a design that allows the most suitable generation method for each modality to be used within a single framework.

Cosmos 3 provides this MoT-configured omnimodel in two sizes: Nano (15.17B) and Super (63.99B). During inference, you can also extract just the Reasoner Tower to run it as a VLM, so you can use the same model either as a Reasoner for understanding-oriented tasks or as the full omnimodel when generation is also needed. This article focuses primarily on Nano for validation.

Validation Environment on DGX Spark

Validation was performed on NVIDIA DGX Spark (GB10 / ARM64 / 128 GB unified memory, CUDA 13.0, Ubuntu 24.04). The model used is Cosmos3-Nano (full Omni configuration, BF16 approximately 30 GB).

The official inference code is set up so that the environment is ready with a single uv sync command, installing torch 2.10.0+cu130, natten (Blackwell wheel), lerobot, and other necessary packages. The Wan 2.2 VAE from Alibaba has been adopted as Cosmos 3's visual tokenizer, and it is automatically fetched from Hugging Face on the first inference.

Running 4 Use Cases

This is where the main content begins. I'll run four modes — text-to-image, text-to-video, image-to-video, and Policy Model — with Cosmos 3's Omni model on DGX Spark. All executions are simply configured by specifying official sample JSON files.

Generating Commercial-Quality Robotics Scenes from Text

Text-to-image generates images of robotics scenes from long prompts. When given content such as "a modern laboratory with white walls and gray floors, a metal-finished robot arm mounted on a white workbench," the actual measurements on DGX Spark showed 960×960 / 35 steps, 22 seconds after model loading, approximately 30 GB GPU memory, producing an image with most of the elements described in the prompt. At 22 seconds per image, it's quite moving to think that an open-source world foundation model can run on a single DGX Spark. Since it's built on training data from the Physical AI domain, it seems well-suited for VSS and manufacturing synthetic scene materials, as well as data augmentation for PPE training.

Generating Video of Grasping Movements from Text

For text-to-video, I tested with the prompt "a gripper grabs a red cube and slowly lifts it." Using a light setting of 256p / 24 frames / 12 fps, the inference time was 22 seconds. In the generated video, the robot arm structure remained consistent over time, and the action sequence of "descend → contact → grasp → lift" was arranged in a physically valid order. The fact that the structure doesn't break down even at low-resolution settings is characteristic behavior of a model trained on the Physical AI domain.

Generating Physically Conservative Video from Existing Images

Image-to-video generates video starting from a conditional image. When the official sample's conditional image (robot arms side by side with a wooden board) was given the prompt "the right arm slowly reaches over the board in the center and returns to its original position," the right arm moved exactly as instructed while both arms were preserved. The inference time was 17 seconds, shorter than text-to-video, suggesting that image conditioning stabilizes diffusion convergence.

What I personally found interesting is the strong respect for the physical state of the conditional image. Since it clearly takes the stance of "preserve what's in the image, don't introduce what's not in the image," this faithfulness of not generating non-existent objects seems reliable for applications like surveillance footage where you want to predict "what would happen if this state were left unattended."

Simultaneously Generating Video and Control Commands with the Policy Model

The centerpiece of this article is the Policy Model. This is the flagship feature mode of Cosmos 3, which simultaneously outputs predicted video and robot action sequences from observation video and natural language task instructions. What previously required connecting separate pipelines for "observation," "planning," "generation," and "control" is now completed in a single inference.

For validation, I used the official sample as-is. The observation video is a Bridge dataset in LeRobot v3 format (WidowX kitchen robot), and the prompt is "Put the pot to the left of the purple item." in English. Running on DGX Spark, it output a 640×480 × 17 frame predicted video and 16 steps × 10-dimensional actions in 21 seconds after model loading.

Policy Model output. Frame 0 of the conditional video (top left) shows a stainless bowl on the left and a purple accessory in the center. Generated Frame 0 (top right) matches the condition. Frame 8 (bottom left) shows the robot arm approaching the bowl with gripper grasping. Frame 16 (bottom right) shows the bowl being lifted and moving to the left of the purple item.

The prompt instructions are properly reproduced in the video: "pot" is interpreted as the stainless bowl corresponding to the "portable container" in the scene, resulting in a flow of grasping and moving it to the left.

Here's the highlight. The Policy Model outputs numerical values for "how to move the robot arm" together with the video. A numerical sequence for 16 steps of arm movement (combining values for end-effector position, orientation, gripper open/close, etc.) is output at a precision level that can be directly passed to the arm for operation. The sample includes "reference movements" to compare against your own output. The official pass/fail threshold (0.05) is also defined, making it clear whether the result passes or fails.

Bar graph of per-step MSE for the Cosmos 3 Policy Model over 16 steps. Steps 0-5 and 8-15 have MSE below 0.01 shown in blue; only steps 6-7 have MSE around 0.1 shown in red. A red dotted line shows the official threshold of 0.05, and a green dotted line shows the overall average MSE of 0.0132.

The overall error was 0.013194, which is less than a quarter of the passing threshold of 0.05. Out of 16 steps, 14 steps almost perfectly matched the reference, with slight deviation only at the moment when the gripper opens and closes. The fact that it could simultaneously generate predicted video and passing-level movement commands in 21 seconds using only observation video and natural language instructions gives a hands-on sense that it could be practically viable as a foundation for "giving verbal instructions to an arm" with small robots like Reachy Mini or SO-ARM101.

Differences from Before and Summary

Finally, let me organize what has changed compared to the previous Cosmos series.

Previously, when building Physical AI applications, it was necessary to construct the observation part and generation part with separate models. The pipeline required connecting two models: Cosmos Reason 2 as a VLM for observation and Cosmos Predict 2.5 as a diffusion model for generation. Since each consumed approximately 17 to 40 GB in BF16, having two models coexist required careful resource management. With Cosmos 3, this observation and generation is completed in a single inference. The Policy Model in this validation output predicted video and 16 steps × 10-dimensional action sequences together in 21 seconds.

The scope for on-site applications also seems to be expanding. For small robots like Reachy Mini or SO-ARM101, the Policy Model's ability to generate "language → video + action" end-to-end gives the impression that it can handle in a single model what previously required separately training GR00T or ACT. For factory footage, use cases like visualizing as video "what would happen if this state were left unattended" starting from anomaly events extracted with VSS are also coming into view.

To summarize, the flow looks like this:

  • Roles that were previously individual, such as video generation and video understanding, are consolidated in Cosmos 3 into a 2-tower MoT of Reasoner + Generator
  • text-to-image / text-to-video / image-to-video produced practical-quality output in approximately 35 steps and 22 seconds
  • image-to-video behaves conservatively by prioritizing the physical state of the conditional image, making it reliable for safety-critical applications
  • The Policy Model simultaneously generates predicted video and action sequences from observation video and task instructions, clearing the official golden standard with MSE 0.013

The biggest change I felt this time is that the concept of completing "observation → planning → generation → control" with a single world foundation model has become achievable at a size that runs on a single DGX Spark.

Cosmos 3 was officially released (GA) at Computex. I plan to continue diving deeper into topics like Cosmos 3 Reasoner's Physical AI inference, detailed environment setup, and further in-depth validation of the Policy Model in separate articles.

https://dev.classmethod.jp/articles/dgx-spark-cosmos3-family-usecase-map/


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article