I organized a usage map for the NVIDIA Cosmos 3 family

NVIDIA Cosmos 3's practical use case differentiation will be organized, incorporating actual measurements from the Reasoner Tower Japanese visual reasoning benchmark. We will specifically examine accuracy comparisons with Cosmos Reason 2, in which situations you should switch over, and when to deploy it as an omnimodel.

森茂洋 / Hiroshi Morishige

2026.06.02

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
In Running NVIDIA Cosmos 3 on DGX Spark, I covered NVIDIA Cosmos 3 as an omnimodel that combines the Reasoner Tower and Generator Tower into a single model. In this article, which serves as the conclusion of the series, I organize Cosmos 3 from the perspective of "how to use it in practice," while also running actual Japanese visual reasoning benchmarks with the Reasoner Tower — responsible for understanding — operating as a VLM.
https://dev.classmethod.jp/articles/dgx-spark-cosmos3-omni-world-model-policy/
Cosmos 3 centers around two omnimodels, Nano and Super, and also includes generation-specialized derivative models and models for robot control. Furthermore, given that the existing Cosmos Reason 2 is widely used as the VLM component in VSS (NVIDIA Video Search and Summarization), the practical decision axis becomes: "How far do we keep leveraging the current Reason 2, and at what point do we switch to Cosmos 3?"
!Some results in this article are from pre-release validation builds. Please be aware that behavior and APIs may differ partially in the release version.
 The Cosmos 3 Family and Cosmos Reason 2First, let me lay out the models released under Cosmos 3 alongside the existing Cosmos Reason 2 on a single map.
The key point to grasp here is that in Cosmos 3, the Reasoner Tower responsible for understanding and the Generator Tower responsible for generation are integrated into a single omnimodel. In the previous generation, models were separated by purpose — Cosmos Reason for video understanding, Cosmos Predict for video generation — but Cosmos 3 consolidates these into omnimodels called Nano and Super. When you only want to use the understanding capability, you can extract just the Reasoner Tower at inference time and run it as a VLM.
Here is a quick-reference table of capabilities and scale.


Model
Parameters
Memory estimate
Main role
DGX Spark standalone


Cosmos Reason 2
8B
~17 GB
VLM only (understanding, structured output)
✅

Cosmos 3 Nano
16B
~30 GB
omnimodel (understanding + generation + control). Can run as standalone VLM by extracting Reasoner Tower
✅

Cosmos 3 Super
64B
~120 GB+
omnimodel large version
⚠️ (KV cache pressure)

Super-Image2Video
64B deriv.
~120 GB+
Image → Video generation specialized
⚠️

Super-Text2Image
64B deriv.
~120 GB+
Text → Image generation specialized
⚠️

Nano-Policy-DROID
16B
~30 GB
Robot control policy for the DROID platform
✅

Cosmos 3 Edge
4B
~8 GB
Lightweight edge version (coming soon)
✅ (planned)

Nano fits comfortably within DGX Spark™'s 128 GB unified memory, while the Super family — both the omnimodel and its generation-specialized derivatives — is tight on a single DGX Spark. If you plan to run Super in production, it is safer to assume an environment at the level of a Brev H100 or DGX Station. The upcoming Cosmos 3 Edge, at around 4B, looks like an option targeting edge GPUs like Jetson.
 Combinations That Can Run on DGX SparkLet me think about what to keep resident, and in what combination, within the budget of DGX Spark's 128 GB unified memory.
From my hands-on experience, extracting only the Reasoner Tower of Cosmos 3 Nano loads in about 17 GB, so placing it alongside Cosmos Reason 2 (~17 GB) keeps total usage around 35 GB. Serving the Reasoner Tower involves overriding the architecture at startup like vllm serve nvidia/Cosmos3-Nano --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}', and by reducing --gpu-memory-utilization, it can co-reside on a single machine even including the KV cache. Running the full omnimodel including generation requires about ~30 GB, which is sufficient on a single DGX Spark.
Conversely, Cosmos 3 Super — both the omnimodel and its generation-specialized derivatives — nearly exhausts DGX Spark's budget on its own, so running it seriously means delegating to a separate node. My conclusion for choosing DGX Spark is that "a single-node environment where Nano plays the lead role and can be switched with Cosmos Reason 2" is the most comfortable fit. It seems wise to design Super into a separate offloaded environment from the start, to avoid headaches later.
 Benchmarking Cosmos 3 Nano's Reasoner in PracticeBefore discussing how to differentiate usage, let me measure the actual capability of the Reasoner Tower. I served Cosmos 3 Nano's Reasoner Tower on a GB10 DGX Spark and ran Japanese visual reasoning benchmarks — Heron-Bench and JMMMU — as well as robot trajectory CoT (Chain-of-Thought) and Embodied Reasoning tasks. The comparison target is Cosmos Reason 2, the VSS default.
First, here are the Heron-Bench results (a benchmark where an LLM scores free-form responses), broken down by question category and image category.
The average score for Cosmos 3 Nano was 2.777, nearly overlapping with Cosmos Reason 2. Across both question categories (conversation / detail / complex) and image categories (7 types including anime / art / culture), neither model dominated the other — the results showed close competition with small fluctuations by category. In terms of Japanese free-form responses, both appear to be at the same level.
Next are the JMMMU results (exact match scoring across 28 fields of multiple-choice questions).
The overall exact match for Cosmos 3 Nano was 0.498, slightly above Cosmos Reason 2. Looking at individual fields, both models are strong in areas like World History, while both struggle in fields like Music and Mechanical Engineering — their strengths and weaknesses follow similar patterns. The conclusion from the measurements is that both models achieve nearly equal accuracy on multiple-choice tasks as well.
I also verified structured output. In the robot trajectory CoT task (having the model output waypoints as a sequence of points for grasping and placing objects from an image), the rate of returning valid JSON for 8 inputs was 0.62, with an average inference time of about 8.3 seconds per image. If trajectory generation stays under 10 seconds, it becomes feasible to integrate into small arms like the SO-ARM101 or Reachy Mini for demo and interaction purposes.
In the Embodied Reasoning probe, I tested whether reasoning could be elicited inside <think>...</think> tags by varying prompt styles. With bare prompts, thinking tags rarely appeared, but with formulations that explicitly requested CoT or specified safety considerations, all 4 cases stably produced substantive reasoning inside <think> tags.
In summary, the accuracy of the Reasoner Tower is on par with Cosmos Reason 2, and the decision axis for switching is not "the difference in accuracy" but rather "properties that only exist in Cosmos 3" — that is the impression these measurements convey.
 Scenarios Where You Should Keep Using Cosmos Reason 2First, let's look at "cases where you should continue using Cosmos Reason 2." As seen in the previous section, Cosmos 3 Nano's Reasoner Tower and Cosmos Reason 2 are at nearly the same accuracy level on benchmarks, so there are several scenarios where switching is not necessary.
The top case is when it is already integrated into an existing VSS pipeline. In VSS 3.1.0 EA, cosmos-reason2-8b is included as standard in the hw env for DGX Spark, and compose files built with Cosmos Reason 2 as the prerequisite are provided for VLM-as-Verifier, Event Reviewer, and Alert Bridge. As mentioned in the previous VSS 3.1.0 EA validation article, replacing this with a Cosmos 3 model is best done by waiting for NVIDIA's roadmap to mature.
Next is when running on edge environments like Jetson Orin Nano or AGX Thor. Cosmos Reason 2 has a proven track record with ARM64 and quantization (FP8 / NVFP4), and edge deployment recipes are well established. The Cosmos 3 family is currently centered on BF16, and the lightweight Cosmos 3 Edge for edge use is not yet available. For Jetson deployments, Cosmos Reason 2 remains the safe choice for now.
The third case is workloads that need to scale stream count, such as parallel multi-camera streaming surveillance. VSS Blueprint benchmarks show that Cosmos Reason 2 can handle 14 parallel streams with a DGX Spark and AGX Thor combination, and for high-density surveillance use cases, proven track record takes priority.
The fourth case is event detection that completes with structured JSON, such as PPE detection, anomaly detection, and action classification. Cosmos 3 Nano also returns the same JSON structure with the same prompts, so "accuracy doesn't change after switching" — but conversely, "there's little to gain from switching." If an existing pipeline is running smoothly, there's not much reason to swap out the model.
Overall, Cosmos Reason 2 retains its strengths for the time being in the position of lightweight, low-latency, proven track record. Until NVIDIA signals a momentum to switch the standard VSS set to Cosmos 3, this is, in my personal view, the optimal solution for stable operation.
 Scenarios Where Switching to Cosmos 3 Nano Is WorthwhileConversely, there are also clear scenarios where switching from Cosmos Reason 2 to Cosmos 3 Nano is worthwhile. Since accuracy itself is at an error-level tie, the decision axis is not "the difference in accuracy" but "properties that only exist in Cosmos 3 Nano."
The first is scenarios requiring causal reasoning or CoT. When you want step-by-step explanations of "why this anomaly occurred" or "what to do next," the Reasoner Tower's reasoning chain is effective. As seen in the Embodied Reasoning section above, specifying concrete risk perspectives or procedures on the prompt side reliably draws out substantive reasoning inside <think> tags.
The second is robotics trajectory planning. The trajectory generation of around 8 seconds measured in Robot CoT is a usable speed for building interactive workflows of "image input → trajectory generation → bridging to ACT / GR00T" with small arms like the SO-ARM101 or Reachy Mini.
The third is simultaneous multi-modal reasoning. In addition to a 256K long context, Cosmos 3 Nano can handle image, video, audio, and action all within a single model as an omnimodel. It becomes effective in scenarios requiring cross-modal processing across multiple images, videos, and text — such as long-form report generation or daily video summaries of a manufacturing line.
The fourth is when world foundation model-derived physical reasoning is needed. The Reasoner Tower of Cosmos 3 Nano is connected to the Generator Tower through shared latent representations, so the physics learning from the generation side indirectly permeates into it. This fits the niche requirement of "I don't need generation, but I want it to have a sense of physics."
From an implementation practicality standpoint, since the JSON output structure is compatible with Cosmos Reason 2, as long as VLM-as-Verifier Sidecar compatibility and production operation stability are addressed, replacing an existing Cosmos Reason 2 pipeline with Cosmos 3 Nano should require minimal rework.
 Scenarios to Deploy Cosmos 3 Nano as an OmnimodelUp to this point, I've been discussing using the Reasoner Tower as a VLM, but Cosmos 3 Nano is an omnimodel with understanding and generation co-existing in a single model. Its true value is likely to shine in use cases that need understanding and generation to operate simultaneously.
The representative case is an end-to-end workflow of observation → planning → generation → control. The Policy Model seen in Running NVIDIA Cosmos 3 on DGX Spark is exactly this — it simultaneously outputs a predicted video and a robot action sequence from an observed video and a natural language task instruction. Considering that Article 1 passed the official golden standard (MSE 0.05) with MSE 0.013, there is a sense that it is within practical range as a VLA (Vision-Language-Action) backbone for lightweight robot arms. I plan to continue covering integration into small robots like Reachy Mini and SO-ARM101 in future robotics series installments.
Synthetic data generation is also a use case unique to the omnimodel. Since it can visualize "what would happen if this state were left unattended" or "what an anomaly event would look like as it progresses" as video from observed footage, it seems well-suited for automatically generating manufacturing compliance materials or safety training content. The combination of passing anomaly events extracted by VSS Event Reviewer to Cosmos 3 Nano's generation mode looks practically feasible.
Data augmentation for Sim2Real robotics is another scenario that leverages the generation side. If a workflow can be established to mass-produce synthetic episodes in LeRobot v3 format with Cosmos 3 and feed them into policy training, it could greatly alleviate the bottleneck of collecting data through real-machine teleoperation. For running on DGX Spark, Cosmos 3 Nano is the practical solution, while Super requires a separate GPU environment in terms of both parameter scale and memory requirements.
 Scenarios Where Generator Functionality Takes Center StageThere are also scenarios where you want to place emphasis on the generation side. Video, image, audio, and action generation were previously split across separate models such as Cosmos Predict 2.5 and Cosmos Transfer. Cosmos 3 integrates these into an omnimodel, and also provides generation-specialized derivatives like Super-Image2Video for image-to-video and Super-Text2Image for text-to-image. The use cases of the previous Cosmos Predict 2.5 line appear to be moving toward general coverage by Cosmos 3.
Particularly for Sim2Real data augmentation, the two-stage pipeline that was previously built with Cosmos Predict 2.5 for video generation → policy for action inference can be consolidated into a single inference with Cosmos 3 Nano. When I previously ran Predict 2.5, generating a 1280×704 video in 36 steps with the 2B model took about 30 minutes. In contrast, the Cosmos 3 Nano Policy Model measured in Article 1, while on a smaller output scale (640×480 × 17 frames), outputs video and action simultaneously in 21 seconds. A direct comparison is not possible due to the difference in output scale, but the feel of the overall workflow is nearly a different experience.
However, ultra-high-resolution long-form video generation (cinematic use cases like 4K × 30 seconds) is outside Cosmos 3's scope. This is a domain to delegate to cloud-based dedicated video generation models like Veo or Wan, so going into Cosmos 3 expecting to "create anything" will lead to a mismatch of use cases. Cosmos 3 is designed as a backbone for Physical AI, so it is best suited for the resolutions and frame counts oriented toward robotics, autonomous driving, and industrial simulation.
 SummaryI organized a usage map for NVIDIA Cosmos 3, incorporating actual measurements of the Reasoner Tower. Here is how the story unfolded.
First, Cosmos 3 centers around omnimodels Nano and Super, with generation-specialized derivatives and robot control models, and will soon be joined by Cosmos 3 Edge. On DGX Spark, Nano is the practical workhorse, while the Super family requires offloading to a separate environment. Next, the Reasoner Tower's Japanese visual reasoning — Heron-Bench 2.777, JMMMU 0.498 — is on par with Cosmos Reason 2, and the decision axis for switching is not accuracy but "properties only in Cosmos 3." Specifically, the moments to switch are when causal reasoning, CoT, multi-modal input unique to the omnimodel, and physical reasoning come into play. Furthermore, for robotics requiring an end-to-end pipeline of observation → planning → generation → control, and for synthetic data generation, deploying Cosmos 3 Nano as an omnimodel becomes worthwhile.
Given the premise of a single-node environment with DGX Spark's 128 GB unified memory, my sense is that the current practical solution is to differentiate by scenario — whether to continue with Cosmos Reason 2 or switch to Cosmos 3 Nano — and operate them with co-residency and switchover. An approach of keeping Reason 2 as the main axis while gradually incorporating new use cases where the Reasoner Tower's reasoning chain or Policy Model is effective seems the safest path.
 Reference LinksNVIDIA Cosmos Platform Overview
NVIDIA Cosmos 3 Official Announcement (NVIDIA Blog)
Cosmos 3 License (OpenMDW 1.1 / Linux Foundation)
Running NVIDIA Cosmos 3 on DGX Spark
Running Cosmos World Foundation Models on DGX Spark (Predict 2.5 + Reason 2 Introduction)
Trying Structured Analysis of Images and Videos with Cosmos-Reason2 on DGX Spark
Fine-tuning Cosmos-Reason2-8B for PPE Detection on DGX Spark
Investigating VSS 3.1.0 EA and the Current State of Manufacturing VSS as Seen at Hannover Messe
Thinking About Where NVIDIA VSS + AI Agents + Skills Fit in Everyday Worksites

I organized a usage map for the NVIDIA Cosmos 3 family

Introduction

The Cosmos 3 Family and Cosmos Reason 2

Combinations That Can Run on DGX Spark

Benchmarking Cosmos 3 Nano's Reasoner in Practice

Scenarios Where You Should Keep Using Cosmos Reason 2

Scenarios Where Switching to Cosmos 3 Nano Is Worthwhile

Scenarios to Deploy Cosmos 3 Nano as an Omnimodel

Scenarios Where Generator Functionality Takes Center Stage

Summary

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Model	Parameters	Memory estimate	Main role	DGX Spark standalone
Cosmos Reason 2	8B	~17 GB	VLM only (understanding, structured output)	✅
Cosmos 3 Nano	16B	~30 GB	omnimodel (understanding + generation + control). Can run as standalone VLM by extracting Reasoner Tower	✅
Cosmos 3 Super	64B	~120 GB+	omnimodel large version	⚠️ (KV cache pressure)
Super-Image2Video	64B deriv.	~120 GB+	Image → Video generation specialized	⚠️
Super-Text2Image	64B deriv.	~120 GB+	Text → Image generation specialized	⚠️
Nano-Policy-DROID	16B	~30 GB	Robot control policy for the DROID platform	✅
Cosmos 3 Edge	4B	~8 GB	Lightweight edge version (coming soon)	✅ (planned)