
I organized a usage map for the NVIDIA Cosmos 3 family
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
In Running NVIDIA Cosmos 3 on DGX Spark, I covered NVIDIA Cosmos 3 as an omnimodel that consolidates both the Reasoner Tower and Generator Tower into a single model. In this article, which serves as a summary of the series, I'll organize how to use Cosmos 3 in practice from a "which to use when" perspective, and also run actual Japanese visual reasoning benchmarks with the Reasoner Tower—responsible for understanding—operating as a VLM.
Cosmos 3 centers around two omnimodels called Nano and Super, along with generation-specialized derivative models and models for robot control. Furthermore, given that the existing Cosmos Reason 2 is widely used as the VLM component in VSS (NVIDIA Video Search and Summarization), the practical decision axis becomes "how far to leverage the existing Reason 2, and from where to switch to Cosmos 3."
The Cosmos 3 Family and Cosmos Reason 2
First, let me lay out the models published under Cosmos 3 and the existing Cosmos Reason 2 on a single map.
The key point to understand here is that in Cosmos 3, the Reasoner Tower responsible for understanding and the Generator Tower responsible for generation are integrated into a single omnimodel. In the previous generation, models were separated by use case—Cosmos Reason for video understanding, Cosmos Predict for video generation—but Cosmos 3 consolidates these into the omnimodel Nano / Super. When you only need understanding, you can extract just the Reasoner Tower at inference time and run it as a standalone VLM.
As a quick reference table for capabilities and scale, the organization looks something like this:
| Model | Parameters | Memory estimate | Primary role | DGX Spark standalone |
|---|---|---|---|---|
| Cosmos Reason 2 | 8B | ~17 GB | VLM only (understanding, structured output) | ✅ |
| Cosmos 3 Nano | 16B | ~30 GB | omnimodel (understanding + generation + control). Also runs as standalone VLM with Reasoner Tower extraction | ✅ |
| Cosmos 3 Super | 64B | ~120 GB+ | omnimodel large version | ⚠️ (KV cache pressure) |
| Super-Image2Video | 64B derived | ~120 GB+ | Image → Video generation specialized | ⚠️ |
| Super-Text2Image | 64B derived | ~120 GB+ | Text → Image generation specialized | ⚠️ |
| Nano-Policy-DROID | 16B | ~30 GB | Robot control policy for the DROID platform | ✅ |
| Cosmos 3 Edge | 4B | ~8 GB | Lightweight edge version (coming soon) | ✅ (planned) |
Nano fits comfortably within DGX Spark™'s 128 GB unified memory, while the Super series—both the omnimodel and generation-specialized derivatives—are cramped on a single DGX Spark. If you're planning to run Super in production, it's safer to assume you'll need an environment on the scale of a Brev H100 or DGX Station. The soon-to-be-available Cosmos 3 Edge is small at around 4B, and looks like a candidate aimed at edge GPUs like Jetson.
Combinations That Can Run on DGX Spark
Let me think about what to combine and keep resident within the budget of DGX Spark's 128 GB unified memory.
From my hands-on experience running it, extracting only the Reasoner Tower of Cosmos 3 Nano loads it at around 17 GB, so placing it alongside Cosmos Reason 2 (~17 GB) keeps things to about 35 GB. Serving the Reasoner Tower is done by overriding the architecture at startup like vllm serve nvidia/Cosmos3-Nano --hf-overrides '{"architectures": ["Cosmos3ReasonerForConditionalGeneration"]}', and by capping --gpu-memory-utilization, it can co-reside on a single machine including KV cache. When running the full omnimodel including generation, ~30 GB is sufficient for a single DGX Spark.
Conversely, Cosmos 3 Super—whether omnimodel or generation-specialized derivatives—nearly exhausts the DGX Spark's budget by itself, so running it seriously means delegating it to a separate node by design. As a decision criterion when choosing DGX Spark, the conclusion is that "a single-node environment where Nano is the main player and can be switched with Cosmos Reason 2" has the best fit. For Super series, building in an offload-to-separate-environment design from the start should save headaches later.
Benchmarking Cosmos 3 Nano's Reasoner in Practice
Before discussing usage decisions, let me take actual measurements of how capable the Reasoner Tower is for understanding. I served Cosmos 3 Nano's Reasoner Tower on a GB10 DGX Spark and ran the Japanese visual reasoning benchmarks Heron-Bench and JMMMU, as well as Robot Trajectory CoT (Chain-of-Thought) and Embodied Reasoning. The comparison target is Cosmos Reason 2, the VSS default.
First, the Heron-Bench results (a benchmark where a LLM scores free-form responses). I lined them up by question category and image category.

The average score for Cosmos 3 Nano was 2.777, nearly overlapping with Cosmos Reason 2. Across both question categories (conversation / detail / complex) and image categories (7 types including anime / art / culture), neither model won decisively—the results were closely contested, with small fluctuations by category. From the standpoint of free-form Japanese responses, both appear to be at the same level.
Next, the JMMMU results (multiple-choice across 28 fields scored by exact match).

Overall exact match for Cosmos 3 Nano was 0.498, slightly above Cosmos Reason 2. Looking at individual fields, some like World History are strong for both, while others like Music and Mechanical Engineering are weak for both—the pattern of strengths and weaknesses is quite similar. The conclusion from actual measurements is that both models are nearly on par even in multiple-choice accuracy.
I also verified structured output. In the Robot Trajectory CoT task (having the model output waypoints as a sequence of points for grasping and placing objects from images), the proportion of valid JSON returned across 8 inputs was 0.62, with an average inference time of approximately 8.3 seconds per image. If trajectory generation comes in under 10 seconds, there's realistic potential for embedding this in small arms like SO-ARM101 or Reachy Mini for demo/interaction purposes.
In the Embodied Reasoning probe, I varied the prompt format to see if reasoning could be elicited inside <think>...</think> tags. Bare prompts rarely produced thinking tags, but when using phrasing that explicitly requests CoT or specifies safety considerations, all 4 cases stably produced substantive reasoning inside <think>.
In summary, the Reasoner Tower's accuracy is on par with Cosmos Reason 2, and the decision axis for switching is not "the accuracy gap" but rather "properties that only exist in Cosmos 3"—that's the impression this benchmarking leaves.
Cases Where You Should Keep Using Cosmos Reason 2
First, let's look at "cases where Cosmos Reason 2 should continue to be used." As we saw in the previous section, since Cosmos 3 Nano's Reasoner Tower and Cosmos Reason 2 are essentially at the same accuracy level on benchmarks, there are several scenarios where there's no need to force a migration.
The top case is when it's already embedded in an existing VSS pipeline. In VSS 3.1.0 EA, cosmos-reason2-8b is included as the standard hw env for DGX Spark, and compose setups built on Cosmos Reason 2 are provided for VLM-as-Verifier, Event Reviewer, and Alert Bridge alike. As I also mentioned in the earlier VSS 3.1.0 EA verification article, waiting for NVIDIA's roadmap to mature before replacing this with Cosmos 3 is the prudent approach.
Next, cases where it's running on edge environments like Jetson Orin Nano or AGX Thor. Cosmos Reason 2 has a track record with ARM64 and quantization (FP8 / NVFP4), and the recipes for edge deployment are well established. The Cosmos 3 series is currently centered on BF16, and the lightweight Cosmos 3 Edge for edge use isn't available yet. If Jetson deployment is the premise, Cosmos Reason 2 remains the safe choice for now.
Third, applications aimed at achieving high stream density, such as parallel streaming surveillance from multiple cameras. The VSS Blueprint benchmark also shows that 14 parallel streams can be handled by Cosmos Reason 2 with the combination of DGX Spark and AGX Thor, making it a track-record-first choice for dense monitoring applications.
Fourth, event detection that completes with structured JSON, such as PPE detection, anomaly detection, and motion classification. Cosmos 3 Nano returns the same JSON structure with the same prompts, so "accuracy won't change if you switch"—but by the same token, "there's little to gain from switching." If the existing pipeline is running fine, there's not much reason to swap out the model.
Overall, Cosmos Reason 2 retains its strengths in the near term in its position of being lightweight, low-latency, and proven. My personal view is that until NVIDIA shows momentum toward switching the standard VSS set to Cosmos 3, this remains the optimal solution for stable operation.
Cases Where Switching to Cosmos 3 Nano Is Worthwhile
Conversely, there are clear cases where switching from Cosmos Reason 2 to Cosmos 3 Nano is worthwhile. Since accuracy is essentially at the margin of error, the decision axis is not "the accuracy gap" but "properties that only exist in Cosmos 3 Nano."
First, scenarios requiring causal reasoning or CoT. When you want to explain step by step "why did this anomaly occur" or "what should be done next," the Reasoner Tower's reasoning chain comes into play. As seen in the Embodied Reasoning section, specifying concrete risk perspectives or procedures on the prompt side can elicit substantive reasoning inside <think>.
Second, robotics trajectory planning. The trajectory generation time of around 8 seconds measured in Robot CoT is a usable speed for building an interactive flow of "image input → trajectory generation → bridging to ACT / GR00T" for small arms like SO-ARM101 or Reachy Mini.
Third, multi-modal simultaneous reasoning. In addition to a 256K long context, Cosmos 3 Nano can handle images, video, audio, and actions all within a single model as an omnimodel. This comes into play for scenarios like long-form report generation that crosses multiple images, video, and text, or processing a full day's worth of manufacturing line footage into a summary across modalities.
Fourth, cases requiring physical reasoning derived from a world foundation model. Cosmos 3 Nano's Reasoner Tower is connected to the Generator Tower through shared latent representations, so the physical learning from the generation side indirectly permeates. This fits the niche requirement of "I don't need generation, but I want it to have a sense of physics."
In terms of implementation practicality, since the JSON output structure is compatible with Cosmos Reason 2, as long as you work out sidecar compatibility with VLM-as-Verifier and production operation stability, migrating an existing Cosmos Reason 2 pipeline to Cosmos 3 Nano should require minimal changes.
Cases Where Cosmos 3 Nano Should Be Deployed as an Omnimodel
So far we've been talking about using the Reasoner Tower as a VLM, but Cosmos 3 Nano is an omnimodel that has both understanding and generation co-residing in a single model. It should truly shine in use cases that need understanding and generation running simultaneously.
The representative case is an end-to-end workflow of observation → planning → generation → control. The Policy Model seen in Running NVIDIA Cosmos 3 on DGX Spark is exactly this—it simultaneously outputs a predicted video and a robot action sequence from an observed video and natural language task instructions. Considering that it passed the official golden standard (MSE 0.05) with MSE 0.013 in Article 1, there's a tangible sense that it's within practical range as a VLA (Vision-Language-Action) backbone for lightweight robot arms. Embedding this in small robots like Reachy Mini and SO-ARM101 is a direction I'd like to continue covering in future robotics articles.
Synthetic data generation is also a use case unique to the omnimodel. From observed footage, it can visualize "what would happen if this state were left unattended" or "how an anomalous event would appear as it progresses" as video, so it looks promising for automatic generation of compliance training materials and safety education materials for manufacturing. The combination of passing anomaly events extracted by VSS Event Reviewer to Cosmos 3 Nano's generation mode seems realistic.
Data augmentation for Sim2Real robotics is another scenario that leverages the generation side. If a pipeline can be built to mass-produce synthetic episodes in LeRobot v3 format with Cosmos 3 and feed them into policy training, the bottleneck of data collection via real-machine teleoperation could be greatly alleviated. When running on DGX Spark, Cosmos 3 Nano is the realistic solution, with Super requiring a separate GPU environment given both its parameter scale and memory requirements.
Cases Where Generator Functionality Takes Center Stage
There are also cases where the focus shifts to the generation side. Until now, generating video, images, audio, and actions was split across separate models like Cosmos Predict 2.5 and Cosmos Transfer. Cosmos 3 integrates this into the omnimodel, and also provides generation-specialized derivatives like Super-Image2Video for image-to-video and Super-Text2Image for text-to-image. The use cases previously covered by Cosmos Predict 2.5 series seem to be broadly moving toward being covered by Cosmos 3.
Particularly for Sim2Real data augmentation, where a two-stage pipeline of video generation with Cosmos Predict 2.5 → action inference with policy was being used, Cosmos 3 Nano can consolidate this into a single inference pass. When I ran Predict 2.5 previously, generating 1280×704 video in 36 steps with the 2B model took about 30 minutes. In contrast, the Cosmos 3 Nano Policy Model measured in Article 1 has a smaller output scale (640×480 × 17 frames), but outputs both video and actions simultaneously in 21 seconds—quite a different story. A direct comparison isn't possible given the different output scales, but the overall feel of the workflow is almost unrecognizably different.
However, ultra-high-resolution long-form video generation (cinematic use cases like 4K × 30 seconds) is outside the scope of Cosmos 3. This is a domain to leave to cloud-based dedicated video generation models like Veo and Wan, so going into Cosmos 3 thinking "I can create anything" would lead to a mismatch of use cases. Cosmos 3 is designed as a backbone for Physical AI, so it's suited for operation at the resolutions and frame counts appropriate for robotics, autonomous driving, and industrial simulation.
Summary
I've organized the usage map for NVIDIA Cosmos 3, incorporating actual measurements of the Reasoner Tower. Here's how it all flows together.
First, Cosmos 3 centers around the omnimodels Nano / Super, with generation-specialized derivatives and robot control models, plus Cosmos 3 Edge joining soon. On DGX Spark, Nano is the practical center, with Super series requiring a separate environment for offloading. Next, the Reasoner Tower's Japanese visual reasoning—Heron-Bench 2.777 and JMMMU 0.498—is on par with Cosmos Reason 2, and the decision axis for switching is not accuracy but "properties unique to Cosmos 3." Specifically, causal reasoning, CoT, multi-modal input unique to omnimodels, and physical reasoning are the scenarios where switching makes sense. Furthermore, for robotics requiring an end-to-end observation → planning → generation → control flow, and for synthetic data generation, Cosmos 3 Nano as an omnimodel delivers clear value.
Given a single-node environment with DGX Spark's 128 GB unified memory as the premise, my sense is that the current practical solution is to co-reside and switch between continuing with Cosmos Reason 2 or switching to Cosmos 3 Nano depending on the scenario. The approach of keeping Reason 2 as the main axis while progressively incorporating Reasoner Tower reasoning chains and Policy Models for new use cases where they're effective seems like the safe path.
Reference Links
- NVIDIA Cosmos Platform Overview
- NVIDIA Cosmos 3 Official Announcement (NVIDIA Blog)
- Cosmos 3 License (OpenMDW 1.1 / Linux Foundation)
- Running NVIDIA Cosmos 3 on DGX Spark
- Running Cosmos World Foundation Models on DGX Spark (Predict 2.5 + Reason 2 Introduction)
- Trying Structured Analysis of Images and Video with Cosmos-Reason2 on DGX Spark
- Fine-tuning Cosmos-Reason2-8B for PPE Detection on DGX Spark
- Investigating the Current State of Manufacturing VSS as Seen at VSS 3.1.0 EA and Hannover Messe
- Thinking Through Practical Use Cases for NVIDIA VSS + AI Agents + Skills in Everyday Workplaces
