
I tried to analyze the updated NVIDIA Physical AI in Cosmos 3 through the three-tier structure of factory AI
This page has been translated by machine translation. View original
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Division.
My motivation for this article is that, with the official release of Cosmos 3 at GTC Taipei, I wanted to take a bird's-eye view of the articles I've been accumulating as individual points throughout this DGX Spark series.
In the time series foundation model series, I covered a comparison of Chronos-2 and TimesFM 2.5, an experiment connecting Chronos-2 to a PLC-like simulator, and anomaly detection with the SKAB dataset — so the "layer dealing with numerical time series" has been yielding decent results. On the other hand, video-related VSS Agent and Cosmos Reason, simulation-related Cosmos Predict, and the omnimodel integrated in Cosmos 3 are all running on the same DGX Spark, and I hadn't properly sorted out how these actually contribute to factory AI and in what ways.
With Cosmos 3 being made publicly available at the GTC Taipei keynote, the models covering each layer are now mostly in place. This article is a bird's-eye overview with zero hands-on verification, reading factory AI through a three-layer structure of numerical, visual, and simulation layers, and organizing the roles and division of responsibilities among NVIDIA's model groups corresponding to each layer (time series foundation models / Cosmos Reason / world models).
What Makes the 3 Systems Different in the First Place
When you break down "learning-based models" used in factory AI by primary input and output, they naturally fall into 3 systems.
| Perspective | Time Series Foundation Models | Cosmos Reason (Reasoning VLM) | World Models (Cosmos Predict / Cosmos 3) |
|---|---|---|---|
| Primary Input | Numerical time series + covariates | Images/video + natural language prompts | Images/video + prompts |
| Primary Output | Future values / quantile / anomaly scores for numerical time series | Natural language situation understanding / bbox / Yes-No judgment | Video frame generation / synthetic data |
| Layer Addressed | Numerical layer (extension of sensor values) | Visual/language layer (meaning of situations) | Simulation layer (generation of physical phenomena) |
| Latency | Milliseconds to hundreds of milliseconds | Seconds to tens of seconds | Tens of seconds to tens of minutes |
| Memory | 100MB to 1GB | Several GB to tens of GB | Tens of GB to 100GB+ |
| Real-time Capability | Can be placed in control loops | Fits in advisory loops | Batch/offline only |
When laid out like this, I think it becomes clear that the 3 systems are not in conflict — they simply have different roles in terms of latency, memory, and real-time capability. The numerical layer engages with control at the millisecond level, the visual layer returns semantic interpretation at the second level, and the simulation layer produces "plausible futures" as video at the minute level.
Plotting the 3 systems on two axes of latency and memory makes their operational images even clearer.
From here, I'll look at what each layer can do by referencing the articles covered in the series.
Numerical Layer — The Domain of Time Series Foundation Models
This is the layer that directly feeds numerical time series data — temperature, flow rate, pressure, current, etc. — from factory PLCs and sensors into models to produce future values and anomaly scores. In the series, we lined up 3 models in the Chronos-2 / TimesFM 2.5 / NV-Tesseract comparison article.
Listing the characteristics of the 3 main models in order: Chronos-2 has an encoder + 1 forward pass structure, so latency remains nearly constant even as the horizon extends (96→720 steps: 6.5ms → 6.8ms). The 28M model runs at 4ms / 84MB, a size that fits even on Jetson-class edge devices — that's its strong point. TimesFM 2.5, on the other hand, has an autoregressive structure that gets heavier as the horizon grows (h=96→192: 86→229ms), but it benefits from context scaling — extending context from c=512 → 15,360 improves accuracy from MASE 1.106 → 0.770. NV-Tesseract is an industrial time series-specialized model whose collaboration with Cognite and Celanese was announced at GTC 2026. It's evaluation-license-based, and hands-on verification remains a task for a separate article for now.
In the SKAB follow-up article focused on anomaly detection, we saw the asymmetry where TimesFM mean outperforms Chronos-2 by +18–22 AUC points on some datasets, while making Chronos-2 multivariate yields 7–8x faster latency. The practical division is: use Chronos-2 28M for edge deployment, TimesFM 2.5 mean when going for accuracy.
In the experiment connecting Chronos-2 to a PLC-like simulator, we achieved AUC 0.999 / F1 0.83 for spike detection across 72h × 16,177 windows. On the other hand, gradual drift in wear+spike mode only reached AUC ≈ 0.51, and slow-progressing degradation is hard to catch with the numerical layer alone — something that was visible in actual measurements. This gap is addressed by the visual and simulation layers described next.
Visual Layer — The Domain of Cosmos Reason / VSS Agent
This is the layer that reads "what is happening" from surveillance cameras, manufacturing line cameras, and still images for visual inspection. In NVIDIA's stack, this layer has two systems: standalone VLM inference (Cosmos Reason) and video search with an agent loop (VSS Agent).
Cosmos Reason 2 and the Cosmos 3 Nano Reasoner Tower are VLMs with 4 structured output capabilities (2D Grounding / Robot CoT / Embodied Reasoning / Temporal Localization), suited for one-shot use cases like returning JSON for "Is PPE being worn in this image?" or "Which process does this motion belong to?" In the series, the article testing Cosmos Reason2's structured reasoning measured 6 functions + PPE detection + video benchmarks in practice. VSS Agent + Skills, on the other hand, is the NVIDIA Video Search and Summarization Blueprint — a larger package that uses Cosmos Reason as its backend VLM while also including video search + summary + agent loop + MCP. This is covered in the VSS 3.1.0 EA re-verification article and the article organizing the VSS of the Agents + Skills era as reading material.
The division is simple: use Cosmos Reason for one-shot VLM inference, and use VSS when you need video search, summarization, and agent loops. Since the Reasoner Tower in Cosmos 3 now performs comparably to Cosmos Reason 2, it's natural to expect the Cosmos 3 lineage to be adopted as the VLM for VSS going forward. I'll revisit this in more detail in the later section "Three Layers Updated by Cosmos 3."
While the numerical layer answers "is the current value abnormal?" in sub-second time, the visual layer returns "how do we interpret what we're currently seeing?" in a matter of seconds. In real projects, a pattern where the visual layer adds meaning to alerts from the numerical layer tends to be effective.
Simulation Layer — The Domain of World Models
This is the layer that generates "plausible next videos" from past footage or a single image. In terms of articles, I covered world foundation models in the Cosmos Predict 2.5 + Reason2 verification article.
There are mainly 3 scenarios where world models are effective in factory AI. For synthetic data generation, the use case involves generating abnormal patterns that couldn't be sufficiently collected from actual equipment as video, and using them to pre-train the visual-layer Cosmos Reason. The Sim2Real bridge is the context of augmenting robot arm motion data through simulation and transferring it to real-world ACT/VLA. And as a Digital Twin, reproducing physical phenomena in a process as video for impact prediction before equipment changes or for educational use is also within scope.
Cosmos Predict 2.5 had actual DGX Spark measurements showing a 2B model generating 1280×704 video at 36 steps in approximately 30 minutes. It's not a layer for real-time control — the practical solution is using it upstream to create data or downstream to create review materials.
And with Cosmos 3, the world model has been restructured into the Cosmos 3 Nano / Super omnimodel, evolving into a form that handles text / image / video / audio / action input and output in a single model.
Collaboration Scenarios for the 3 Systems
Mapping the 3 systems to manufacturing use cases naturally reveals collaboration patterns.
| Scenario | Role of Time Series FM | Role of Cosmos Reason | Role of World Model |
|---|---|---|---|
| A: Integrated Anomaly Detection | Immediate anomaly score from PLC sensors | Supplement visual anomaly judgment from surveillance camera footage | Pre-train Cosmos Reason with synthetic anomaly video |
| B: Quality Digital Twin | Cross-referencing actual measurements and predictions | Explain root cause of anomalies in natural language | Reproduce physical phenomena of processes as video |
| C: New Employee Training / SOP Compliance | Real-time warning of deviations from set values | Judge procedure compliance from worker movements | Generate video of failure patterns for training materials |
| D: Traceability Enhancement | Numerical log of manufacturing history | Extract process events from video logs | Digital Twin reconstruction of past processes |
For example, writing out the collaboration in Scenario A as a timeline, the flow looks like this:
Looking at this timeline, I think it's easy to grasp the structure of the 3 systems "looking at the same event from different angles." It's a layered design where the numerical layer responds immediately, the visual layer supplements, and the LLM verbalizes. The Cosmos world model doesn't enter this direct operational loop, but contributes to improving overall accuracy by mass-producing anomaly footage in the pre-training phase.
Three Layers Through NVIDIA's Strategy Map
NVIDIA has intentionally positioned the 3 systems as complementary, and the strategic picture became much clearer at GTC 2026 / GTC Taipei. NV-Tesseract (launched in collaboration with Cognite and Celanese) handles the numerical sensor prediction and anomaly detection space; VSS + Cosmos Reason (expanding through partner cases such as Invisible AI / Tulip / Fogsphere / Pegatron) is central for video-based factory visualization; and world models — restructured from Cosmos Predict into Cosmos 3 omnimodel — handle the domains of synthetic data, Digital Twin, and Policy Models. This is the three-pillar structure.
At the GTC Taipei keynote, Cosmos 3 was announced as a Physical AI foundation model alongside Alpamayo 2 (a reasoning VLA for autonomous driving) and Isaac GR00T (a reference for humanoid robots). NVIDIA has organized Physical AI into 3 domains — "General AI / Autonomous Driving / Humanoid" — and is presenting foundation models and reference designs for each. In the context of factory AI, this can be read as NVIDIA intentionally assigning models to the three layers of numerical, visual, and simulation.
Three Layers Updated by Cosmos 3
With the release of Cosmos 3 at the GTC Taipei keynote, both the visual layer and simulation layer were updated. Here's a summary of the key points.
| Use Case | Previous Generation (Current) | Cosmos 3 (New) |
|---|---|---|
| Visual inspection / status description | Cosmos Reason 2-8B | Cosmos 3 Nano Reasoner Tower (comparable to Cosmos Reason 2) |
| Pre-manufacturing simulation | Cosmos Predict 2.5 | Cosmos 3 Nano (omnimodel, 4-modality integration including audio and process sounds) |
| Policy Model (robot integration) | N/A | Cosmos 3 Nano (omnimodel, action generation for integration with SO-101, etc.) |
Cosmos 3 is released as the Nano (16B) and Super (64B) omnimodel, with Edge (4B) coming soon. The license is OpenMDW 1.1 (Linux Foundation, commercially usable). The Reasoner Tower for understanding and the Generator Tower for generation, which were separate in the previous generation, are integrated into a single omnimodel, and at inference time, only the Reasoner Tower can be extracted as a VLM. For actual behavior and benchmark values of each model on real hardware, please also refer to the following articles.
When loading onto DGX Spark's 128GB unified memory, having Cosmos Reason 2, Cosmos 3 Nano, and Chronos-2 all coexist fits within about 50GB, making a configuration where the visual layer + numerical layer runs on a single machine increasingly realistic. For cases where world model generation is run in earnest, operating it at a separate time slot continues to be the practical approach.
Manufacturing 5-Layer Organization — Rule-Based and Learning-Based Layers
So far I've been focusing on "3 systems of learning-based models," but in actual manufacturing settings, these sit on top of existing rule-based layers like PLC / SCADA / MES. Organizing both into 5 layers makes the role division on the AI side more three-dimensional.
| Layer | Nature | Examples |
|---|---|---|
| PLC | Rules for equipment control | Stop when sensor value exceeds threshold, open valve, run motor |
| SCADA | Rules for monitoring, alarms, and operations | Alarm when temperature limit is exceeded, display on screen, record history |
| MES | Rules for manufacturing operations and process management | This lot flows in this process order, if inspection result is NG don't advance to next process |
| Prediction Models | Estimate future states and signs of anomaly from past and present data | Given this temperature/vibration/quality trend, there's a high chance of anomaly in a few hours |
| Optimization Models | Choose better actions within multiple constraints | Propose optimal conditions factoring in quality, yield, delivery, and power costs |
Viewing it as "rule-based layer = world of existing equipment" versus "learning-based layer = today's three layers + LLM assistance" makes it easier to organize what kind of factory AI proposals are being discussed.
The decision materials to incorporate when designing a quality stabilization model for manufacturing can also be organized using the same three layers + LLM assistance.
| Decision Material | Content | Responsible Layer |
|---|---|---|
| Equipment signals | Time series data such as temperature, flow rate, speed, pressure, and current | Numerical layer (Chronos-2 / TimesFM) |
| Subtle anomaly signs | Gradients, fluctuations, slight differences, combinations of multiple signals | Numerical layer (NV-Tesseract's forte) |
| Images / video | Appearance, color unevenness, chips, cracks, misalignment, work conditions | Visual layer (Cosmos Reason) |
| Reference information / ambient conditions | Raw material specifications, quality standards, work instructions, temperature/humidity, outdoor conditions | LLM + RAG (reference system, knowledge) |
| Human expertise | Veteran perspectives, correction sequences, startup intuitions, and other tacit knowledge | LLM formalization (prompt / FT material) |
Equipment signals and subtle anomaly signs go to the numerical layer, images and video go to the visual layer, and reference information and human expertise go to LLM assistance — that's the basic mapping. World models don't directly correspond to these 5 items, but they function as a behind-the-scenes contributor to filling data gaps by retroactively generating synthetic data of equipment signals and video footage.
When a real project brings in a request like "please build a quality stabilization model," going through these 5 items as a checklist makes it easier to advance discussions about what to place on which layer — that's the practical use case for this organization.
Summary
I've broken down the learning-based layers of factory AI into 3 systems and lined them up at the current state after the release of Cosmos 3. There are 4 key takeaways from this organization.
First, the 3 systems have different roles in terms of latency, memory, and real-time capability. The structure is: the numerical layer engages with control at the millisecond level, the visual layer returns semantic interpretation at the second level, and the simulation layer handles data generation at the minute level. Second, the 3 systems are stronger in combination. The visual layer adds meaning to anomaly scores from the numerical layer, and the simulation layer acts as a behind-the-scenes contributor supplying pre-training data. Third, Cosmos 3 has significantly updated both the visual and simulation layers: the Cosmos 3 Nano Reasoner Tower is comparable to Cosmos Reason 2, and as an omnimodel, the world model domain has advanced significantly with 4-modality integration and Policy Model capabilities. Fourth is the structure where the learning-based layer (prediction / optimization) sits on top of the rule-based layer (PLC / SCADA / MES), which provides a useful reference framework when designing quality stabilization models for manufacturing.
Reference Links
Time Series Foundation Model Series
- Running and Comparing Time Series Foundation Models on DGX Spark
- Predicting PLC-like Time Series Data with Chronos-2 and Generating Maintenance Comments with Nemotron
- Trying Industrial Sensor Anomaly Detection with SKAB and Time Series Foundation Models
VSS-Related Articles
- Investigating the Current State of Manufacturing VSS as Seen in VSS 3.1.0 EA and Hannover Messe
- Thinking About Everyday Use Cases for NVIDIA VSS + AI Agents + Skills in the Field
Cosmos-Related Articles
- Trying Structured Analysis of Images and Video with Cosmos-Reason2 on DGX Spark
- Running NVIDIA Cosmos 3 on DGX Spark
- Organizing the NVIDIA Cosmos 3 Family Usage Map on DGX Spark
