I tried to analyze the updated NVIDIA Physical AI in Cosmos 3 through the three-tier structure of factory AI

Looking back at the three layers of numerical, visual, and simulation that have been built up in the DGX Spark series, we take a bird's-eye view with the release of Cosmos 3. We structurally analyze factory AI, covering the role division of each layer, how they coexist in terms of latency and memory, and collaborative scenarios in manufacturing.

森茂洋 / Hiroshi Morishige

2026.06.08

This page has been translated by machine translation. View original

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Division.
My motivation for writing this article was to take a bird's-eye view of the articles I've accumulated as individual points throughout the DGX Spark series, now that Cosmos 3 has been officially released at GTC Taipei.
In the time series foundation model series, after comparing Chronos-2 and TimesFM 2.5, experimenting with connecting Chronos-2 to a PLC-style simulator, and anomaly detection with the SKAB dataset, the "numerical time series handling layer" was starting to feel quite solid. On the other hand, video-related VSS Agent and Cosmos Reason, simulation-related Cosmos Predict, and the omnimodel integrated in Cosmos 3 are all running on the same DGX Spark, and I hadn't properly taken stock of how these actually contribute to factory AI and in what ways.
With Cosmos 3 becoming publicly available at the GTC Taipei keynote, the models covering each layer are now essentially all available. This article, as a bird's-eye overview with zero hands-on verification, interprets factory AI through a three-layer structure of numerical, visual, and simulation, and organizes the role division and differentiation of NVIDIA's model groups corresponding to each layer (time series foundation models / Cosmos Reason / world models).
 What's Different About the 3 Systems in the First PlaceWhen you categorize "learning-based models" used in factory AI by their primary input and output, they naturally fall into 3 systems.


Perspective
Time Series Foundation Models
Cosmos Reason (Reasoning VLM)
World Models (Cosmos Predict / Cosmos 3)


Primary Input
Numerical time series + covariates
Images/video + natural language prompts
Images/video + prompts

Primary Output
Future values / quantile / anomaly score of numerical time series
Natural language situation understanding / bbox / Yes-No judgment
Video frame generation / synthetic data

Layer Addressed
Numerical layer (extension of sensor values)
Visual/language layer (contextual meaning-making)
Simulation layer (generation of physical phenomena)

Latency
Milliseconds to hundreds of milliseconds
Several seconds to tens of seconds
Tens of seconds to tens of minutes

Memory
100MB to 1GB
Several GB to tens of GB
Tens of GB to 100GB+

Real-time Capability
Can be incorporated into control loops
Fits into advisory loops
Batch/offline only

When laid out side by side, it becomes clear that the 3 systems are not in opposition—they simply have different roles in terms of latency, memory, and real-time capability. The numerical layer handles control at the millisecond level, the visual layer returns semantic interpretation at the second level, and the simulation layer produces "plausible futures" as video at the minute level.
Plotting the 3 systems on two axes of latency and memory makes the operational image even clearer.
Let's look at what each layer can do, referencing articles covered in the series.
 Numerical Layer — The Domain of Time Series Foundation ModelsThis layer directly feeds numerical time series such as temperature, flow rate, pressure, and current from factory PLCs and sensors into models to produce future values and anomaly scores. In the series, we lined up 3 models in the comparison article for Chronos-2 / TimesFM 2.5 / NV-Tesseract.
Looking at the characteristics of the 3 main models in order: Chronos-2 has an encoder + 1 forward pass structure where latency remains nearly constant even as the horizon extends (6.5ms → 6.8ms for 96→720 steps). The 28M model fits in 4ms / 84MB, making it small enough to run on edge devices like Jetson—that's its sweet spot. TimesFM 2.5, on the other hand, has an autoregressive structure that gets heavier with longer horizons (86→229ms for h=96→192), but it's the type that benefits from context scaling, with accuracy improving from MASE 1.106 → 0.770 as context extends from c=512 → 15,360. NV-Tesseract is an industrial time series-specialized model announced in partnership with Cognite and Celanese at GTC 2026. It's evaluation license-based, and actual hands-on verification remains as homework for a separate article.
In the SKAB follow-up article focused on anomaly detection, we found an asymmetry: TimesFM mean outperforms Chronos-2 by +18 to 22 AUC points on some datasets, while making Chronos-2 multivariate makes it 7-8x faster in latency. A practical differentiation is: Chronos-2 28M for edge deployment, TimesFM 2.5 mean when prioritizing accuracy.
In the experiment connecting Chronos-2 to a PLC-style simulator, spike detection across 72h × 16,177 windows achieved AUC 0.999 / F1 0.83. On the other hand, gradual drift in wear+spike mode only reached AUC ≈ 0.51, revealing through actual measurement that slow-progressing degradation is difficult to catch with the numerical layer alone. This gap is designed to be filled by the next visual layer and simulation layer.
 Visual Layer — The Domain of Cosmos Reason / VSS AgentThis layer reads "what is happening" from surveillance cameras, manufacturing line camera footage, and still images for visual inspection. In NVIDIA's stack, this layer has two systems: standalone VLM inference (Cosmos Reason) and video search with an Agent loop (VSS Agent).
Cosmos Reason 2 and the Reasoner Tower of Cosmos 3 Nano are VLMs with 4 structured output functions (2D Grounding / Robot CoT / Embodied Reasoning / Temporal Localization), suited for one-shot use cases like returning JSON answers to "Is PPE compliance maintained in this image?" or "Which process does this action belong to?" In the series, the article testing Cosmos Reason 2's structured reasoning covered 6 functions + PPE detection + video benchmarks with actual measurements. VSS Agent + Skills, on the other hand, is the NVIDIA Video Search and Summarization Blueprint—a larger package that uses Cosmos Reason as the VLM under the hood while including video search + summary + Agent loop + MCP. This is covered in the VSS 3.1.0 EA re-verification article and the article organizing the VSS for the Agents + Skills era as a reading piece.
The differentiation is simple: use Cosmos Reason for standalone VLM inference, use VSS when you need video search, summary, and Agent loops. Since the Reasoner Tower in Cosmos 3 is now achieving performance on par with Cosmos Reason 2, it's natural to expect the Cosmos 3 series to be adopted as the VLM for VSS going forward. I'll revisit this in the later section "Three Layers Updated by Cosmos 3."
While the numerical layer answers "is the value off right now" in under a second, the visual layer returns "how to interpret the currently visible situation" in a few seconds. The structure where the visual layer adds meaning to alerts from the numerical layer is the setup that tends to work well in real projects.
 Simulation Layer — The Domain of World ModelsThis layer generates "plausible next videos" from past footage or a single image. The article verifying Cosmos Predict 2.5 + Reason2 covered world foundation models.
There are mainly 3 scenarios where world models are effective in factory AI. For synthetic data generation, they're useful for generating abnormal patterns that can't be sufficiently collected on real equipment as video, for pre-training visual layer Cosmos Reason models. The Sim2Real bridge is the context of augmenting robot arm motion data through simulation and transferring it to real-equipment ACT/VLA. And as a Digital Twin, the use case of reproducing physical phenomena in processes as video for impact prediction before equipment changes or for training materials is also in scope.
There was actual DGX Spark measurement showing that Cosmos Predict 2.5 generates 1280×704 video in 36 steps in about 30 minutes using the 2B model. Rather than being a layer used for real-time control, the practical solution is to use it to create data upstream or to create review materials downstream.
And with Cosmos 3, the world model was restructured into omnimodels called Cosmos 3 Nano / Super, evolving to handle text / image / video / audio / action with a single model.
 Collaboration Scenarios for the 3 SystemsApplying the 3 systems to manufacturing use cases naturally reveals collaboration patterns.


Scenario
Role of Time Series FM
Role of Cosmos Reason
Role of World Model


A: Integrated Anomaly Detection
Immediate anomaly score from PLC sensors
Supplementary visual anomaly judgment from surveillance camera footage
Pre-train Cosmos Reason with synthetic abnormal video

B: Quality Digital Twin
Cross-referencing actual measurements with predictions
Explain anomaly grounds in natural language
Reproduce physical phenomena of processes as video

C: New Employee Training / SOP Compliance
Real-time warning of setpoint deviations
Assess procedure compliance from worker movements
Generate failure pattern video for training materials

D: Traceability Enhancement
Numerical log of manufacturing history
Extract process events from video logs
Digital Twin reproduction of past processes

For example, writing out the Scenario A collaboration as a timeline looks like this:
This timeline makes it easy to grasp the structure where the 3 systems "view the same event from different angles." It's a layered design where the numerical layer responds immediately, the visual layer supplements, and the LLM verbalizes. Cosmos-series world models don't enter this direct operational loop, but they contribute to raising overall accuracy by mass-producing anomaly footage in the pre-training stage.
 Three Layers Viewed Through NVIDIA's Strategy MapNVIDIA has intentionally positioned the 3 systems as complementary, and the strategic outline became clear at GTC 2026 / GTC Taipei. NV-Tesseract (launched in partnership with Cognite and Celanese) handles the numerical sensor prediction and anomaly detection domain, VSS + Cosmos Reason (growing through partner case studies with Invisible AI / Tulip / Fogsphere / Pegatron, etc.) is central to video-based factory visualization, and world models restructured from Cosmos Predict to Cosmos 3 omnimodel handle synthetic data, Digital Twins, and Policy Models. This is the three-pillar structure.
At the GTC Taipei keynote, Cosmos 3 was announced as a foundational model for Physical AI alongside Alpamayo 2 (a reasoning VLA for autonomous driving) and Isaac GR00T (a reference for humanoid robots). NVIDIA organizes Physical AI into 3 domains—"general AI / autonomous driving / humanoid"—and positions foundation models and reference designs for each. In the context of factory AI, one can read this as NVIDIA intentionally assigning models to the three layers of numerical, visual, and simulation.
 Three Layers Updated by Cosmos 3With Cosmos 3 released at the GTC Taipei keynote, both the visual layer and simulation layer have been updated. Here's a summary of the key points.


Use Case
Previous Generation (Current)
Cosmos 3 (New)


Visual inspection / status description
Cosmos Reason 2-8B
Cosmos 3 Nano's Reasoner Tower (on par with Cosmos Reason 2)

Pre-manufacturing simulation
Cosmos Predict 2.5
Cosmos 3 Nano (omnimodel, 4-modality integration including sound and process audio)

Policy Model (robot integration)
N/A
Cosmos 3 Nano (omnimodel, action generation for SO-101, etc.)

Cosmos 3 is released as omnimodels Nano (16B) and Super (64B), with Edge (4B) coming soon. The license is OpenMDW 1.1 (Linux Foundation, commercially available). The Reasoner Tower responsible for understanding and the Generator Tower responsible for generation, which were separate in the previous generation, are integrated into a single omnimodel, with the option to extract just the Reasoner Tower as a VLM during inference. For actual behavior and benchmark values of each model on real hardware, please also refer to the following articles.
https://dev.classmethod.jp/articles/dgx-spark-cosmos3-omni-world-model-policy/
https://dev.classmethod.jp/articles/dgx-spark-cosmos3-family-usecase-map/
When loading onto DGX Spark's 128GB unified memory, housing Cosmos Reason 2, Cosmos 3 Nano, and Chronos-2 all together fits within about 50GB, making it realistic to have a configuration where the visual layer + numerical layer are complete on a single machine. For seriously running world model generation, it still seems best to run it at a separate timing.
 Manufacturing 5-Layer Organization — Rule-Based and Learning-Based LayersSo far we've focused on the "3 systems of learning-based models," but in manufacturing environments, these sit on top of existing rule-based layers like PLC / SCADA / MES. Organizing both in 5 layers gives a more three-dimensional view of the AI side's role division.


Layer
Nature
Examples


PLC
Rules for equipment control
Stop when sensor value exceeds threshold, open valve, run motor

SCADA
Rules for monitoring, alarms, and operations
Alert when temperature exceeds upper limit, display on screen, record history

MES
Rules for manufacturing operations and process management
This lot flows in this process order, don't advance to next process if inspection result is NG

Predictive Models
Estimate future states or early signs of anomalies from past/current data
Given this temperature/vibration/quality trend, there's a high likelihood of anomaly in a few hours

Optimization Models
Select better actions within multiple constraints
Suggest optimal conditions considering quality, yield, delivery, and power cost

Viewing it as the contrast between "rule-based layer = world of existing equipment" and "learning-based layer = today's three layers + LLM assistance" makes it easier to organize what kind of proposals are being discussed for manufacturing AI.
The decision materials that should be incorporated when designing a quality stabilization model for manufacturing can also be organized using the same three layers + LLM assistance.


Decision Material
Content
Responsible Layer


Equipment signals
Time series data such as temperature, flow rate, speed, pressure, current
Numerical layer (Chronos-2 / TimesFM)

Subtle anomaly signs
Gradients, fluctuations, micro-differences, combinations of multiple signals
Numerical layer (NV-Tesseract's forte)

Images/video
Appearance, color unevenness, chipping, cracking, misalignment, work status
Visual layer (Cosmos Reason)

Reference information / peripheral conditions
Raw material specifications, quality standards, procedure manuals, temperature/humidity, outdoor conditions
LLM + RAG (reference, knowledge)

Human expertise
Veterans' perspectives, correction sequences, startup intuition and other tacit knowledge
LLM formalization (prompts / FT material)

The basic mapping is: equipment signals and subtle anomaly signs go to the numerical layer, images and video go to the visual layer, reference information and human expertise go to LLM assistance. World models don't directly correspond to these 5 items, but they function as a behind-the-scenes player filling data gaps by retroactively generating synthetic data for equipment signals and image/video.
When a real project brings the request to "build a quality stabilization model," taking stock of these 5 items first makes it easier to advance discussions about what should go on which layer—that's the practical use of this organization.
 SummaryWe've broken down the learning-based layers of factory AI into 3 systems and reorganized them at the current state after the Cosmos 3 release. There are 4 key takeaways from the organization.
First, the 3 systems have different roles in terms of latency, memory, and real-time capability. The structure is that the numerical layer handles control at the millisecond level, the visual layer returns semantic interpretation at the second level, and the simulation layer handles data generation at the minute level. Second, the 3 systems are strongest when working together. The visual layer adds meaning to anomaly scores from the numerical layer, and the simulation layer works behind the scenes to supply pre-training data. Third, with Cosmos 3, the visual layer and simulation layer have been significantly updated: the Reasoner Tower of Cosmos 3 Nano is on par with Cosmos Reason 2, and as an omnimodel, the world model domain made major advances with 4-modality integration and Policy Model. Fourth, the structure of learning-based layers (prediction / optimization) sitting on top of rule-based layers (PLC / SCADA / MES) provides a useful reference framework when designing quality stabilization models for manufacturing.
 Reference Links Time Series Foundation Model SeriesRunning and Comparing Time Series Foundation Models on DGX Spark
Predicting PLC-Style Time Series Data with Chronos-2 and Generating Maintenance Comments with Nemotron
Trying Industrial Sensor Anomaly Detection with SKAB and Time Series Foundation Models
 VSS-Related ArticlesInvestigating the Current State of Manufacturing VSS as Seen Through VSS 3.1.0 EA and Hannover Messe
Thinking Through Practical Use Cases for NVIDIA VSS + AI Agents + Skills in Everyday Field Settings
 Cosmos-Related ArticlesTrying Structured Analysis of Images and Video with Cosmos-Reason2 on DGX Spark
Running NVIDIA Cosmos 3 on DGX Spark
Organizing the NVIDIA Cosmos 3 Family Usage Map on DGX Spark
 NeMo Framework OverviewA Bird's-Eye View of NVIDIA NeMo Framework — Spring 2026 Ecosystem Map and DGX Spark Series Index

I tried to analyze the updated NVIDIA Physical AI in Cosmos 3 through the three-tier structure of factory AI

What's Different About the 3 Systems in the First Place

Numerical Layer — The Domain of Time Series Foundation Models

Visual Layer — The Domain of Cosmos Reason / VSS Agent

Simulation Layer — The Domain of World Models

Collaboration Scenarios for the 3 Systems

Three Layers Viewed Through NVIDIA's Strategy Map

Three Layers Updated by Cosmos 3

Manufacturing 5-Layer Organization — Rule-Based and Learning-Based Layers

Summary

Reference Links

Time Series Foundation Model Series

NeMo Framework Overview

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Perspective	Time Series Foundation Models	Cosmos Reason (Reasoning VLM)	World Models (Cosmos Predict / Cosmos 3)
Primary Input	Numerical time series + covariates	Images/video + natural language prompts	Images/video + prompts
Primary Output	Future values / quantile / anomaly score of numerical time series	Natural language situation understanding / bbox / Yes-No judgment	Video frame generation / synthetic data
Layer Addressed	Numerical layer (extension of sensor values)	Visual/language layer (contextual meaning-making)	Simulation layer (generation of physical phenomena)
Latency	Milliseconds to hundreds of milliseconds	Several seconds to tens of seconds	Tens of seconds to tens of minutes
Memory	100MB to 1GB	Several GB to tens of GB	Tens of GB to 100GB+
Real-time Capability	Can be incorporated into control loops	Fits into advisory loops	Batch/offline only

Scenario	Role of Time Series FM	Role of Cosmos Reason	Role of World Model
A: Integrated Anomaly Detection	Immediate anomaly score from PLC sensors	Supplementary visual anomaly judgment from surveillance camera footage	Pre-train Cosmos Reason with synthetic abnormal video
B: Quality Digital Twin	Cross-referencing actual measurements with predictions	Explain anomaly grounds in natural language	Reproduce physical phenomena of processes as video
C: New Employee Training / SOP Compliance	Real-time warning of setpoint deviations	Assess procedure compliance from worker movements	Generate failure pattern video for training materials
D: Traceability Enhancement	Numerical log of manufacturing history	Extract process events from video logs	Digital Twin reproduction of past processes

Use Case	Previous Generation (Current)	Cosmos 3 (New)
Visual inspection / status description	Cosmos Reason 2-8B	Cosmos 3 Nano's Reasoner Tower (on par with Cosmos Reason 2)
Pre-manufacturing simulation	Cosmos Predict 2.5	Cosmos 3 Nano (omnimodel, 4-modality integration including sound and process audio)
Policy Model (robot integration)	N/A	Cosmos 3 Nano (omnimodel, action generation for SO-101, etc.)

Layer	Nature	Examples
PLC	Rules for equipment control	Stop when sensor value exceeds threshold, open valve, run motor
SCADA	Rules for monitoring, alarms, and operations	Alert when temperature exceeds upper limit, display on screen, record history
MES	Rules for manufacturing operations and process management	This lot flows in this process order, don't advance to next process if inspection result is NG
Predictive Models	Estimate future states or early signs of anomalies from past/current data	Given this temperature/vibration/quality trend, there's a high likelihood of anomaly in a few hours
Optimization Models	Select better actions within multiple constraints	Suggest optimal conditions considering quality, yield, delivery, and power cost

Decision Material	Content	Responsible Layer
Equipment signals	Time series data such as temperature, flow rate, speed, pressure, current	Numerical layer (Chronos-2 / TimesFM)
Subtle anomaly signs	Gradients, fluctuations, micro-differences, combinations of multiple signals	Numerical layer (NV-Tesseract's forte)
Images/video	Appearance, color unevenness, chipping, cracking, misalignment, work status	Visual layer (Cosmos Reason)
Reference information / peripheral conditions	Raw material specifications, quality standards, procedure manuals, temperature/humidity, outdoor conditions	LLM + RAG (reference, knowledge)
Human expertise	Veterans' perspectives, correction sequences, startup intuition and other tacit knowledge	LLM formalization (prompts / FT material)