
I tried predicting PLC-style time series data with Chronos-2 and generating maintenance comments with Nemotron
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
This time, I'll combine Amazon Chronos-2 with a local LLM on DGX Spark (Nemotron 3 Nano 30B-A3B-NVFP4) to create a single reference implementation for the following workflow on PLC-like data:
- Generate current, temperature, vibration, and ambient temperature at 1Hz for 72 hours using a custom PLC-like simulator
- Run multivariate prediction with Chronos-2 and calculate anomaly scores from residuals
- Extract high-score windows and have the local LLM write "maintenance comments" in Japanese
While structuring this as a hands-on that can go from 0 to 1 without a physical PLC, the "Production Extension Mapping" section at the end of each chapter shows the correspondence for replacing each component with real-environment services (AWS IoT SiteWise / Timestream / SageMaker / Bedrock, etc.).
Note that the scope of this article covers a 0-to-1 hands-on with Chronos-2 alone and LLM integration. Connection paths to physical PLCs and comparisons with other time series foundation models such as TimesFM 2.5 are out of scope and planned for separate articles.
All validation is completed on a single DGX Spark (128GB).
Validation Configuration
The validation configuration is as follows.
Each block can be directly replaced with AWS IoT SiteWise (PLC data collection) → Timestream / TimescaleDB (time series DB) → SageMaker Endpoint (time series model) → Bedrock / same LLM (interpretation layer), making this a reference implementation for production predictive maintenance.
Design Philosophy of the PLC-like Simulator
For this validation, I prepared a PLC-like simulator.
- Readers can reproduce the same series on their own machines (an article that requires a physical PLC to proceed would leave you with nothing running after finishing it)
- Since we can control the timing of degradation and sudden spikes as we like, it's easier to isolate what Chronos-2 picks up and what it misses
- The narrative of "normal interval → degradation progression → sudden anomaly" can be compressed into 72 hours
The connection path to a physical PLC (OPC UA / Modbus TCP / MQTT) is outside the scope of this article, so we'll proceed with the simulator playing the role of the PLC.
The design uses the following 6-block structure.
FactoryState Operation mode / Shift / Ambient temperature
LoadGenerator Day/night + sine wave + noise + switching dip
EquipmentPhysics Load → Current → Temperature → Vibration chain
SensorGenerator Gaussian noise + rare dropout
FailureInjector Linear degradation + nonlinear acceleration + sudden spikes
StreamPublisher CSV output + deque of recent window
The key point is that EquipmentPhysics builds "Load → Current → Temperature → Vibration" as a multivariate causal chain. Time series foundation models like Chronos-2 that can handle covariates should leverage these inter-sensor correlations to improve prediction accuracy, so the intent is to provide data with the same properties as real-world data.
Simulator Implementation
EquipmentPhysics aimed for "simple linear + wear-dependent without overfitting." To prevent Chronos-2 from memorizing the context and perfectly predicting the future, the equations are kept straightforward. Current, temperature, and vibration are calculated as follows.
current = line_speed * 0.10 + 2.0 * wear # Current increases with wear
bearing_temp = ambient + load * 0.30 + current * 0.20 # Temperature rises with load + current
vibration = 1.0 + 0.5 * wear # Vibration increases with wear
FailureInjector adds linear degradation (wear_per_step = 5e-6 / step), plus a latter phase where progression accelerates 2.5x when wear exceeds 1.0. Sudden spikes occur with probability 0.0008/step, implemented as a state machine that lasts 5 steps and then naturally subsides.
Generating 72 hours (259,200 rows) at 1Hz produced the following output.

Current, temperature, and vibration all gradually rise over time. Temperature oscillates sinusoidally due to the 24-hour ambient temperature cycle layered on top. Vibration can be seen spiking up to 5–6 mm/s during sudden spikes. Wear (true value) exceeds 0.5 (label switching threshold) around the 28-hour mark, passes 1.0 (nonlinear threshold) around the 56-hour mark and accelerates, reaching approximately 1.8 by the 72-hour point.
The label distribution is as follows.
| Label | Count | Ratio |
|---|---|---|
| normal | 99,640 | 38.4% |
| wear | 158,635 | 61.2% |
| spike | 925 | 0.4% |
Prediction with Chronos-2
Chronos-2 is a foundation model that can perform zero-shot multivariate time series prediction, capable of predicting all sensors simultaneously with a 3D tensor input of predict_quantiles((1, n_var, ctx)). In this article, I wrote a minimal wrapper myself.
from chronos import BaseChronosPipeline
pipeline = BaseChronosPipeline.from_pretrained(
"autogluon/chronos-2-small", # 28M
device_map="cuda",
dtype=torch.bfloat16,
)
# X: (T, V) = (3600, 4) 3600 rows × 4 sensors
ctx = torch.tensor(X.T[None, :, :], dtype=torch.float32) # (1, V, T)
quantiles_list, _ = pipeline.predict_quantiles(
ctx,
prediction_length=16,
quantile_levels=[0.1, 0.5, 0.9],
)
preds = quantiles_list[0].cpu().numpy()[:, :, 1].T # (h, V) median
Running 72 hours worth of data (16,177 windows, ctx=256 / h=16 / stride=16) on DGX Spark produced the following latencies. The numbers are measured in spike-only mode; latency is virtually the same in wear+spike mode.
| Model | Parameters | Latency p50 | p95 | Wall (16,177 win) |
|---|---|---|---|---|
Chronos-2 28M (autogluon/chronos-2-small) |
28M | 4.07 ms | 4.16 ms | 66.3 s |
Chronos-2 120M (amazon/chronos-2) |
120M | 7.06 ms | 7.29 ms | 114.9 s |
The 28M model processes approximately 4ms per window, comfortably fitting within a typical PLC scan cycle (100ms to 1 second). The 120M is approximately 1.7x slower, but still under 10ms, which is well within the budget for near-real-time monitoring.
Overlaying the prediction curve for a single window looks like this.

For all four channels—current, temperature, vibration, and ambient temperature—the median prediction gently extends as a continuation. Chronos-2 has a tendency to straightforwardly extend the direction of the context, so gradual drift is directly reflected in the predicted values.
Anomaly Score and Threshold
The difference between prediction and actual is treated as a residual, converted to per-sensor residuals in z-score space, and then aggregated. The z-score baseline is fit from the first 6 hours (the normal interval before degradation).
diff = (pred - truth) / std # (N, h, V)
per_sensor = np.mean(np.abs(diff), axis=1) # (N, V)
score_mean = per_sensor.mean(axis=1) # Aggregated (N,)
Three aggregation strategies were tested: mean, max, and pca. Mean averages residuals across all sensors for stability, max is sensitive to outliers in a single sensor, and pca projects per-sensor residuals onto the principal component direction to capture breakdowns in inter-sensor correlation structure.
Let me clarify the validation approach here. Since Chronos-2 is a foundation model that straightforwardly predicts the continuation of a time series, gradual degradation (wear) is reflected directly in the predicted values as "a continuation of the current trend." As a result, the residual between prediction and actual remains small even during degradation, and only spikes up significantly during sudden spikes.
Without accounting for this, when AUC was measured with "wear + spike both as positive labels," both 28M and 120M produced nearly random scores of approximately 0.51. Detecting wear from residuals alone is difficult for Chronos-2 by itself. Detecting gradual drift can be delegated to a separate layer such as moving averages, and this article focuses on sudden spikes where Chronos-2 excels.
[metrics:mean] AUC=0.51 F1=0.56 FAR=0.49 MAR=0.49
[metrics: max] AUC=0.51 F1=0.55 FAR=0.51 MAR=0.50
[metrics: pca] AUC=0.52 F1=0.56 FAR=0.49 MAR=0.49
When "spike only as positive label" is used, the results change dramatically.
| Model | Label | AUC (best) | F1 | FAR | MAR |
|---|---|---|---|---|---|
| 28M | wear + spike | 0.517 (pca) | 0.56 | 0.49 | 0.49 |
| 28M | spike-only | 0.999 (pca) | 0.83 | 0.006 | 0.000 |
| 120M | wear + spike | 0.518 (pca) | 0.56 | 0.49 | 0.49 |
| 120M | spike-only | 0.994 (mean) | 0.75 | 0.007 | 0.11 |
In spike-only mode, the 28M achieved MAR=0.000, meaning it missed none of the 232 spikes. The 120M also achieves AUC=0.99, nearly equivalent, but F1 drops from 0.83 to 0.75 and MAR rises to 0.11.
With the combination of this article's simulator and spike-only labels, a larger model is not necessarily advantageous for anomaly detection. A plausible explanation is that the 120M picks up too much of the subtle fluctuation just before a spike from the context and ends up predicting the incoming spike as a continuation of the trend, making residuals harder to rise. The benefits of multivariation and the compatibility between parameter scale and residual-based detection will be explored in a future article using a public dataset.
The relationship between anomaly scores and thresholds over time looks like this.

You can see the score spiking sharply only at the moments when spikes occur. The gradual wear progression stays in the background and barely appears in the score. This figure clearly demonstrates Chronos-2's characteristic of being strong on sudden events but weak on drift.
Going forward, if you want to capture the kind of degradation that actually matters on-site (the kind that quietly progresses over several days), rather than Chronos-2 alone, you'd want to run a moving average-based monitor in parallel as a separate layer that watches whether the absolute z-score values are gradually creeping up over time. I think it's realistic to treat models that are strong on sudden changes and techniques that are strong on slow changes as tools with fundamentally different intended uses and use them accordingly.
Generating Maintenance Comments with Nemotron LLM
Once the anomaly scores and residuals are available, they can be passed to the LLM to generate human-readable maintenance comments. Since we want everything to run on a single DGX Spark machine, I launched Nemotron 3 Nano 30B-A3B-NVFP4 as a local LLM via vLLM. With an Active 3B MoE architecture, inference is fast, and NVFP4 quantization brings it down to approximately 18GB.
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
--host 0.0.0.0 --port 8001 \
--served-model-name nemotron-3-nano-nvfp4-local \
--max-model-len 8192 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.5 \
--enforce-eager \
--moe-backend flashinfer_cutlass \
--trust-remote-code
--moe-backend flashinfer_cutlass is required for NVFP4 quantization; forgetting this will cause loading to fail on the MoE portion of the Nemotron 3 series. --enforce-eager is an option that disables CUDA graph capture to stabilize loading.
The prompt was structured as follows.
SYSTEM = (
"You are a maintenance engineer for industrial equipment."
"Looking at anomaly scores and sensor values from a time series model,"
"please return a short maintenance comment (1-2 sentences) that would be useful on-site."
"Base your comments only on observed numerical values, not speculation."
"Avoid citing non-existent past cases or making definitive statements without grounds (such as 'will definitely fail')."
)
Observation information is passed as JSON in the user message. An important note: per_sensor_residual_zscore is a dimensionless anomaly measure (z-score), while recent_sensor_values contains raw sensor values in physical units (A / °C / mm/s), and the two must be treated as separate things.
{
"window_index": 228,
"anomaly_score_mean_zscore": 3.7965,
"per_sensor_residual_zscore": {
"motor_current_a": 0.1805,
"bearing_temp_c": 0.1083,
"vibration_mm_s": 14.8165,
"ambient_temp_c": 0.0809
},
"recent_sensor_values": {
"motor_current_a": 8.1126,
"bearing_temp_c": 42.5778,
"vibration_mm_s": 5.4951,
"ambient_temp_c": 22.4882
},
"scaler_baseline_anomaly_free_6h": {
"motor_current_a": { "mean": 8.0515, "std": 0.4946 },
"vibration_mm_s": { "mean": 1.0383, "std": 0.218 }
}
}
With temperature 0.2, Nemotron's thinking mode OFF (the reason is discussed in detail in the next section), and max_tokens=200, submitting 6 windows (3 true positives + 3 false positives) produced the following output.
True Positive (spike) Examples
[window 228] anomaly_score=3.80 / vibration residual z=14.82
The anomaly score is high at 3.7965, and in particular vibration
at 14.8 mm/s significantly exceeds the baseline (mean 1.04 ± 0.22),
suggesting maintenance is warranted.
However, motor current at 8.11 A and bearing temperature at 42.58 °C
remain near their normative average values, and the basis for the anomaly
is limited to vibration data.
[window 6557] anomaly_score=3.54 / vibration residual z=13.80
The mean Z-score of the anomaly score is high at 3.54, and vibration
has risen to approximately 22.5 times the mean (4.98 mm/s).
While motor current and bearing temperature are within normal range,
the vibration anomaly is pronounced.
The comments identify vibration as the cause while also noting that other sensors are normal—resulting in text that maintenance staff can read directly to understand the situation.
False Positive (false alarm) Example
[window 5828] label=0 (normal) / anomaly_score=4.90 / vibration residual z=19.28
The recent vibration has risen significantly from the mean value (1.04 mm/s),
with the residual Z-score at approximately 19.3, which is abnormally high.
On the other hand, motor current, bearing temperature, and ambient temperature
remain near their average values, and no particular anomaly can be identified.
This is the most interesting response in this article. Chronos-2 judges this as "anomalous" with a score of 4.90, yet Nemotron writes "no particular anomaly can be identified" based on the fact that the other sensors are normal. This follows the same reasoning process a maintenance engineer on-site would use when combining multiple pieces of evidence.
Response Quality and Caveats
Up to this point, output was fairly stable at temperature 0.2, but running at temperature 0.7 with 3 repetitions (same window × 3 requests) shows clear variation.
Failure Pattern 1: Confusion between z-score and physical units
[window 4784, rep 1] T=0.7
Vibration significantly exceeds the mean value (approximately 19.8 mm/s),
and an anomaly has been detected.
The actual vibration value is 5.8 mm/s, and 19.8 is the z-score residual. The LLM confused per_sensor_residual_zscore: 19.8 and recent_sensor_values: 5.82 in the prompt and wrote "19.8 mm/s."
Failure Pattern 2: The primary sensor differs across responses for the same window
[window 228, rep 1]
Motor current (8.11A) and bearing temperature (42.58°C) show high
anomaly scores (3.79) compared to normal data from the past 6 hours
and have been detected as outliers.
[window 228, rep 2]
Vibration is significantly elevated at 5.50 mm/s compared to the
mean of 1.04 mm/s, and a residual-Z score of 14.8 is the primary
driver of the anomaly score.
In reality, only vibration is anomalous; motor current and temperature are nearly within normal range. Rep 1 is a misreading, rep 2 is correct. Two responses to the same input point to completely different primary causes.
Failure Pattern 3: Logical contradiction
[window 5828, rep 0] T=0.7
The anomaly score (Z-score mean 4.9) and vibration residual (approximately 19.3)
are notable, and in particular an anomalous trend was observed where vibration
significantly exceeds the mean value.
However, since the recent measured value of vibration (approximately 5.8 mm/s)
does not significantly exceed the baseline mean (approximately 1.0),
please note that the current anomaly score is based on past residuals and does not
indicate an immediate risk of failure.
5.8 mm/s is approximately 5.6 times the baseline of 1.0 mm/s, which is clearly "significantly exceeding" it. The LLM cites the numbers but fails to compare them correctly.
Mitigations
Operating with temperature 0.2 + thinking mode OFF significantly reduces these failures.
max_tokens was set to 200 based on the assumption of "1-2 sentences of Japanese maintenance comments," and the actual measured completion_tokens also falls within 80-150. There is also an intent to avoid latency increases from excessive generation.
For hallucinations that persist despite this (particularly the confusion between z-score and physical units), it would be worth explicitly stating in the prompt that "z-score is dimensionless and recent_sensor_values are in physical units," or supplementing with retrieval augmentation that includes "reference values for the same sensor in the past."
There are three advantages unique to local LLMs: the ability to return maintenance comments offline even when the network is down; latency of 2-4 seconds is well within a minute-scale judgment loop; and since sensitive sensor data doesn't need to be sent to the cloud, it's easier to adopt in environments with data export restrictions.
Production Extension Mapping
At this point, we have a minimal reference implementation running on a single DGX Spark. The table below shows how each component maps to real-environment services when moving to production predictive maintenance. Please note that these are reference correspondences that have not been validated within this article.
| Minimal Configuration (this article, single DGX Spark) | Production Extension (AWS) | Production Extension (Azure / On-premises) |
|---|---|---|
| Custom PLC-like simulator | Physical PLC + OPC UA / Modbus TCP / MQTT | Same + Azure IoT Hub |
| CSV / deque for recent window | AWS IoT SiteWise / Timestream | InfluxDB / TimescaleDB / Azure Data Explorer |
| Chronos-2 local single-machine inference | SageMaker Endpoint / Bedrock Marketplace | Azure ML Endpoint / On-premises Triton |
| Nemotron Nano 30B-A3B-NVFP4 local | Bedrock Claude / Nova, or same Nemotron via cloud inference | Azure OpenAI / On-premises vLLM |
| matplotlib timeline | Amazon Managed Grafana / QuickSight | Azure Managed Grafana / Power BI |
| Single node | IoT Greengrass edge + AWS cloud aggregation | Azure IoT Edge + Azure ML |
| Manual prompting | Bedrock Agent / MCP Server integration | Azure AI Agent Service |
The data collection path for connecting to a physical PLC (OPC UA / Modbus TCP / MQTT / time series DB selection) is outside the scope of this article, and we focused specifically on the Chronos-2 + LLM combination under the premise that "the simulator is playing the role of the PLC."
Summary and Future Work
On a single DGX Spark, I connected "PLC-like data generation → Chronos-2 multivariate prediction → residual scoring → local LLM writing Japanese maintenance comments" into a single reference implementation. The main findings that emerged during validation are as follows.
- Chronos-2 is extremely strong on sudden spikes (spike-only AUC=0.999, MAR=0.000)
- Gradual drift is absorbed by the prediction model as "a continuation of the trend," making it difficult to detect from residuals alone (AUC≈0.51)
- A larger model is not necessarily advantageous for anomaly detection (28M F1=0.83 vs. 120M F1=0.75)
- Local LLM maintenance comments are fairly stable at temperature 0.2 + thinking OFF, but confusion between z-score and physical units and discrepancies in identified primary causes are noticeable at temperature 0.7
All validation was completed on a single DGX Spark (128GB). The validation code is available at himorishige/dgx-spark-blog/chronos2-plc-sim/. It is structured so that you can reproduce the same results on your own DGX Spark or any CUDA-compatible GPU by running uv sync && python run_simulation.py --hours 72 && python predict_pipeline.py --model chronos2-28m --positive-kinds spike && python comment_pipeline.py.
Future items I'd like to try include the following.
- Replacing the data collection side with AWS IoT SiteWise + Timestream
- Running Chronos-2 on a SageMaker Endpoint and having Bedrock Claude write maintenance comments via Lambda
- Making it callable from Claude Code via MCP Server so that development agents can directly interact with the simulator
- Exploring layering with Cosmos Reason (image/video understanding model), i.e., a three-tier Physical AI configuration
- Displaying anomaly scores and LLM comments side by side on a Grafana dashboard
What this validation confirmed is that the division of "using Chronos-2 to catch sudden events, and using separate techniques to monitor slow degradation" is practical for real-world use. Rather than trying to handle everything with a single model, the idea of leveraging each tool's strengths through combination seems like a solid foundation when bringing time series foundation models into real projects.
I hope this serves as a starting point for those who want to try Chronos-2 from scratch, or those who want to test a pipeline that returns "on-site language" using a local LLM.

