I tried predicting PLC-style time series data with Chronos-2 and generating maintenance comments with Nemotron

I tried predicting PLC-style time series data with Chronos-2 and generating maintenance comments with Nemotron

2026.05.26

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

This time, I'll combine Amazon Chronos-2 with a local LLM on DGX Spark (Nemotron 3 Nano 30B-A3B-NVFP4) to create a single reference implementation for PLC-like data, covering the following flow:

  • Generate current, temperature, vibration, and ambient temperature at 1Hz for 72 hours using a custom PLC-like simulator
  • Run multivariate prediction with Chronos-2 and calculate anomaly scores from residuals
  • Extract high-score windows and have the local LLM write "maintenance comments" in Japanese

The configuration is designed to allow a hands-on experience from 0 to 1 without actual PLC hardware, while the "Production Extension Mapping" section at the end shows how to replace each component with real-environment services (AWS IoT SiteWise / Timestream / SageMaker / Bedrock, etc.).

Note that the scope of this article covers the Chronos-2 hands-on from 0 to 1 and LLM integration only. Connection paths to actual PLCs and comparisons with other time-series foundation models such as TimesFM 2.5 are out of scope and will be covered in separate articles.

All verification is completed on a single DGX Spark (128GB) unit.

Verification Configuration

The verification configuration is as follows.

Each block can be directly replaced with AWS IoT SiteWise (PLC data collection) → Timestream / TimescaleDB (time-series DB) → SageMaker Endpoint (time-series model) → Bedrock / same LLM (interpretation layer), making this a reference implementation for production predictive maintenance.

Design Philosophy of the PLC-like Simulator

I prepared a PLC-like simulator for verification for the following reasons:

  • Readers can reproduce the same series on their own machines (an article that requires actual PLC hardware to proceed would leave readers with nothing running after finishing it)
  • Since we can control the timing of degradation and sudden spikes as we wish, it's easier to isolate what Chronos-2 picks up and what it misses
  • The story of "normal period → degradation progression → sudden anomaly" can be compressed into 72 hours

The connection path to actual PLCs (OPC UA / Modbus TCP / MQTT) is outside the scope of this article, and we proceed with the image of the simulator playing the role of the PLC.

The design uses the following 6-block structure:

FactoryState        Operation mode / shift / ambient temperature
LoadGenerator       Day/night + sine + noise + switching dip
EquipmentPhysics    Load → Current → Temperature → Vibration chain
SensorGenerator     Gaussian noise + rare dropout
FailureInjector     Linear degradation + nonlinear acceleration + sudden spikes
StreamPublisher     CSV output + deque of recent window

The key point is that EquipmentPhysics builds "load → current → temperature → vibration" as a multivariate causal chain. Time-series foundation models like Chronos-2 that can handle covariates should improve prediction accuracy by leveraging the correlations between sensors, so the intention is to provide data with the same characteristics as the real world.

Simulator Implementation

EquipmentPhysics aimed for "simple linear + wear-dependent that doesn't overfit." The equations are kept straightforward so that Chronos-2 doesn't memorize the context and perfectly predict the future. Current, temperature, and vibration were calculated as follows:

current = line_speed * 0.10 + 2.0 * wear                  # current increases with wear
bearing_temp = ambient + load * 0.30 + current * 0.20     # temperature rises with load + current
vibration = 1.0 + 0.5 * wear                              # vibration increases with wear

FailureInjector adds linear degradation (wear_per_step = 5e-6 / step), plus a second-phase region where progression accelerates 2.5x when wear exceeds 1.0. Sudden spikes occur with probability 0.0008/step and are implemented as a state machine that persists for 5 steps then naturally subsides.

Generating 72 hours (259,200 rows) at 1 Hz produced the following output:

72-hour timeline of simulator output

Current, temperature, and vibration all rise over time. Temperature oscillates sinusoidally due to the 24-hour cycle of ambient temperature on top. You can see vibration jumping up to 5–6 mm/s during sudden spikes. The wear (true value) exceeds 0.5 (label switch threshold) around hour 28, passes 1.0 (nonlinear threshold) around hour 56 and accelerates, reaching approximately 1.8 by hour 72.

The label distribution is as follows:

Label Count Ratio
normal 99,640 38.4%
wear 158,635 61.2%
spike 925 0.4%

Prediction with Chronos-2

Chronos-2 is a foundation model that can predict multivariate time series zero-shot, accepting a 3D tensor input predict_quantiles((1, n_var, ctx)) to predict all sensors simultaneously. In this article, I wrote a minimal wrapper myself.

from chronos import BaseChronosPipeline

pipeline = BaseChronosPipeline.from_pretrained(
    "autogluon/chronos-2-small",   # 28M
    device_map="cuda",
    dtype=torch.bfloat16,
)

# X: (T, V) = (3600, 4)  3600 rows × 4 sensors
ctx = torch.tensor(X.T[None, :, :], dtype=torch.float32)  # (1, V, T)
quantiles_list, _ = pipeline.predict_quantiles(
    ctx,
    prediction_length=16,
    quantile_levels=[0.1, 0.5, 0.9],
)
preds = quantiles_list[0].cpu().numpy()[:, :, 1].T   # (h, V) median

Running 72 hours worth of data (16,177 windows, ctx=256 / h=16 / stride=16) on DGX Spark yielded the following latencies. Values are measured in spike-only mode; latency is nearly identical in wear+spike mode.

Model Parameters Latency p50 p95 wall (16,177 win)
Chronos-2 28M (autogluon/chronos-2-small) 28M 4.07 ms 4.16 ms 66.3 s
Chronos-2 120M (amazon/chronos-2) 120M 7.06 ms 7.29 ms 114.9 s

The 28M model takes about 4 ms per window, easily fitting within a typical PLC scan cycle (100 ms to 1 second). The 120M model is about 1.7x slower, but still under 10 ms, which is well within the budget for near-real-time monitoring.

Overlaying the prediction curve for a single window looks like this:

Chronos-2 28M prediction vs actual (4 sensors, horizon=16)

For all 4 channels—current, temperature, vibration, and ambient temperature—the median prediction extends as a gentle continuation. Chronos-2 has a tendency to straightforwardly extend the direction of the context, so gradual drift is reflected directly in the predicted values.

Anomaly Score and Threshold

The deviation between prediction and actual is treated as residual, converted to per-sensor residuals in z-score space, then aggregated. The z-score baseline is fit from the first 6 hours (normal period before degradation).

diff = (pred - truth) / std            # (N, h, V)
per_sensor = np.mean(np.abs(diff), axis=1)   # (N, V)
score_mean = per_sensor.mean(axis=1)         # aggregated (N,)

Three aggregation strategies were tested: mean, max, and pca. Mean averages residuals across all sensors for stability, max is sensitive to outliers in a single sensor, and pca projects per-sensor residuals onto the principal component direction to capture breakdowns in inter-sensor correlation structure.

Let me first clarify the verification approach. Chronos-2 is a foundation model that straightforwardly predicts the continuation of a time series, so gradual degradation (wear) is reflected directly in the predicted values as "a continuation of the current trend." As a result, residuals between prediction and actual remain small even during degradation, and only jump significantly during sudden spikes.

Without accounting for this, when measuring AUC with "wear + spike both labeled as positive," both 28M and 120M came out to nearly random 0.51. Detecting wear from residuals alone is difficult for Chronos-2 by itself. Detecting gradual drift should be delegated to a separate layer such as moving averages, and this article focuses on sudden spikes where Chronos-2 excels.

[metrics:mean] AUC=0.51  F1=0.56  FAR=0.49  MAR=0.49
[metrics: max] AUC=0.51  F1=0.55  FAR=0.51  MAR=0.50
[metrics: pca] AUC=0.52  F1=0.56  FAR=0.49  MAR=0.49

When switching to "spike-only as positive label," the results change dramatically:

Model Label AUC (best) F1 FAR MAR
28M wear + spike 0.517 (pca) 0.56 0.49 0.49
28M spike-only 0.999 (pca) 0.83 0.006 0.000
120M wear + spike 0.518 (pca) 0.56 0.49 0.49
120M spike-only 0.994 (mean) 0.75 0.007 0.11

The 28M model achieved MAR=0.000 in spike-only mode, meaning it missed not a single one of the 232 spikes. The 120M model also achieved nearly equivalent AUC=0.99, but F1 dropped from 0.83 to 0.75 and MAR rose to 0.11.

With this article's simulator and spike-only label combination, a larger model is not necessarily advantageous for anomaly detection. The 120M model seems to pick up the subtle fluctuations just before a spike from the context too well, "predicting" the incoming spike as an extension of the prediction, making it harder for residuals to emerge. The relationship between the benefits of multivariatization and the compatibility of parameter scale with residual-based detection will be explored further in a separate article using public datasets.

The relationship between anomaly score and threshold over time looks like this:

Anomaly score and threshold (spike-only, Chronos-2 28M)

You can see the score spiking sharply only at the timing of spike occurrences. The gradual wear progression remains in the background and barely appears in the score. This clearly illustrates Chronos-2's character of being strong on sudden anomalies but weak on drift.

Going forward, to capture the degradation that's really important in the field (the kind that quietly progresses over several days), rather than using Chronos-2 alone, you'd want to run a parallel moving-average-based monitor that watches whether the absolute z-score value gradually rises over time. I think it's realistic to treat models strong on sudden changes and methods strong on slow changes as having different intended purposes and use them accordingly.

Maintenance Comment Generation with Nemotron LLM

Once we have the anomaly scores and residuals, we pass them to the LLM to turn them into maintenance comments that humans can read. To keep everything self-contained on a single DGX Spark, I launched Nemotron 3 Nano 30B-A3B-NVFP4 with vLLM as the local LLM. Being an Active 3B MoE, inference is fast, and NVFP4 quantization keeps it to around 18GB.

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --host 0.0.0.0 --port 8001 \
  --served-model-name nemotron-3-nano-nvfp4-local \
  --max-model-len 8192 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.5 \
  --enforce-eager \
  --moe-backend flashinfer_cutlass \
  --trust-remote-code

--moe-backend flashinfer_cutlass is mandatory for NVFP4 quantization; forgetting this causes load failures in the MoE part of Nemotron 3 series. --enforce-eager disables CUDA graph capture to stabilize loading.

The prompt was structured as follows:

SYSTEM = (
    "You are a maintenance engineer for industrial equipment."
    "Look at the anomaly scores and sensor values from a time-series model and"
    "provide a short maintenance comment (1-2 sentences) useful for field personnel."
    "Base your comments only on observed numbers, not speculation."
    "Avoid fabricated past cases or definitive expressions without basis (e.g., 'will definitely fail')."
)

Observation information is passed in JSON in the user message. Note that per_sensor_residual_zscore is the dimensionless z-score anomaly degree, while recent_sensor_values is the raw sensor value in physical units (A / °C / mm/s), and the two must be treated as separate things.

{
  "window_index": 228,
  "anomaly_score_mean_zscore": 3.7965,
  "per_sensor_residual_zscore": {
    "motor_current_a": 0.1805,
    "bearing_temp_c": 0.1083,
    "vibration_mm_s": 14.8165,
    "ambient_temp_c": 0.0809
  },
  "recent_sensor_values": {
    "motor_current_a": 8.1126,
    "bearing_temp_c": 42.5778,
    "vibration_mm_s": 5.4951,
    "ambient_temp_c": 22.4882
  },
  "scaler_baseline_anomaly_free_6h": {
    "motor_current_a": { "mean": 8.0515, "std": 0.4946 },
    "vibration_mm_s": { "mean": 1.0383, "std": 0.218 }
  }
}

With temperature 0.2, Nemotron's thinking mode OFF (reasons detailed in the next section), and max_tokens=200, sending to 6 windows (3 true positives + 3 false positives) produced the following outputs:

True Positive (spike) Examples

[window 228] anomaly_score=3.80 / vibration residual z=14.82

The anomaly score is high at 3.7965, and vibration in particular
at 14.8 mm/s far exceeds the baseline (mean 1.04 ± 0.22),
suggesting maintenance is warranted.
However, motor current at 8.11 A and bearing temperature at 42.58 °C
are within normal mean ranges, so the anomaly basis is
limited to the vibration data.
[window 6557] anomaly_score=3.54 / vibration residual z=13.80

The mean Z-score of the anomaly score is high at 3.54, and vibration
has risen to approximately 22.5 times the mean
(4.98 mm/s).
Motor current and bearing temperature are within normal range, but
the vibration anomaly is prominent.

The responses identify vibration as the cause while also noting that other sensors are normal—text that maintenance personnel can read directly to understand the situation.

False Positive (false alarm) Example

[window 5828] label=0 (normal) / anomaly_score=4.90 / vibration residual z=19.28

Recent vibration has risen significantly from the mean (1.04 mm/s),
with the residual Z-score at approximately 19.3, which is abnormally high.
On the other hand, motor current, bearing temperature, and ambient temperature
are near their mean values, and can be judged as showing no particular anomaly.

This is the most interesting response in this article. While Chronos-2 judges a score of 4.90 as "anomalous," Nemotron writes "can be judged as showing no particular anomaly" based on the other sensors being normal. This mirrors how field maintenance personnel combine multiple pieces of evidence to make judgments.

Response Quality and Caveats

Up to this point, output was fairly stable at temperature 0.2, but running 3 repetitions at temperature 0.7 (same window × 3 requests) shows clear variability.

Failure Pattern 1: Confusion of z-score and physical units

[window 4784, rep 1] T=0.7

Vibration far exceeds the mean (approximately 19.8 mm/s),
so an anomaly has been detected.

The actual vibration value is 5.8 mm/s; 19.8 is the z-score residual. The LLM confused per_sensor_residual_zscore: 19.8 with recent_sensor_values: 5.82 in the prompt and wrote "19.8 mm/s."

Failure Pattern 2: Differing primary sensor across repetitions for the same window

[window 228, rep 1]
Motor current (8.11A) and bearing temperature (42.58°C),
compared to normal data from the past 6 hours, have high anomaly scores (3.79)
and have been detected as outliers.

[window 228, rep 2]
Vibration compared to mean 1.04 mm/s is now 5.50 mm/s,
significantly higher, with residual Z-score 14.8 being the primary driver
of the anomaly score.

In reality, only vibration is anomalous; motor current and temperature are within normal range. Rep 1 misreads, rep 2 is correct. Two responses to the same input point to completely different primary causes.

Failure Pattern 3: Logical contradiction

[window 5828, rep 0] T=0.7

The anomaly score (Z-score mean 4.9) and vibration residual (approximately 19.3)
stand out, and particularly an anomalous trend of vibration far exceeding the mean
was observed.
However, since the recent measured value of vibration (approximately 5.8 mm/s)
does not far exceed the baseline mean (approximately 1.0),
please note that the current anomaly score is based on past residuals
and does not indicate an immediate risk of failure.

5.8 mm/s is approximately 5.6 times the baseline of 1.0 mm/s, clearly "far exceeding" it. The LLM cites the numbers but fails to compare them correctly.

Mitigation

Running with temperature 0.2 + thinking mode OFF significantly reduces these failures.

max_tokens was set to 200 because the intent was "1–2 sentence Japanese maintenance comments," and measured completion_tokens also fell within 80–150. There's also the intention to avoid latency increases from over-generation.

For hallucinations that remain even then (especially confusion of z-score and physical units), improvements such as explicitly stating in the prompt that "z-score is dimensionless, recent_sensor_values are in physical units," or adding retrieval augmentation with "reference values for the same sensor in the past" seem necessary.

There are three advantages unique to local LLM. Maintenance comments can be returned offline even when the network is down; latency of 2–4 seconds is well within the minute-scale judgment loop; and since sensitive sensor data doesn't need to be sent to the cloud, it's easy to adopt in environments with data export restrictions.

Production Extension Mapping

At this point, we have a minimal reference implementation running on a single DGX Spark. When moving to production predictive maintenance, the following table shows how to replace each component with real-environment services. Please note that this is a reference mapping not verified within this article.

Minimal configuration (this article, single DGX Spark) Production extension (AWS) Production extension (Azure / on-premises)
Custom PLC-like simulator Actual PLC + OPC UA / Modbus TCP / MQTT Same + Azure IoT Hub
CSV / deque for recent window AWS IoT SiteWise / Timestream InfluxDB / TimescaleDB / Azure Data Explorer
Chronos-2 local single-machine inference SageMaker Endpoint / Bedrock Marketplace Azure ML Endpoint / on-premises Triton
Nemotron Nano 30B-A3B-NVFP4 local Bedrock Claude / Nova, or same Nemotron via cloud inference Azure OpenAI / on-premises vLLM
matplotlib timeline Amazon Managed Grafana / QuickSight Azure Managed Grafana / Power BI
Single node IoT Greengrass edge + AWS cloud aggregation Azure IoT Edge + Azure ML
Manual prompt Bedrock Agent / MCP Server integration Azure AI Agent Service

The collection path for connecting to actual PLCs (OPC UA / Modbus TCP / MQTT / time-series DB selection) is outside the scope of this article, and the focus remains on the Chronos-2 + LLM combination under the premise that "the simulator is playing the role of the PLC."

Summary and Next Steps

On a single DGX Spark, I connected "PLC-like data generation → Chronos-2 multivariate prediction → residual scoring → local LLM writes Japanese maintenance comments" into a single reference implementation. The main findings that emerged during verification are as follows:

  • Chronos-2 is very strong on sudden spikes (spike-only AUC=0.999, MAR=0.000)
  • Gradual drift is absorbed by the prediction model as "an extension of the prediction," making it difficult to detect with residual-based methods (AUC≈0.51)
  • A larger model is not necessarily advantageous for anomaly detection (28M F1=0.83 vs 120M F1=0.75)
  • Local LLM maintenance comments are fairly stable at temperature 0.2 + thinking OFF, but confusion of z-score and physical units and disagreement on primary cause become prominent at temperature 0.7

All verification was completed on a single DGX Spark (128GB). The verification code is available at himorishige/dgx-spark-blog/chronos2-plc-sim/. The configuration is set up to reproduce the same results on your own DGX Spark or any CUDA-compatible GPU with uv sync && python run_simulation.py --hours 72 && python predict_pipeline.py --model chronos2-28m --positive-kinds spike && python comment_pipeline.py.

Future items I'd like to try include:

  • Replacing the data collection side with AWS IoT SiteWise + Timestream
  • Running Chronos-2 on a SageMaker Endpoint and having Bedrock Claude write maintenance comments via Lambda
  • Setting up a configuration where Claude Code can call it through an MCP Server so development agents can directly interact with the simulator
  • Exploring hierarchical integration with Cosmos Reason (image/video understanding model), i.e., a three-layer Physical AI configuration
  • Displaying anomaly scores and LLM comments side by side on a Grafana dashboard

What this verification confirmed is that the division of "using Chronos-2 to catch sudden events and monitoring gradual degradation with a separate method" is practical for real-world use. Rather than trying to handle everything with a single model, the idea of combining tools by leveraging their strengths looks like it will be the foundation for bringing time-series foundation models into real projects.

I hope this serves as a starting point for those who want to try Chronos-2 from scratch, or for those who want to test a pipeline where a local LLM returns responses "in the language of the field."


生成AI活用はクラスメソッドにお任せ

過去に支援してきた生成AIの支援実績100+を元にホワイトペーパーを作成しました。御社が抱えている課題のうち、どれが解決できて、どのようなサービスが受けられるのか?4つのフェーズに分けてまとめています。どうぞお気軽にご覧ください。

生成AI資料イメージ

無料でダウンロードする

Share this article

AWSのお困り事はクラスメソッドへ