I tried anomaly detection of industrial sensors with SKAB and time series foundation models

I tried anomaly detection of industrial sensors with SKAB and time series foundation models

2026.06.01

This page has been translated by machine translation. View original

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

In a previous article, Comparing Time Series Foundation Models Running on DGX Spark, I compared Chronos-2 and TimesFM 2.5 on prediction accuracy, latency, and context scaling using the ETTh1 benchmark. Near the end of that article, I briefly touched on the idea of "whether anomaly detection could work if time series foundation models were applied to PLC data in manufacturing" — in just one paragraph — and my motivation this time is to dig deeper into that topic with real data.

https://dev.classmethod.jp/articles/dgx-spark-timeseries-fm-3-bench/

I chose SKAB (Skoltech Anomaly Benchmark) as the subject. It contains 8-sensor time series recorded on an industrial pump testbed, with manually labeled real anomaly intervals. In the previous anomaly detection simulation, Spike / Level shift / Noise burst were artificially injected, but this time I observe how the models respond to anomaly patterns originating from real equipment.

The evaluation was conducted by running Chronos-2 (28M / 120M) and TimesFM 2.5 (200M) on DGX Spark, measuring ROC AUC / F1 / FAR / MAR across all 34 datasets.

Overview of the SKAB Dataset

SKAB is an open dataset published on GitHub by Skoltech (Skolkovo Institute of Science and Technology). The license is GPL-3.0, and the data structure is as follows.

  • 34 labeled CSV files (valve1/ 16 files, valve2/ 4 files, other/ 14 files) + 1 anomaly-free.csv
  • Each CSV is sampled at 1 Hz with approximately 1,000 rows (about 16 minutes), totaling 37,401 rows
  • 8 sensor columns (vibration acceleration ×2, current, voltage, pressure, body temperature, fluid temperature, flow rate) + anomaly label + changepoint marker
  • Overall anomaly rate is 34.9%, with 129 changepoints

For reference, here is what the 8-sensor time series from valve1/0.csv looks like. The red bands indicate labeled anomaly intervals.

SKAB valve1/0 8-sensor time series

You can see that Voltage is on a ~230V scale while Accelerometer is on a ~0.03 scale, meaning the scale differs by about 4 orders of magnitude between sensors. As I will discuss later, how to absorb this scale difference is directly tied to anomaly detection accuracy.

During the anomaly interval (rows 573–974), Temperature and Thermocouple gradually rise, Volume Flow Rate becomes choppy, and Voltage remains nearly normal. "Complex behavioral changes that are hard to notice by looking at a single sensor alone" is a pattern commonly seen in SKAB anomalies.

Evaluation Approach

The anomaly detection workflow was structured as follows.

  1. Learn a per-sensor z-score scaler from the anomaly-free section (the consecutive interval where anomaly=0 at the beginning) of each dataset
  2. Slice context and horizon using sliding_windows(context_len=256, horizon=16, stride=16), and assign the ground-truth label corresponding to the horizon interval as 1 if there is at least one anomaly within the interval
  3. Compute predictions for the horizon interval with each model, and calculate residuals per sensor in z-score space
  4. Aggregate per-sensor residuals into a single anomaly score using one of mean / max / pca
  5. Evaluate ranking capability with ROC AUC, and compute F1 / FAR / MAR at the best-F1 threshold

Four evaluation metrics were listed because in manufacturing anomaly detection, the balance between miss rate (MAR) and false alarm rate (FAR) is the operational key. F1 combines both into a single value, while ROC AUC provides threshold-independent ranking performance, making it easy to interpret.

Z-score normalization was introduced to absorb Voltage dominance. If aggregating with raw MSE, Voltage MSE alone is on the ~100 order while other sensors are on the ~0.01 order, causing other sensor anomalies to become invisible as Voltage pulls everything. By standardizing per sensor, residuals from any sensor can be discussed on the same scale.

To ensure reproducibility, np.random.seed(0) and torch.manual_seed(0) were fixed, and each cell was run twice independently to obtain the standard deviation of AUC. Datasets where the threshold exceeded 0.05 were designed to receive a third run. As a result, across all 4 cells × 34 datasets, there were 0 cases with std > 0.05, and both Chronos-2 and TimesFM 2.5 showed very stable results.

Overview of the Evaluation Pipeline

From data ingestion through aggregation and evaluation, the overall flow looks like this.

Anomaly Detection with Chronos-2

Chronos-2 by AWS returns predictions for multiple sensors at once in the form pipeline.predict_quantiles((1, n_var, ctx)). I measured latency and accuracy across all 34 datasets for both the "multivariate mode" where all 8 sensors are submitted together, and the "univariate 8-parallel mode" where each sensor is submitted one at a time, 8 times.

Model × Mode AUC Median min max Warm Latency GPU Memory
28M univariate 0.6103 0.234 0.942 3.7 ms / sensor 86 MB
28M multivariate 0.6376 0.255 0.945 4.1 ms / 8 sensor 88 MB
120M univariate 0.6351 0.244 0.952 6.5 ms / sensor 260 MB
120M multivariate 0.6234 0.247 0.960 7.3 ms / 8 sensor 264 MB

What catches the eye when looking at these numbers is that going multivariate with 28M raises AUC by +2.7 points, while with 120M it slightly degrades by -1.2 points. In the previous ETTh1 benchmark, "120M shows -7% OT MASE with multivariate" was also observed, so this is consistent as a trend — it appears that 120M with sufficient capacity is fine with univariate, while the smaller 28M gets a boost from cross-channel attention.

The latency side is even clearer. When processing 8 sensors per window, the 28M goes from 30 ms in univariate 8-parallel to 4.1 ms in multivariate, and the 120M goes from 52 ms to 7.3 ms, roughly a 7x speedup for both. Since the mechanism of returning predictions for 8 sensors in a single forward pass works directly in its favor, going multivariate appears to be not just cost-free but actually a net positive.

Chronos-2 28M / 120M univariate vs multivariate

Against a PLC scan cycle budget (typically 100 ms to 1 second), the 4.1 ms of Chronos-2 28M multivariate offers more than 25x headroom. GPU memory is also around 88 MB, a size that can comfortably fit on edge devices like the Jetson Orin Nano class. Among the 34 datasets, 28M multivariate outperformed 120M multivariate on 18 datasets, so there seems to be no reason to dismiss 28M if edge deployment is a consideration.

Anomaly Detection with TimesFM 2.5

TimesFM 2.5's forecast(inputs=[1Darray]) is a univariate-only API that accepts a list of 1D inputs. To handle multivariate data, residuals from independently predicting 8 sensors must be aggregated into one score in a downstream step. Three aggregation strategies — mean / max / pca — were compared.

Aggregation Strategy AUC Median min max Scenarios where it works best
mean 0.7723 0.328 0.898 Situations where all sensors show anomaly uniformly
max 0.6923 0.364 0.830 Situations where a single sensor spike is the cause
pca 0.5538 0.216 0.875 Situations where periodic correlation structure breaks (in theory)

Looking at the numbers, mean clearly outperforms the others with a median of 0.77, pulling ahead by a wide margin. Max is good at catching sudden spikes in a single sensor but showed a weakness against noise. PCA theoretically aims to capture "anomalies that deviate from the normal correlation structure," but with SKAB's dataset size (roughly 30–60 windows per dataset), estimation of the first principal component tends to be unstable, resulting in AUC even lower than max.

TimesFM 2.5 (c=512) aggregation strategy comparison

However, looking at the per-dataset distribution, other/1 shows pca = 0.875, the top among the 3 strategies, and other/3 also sees pca = 0.678 performing respectably. The pattern here is that mean wins overall on average, while other strategies can hit the mark in specific cases — a common case-by-case situation in manufacturing anomaly detection.

The next interesting question is how far back to look, i.e., context scaling. One SKAB dataset is approximately 1,000 rows (about 16 minutes). I compared running TimesFM 2.5 with context of 128 (about 2 minutes) versus 512 (about 8 minutes). In the previous parent article, extending to c=15,360 improved MASE by 30%, so I expected "longer context should always be advantageous for anomaly detection too," but as it turned out, it was not that simple.

Context AUC Median (mean aggregation) std Number of evaluable datasets
c=128 0.6822 0.162 12
c=512 0.7723 0.182 11

Overall, c=512 wins by about +0.06, but at the per-dataset level, some pairs show reversals.

Dataset c=128 c=512 Δ Interpretation
valve1/0 0.5426 0.3284 -0.214 Short context is advantageous
valve2/3 0.7324 0.5520 -0.180 Short context is advantageous
other/1 0.4231 0.7917 +0.369 Long context is essential
valve1/2 0.6523 0.7902 +0.138 Long context is advantageous

TimesFM 2.5 c=128 vs c=512 per-dataset comparison

Translating this into a manufacturing context, short context is suited for instantaneous anomalies such as sudden valve clogging or sharp shifts. On the other hand, for long-term trend anomalies such as disrupted operation cycles or collapsed periodicity, long context is essential — with c=128, these get treated as "normal for this equipment" and missed. Running short-context and long-context models in parallel and using both anomaly scores in an ensemble is a natural option that comes to mind.

Note that c=2048 was also tried, but even the longest SKAB dataset has 1,328 rows, and the constraint context + horizon ≤ T could not be satisfied, making all datasets unevaluable. Within the scale of SKAB, comparing c=128 and c=512 is the practical range.

Choosing Among the 3 Models

Having looked at Chronos-2 and TimesFM 2.5 separately, I will now line them up again on the same 11-dataset subset. Since the TimesFM side was sampled to 12 datasets due to inference cost constraints (other/2 was excluded because no positive labels appeared), the AUC median for Chronos-2 was also recomputed on this subset of 11 datasets.

Cell AUC Median std min max Warm Latency GPU Memory
chronos2-28m univariate 0.5648 0.196 0.251 0.942 3.7 ms / sensor 86 MB
chronos2-28m multivariate 0.5602 0.191 0.255 0.937 4.1 ms / 8 sensor 88 MB
chronos2-120m univariate 0.5926 0.192 0.244 0.921 6.5 ms / sensor 260 MB
chronos2-120m multivariate 0.5557 0.189 0.272 0.921 7.3 ms / 8 sensor 264 MB
TimesFM 2.5 mean (c=512) 0.7723 0.182 0.328 0.898 229 ms / sensor 944 MB
TimesFM 2.5 max (c=512) 0.6923 0.131 0.364 0.830 Same as above Same as above

SKAB Anomaly Detection ROC AUC Distribution

TimesFM 2.5 mean outperforms all Chronos-2 variants by +18 to +22 AUC points (in relative terms, +30 to +39%). However, when tracking per-dataset results, the two models differ quite a bit in how they capture anomalies.

Dataset C28 multi C120 multi TFM mean Pattern
valve2/1 0.43 0.42 0.82 Hard case that TimesFM picks up
valve2/3 0.26 0.27 0.55 Same as above
valve1/0 0.52 0.56 0.33 Only Chronos-2 picks it up
other/4 0.79 0.81 0.60 Same as above
valve2/2 0.94 0.92 0.90 Both models perform well

To summarize broadly, TimesFM roughly doubles the AUC on hard cases where Chronos-2 fails entirely (valve2/1, valve2/3), while Chronos-2 maintains mid-range or above on valve1/0 and other/4 where TimesFM falls. This is consistent with the hypothesis that "long-term trend anomalies suit TimesFM, short-term spike anomalies suit Chronos-2," and an ensemble combining both models emerges as the next logical step.

Looking at the latency side, the picture is entirely different.

Latency comparison for inferring 8 sensors in 1 window

The difference between Chronos-2 28M multivariate (4.1 ms) and TimesFM 2.5 (approximately 1,832 ms) is about 450x. TimesFM wins on accuracy, but it does not fit within the PLC scan cycle budget of 100 ms, nor does it make the 1 Hz real-time judgment limit (1,000 ms upper bound).

The usage guidelines that can be drawn from this are as follows.

Scenario Recommendation
Edge deployment (PLC-direct IPC, Jetson, etc.) + real-time judgment Chronos-2 28M multivariate (4 ms / 88 MB)
Factory server (aggregating multiple lines) + medium accuracy Chronos-2 120M (uni or multi)
Server aggregation + accuracy-first (batch judgment OK) TimesFM 2.5 mean (c=512) (229 ms / 944 MB)
Don't want to miss either type of anomaly Ensemble of Chronos-2 28M multi + TimesFM 2.5 mean

In manufacturing settings, a two-stage judgment approach — where the edge performs a primary judgment and only suspicious items are re-evaluated on the server side — is a standard practice. A natural arrangement would be to place Chronos-2 28M at the edge and TimesFM 2.5 at the aggregation layer.

Detection Example

The numbers alone can be hard to visualize, so let's take a specific look at how Chronos-2 28M multivariate detects anomalies in valve1/4 (a dataset with AUC 0.87).

valve1/4 Current sensor anomaly interval + anomaly score + detection result

The top panel shows the raw Current sensor signal (red band indicates the true anomaly interval), the middle panel shows the per-window anomaly score aggregated by mean, and the bottom panel shows the 0/1 detection result at the best-F1 threshold overlaid with the ground truth. The numbers — AUC 0.868 / F1 0.792 / threshold 0.750 — are not bad for manufacturing anomaly detection.

What is worth noting is that while the anomaly score forms a clear peak in the true anomaly interval (around rows 600 to 950), there are also small peaks around 0.5 in the normal intervals before it. In reality, it is hard to completely suppress false positives in noisy normal states, and even after optimizing the threshold at best-F1, the values come out to FAR 3.4% and MAR 30.4%. If erring on the side of safety, the threshold would be lowered to increase FAR while suppressing MAR — this kind of trade-off is the key discussion point when moving from PoC to production.

How to Set Up PLC Real-Time Integration

As a way to reproduce the "PLC → time series model" loop without having actual PLC hardware, a combination of OpenPLC (an open-source PLC emulator) and Node-RED is available. A configuration that streams SKAB CSV via Modbus TCP and returns an anomaly score from Chronos-2 looks like this.

For the latency budget: PLC scan (1 Hz, 1,000 ms) → InfluxDB write + latest window retrieval (5–20 ms) → Chronos-2 28M multivariate inference (4 ms) → threshold judgment + notification (1–5 ms), bringing the entire inference pipeline to 10–30 ms. Even if the scan rate increases to 100 ms, Chronos-2 28M handles it fine; TimesFM 2.5 (1,832 ms) would not make it in time.

In terms of implementation, three key points to keep in mind for operating close to the 4 ms measured in this article are: maintaining a deque on the Node-RED side rather than fetching 256 rows from InfluxDB every time for lower latency; pre-saving the z-score scaler learned from the anomaly-free section so that only application is performed at inference time; and knowing that Chronos-2's multivariate input expects a 3D tensor of shape (1, n_var, ctx).

Three-Layer Architecture Combined with SCADA / MES

In actual manufacturing plants, PLCs are topped by SCADA, which in turn is topped by MES in a hierarchical structure, and where to place the time series foundation model depends on the scale of the line. For small scale, edge inference directly connected to the PLC; for medium scale, SCADA aggregates data and passes it to the model, returning anomaly scores to the HMI; for large scale, combining SCADA tags with MES process instructions for multivariate prediction fed back to MES for per-lot quality traceability — this is the natural progression.

Applying the model selection guidelines here, a three-layer architecture naturally takes shape.

The roles are cleanly separated: L1 for immediate alerting, L2 for detailed evaluation using both models in an ensemble, and L3 for daily and per-lot roll-up analysis. The Cognite × NVIDIA × Celanese case study introduced in Chapter 7 of the previous parent article can also be read as a setup where NV-Tesseract plays an active role at L2/L3, so reading it alongside this article's Chronos-2 / TimesFM 2.5 results should give a clearer picture of the overall manufacturing AI stack.

Summary

Evaluating across all 34 datasets whether Chronos-2 and TimesFM 2.5 "work" against SKAB's real-equipment anomaly data — and how to differentiate their use — revealed the following key points.

  • Going multivariate is not just cost-free but actually beneficial. With Chronos-2 28M, going multivariate raises AUC by +2.7 points and speeds up latency by about 7x.
  • Models show clear strengths and weaknesses for different anomaly types. While TimesFM 2.5 mean ranks first overall with an AUC median of 0.77, Chronos-2 outperforms it on cases like valve1/0 where TimesFM drops — a complementary relationship.
  • Context scaling is not "longer is always better." Switching between c=128 and c=512 in TimesFM 2.5 results in reversals of up to ±0.37 depending on the dataset.
  • Balancing PLC scan cycle budget against accuracy requirements, the practical solution is a three-layer architecture with edge 28M / server 120M + TimesFM, and Chronos-2 28M multivariate (4 ms / 88 MB) can be used as-is at the edge layer.

Once NV-Tesseract (currently at evaluation license stage) becomes available to run locally, I would like to take another comparison with TimesFM 2.5 at the L2/L3 layers. Reading the Cognite × NVIDIA × Celanese case study's 4-sensor state prediction together with this article's 8-sensor anomaly detection should help paint a picture of the 2026 edition of the manufacturing AI stack.


製造業のクラウド活用とデジタル化を支援します

クラスメソッドの専門家による包括的なクラウド導入とデジタル化支援で、製造業の業務効率を最大化しましょう。AWSの導入から運用、最適化まで、最新技術と豊富な知見であらゆる課題に対応します。生産ラインのデジタル化やデータ活用、IoTの導入事例もございます。ぜひ、弊社の実績をご覧ください。

製造業界での支援内容を見る

Share this article

AWSのお困り事はクラスメソッドへ

Related articles