I tried running and comparing time series foundation models on DGX Spark

I tried running and comparing time series foundation models on DGX Spark

2026.05.25

This page has been translated by machine translation. View original

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Division.

Starting in the fall of 2025, foundation models for time-series forecasting received major updates in rapid succession. Google's TimesFM 2.5, NVIDIA's NV-Tesseract (currently in evaluation license stage), and AWS's Chronos-2 — three leading models from three companies emerged within just two months. Each targets slightly different use cases such as demand forecasting, anomaly detection, and manufacturing sensor monitoring, and there's ongoing debate about which one comes out on top in benchmarks.

This article benchmarks Chronos-2 and TimesFM 2.5, which can be run locally, on actual hardware, while NV-Tesseract — currently only accessible via DGX Cloud NIM — is introduced as a reading section based on official information and Cognite's case study. The verification environment used a DGX Spark on hand.

https://dev.classmethod.jp/articles/dgx-spark-chronos2-plc-sim-llm-maintenance/

Positioning and Comparison of the 3 Models

When organizing the three models by developer, distribution method, and intended use on an ecosystem basis, the picture looks like this. AWS integrates via SageMaker JumpStart and Bedrock Marketplace, Google integrates via BigQuery's AI.FORECAST, and NVIDIA provides it as a NIM on DGX Cloud — it's interesting how each company's platform strategy is transparently visible.

Placing the main specs side by side makes the commonalities and differences clearly visible.

Item Chronos-2 TimesFM 2.5 NV-Tesseract
Provider AWS (Amazon Science) Google Research NVIDIA
Release 2025-10-20 2025-09 2025 (Evaluation License)
Architecture T5 encoder + Group Attention Decoder-only + Quantile head Transformer (details undisclosed)
Parameters 120M / 28M 200M Undisclosed
Context Limit 8,192 16,384 Undisclosed
Multivariate Forecasting Native support Supported Supported
Anomaly Detection - (Forecast-focused) AI.DETECT_ANOMALIES GA Native (NV-Tesseract-AD)
Fine-Tuning Publicly available Publicly available Via NIM only
License Apache 2.0 Apache 2.0 Evaluation license
Availability HF / GitHub HF / JAX version available DGX Cloud NIM only
Commercial Use Available Available Conditional during evaluation

While both Chronos-2 and TimesFM 2.5 are available under Apache 2.0 and can run locally, NV-Tesseract is in the evaluation license stage and can only be accessed via DGX Cloud NIM — a notable difference in availability. The general release timeline has not been announced as of now. (As of May 2026)

Verification Environment

Both models are fully operable via Python library calls. Since dependency versions conflict, separate uv venvs were created for each.

Item Value
Hardware DGX Spark (NVIDIA GB10, aarch64, 128GB UMA)
OS / Driver Ubuntu 24.04 / NVIDIA driver 580.142
Python 3.13.13
PyTorch 2.12.0+cu130 (common to both venvs)
For Chronos-2 chronos-forecasting==2.2.2, transformers==4.57.6
For TimesFM 2.5 timesfm==2.0.0 (git head d720daa6), safetensors==0.7.0

Trying Chronos-2

Just install chronos-forecasting>=2.1.0 (2.1.0 and later includes a bug fix for past-value covariates) via pip. PyTorch specifies the cu130 wheel index explicitly for the ARM64 + Blackwell GB10.

uv pip install --index-url https://download.pytorch.org/whl/cu130 \
  --extra-index-url https://pypi.org/simple/ torch
uv pip install "chronos-forecasting>=2.1.0"

The minimal sample looks like this. It goes end-to-end from BaseChronosPipeline.from_pretrained to predict_quantiles, returning 21 types of quantiles in a single inference after loading.

import numpy as np
import torch
from chronos import BaseChronosPipeline

pipeline = BaseChronosPipeline.from_pretrained(
    "amazon/chronos-2", device_map="cuda", dtype=torch.bfloat16
)
# Input is a 3D tensor of shape (n_series, n_variates, history_length)
context = torch.tensor(np.arange(512, dtype=np.float32)[None, None, :])
quantiles_list, mean_list = pipeline.predict_quantiles(
    context, prediction_length=96, quantile_levels=[0.1, 0.5, 0.9]
)
print(quantiles_list[0].shape)  # torch.Size([1, 96, 3])

One thing to be careful about: predict_quantiles in chronos-forecasting>=2.2 has completely revamped argument names, input shapes, and return values, so directly copying from older articles or official examples won't work.

Trying TimesFM 2.5

The one thing to watch out for with TimesFM 2.5 is that the timesfm==1.3.0 distributed on PyPI does not yet include the TimesFM 2.5 API. The Hugging Face model card also explicitly states "pip install from PyPI coming soon. At this point, please git clone," meaning a source install from the GitHub HEAD is the only option for now. Since it's installed with --no-deps, safetensors must be specified separately.

uv pip install --index-url https://download.pytorch.org/whl/cu130 \
  --extra-index-url https://pypi.org/simple/ torch safetensors
uv pip install --force-reinstall --no-deps \
  "timesfm @ git+https://github.com/google-research/timesfm.git"

Here is the minimal sample. The pattern is to set the context and horizon limits using ForecastConfig and compile, then pass the actual prediction length to forecast.

import numpy as np
import timesfm

model = timesfm.TimesFM_2p5_200M_torch.from_pretrained(
    "google/timesfm-2.5-200m-pytorch"
)
model.compile(timesfm.ForecastConfig(
    max_context=2048, max_horizon=256,
    normalize_inputs=True, use_continuous_quantile_head=True,
    fix_quantile_crossing=True,
))
point_forecast, quantile_forecast = model.forecast(
    horizon=96, inputs=[np.arange(512, dtype=np.float32)],
)
print(point_forecast.shape)  # (1, 96)

Benchmark Results

From here, using the ETTh1 (electricity dataset) as a subject, the 3 models are evaluated across three axes: latency, accuracy, and memory efficiency.

The OT column (oil temperature) from ETTh1 (Electricity Transformer Temperature, Hourly, 7 variables, 17,420 rows) was used. This is a dataset from a Chinese electric power company monitoring distribution transformers over 2 years (July 2016 to July 2018), aimed at predicting signs of overheating accidents from the transformer's internal oil temperature (OT) and 6 types of load variables (high/medium/low active/reactive power). Its structure closely resembles industrial equipment monitoring data in manufacturing, and since the Informer paper (AAAI 2021 Best Paper), it has been used as a standard benchmark for time-series forecasting.

Evaluation was measured using 64 rolling-window splits taken from the tail at step=24. Metrics are MASE (Mean Absolute Scaled Error) and RMSE; inference latency uses the warm median with cold-warm separation; peak GPU memory is captured with torch.cuda.max_memory_allocated().

Short-term Forecasting (horizon=96, context=512)

Model Warm Median Warm p95 Peak GPU mem MASE RMSE
Chronos-2 120M 6.5 ms 6.7 ms 0.255 GB 1.149 2.228
Chronos-2 28M 3.7 ms 3.9 ms 0.084 GB 1.154 2.263
TimesFM 2.5 200M 86.0 ms 89.3 ms 0.917 GB 1.106 2.169

Latency comparison of 3 models. Chronos-2 variants are in the single-digit ms range, TimesFM 2.5 is at 86ms

Peak GPU memory comparison of 3 models. 28M is extremely lightweight at 84MB

Parallel bar chart of MASE and RMSE. All 3 models slightly outperform seasonal naive

In terms of accuracy, TimesFM 2.5 achieves the best MASE and RMSE, with all 3 models only marginally outperforming seasonal naive (MASE=1.0, which simply returns the value from 24 hours ago). Since ETTh1 OT has a strong diurnal cycle, the seasonal baseline is a tough competitor even for zero-shot foundation models.

The latency difference is quite pronounced: Chronos-2 28M (3.7ms) and Chronos-2 120M (6.5ms) are 13 to 23 times faster than TimesFM 2.5 200M (86ms). This directly reflects the architectural difference between Chronos-2, which outputs quantiles for the full horizon in a single encoder forward pass, and TimesFM 2.5, which generates autoregressively with a decoder-only architecture.

In terms of GPU memory, the 28M model uses 1/3 of the 120M and 1/11 of TimesFM 2.5, while delivering nearly equivalent accuracy.

What Happens When the Horizon Is Extended

The forecasting target was extended from 96 hours (4 days) to 720 hours (30 days) to observe how latency and accuracy change.

Latency trend as horizon is extended. Chronos-2 stays nearly flat while TimesFM spikes sharply from h=96 to h=192

horizon Chronos-2 120M Chronos-2 28M TimesFM 2.5 200M
96 6.5 ms / MASE 1.149 3.7 ms / MASE 1.154 86 ms / MASE 1.106
192 6.7 ms / 1.196 3.8 ms / 1.221 229 ms / 1.241
336 6.8 ms / 1.259 3.9 ms / 1.330 228 ms / 1.214
720 6.8 ms / 1.453 3.8 ms / 1.543 228 ms / 1.411

Chronos-2's latency is nearly identical between horizon=96 and horizon=720 (6.5ms → 6.8ms) — quietly impressive. TimesFM 2.5 increased by 2.7x from h=96 to h=192, then plateaued as the internal horizon_len hit its ceiling.

In terms of accuracy, all models naturally degrade as the horizon extends, but TimesFM 2.5 still slightly outperforms Chronos-2 120M (1.453) with a MASE of 1.411 at h=720. Maintaining accuracy advantage even for long-horizon forecasting is one of TimesFM 2.5's strengths.

TimesFM 2.5 Shows Dramatic Accuracy Improvement When Context Is Extended

Given that TimesFM 2.5 advertises "16K context," testing whether accuracy actually improves with longer context was an important point of investigation. Since Chronos-2 has a context limit of 8,192, this was evaluated with TimesFM 2.5 alone.

Context scaling of TimesFM 2.5. MASE improves dramatically from c=512 → 8K → 15K

context Warm Median Peak GPU MASE vs c=512
512 86.0 ms 0.917 GB 1.106 baseline
8,192 331.2 ms 0.993 GB 0.883 20% improvement
15,360 447.8 ms 1.074 GB 0.770 30% improvement

At c=15,360, MASE reaches 0.770 — that's 23% better than seasonal naive. The benefit of "16K context" is clearly reflected in the numbers, reaffirming that this model truly leverages long context. One important note: the full 16,384 context limit of the model cannot be fully utilized, as there is an architectural constraint of max_context + max_horizon ≤ 16384. If horizon=1024 is reserved, the practical context limit is up to 15,360. The long-context verification in this article was also measured at c=15,360.

A Look at Multivariate Forecasting

Since ETTh1 has 6 electrical load columns — HUFL / HULL / MUFL / MULL / LUFL / LULL — in addition to OT, we tested whether simultaneously forecasting all 7 variables in multivariate mode with Chronos-2 would change the OT accuracy.

Variable-wise MASE for 7-variable simultaneous forecasting. LULL is the best, while OT slightly degrades

As a result, looking at OT accuracy with Chronos-2 120M, it worsened by 7% from univariate MASE 1.149 to multivariate MASE 1.226. On the other hand, the other 6 variables were predicted well with MASE values of 0.87–1.09, and LULL in particular achieved 0.90, outperforming seasonal naive by 10%.

This is likely the same phenomenon commonly seen in machine learning where "mixing in covariates with weak correlation to the target can be counterproductive." Testing the same with Chronos-2 28M showed OT MASE remaining nearly unchanged at 1.149, suggesting that for dashboard use cases where "all metrics are returned in a single forward pass," this model could be a good fit.

Does the Ranking Hold in a Different Domain?

To check whether the ranking established with ETTh1 (electricity data) holds in other domains, we ran the same evaluation on the T column (room temperature in Rome) from the UCI Air Quality dataset.

Ranking reversal between ETTh1 and AirQuality. MASE rankings swap depending on the dataset

Dataset 1st Place 2nd Place 3rd Place
ETTh1 (Electricity OT) TimesFM 2.5 (1.106) Chronos-2 120M (1.149) Chronos-2 28M (1.154)
AirQuality (Room Temp T) Chronos-2 120M (1.206) Chronos-2 28M (1.254) TimesFM 2.5 (1.364)

The rankings were reversed. AirQuality is noisier as a time series than ETTh1 and lacks a strong diurnal cycle (while day-night differences exist, seasonal variation is also large), so it seems that TimesFM 2.5's strength with long context doesn't come into play with short context (c=512). This result shows that "TimesFM 2.5 is not always the best," reinforcing the importance of comparing models on the actual production domain.

Anomaly Detection Simulation

Hearing that NV-Tesseract is strong at anomaly detection naturally raises the question of how useful prediction-based anomaly detection with Chronos-2 / TimesFM 2.5 can be. Three types of anomalies — spikes, level shifts, and noise bursts — were artificially injected into the ETTh1 test period, and the absolute value of prediction residuals was used as an anomaly score to calculate per-step ROC AUC.

ROC AUC for 3 models × 3 anomaly patterns. Spikes are 0.96+ for all models, noise is around 0.66 for all

Pattern Chronos-2 120M Chronos-2 28M TimesFM 2.5 200M
Spike (instantaneous anomaly) 0.965 0.968 0.981
Level shift (persistent anomaly) 0.719 0.706 0.685
Noise burst (short-term) 0.658 0.642 0.663

All 3 models achieved AUC above 0.96 for spikes, meaning instantaneous outliers can be detected almost certainly. The chart below shows a qualitative example, where the prediction residuals spike sharply only at the injected spike points (red circles), making it visually clear as well.

Spike detection example with Chronos-2 120M. Prediction residuals stand out prominently at injection points

Level shifts scored a moderate AUC of 0.69–0.72, meaning gradual drift-type anomalies similar to equipment degradation can be detected to some extent. Noise bursts scored lower at AUC 0.64–0.67, suggesting that short-term, high-frequency noise is easy to miss with prediction-based FMs alone. A dedicated anomaly detection model like NV-Tesseract should excel in this area, and it would be worth comparing once it's publicly available.

NV-Tesseract

According to the official NVIDIA Developer Blog (New NVIDIA NV-Tesseract Time-Series Models and Advancing Anomaly Detection for Industry Applications), NV-Tesseract is a Transformer-based model supporting three tasks: forecasting, anomaly detection, and classification. In particular, anomaly detection (NV-Tesseract-AD) uses a mechanism called segmented / multi-scale adaptive thresholding, which appears to be designed for detecting subtle outliers in industrial sensors.

One striking real-world manufacturing example is the partnership with Cognite, announced on the first day of GTC 2026 (March 16, 2026). A case study has been published about running NV-Tesseract NIM for state prediction from reactor water level sensors at Celanese's chemical plant in Clear Lake, Texas. The challenge from Celanese's side was that "every time manual sampling occurred, conventional prediction models would cause bias jumps that disrupted continuous operation," and the goal is to bridge this gap with NV-Tesseract's real-time forecasting.

What's interesting about the delivery model is that it connects NVIDIA's NV-Tesseract NIM to Cognite's Industrial Knowledge Graph (a graph DB of equipment, sensors, and operational knowledge), providing a package where data context and the model are delivered together. The target sectors are stated to extend beyond chemicals to energy, manufacturing, power and renewable energy, and heavy industry, and it's clear that the intent is to deploy this through industrial OT data platform providers as a gateway to optimize entire production lines. Combined with video-based foundation models like VSS and Cosmos, there seems to be a trajectory toward factory visualization on two fronts: "sensor time series + video."

The timeline for when it will be available to run locally on DGX Spark or Jetson is currently undetermined, but it's something to look forward to.

Comparison of Ecosystem and Operational Aspects

Why did the benchmark results in Chapter 6 turn out the way they did? Looking at the internal structures of the 3 models side by side reveals that differences in design philosophy directly translate into differences in operational characteristics.

Chronos-2 outputs the full horizon in a single forward pass via the encoder + quantile head, so latency doesn't increase much as the horizon grows. TimesFM 2.5 generates patch-by-patch autoregressively with a decoder, so latency grows with the horizon, but in exchange it can leverage long context — that's the design tradeoff.

Here are recommendations by use case:

Intended Use Likely Best Choice
AWS-centric demand forecasting (rich covariates) Chronos-2 + SageMaker JumpStart / Bedrock Marketplace
One-shot SQL forecasting on GoogleCloud data lake TimesFM 2.5 + BigQuery AI.FORECAST
IoT sensor anomaly detection (high accuracy) NV-Tesseract
Lightweight edge deployment Chronos-2 28M
Long-term trend forecasting (thousands of steps) TimesFM 2.5 (long context mode)
Massively parallel short-term forecasting Chronos-2 120M
Probabilistic forecasting for finance, etc. Chronos-2

Application Considerations for Real Projects

What I'm thinking about for a project I'm actually involved in is a use case where real-time data acquired from PLCs (industrial controllers that control sensors and set values on the manufacturing floor) is fed into a time-series model to sequentially determine "is it likely that normal products will be produced under current operating conditions?" and "are there signs of heading toward a failure as the set values change?" Passing set values as future covariates enables forecasting future values of quality indicators, and looking at prediction residuals can also serve as an anomaly score. As shown in the benchmark in this article, Chronos-2 28M achieves a warm latency of 3.7ms, well within PLC scan cycles (100ms to 1 second), making real-time integration in an edge-close configuration realistically feasible. The Cognite × Celanese case mentioned earlier follows exactly the same structure, implementing this by connecting an OT platform provider to NIM. I'm planning to cover the technical verification of feeding actual PLC data in a separate article.

Implementation Pitfalls to Be Aware Of

Here are two points where it was easy to get stuck due to transitional library conditions rather than the time-series models themselves.

The predict_quantiles API for Chronos-2 was completely revamped in version 2.2, so directly copying from older articles or official examples won't work. The argument is inputs=tensor, the input must be 3-dimensional (n_series, n_variates, history_length) (even for univariate: tensor[None, None, :]), and the return value is tuple[list[Tensor], list[Tensor]] where the shape of quantiles_list[0] is (n_variates, horizon, num_quantiles).

The practical context limit for TimesFM 2.5 is up to 15,360. Due to the architectural constraint of max_context + max_horizon ≤ 16384, reserving horizon=1024 limits the context to 15,360. The long-context verification in this article was also measured at c=15,360.

Summary

This article organized the major updates to time-series foundation models that have been rolling out since fall 2025, benchmarked AWS Chronos-2 and Google TimesFM 2.5 on actual hardware, and introduced NV-Tesseract based on official information and the Cognite case study.

To put it simply: Chronos-2 for latency-sensitive, massively parallel workloads; TimesFM 2.5 when you want to leverage long context for better accuracy; and NV-Tesseract for dedicated anomaly detection in manufacturing. Chronos-2 28M is lightweight in both memory and inference time, with accuracy nearly on par with the 120M model — an interesting option for near-edge deployments.

Chronos-2

TimesFM 2.5

NV-Tesseract

Benchmark References


製造業のクラウド活用とデジタル化を支援します

クラスメソッドの専門家による包括的なクラウド導入とデジタル化支援で、製造業の業務効率を最大化しましょう。AWSの導入から運用、最適化まで、最新技術と豊富な知見であらゆる課題に対応します。生産ラインのデジタル化やデータ活用、IoTの導入事例もございます。ぜひ、弊社の実績をご覧ください。

製造業界での支援内容を見る

Share this article

AWSのお困り事はクラスメソッドへ