I tried running and comparing time series foundation models on DGX Spark

I tried running and comparing time series foundation models on DGX Spark

2026.05.25

This page has been translated by machine translation. View original

Hello, I'm Shigeru Mori from the Classmethod Manufacturing Business Technology Department.

Starting in the fall of 2025, foundation models for time-series forecasting received major updates in rapid succession. Google's TimesFM 2.5, NVIDIA's NV-Tesseract (in evaluation license stage), and AWS's Chronos-2 — three major models from three companies lined up within just two months. Each has slightly different target areas, such as demand forecasting, anomaly detection, and sensor monitoring in manufacturing, and there's ongoing debate about which one comes out on top in benchmarks.

In this article, I'll benchmark Chronos-2 and TimesFM 2.5, which can be run locally, and introduce NV-Tesseract — which can currently only be accessed via DGX Cloud NIM — as a reading section based on official information and Cognite's case study. The test environment used my DGX Spark on hand.

Positioning and Comparison of the 3 Models

When organizing the three models by developer, distribution format, and intended use at the ecosystem level, the picture looks like this. AWS places it on SageMaker JumpStart and Bedrock Marketplace, Google places it on BigQuery's AI.FORECAST, and NVIDIA provides it as a DGX Cloud NIM — it's interesting how each company's platform strategy is clearly visible through these choices.

Placing the key specs side by side, the commonalities and differences become clear.

Item Chronos-2 TimesFM 2.5 NV-Tesseract
Provider AWS (Amazon Science) Google Research NVIDIA
Release 2025-10-20 2025-09 2025 (Evaluation License)
Architecture T5 encoder + Group Attention Decoder-only + Quantile Head Transformer (details undisclosed)
Parameters 120M / 28M 200M Undisclosed
Context Limit 8,192 16,384 Undisclosed
Multivariate Forecasting Native support Supported Supported
Anomaly Detection - (forecasting-focused) AI.DETECT_ANOMALIES GA Native (NV-Tesseract-AD)
Fine-Tuning Available Available Via NIM only
License Apache 2.0 Apache 2.0 Evaluation License
Availability HF / GitHub HF / JAX version available DGX Cloud NIM only
Commercial Use Available Available Conditional during evaluation period

Both Chronos-2 and TimesFM 2.5 are Apache 2.0 and can run locally, while NV-Tesseract is in the evaluation license stage and can only be accessed via DGX Cloud NIM — a clear difference in availability. The general release date has not been announced as of now. (As of May 2026)

Test Environment

Both models are fully operable via Python library calls. Since dependency versions conflict, I set up separate uv venv environments for each.

Item Value
Hardware DGX Spark (NVIDIA GB10, aarch64, 128GB UMA)
OS / Driver Ubuntu 24.04 / NVIDIA driver 580.142
Python 3.13.13
PyTorch 2.12.0+cu130 (shared across both venvs)
For Chronos-2 chronos-forecasting==2.2.2, transformers==4.57.6
For TimesFM 2.5 timesfm==2.0.0 (git head d720daa6), safetensors==0.7.0

Trying Chronos-2

Just install chronos-forecasting>=2.1.0 (2.1.0 and later includes a bug fix for past covariate handling) via pip. The cu130 wheel index is explicitly specified for ARM64 + Blackwell GB10.

uv pip install --index-url https://download.pytorch.org/whl/cu130 \
  --extra-index-url https://pypi.org/simple/ torch
uv pip install "chronos-forecasting>=2.1.0"

The minimal sample looks like this. It runs end-to-end from BaseChronosPipeline.from_pretrained to predict_quantiles, returning 21 quantiles in a single inference after loading.

import numpy as np
import torch
from chronos import BaseChronosPipeline

pipeline = BaseChronosPipeline.from_pretrained(
    "amazon/chronos-2", device_map="cuda", dtype=torch.bfloat16
)
# Input is a 3D tensor of shape (n_series, n_variates, history_length)
context = torch.tensor(np.arange(512, dtype=np.float32)[None, None, :])
quantiles_list, mean_list = pipeline.predict_quantiles(
    context, prediction_length=96, quantile_levels=[0.1, 0.5, 0.9]
)
print(quantiles_list[0].shape)  # torch.Size([1, 96, 3])

One thing to watch out for: chronos-forecasting>=2.2 has completely revamped the argument names, input shape, and return values of predict_quantiles, so copy-pasting from older articles or official examples may not work as-is.

Trying TimesFM 2.5

The one thing to be careful about with TimesFM 2.5 is that the timesfm==1.3.0 distributed on PyPI does not yet include the TimesFM 2.5 API. Even the Hugging Face model card explicitly states "pip install from PyPI coming soon. At this point, please git clone," so currently the only option is to source install from the GitHub HEAD. Since we install with --no-deps, safetensors needs to be specified separately.

uv pip install --index-url https://download.pytorch.org/whl/cu130 \
  --extra-index-url https://pypi.org/simple/ torch safetensors
uv pip install --force-reinstall --no-deps \
  "timesfm @ git+https://github.com/google-research/timesfm.git"

Here is the minimal sample. The pattern is to set the context and horizon limits with ForecastConfig and compile, then pass the actual prediction length to forecast.

import numpy as np
import timesfm

model = timesfm.TimesFM_2p5_200M_torch.from_pretrained(
    "google/timesfm-2.5-200m-pytorch"
)
model.compile(timesfm.ForecastConfig(
    max_context=2048, max_horizon=256,
    normalize_inputs=True, use_continuous_quantile_head=True,
    fix_quantile_crossing=True,
))
point_forecast, quantile_forecast = model.forecast(
    horizon=96, inputs=[np.arange(512, dtype=np.float32)],
)
print(point_forecast.shape)  # (1, 96)

Benchmark Results

From here, I'll evaluate the 3 models on ETTh1 (an electricity dataset) across three axes: latency, accuracy, and memory efficiency.

For the evaluation, I used the OT column (oil temperature) from ETTh1 (Electricity Transformer Temperature, Hourly, 7 variables, 17,420 rows). This is a dataset from a Chinese power company that monitored distribution transformers for two years (July 2016 to July 2018), aimed at predicting signs of overheating accidents from the internal oil temperature (OT) and six types of load variables (high/medium/low active/reactive power). Its structure is close to equipment monitoring data in manufacturing, and it has been used as a standard benchmark for time-series forecasting since the Informer paper (AAAI 2021 Best Paper).

The evaluation took 64 rolling-window splits from the end with step=24. The metrics are MASE (Mean Absolute Scaled Error) and RMSE. Inference latency is the warm median with cold/warm separation, and peak GPU memory is measured with torch.cuda.max_memory_allocated().

Short-term Forecasting (horizon=96, context=512)

Model Warm Median Warm p95 Peak GPU mem MASE RMSE
Chronos-2 120M 6.5 ms 6.7 ms 0.255 GB 1.149 2.228
Chronos-2 28M 3.7 ms 3.9 ms 0.084 GB 1.154 2.263
TimesFM 2.5 200M 86.0 ms 89.3 ms 0.917 GB 1.106 2.169

Latency comparison of 3 models. Chronos-2 series is a few ms, TimesFM 2.5 is 86ms

Peak GPU memory comparison of 3 models. Very lightweight at 84MB with 28M

Parallel bar for MASE and RMSE. All 3 models slightly outperform seasonal naive

In terms of accuracy, TimesFM 2.5 achieves the best MASE/RMSE, with all three models only marginally outperforming seasonal naive (MASE=1.0, which simply returns the same value from 24 hours ago). Since ETTh1 OT has a strong daily cycle, even zero-shot foundation models face a reasonably tough baseline.

The latency difference is quite pronounced: Chronos-2 28M (3.7ms) and Chronos-2 120M (6.5ms) are 13 to 23 times faster than TimesFM 2.5 200M (86ms). This directly reflects the architectural difference between Chronos-2, which outputs all quantiles in a single encoder forward pass, and TimesFM 2.5, which generates autoregressively with a decoder-only architecture.

In terms of GPU memory, the 28M model uses 1/3 of the 120M and 1/11 of TimesFM 2.5, while achieving nearly equivalent accuracy.

What Happens as Horizon Increases

I extended the forecast target from 96 hours (4 days) to 720 hours (30 days) to see how latency and accuracy change.

Latency trend as horizon increases. Chronos-2 stays nearly flat, TimesFM spikes sharply from h=96 to h=192

horizon Chronos-2 120M Chronos-2 28M TimesFM 2.5 200M
96 6.5 ms / MASE 1.149 3.7 ms / MASE 1.154 86 ms / MASE 1.106
192 6.7 ms / 1.196 3.8 ms / 1.221 229 ms / 1.241
336 6.8 ms / 1.259 3.9 ms / 1.330 228 ms / 1.214
720 6.8 ms / 1.453 3.8 ms / 1.543 228 ms / 1.411

Chronos-2's latency is nearly identical between horizon=96 and horizon=720 (6.5ms → 6.8ms) — quietly impressive. TimesFM 2.5 increased 2.7x from h=96 to h=192, then plateaued as the internal horizon_len hit its ceiling.

Accuracy naturally degrades for all models as horizon increases, but TimesFM 2.5 still slightly outperforms Chronos-2 120M (1.453) with MASE 1.411 at h=720. Maintaining its accuracy advantage even in long-term forecasting is a strength of TimesFM 2.5.

TimesFM 2.5 Shows Dramatic Accuracy Improvement with Longer Context

Since TimesFM 2.5 touts "16K context," whether accuracy actually improves when context is extended was a key question worth testing. Chronos-2 has a context limit of 8,192, so this evaluation was done with TimesFM 2.5 alone.

Context scaling for TimesFM 2.5. MASE improves dramatically from c=512 to c=8K to c=15K

context Warm Median Peak GPU MASE vs c=512
512 86.0 ms 0.917 GB 1.106 baseline
8,192 331.2 ms 0.993 GB 0.883 20% improvement
15,360 447.8 ms 1.074 GB 0.770 30% improvement

At c=15,360, MASE reaches 0.770 — 23% better than seasonal naive. The benefit of "16K context" is clearly reflected in the numbers, reinforcing the impression that this model truly benefits from long context. One important caveat: you cannot use the full 16,384 context limit of the model. Due to the architectural constraint that max_context + max_horizon ≤ 16384, if you reserve horizon=1024, the practical context limit is 15,360.

Looking at Multivariate Forecasting

ETTh1 also has six power load columns beyond OT — HUFL / HULL / MUFL / MULL / LUFL / LULL — so I tested how predicting all 7 variables simultaneously with Chronos-2 multivariate mode affects OT-only accuracy.

Per-variable MASE for 7-variable simultaneous forecasting. LULL is best, OT slightly degrades

The result: for Chronos-2 120M, OT accuracy degraded 7% from MASE 1.149 (univariate) to 1.226 (within multivariate). On the other hand, the other 6 variables were forecasted well with MASE 0.87–1.09, with LULL achieving 0.90 — 10% better than seasonal naive.

This is likely the phenomenon commonly seen in machine learning: "mixing covariates with weak correlation to the target can backfire." Running the same test with Chronos-2 28M yields OT MASE of about 1.149, essentially unchanged, suggesting it could be useful for dashboard use cases by leveraging the efficiency of "returning all variables in a single forward pass."

Does the Ranking Hold Across Different Domains?

To check whether the ranking from ETTh1 (electricity data) holds in other domains, I ran the same conditions on the T column (room temperature in Rome) from the UCI Air Quality dataset.

Ranking reversal between ETTh1 and AirQuality. MASE rankings flip between datasets

Dataset 1st Place 2nd Place 3rd Place
ETTh1 (Electricity OT) TimesFM 2.5 (1.106) Chronos-2 120M (1.149) Chronos-2 28M (1.154)
AirQuality (Room Temperature T) Chronos-2 120M (1.206) Chronos-2 28M (1.254) TimesFM 2.5 (1.364)

The rankings reversed. AirQuality has noisier time series than ETTh1 and lacks as strong a daily cycle (there's a day/night difference, but seasonal variation is also large), so with short context (c=512), TimesFM 2.5's long-context advantage doesn't seem to help. This result confirms that "TimesFM 2.5 isn't always best," and that actually comparing models on your target domain is important.

Anomaly Detection Simulation

Hearing that NV-Tesseract excels at anomaly detection naturally raises the question: how useful are Chronos-2 / TimesFM 2.5's prediction-based anomaly detection approaches? I artificially injected three types of anomalies — spikes, level shifts, and noise bursts — into the test section of ETTh1, and computed per-step ROC AUC using the absolute value of prediction residuals as anomaly scores.

ROC AUC for 3 models × 3 anomaly patterns. Spikes score 0.96+ for all models, noise around 0.66 for all models

Pattern Chronos-2 120M Chronos-2 28M TimesFM 2.5 200M
Spike (instantaneous anomaly) 0.965 0.968 0.981
Level Shift (persistent anomaly) 0.719 0.706 0.685
Noise Burst (short-term) 0.658 0.642 0.663

All three models achieve AUC above 0.96 for spikes, meaning instantaneous outliers can be detected with near certainty. The chart below shows a qualitative example: the prediction residual spikes sharply only at the injected spike points (red circles), which is visually easy to understand.

Spike detection example with Chronos-2 120M. Prediction residual is prominent at injection points

Level shifts score AUC 0.69–0.72, a moderate result. This suggests gradual equipment drift-type anomalies can be detected to a reasonable degree. Noise bursts score lower at AUC 0.64–0.67, suggesting that short-term high-frequency noise is likely to be missed by prediction FMs alone. This is exactly where dedicated anomaly detection models like NV-Tesseract should shine, and I'd like to compare them again once they're publicly available.

NV-Tesseract

According to the official NVIDIA Developer Blog (New NVIDIA NV-Tesseract Time-Series Models and Advancing Anomaly Detection for Industry Applications), NV-Tesseract is a Transformer-based model supporting three tasks: forecasting, anomaly detection, and classification. In particular, the anomaly detection variant (NV-Tesseract-AD) appears to be designed around a mechanism called segmented / multi-scale adaptive thresholding, targeting precise detection of subtle outliers in industrial sensors.

A particularly notable real-world manufacturing example is the partnership with Cognite announced to coincide with the first day of GTC 2026 (March 16, 2026). A case study has been published of NV-Tesseract NIM running state predictions from reactor water level sensors at Celanese's chemical plant in Clear Lake, Texas. Celanese's pain point was that "traditional prediction models would produce bias jumps (stepped discontinuities in bias) every time manual sampling occurred, disrupting continuous operations" — the goal being to close this gap with real-time forecasting by NV-Tesseract.

What's interesting about the delivery format is that it connects Cognite's Industrial Knowledge Graph (a graph DB of equipment, sensors, and operational knowledge) to the NV-Tesseract NIM, making data context and model available together as a package. The target sectors extend beyond chemicals to energy, manufacturing, power and renewables, and heavy industry — suggesting a trajectory where industrial OT data platform providers serve as the gateway to optimize entire industrial lines. Combined with video foundation models like VSS and Cosmos, there seems to be a growing possibility of factory visualization progressing along both axes of "sensor time series + video."

The timeline for when it will be runnable locally on DGX Spark or Jetson is currently unannounced, but I look forward to it.

Ecosystem and Operational Comparison

Why did the benchmark results in Section 6 turn out the way they did? Looking at the internal architectures of the three models side by side, it becomes clear that design philosophy differences directly manifest as operational characteristics.

Chronos-2 outputs all horizons at once in a single encoder + quantile head forward pass, so latency doesn't grow much as horizon increases. TimesFM 2.5 generates patch units autoregressively with a decoder, so latency grows with horizon length — but in exchange, it can leverage long context. These are their respective design trade-offs.

Recommended choices by use case might look like this:

Intended Use Likely Best Choice
AWS-centric demand forecasting (rich covariates) Chronos-2 + SageMaker JumpStart / Bedrock Marketplace
One-shot SQL forecasting on a GoogleCloud data lake TimesFM 2.5 + BigQuery AI.FORECAST
IoT sensor anomaly detection (high accuracy requirements) NV-Tesseract
Lightweight edge-side deployment Chronos-2 28M
Long-term trend forecasting (thousands of steps) TimesFM 2.5 (long context mode)
High-volume parallel short-term forecasting Chronos-2 120M
Probabilistic forecasting emphasis (e.g., finance) Chronos-2

Application Considerations in Real Projects

In projects I'm currently involved in, I'm thinking about a use case where real-time data acquired from PLCs (industrial controllers that manage sensors and setpoints on the factory floor) is fed into a time-series model to continuously determine "are we likely to produce products as usual given current operating conditions?" and "are there signs of heading toward failure as setpoints change?" If setpoints are passed as future covariates, future values of quality indicators can be predicted, and prediction residuals can also serve as anomaly scores. Based on the benchmark in this article, Chronos-2 28M at 3.7ms warm latency comfortably fits within PLC scan cycles (100ms–1 second), making real-time edge integration realistically feasible. The Cognite × Celanese case mentioned earlier follows exactly the same structure, implementing this by connecting an OT platform with NIM. I plan to cover the technical verification of feeding actual PLC data in a separate article.

Implementation Pitfalls to Watch Out For

Just two points where I got stuck due to the transitional state of the libraries rather than the time-series models themselves.

The predict_quantiles API in Chronos-2 was completely revamped in the 2.2 series, so copy-pasting from older articles or official examples won't work as-is. Arguments use inputs=tensor, input must be 3D (n_series, n_variates, history_length) (even for univariate, use tensor[None, None, :]), and the return value is tuple[list[Tensor], list[Tensor]] where quantiles_list[0] has shape (n_variates, horizon, num_quantiles).

The effective context limit for TimesFM 2.5 is 15,360. Due to the architectural constraint max_context + max_horizon ≤ 16384, reserving horizon=1024 limits context to 15,360. The long-context evaluation in this article was also measured at c=15,360.

Summary

In this article, I organized the major updates to time-series foundation models that started in the fall of 2025, benchmarked AWS Chronos-2 and Google TimesFM 2.5 on real hardware, and covered NV-Tesseract based on official information and the Cognite case study.

To put it simply: Chronos-2 for latency-critical high-volume parallel processing, TimesFM 2.5 when long context is needed for better accuracy, and NV-Tesseract when you want a dedicated model for anomaly detection in manufacturing. Chronos-2 28M is lightweight in both memory and inference time with accuracy nearly matching the 120M model — an interesting option for edge-side deployments as well.

Chronos-2

TimesFM 2.5

NV-Tesseract

Benchmark References


製造業のクラウド活用とデジタル化を支援します

クラスメソッドの専門家による包括的なクラウド導入とデジタル化支援で、製造業の業務効率を最大化しましょう。AWSの導入から運用、最適化まで、最新技術と豊富な知見であらゆる課題に対応します。生産ラインのデジタル化やデータ活用、IoTの導入事例もございます。ぜひ、弊社の実績をご覧ください。

製造業界での支援内容を見る

Share this article

AWSのお困り事はクラスメソッドへ