
I tried running and comparing time series foundation models on DGX Spark
This page has been translated by machine translation. View original
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Division.
Starting in the fall of 2025, foundation models for time-series forecasting received major updates in rapid succession. Google's TimesFM 2.5, NVIDIA's NV-Tesseract (currently in evaluation license stage), and AWS's Chronos-2 — three leading models from three companies emerged within just two months. Each targets slightly different use cases such as demand forecasting, anomaly detection, and manufacturing sensor monitoring, and there's ongoing debate about which one comes out on top in benchmarks.
This article benchmarks Chronos-2 and TimesFM 2.5, which can be run locally, on actual hardware, while NV-Tesseract — currently only accessible via DGX Cloud NIM — is introduced as a reading section based on official information and Cognite's case study. The verification environment used a DGX Spark on hand.
Positioning and Comparison of the 3 Models
When organizing the three models by developer, distribution method, and intended use on an ecosystem basis, the picture looks like this. AWS integrates via SageMaker JumpStart and Bedrock Marketplace, Google integrates via BigQuery's AI.FORECAST, and NVIDIA provides it as a NIM on DGX Cloud — it's interesting how each company's platform strategy is transparently visible.
Placing the main specs side by side makes the commonalities and differences clearly visible.
| Item | Chronos-2 | TimesFM 2.5 | NV-Tesseract |
|---|---|---|---|
| Provider | AWS (Amazon Science) | Google Research | NVIDIA |
| Release | 2025-10-20 | 2025-09 | 2025 (Evaluation License) |
| Architecture | T5 encoder + Group Attention | Decoder-only + Quantile head | Transformer (details undisclosed) |
| Parameters | 120M / 28M | 200M | Undisclosed |
| Context Limit | 8,192 | 16,384 | Undisclosed |
| Multivariate Forecasting | Native support | Supported | Supported |
| Anomaly Detection | - (Forecast-focused) | AI.DETECT_ANOMALIES GA |
Native (NV-Tesseract-AD) |
| Fine-Tuning | Publicly available | Publicly available | Via NIM only |
| License | Apache 2.0 | Apache 2.0 | Evaluation license |
| Availability | HF / GitHub | HF / JAX version available | DGX Cloud NIM only |
| Commercial Use | Available | Available | Conditional during evaluation |
While both Chronos-2 and TimesFM 2.5 are available under Apache 2.0 and can run locally, NV-Tesseract is in the evaluation license stage and can only be accessed via DGX Cloud NIM — a notable difference in availability. The general release timeline has not been announced as of now. (As of May 2026)
Verification Environment
Both models are fully operable via Python library calls. Since dependency versions conflict, separate uv venvs were created for each.
| Item | Value |
|---|---|
| Hardware | DGX Spark (NVIDIA GB10, aarch64, 128GB UMA) |
| OS / Driver | Ubuntu 24.04 / NVIDIA driver 580.142 |
| Python | 3.13.13 |
| PyTorch | 2.12.0+cu130 (common to both venvs) |
| For Chronos-2 | chronos-forecasting==2.2.2, transformers==4.57.6 |
| For TimesFM 2.5 | timesfm==2.0.0 (git head d720daa6), safetensors==0.7.0 |
Trying Chronos-2
Just install chronos-forecasting>=2.1.0 (2.1.0 and later includes a bug fix for past-value covariates) via pip. PyTorch specifies the cu130 wheel index explicitly for the ARM64 + Blackwell GB10.
uv pip install --index-url https://download.pytorch.org/whl/cu130 \
--extra-index-url https://pypi.org/simple/ torch
uv pip install "chronos-forecasting>=2.1.0"
The minimal sample looks like this. It goes end-to-end from BaseChronosPipeline.from_pretrained to predict_quantiles, returning 21 types of quantiles in a single inference after loading.
import numpy as np
import torch
from chronos import BaseChronosPipeline
pipeline = BaseChronosPipeline.from_pretrained(
"amazon/chronos-2", device_map="cuda", dtype=torch.bfloat16
)
# Input is a 3D tensor of shape (n_series, n_variates, history_length)
context = torch.tensor(np.arange(512, dtype=np.float32)[None, None, :])
quantiles_list, mean_list = pipeline.predict_quantiles(
context, prediction_length=96, quantile_levels=[0.1, 0.5, 0.9]
)
print(quantiles_list[0].shape) # torch.Size([1, 96, 3])
One thing to be careful about: predict_quantiles in chronos-forecasting>=2.2 has completely revamped argument names, input shapes, and return values, so directly copying from older articles or official examples won't work.
Trying TimesFM 2.5
The one thing to watch out for with TimesFM 2.5 is that the timesfm==1.3.0 distributed on PyPI does not yet include the TimesFM 2.5 API. The Hugging Face model card also explicitly states "pip install from PyPI coming soon. At this point, please git clone," meaning a source install from the GitHub HEAD is the only option for now. Since it's installed with --no-deps, safetensors must be specified separately.
uv pip install --index-url https://download.pytorch.org/whl/cu130 \
--extra-index-url https://pypi.org/simple/ torch safetensors
uv pip install --force-reinstall --no-deps \
"timesfm @ git+https://github.com/google-research/timesfm.git"
Here is the minimal sample. The pattern is to set the context and horizon limits using ForecastConfig and compile, then pass the actual prediction length to forecast.
import numpy as np
import timesfm
model = timesfm.TimesFM_2p5_200M_torch.from_pretrained(
"google/timesfm-2.5-200m-pytorch"
)
model.compile(timesfm.ForecastConfig(
max_context=2048, max_horizon=256,
normalize_inputs=True, use_continuous_quantile_head=True,
fix_quantile_crossing=True,
))
point_forecast, quantile_forecast = model.forecast(
horizon=96, inputs=[np.arange(512, dtype=np.float32)],
)
print(point_forecast.shape) # (1, 96)
Benchmark Results
From here, using the ETTh1 (electricity dataset) as a subject, the 3 models are evaluated across three axes: latency, accuracy, and memory efficiency.
The OT column (oil temperature) from ETTh1 (Electricity Transformer Temperature, Hourly, 7 variables, 17,420 rows) was used. This is a dataset from a Chinese electric power company monitoring distribution transformers over 2 years (July 2016 to July 2018), aimed at predicting signs of overheating accidents from the transformer's internal oil temperature (OT) and 6 types of load variables (high/medium/low active/reactive power). Its structure closely resembles industrial equipment monitoring data in manufacturing, and since the Informer paper (AAAI 2021 Best Paper), it has been used as a standard benchmark for time-series forecasting.
Evaluation was measured using 64 rolling-window splits taken from the tail at step=24. Metrics are MASE (Mean Absolute Scaled Error) and RMSE; inference latency uses the warm median with cold-warm separation; peak GPU memory is captured with torch.cuda.max_memory_allocated().
Short-term Forecasting (horizon=96, context=512)
| Model | Warm Median | Warm p95 | Peak GPU mem | MASE | RMSE |
|---|---|---|---|---|---|
| Chronos-2 120M | 6.5 ms | 6.7 ms | 0.255 GB | 1.149 | 2.228 |
| Chronos-2 28M | 3.7 ms | 3.9 ms | 0.084 GB | 1.154 | 2.263 |
| TimesFM 2.5 200M | 86.0 ms | 89.3 ms | 0.917 GB | 1.106 | 2.169 |



In terms of accuracy, TimesFM 2.5 achieves the best MASE and RMSE, with all 3 models only marginally outperforming seasonal naive (MASE=1.0, which simply returns the value from 24 hours ago). Since ETTh1 OT has a strong diurnal cycle, the seasonal baseline is a tough competitor even for zero-shot foundation models.
The latency difference is quite pronounced: Chronos-2 28M (3.7ms) and Chronos-2 120M (6.5ms) are 13 to 23 times faster than TimesFM 2.5 200M (86ms). This directly reflects the architectural difference between Chronos-2, which outputs quantiles for the full horizon in a single encoder forward pass, and TimesFM 2.5, which generates autoregressively with a decoder-only architecture.
In terms of GPU memory, the 28M model uses 1/3 of the 120M and 1/11 of TimesFM 2.5, while delivering nearly equivalent accuracy.
What Happens When the Horizon Is Extended
The forecasting target was extended from 96 hours (4 days) to 720 hours (30 days) to observe how latency and accuracy change.

| horizon | Chronos-2 120M | Chronos-2 28M | TimesFM 2.5 200M |
|---|---|---|---|
| 96 | 6.5 ms / MASE 1.149 | 3.7 ms / MASE 1.154 | 86 ms / MASE 1.106 |
| 192 | 6.7 ms / 1.196 | 3.8 ms / 1.221 | 229 ms / 1.241 |
| 336 | 6.8 ms / 1.259 | 3.9 ms / 1.330 | 228 ms / 1.214 |
| 720 | 6.8 ms / 1.453 | 3.8 ms / 1.543 | 228 ms / 1.411 |
Chronos-2's latency is nearly identical between horizon=96 and horizon=720 (6.5ms → 6.8ms) — quietly impressive. TimesFM 2.5 increased by 2.7x from h=96 to h=192, then plateaued as the internal horizon_len hit its ceiling.
In terms of accuracy, all models naturally degrade as the horizon extends, but TimesFM 2.5 still slightly outperforms Chronos-2 120M (1.453) with a MASE of 1.411 at h=720. Maintaining accuracy advantage even for long-horizon forecasting is one of TimesFM 2.5's strengths.
TimesFM 2.5 Shows Dramatic Accuracy Improvement When Context Is Extended
Given that TimesFM 2.5 advertises "16K context," testing whether accuracy actually improves with longer context was an important point of investigation. Since Chronos-2 has a context limit of 8,192, this was evaluated with TimesFM 2.5 alone.

| context | Warm Median | Peak GPU | MASE | vs c=512 |
|---|---|---|---|---|
| 512 | 86.0 ms | 0.917 GB | 1.106 | baseline |
| 8,192 | 331.2 ms | 0.993 GB | 0.883 | 20% improvement |
| 15,360 | 447.8 ms | 1.074 GB | 0.770 | 30% improvement |
At c=15,360, MASE reaches 0.770 — that's 23% better than seasonal naive. The benefit of "16K context" is clearly reflected in the numbers, reaffirming that this model truly leverages long context. One important note: the full 16,384 context limit of the model cannot be fully utilized, as there is an architectural constraint of max_context + max_horizon ≤ 16384. If horizon=1024 is reserved, the practical context limit is up to 15,360. The long-context verification in this article was also measured at c=15,360.
A Look at Multivariate Forecasting
Since ETTh1 has 6 electrical load columns — HUFL / HULL / MUFL / MULL / LUFL / LULL — in addition to OT, we tested whether simultaneously forecasting all 7 variables in multivariate mode with Chronos-2 would change the OT accuracy.

As a result, looking at OT accuracy with Chronos-2 120M, it worsened by 7% from univariate MASE 1.149 to multivariate MASE 1.226. On the other hand, the other 6 variables were predicted well with MASE values of 0.87–1.09, and LULL in particular achieved 0.90, outperforming seasonal naive by 10%.
This is likely the same phenomenon commonly seen in machine learning where "mixing in covariates with weak correlation to the target can be counterproductive." Testing the same with Chronos-2 28M showed OT MASE remaining nearly unchanged at 1.149, suggesting that for dashboard use cases where "all metrics are returned in a single forward pass," this model could be a good fit.
Does the Ranking Hold in a Different Domain?
To check whether the ranking established with ETTh1 (electricity data) holds in other domains, we ran the same evaluation on the T column (room temperature in Rome) from the UCI Air Quality dataset.

| Dataset | 1st Place | 2nd Place | 3rd Place |
|---|---|---|---|
| ETTh1 (Electricity OT) | TimesFM 2.5 (1.106) | Chronos-2 120M (1.149) | Chronos-2 28M (1.154) |
| AirQuality (Room Temp T) | Chronos-2 120M (1.206) | Chronos-2 28M (1.254) | TimesFM 2.5 (1.364) |
The rankings were reversed. AirQuality is noisier as a time series than ETTh1 and lacks a strong diurnal cycle (while day-night differences exist, seasonal variation is also large), so it seems that TimesFM 2.5's strength with long context doesn't come into play with short context (c=512). This result shows that "TimesFM 2.5 is not always the best," reinforcing the importance of comparing models on the actual production domain.
Anomaly Detection Simulation
Hearing that NV-Tesseract is strong at anomaly detection naturally raises the question of how useful prediction-based anomaly detection with Chronos-2 / TimesFM 2.5 can be. Three types of anomalies — spikes, level shifts, and noise bursts — were artificially injected into the ETTh1 test period, and the absolute value of prediction residuals was used as an anomaly score to calculate per-step ROC AUC.

| Pattern | Chronos-2 120M | Chronos-2 28M | TimesFM 2.5 200M |
|---|---|---|---|
| Spike (instantaneous anomaly) | 0.965 | 0.968 | 0.981 |
| Level shift (persistent anomaly) | 0.719 | 0.706 | 0.685 |
| Noise burst (short-term) | 0.658 | 0.642 | 0.663 |
All 3 models achieved AUC above 0.96 for spikes, meaning instantaneous outliers can be detected almost certainly. The chart below shows a qualitative example, where the prediction residuals spike sharply only at the injected spike points (red circles), making it visually clear as well.

Level shifts scored a moderate AUC of 0.69–0.72, meaning gradual drift-type anomalies similar to equipment degradation can be detected to some extent. Noise bursts scored lower at AUC 0.64–0.67, suggesting that short-term, high-frequency noise is easy to miss with prediction-based FMs alone. A dedicated anomaly detection model like NV-Tesseract should excel in this area, and it would be worth comparing once it's publicly available.
NV-Tesseract
According to the official NVIDIA Developer Blog (New NVIDIA NV-Tesseract Time-Series Models and Advancing Anomaly Detection for Industry Applications), NV-Tesseract is a Transformer-based model supporting three tasks: forecasting, anomaly detection, and classification. In particular, anomaly detection (NV-Tesseract-AD) uses a mechanism called segmented / multi-scale adaptive thresholding, which appears to be designed for detecting subtle outliers in industrial sensors.
One striking real-world manufacturing example is the partnership with Cognite, announced on the first day of GTC 2026 (March 16, 2026). A case study has been published about running NV-Tesseract NIM for state prediction from reactor water level sensors at Celanese's chemical plant in Clear Lake, Texas. The challenge from Celanese's side was that "every time manual sampling occurred, conventional prediction models would cause bias jumps that disrupted continuous operation," and the goal is to bridge this gap with NV-Tesseract's real-time forecasting.
What's interesting about the delivery model is that it connects NVIDIA's NV-Tesseract NIM to Cognite's Industrial Knowledge Graph (a graph DB of equipment, sensors, and operational knowledge), providing a package where data context and the model are delivered together. The target sectors are stated to extend beyond chemicals to energy, manufacturing, power and renewable energy, and heavy industry, and it's clear that the intent is to deploy this through industrial OT data platform providers as a gateway to optimize entire production lines. Combined with video-based foundation models like VSS and Cosmos, there seems to be a trajectory toward factory visualization on two fronts: "sensor time series + video."
The timeline for when it will be available to run locally on DGX Spark or Jetson is currently undetermined, but it's something to look forward to.
Comparison of Ecosystem and Operational Aspects
Why did the benchmark results in Chapter 6 turn out the way they did? Looking at the internal structures of the 3 models side by side reveals that differences in design philosophy directly translate into differences in operational characteristics.
Chronos-2 outputs the full horizon in a single forward pass via the encoder + quantile head, so latency doesn't increase much as the horizon grows. TimesFM 2.5 generates patch-by-patch autoregressively with a decoder, so latency grows with the horizon, but in exchange it can leverage long context — that's the design tradeoff.
Here are recommendations by use case:
| Intended Use | Likely Best Choice |
|---|---|
| AWS-centric demand forecasting (rich covariates) | Chronos-2 + SageMaker JumpStart / Bedrock Marketplace |
| One-shot SQL forecasting on GoogleCloud data lake | TimesFM 2.5 + BigQuery AI.FORECAST |
| IoT sensor anomaly detection (high accuracy) | NV-Tesseract |
| Lightweight edge deployment | Chronos-2 28M |
| Long-term trend forecasting (thousands of steps) | TimesFM 2.5 (long context mode) |
| Massively parallel short-term forecasting | Chronos-2 120M |
| Probabilistic forecasting for finance, etc. | Chronos-2 |
Application Considerations for Real Projects
What I'm thinking about for a project I'm actually involved in is a use case where real-time data acquired from PLCs (industrial controllers that control sensors and set values on the manufacturing floor) is fed into a time-series model to sequentially determine "is it likely that normal products will be produced under current operating conditions?" and "are there signs of heading toward a failure as the set values change?" Passing set values as future covariates enables forecasting future values of quality indicators, and looking at prediction residuals can also serve as an anomaly score. As shown in the benchmark in this article, Chronos-2 28M achieves a warm latency of 3.7ms, well within PLC scan cycles (100ms to 1 second), making real-time integration in an edge-close configuration realistically feasible. The Cognite × Celanese case mentioned earlier follows exactly the same structure, implementing this by connecting an OT platform provider to NIM. I'm planning to cover the technical verification of feeding actual PLC data in a separate article.
Implementation Pitfalls to Be Aware Of
Here are two points where it was easy to get stuck due to transitional library conditions rather than the time-series models themselves.
The predict_quantiles API for Chronos-2 was completely revamped in version 2.2, so directly copying from older articles or official examples won't work. The argument is inputs=tensor, the input must be 3-dimensional (n_series, n_variates, history_length) (even for univariate: tensor[None, None, :]), and the return value is tuple[list[Tensor], list[Tensor]] where the shape of quantiles_list[0] is (n_variates, horizon, num_quantiles).
The practical context limit for TimesFM 2.5 is up to 15,360. Due to the architectural constraint of max_context + max_horizon ≤ 16384, reserving horizon=1024 limits the context to 15,360. The long-context verification in this article was also measured at c=15,360.
Summary
This article organized the major updates to time-series foundation models that have been rolling out since fall 2025, benchmarked AWS Chronos-2 and Google TimesFM 2.5 on actual hardware, and introduced NV-Tesseract based on official information and the Cognite case study.
To put it simply: Chronos-2 for latency-sensitive, massively parallel workloads; TimesFM 2.5 when you want to leverage long context for better accuracy; and NV-Tesseract for dedicated anomaly detection in manufacturing. Chronos-2 28M is lightweight in both memory and inference time, with accuracy nearly on par with the 120M model — an interesting option for near-edge deployments.
Reference Links
Chronos-2
- Amazon Science Blog - Introducing Chronos-2
- GitHub - amazon-science/chronos-forecasting
- arXiv 2510.15821 - Chronos-2: From Univariate to Universal Forecasting
TimesFM 2.5
- HuggingFace - google/timesfm-2.5-200m-pytorch
- GitHub - google-research/timesfm
- BigQuery AI.FORECAST documentation
NV-Tesseract
- NVIDIA Developer Blog - NV-Tesseract Time-Series Models
- NVIDIA Developer Blog - NV-Tesseract-AD for Industry Applications
- Cognite × NVIDIA Partnership
