
I tried running and comparing time series foundation models on DGX Spark
This page has been translated by machine translation. View original
Hello, I'm Shigeru Mori from the Classmethod Manufacturing Business Technology Department.
Starting in the fall of 2025, foundation models for time-series forecasting received major updates in rapid succession. Google's TimesFM 2.5, NVIDIA's NV-Tesseract (in evaluation license stage), and AWS's Chronos-2 — three major models from three companies lined up within just two months. Each has slightly different target areas, such as demand forecasting, anomaly detection, and sensor monitoring in manufacturing, and there's ongoing debate about which one comes out on top in benchmarks.
In this article, I'll benchmark Chronos-2 and TimesFM 2.5, which can be run locally, and introduce NV-Tesseract — which can currently only be accessed via DGX Cloud NIM — as a reading section based on official information and Cognite's case study. The test environment used my DGX Spark on hand.
Positioning and Comparison of the 3 Models
When organizing the three models by developer, distribution format, and intended use at the ecosystem level, the picture looks like this. AWS places it on SageMaker JumpStart and Bedrock Marketplace, Google places it on BigQuery's AI.FORECAST, and NVIDIA provides it as a DGX Cloud NIM — it's interesting how each company's platform strategy is clearly visible through these choices.
Placing the key specs side by side, the commonalities and differences become clear.
| Item | Chronos-2 | TimesFM 2.5 | NV-Tesseract |
|---|---|---|---|
| Provider | AWS (Amazon Science) | Google Research | NVIDIA |
| Release | 2025-10-20 | 2025-09 | 2025 (Evaluation License) |
| Architecture | T5 encoder + Group Attention | Decoder-only + Quantile Head | Transformer (details undisclosed) |
| Parameters | 120M / 28M | 200M | Undisclosed |
| Context Limit | 8,192 | 16,384 | Undisclosed |
| Multivariate Forecasting | Native support | Supported | Supported |
| Anomaly Detection | - (forecasting-focused) | AI.DETECT_ANOMALIES GA |
Native (NV-Tesseract-AD) |
| Fine-Tuning | Available | Available | Via NIM only |
| License | Apache 2.0 | Apache 2.0 | Evaluation License |
| Availability | HF / GitHub | HF / JAX version available | DGX Cloud NIM only |
| Commercial Use | Available | Available | Conditional during evaluation period |
Both Chronos-2 and TimesFM 2.5 are Apache 2.0 and can run locally, while NV-Tesseract is in the evaluation license stage and can only be accessed via DGX Cloud NIM — a clear difference in availability. The general release date has not been announced as of now. (As of May 2026)
Test Environment
Both models are fully operable via Python library calls. Since dependency versions conflict, I set up separate uv venv environments for each.
| Item | Value |
|---|---|
| Hardware | DGX Spark (NVIDIA GB10, aarch64, 128GB UMA) |
| OS / Driver | Ubuntu 24.04 / NVIDIA driver 580.142 |
| Python | 3.13.13 |
| PyTorch | 2.12.0+cu130 (shared across both venvs) |
| For Chronos-2 | chronos-forecasting==2.2.2, transformers==4.57.6 |
| For TimesFM 2.5 | timesfm==2.0.0 (git head d720daa6), safetensors==0.7.0 |
Trying Chronos-2
Just install chronos-forecasting>=2.1.0 (2.1.0 and later includes a bug fix for past covariate handling) via pip. The cu130 wheel index is explicitly specified for ARM64 + Blackwell GB10.
uv pip install --index-url https://download.pytorch.org/whl/cu130 \
--extra-index-url https://pypi.org/simple/ torch
uv pip install "chronos-forecasting>=2.1.0"
The minimal sample looks like this. It runs end-to-end from BaseChronosPipeline.from_pretrained to predict_quantiles, returning 21 quantiles in a single inference after loading.
import numpy as np
import torch
from chronos import BaseChronosPipeline
pipeline = BaseChronosPipeline.from_pretrained(
"amazon/chronos-2", device_map="cuda", dtype=torch.bfloat16
)
# Input is a 3D tensor of shape (n_series, n_variates, history_length)
context = torch.tensor(np.arange(512, dtype=np.float32)[None, None, :])
quantiles_list, mean_list = pipeline.predict_quantiles(
context, prediction_length=96, quantile_levels=[0.1, 0.5, 0.9]
)
print(quantiles_list[0].shape) # torch.Size([1, 96, 3])
One thing to watch out for: chronos-forecasting>=2.2 has completely revamped the argument names, input shape, and return values of predict_quantiles, so copy-pasting from older articles or official examples may not work as-is.
Trying TimesFM 2.5
The one thing to be careful about with TimesFM 2.5 is that the timesfm==1.3.0 distributed on PyPI does not yet include the TimesFM 2.5 API. Even the Hugging Face model card explicitly states "pip install from PyPI coming soon. At this point, please git clone," so currently the only option is to source install from the GitHub HEAD. Since we install with --no-deps, safetensors needs to be specified separately.
uv pip install --index-url https://download.pytorch.org/whl/cu130 \
--extra-index-url https://pypi.org/simple/ torch safetensors
uv pip install --force-reinstall --no-deps \
"timesfm @ git+https://github.com/google-research/timesfm.git"
Here is the minimal sample. The pattern is to set the context and horizon limits with ForecastConfig and compile, then pass the actual prediction length to forecast.
import numpy as np
import timesfm
model = timesfm.TimesFM_2p5_200M_torch.from_pretrained(
"google/timesfm-2.5-200m-pytorch"
)
model.compile(timesfm.ForecastConfig(
max_context=2048, max_horizon=256,
normalize_inputs=True, use_continuous_quantile_head=True,
fix_quantile_crossing=True,
))
point_forecast, quantile_forecast = model.forecast(
horizon=96, inputs=[np.arange(512, dtype=np.float32)],
)
print(point_forecast.shape) # (1, 96)
Benchmark Results
From here, I'll evaluate the 3 models on ETTh1 (an electricity dataset) across three axes: latency, accuracy, and memory efficiency.
For the evaluation, I used the OT column (oil temperature) from ETTh1 (Electricity Transformer Temperature, Hourly, 7 variables, 17,420 rows). This is a dataset from a Chinese power company that monitored distribution transformers for two years (July 2016 to July 2018), aimed at predicting signs of overheating accidents from the internal oil temperature (OT) and six types of load variables (high/medium/low active/reactive power). Its structure is close to equipment monitoring data in manufacturing, and it has been used as a standard benchmark for time-series forecasting since the Informer paper (AAAI 2021 Best Paper).
The evaluation took 64 rolling-window splits from the end with step=24. The metrics are MASE (Mean Absolute Scaled Error) and RMSE. Inference latency is the warm median with cold/warm separation, and peak GPU memory is measured with torch.cuda.max_memory_allocated().
Short-term Forecasting (horizon=96, context=512)
| Model | Warm Median | Warm p95 | Peak GPU mem | MASE | RMSE |
|---|---|---|---|---|---|
| Chronos-2 120M | 6.5 ms | 6.7 ms | 0.255 GB | 1.149 | 2.228 |
| Chronos-2 28M | 3.7 ms | 3.9 ms | 0.084 GB | 1.154 | 2.263 |
| TimesFM 2.5 200M | 86.0 ms | 89.3 ms | 0.917 GB | 1.106 | 2.169 |



In terms of accuracy, TimesFM 2.5 achieves the best MASE/RMSE, with all three models only marginally outperforming seasonal naive (MASE=1.0, which simply returns the same value from 24 hours ago). Since ETTh1 OT has a strong daily cycle, even zero-shot foundation models face a reasonably tough baseline.
The latency difference is quite pronounced: Chronos-2 28M (3.7ms) and Chronos-2 120M (6.5ms) are 13 to 23 times faster than TimesFM 2.5 200M (86ms). This directly reflects the architectural difference between Chronos-2, which outputs all quantiles in a single encoder forward pass, and TimesFM 2.5, which generates autoregressively with a decoder-only architecture.
In terms of GPU memory, the 28M model uses 1/3 of the 120M and 1/11 of TimesFM 2.5, while achieving nearly equivalent accuracy.
What Happens as Horizon Increases
I extended the forecast target from 96 hours (4 days) to 720 hours (30 days) to see how latency and accuracy change.

| horizon | Chronos-2 120M | Chronos-2 28M | TimesFM 2.5 200M |
|---|---|---|---|
| 96 | 6.5 ms / MASE 1.149 | 3.7 ms / MASE 1.154 | 86 ms / MASE 1.106 |
| 192 | 6.7 ms / 1.196 | 3.8 ms / 1.221 | 229 ms / 1.241 |
| 336 | 6.8 ms / 1.259 | 3.9 ms / 1.330 | 228 ms / 1.214 |
| 720 | 6.8 ms / 1.453 | 3.8 ms / 1.543 | 228 ms / 1.411 |
Chronos-2's latency is nearly identical between horizon=96 and horizon=720 (6.5ms → 6.8ms) — quietly impressive. TimesFM 2.5 increased 2.7x from h=96 to h=192, then plateaued as the internal horizon_len hit its ceiling.
Accuracy naturally degrades for all models as horizon increases, but TimesFM 2.5 still slightly outperforms Chronos-2 120M (1.453) with MASE 1.411 at h=720. Maintaining its accuracy advantage even in long-term forecasting is a strength of TimesFM 2.5.
TimesFM 2.5 Shows Dramatic Accuracy Improvement with Longer Context
Since TimesFM 2.5 touts "16K context," whether accuracy actually improves when context is extended was a key question worth testing. Chronos-2 has a context limit of 8,192, so this evaluation was done with TimesFM 2.5 alone.

| context | Warm Median | Peak GPU | MASE | vs c=512 |
|---|---|---|---|---|
| 512 | 86.0 ms | 0.917 GB | 1.106 | baseline |
| 8,192 | 331.2 ms | 0.993 GB | 0.883 | 20% improvement |
| 15,360 | 447.8 ms | 1.074 GB | 0.770 | 30% improvement |
At c=15,360, MASE reaches 0.770 — 23% better than seasonal naive. The benefit of "16K context" is clearly reflected in the numbers, reinforcing the impression that this model truly benefits from long context. One important caveat: you cannot use the full 16,384 context limit of the model. Due to the architectural constraint that max_context + max_horizon ≤ 16384, if you reserve horizon=1024, the practical context limit is 15,360.
Looking at Multivariate Forecasting
ETTh1 also has six power load columns beyond OT — HUFL / HULL / MUFL / MULL / LUFL / LULL — so I tested how predicting all 7 variables simultaneously with Chronos-2 multivariate mode affects OT-only accuracy.

The result: for Chronos-2 120M, OT accuracy degraded 7% from MASE 1.149 (univariate) to 1.226 (within multivariate). On the other hand, the other 6 variables were forecasted well with MASE 0.87–1.09, with LULL achieving 0.90 — 10% better than seasonal naive.
This is likely the phenomenon commonly seen in machine learning: "mixing covariates with weak correlation to the target can backfire." Running the same test with Chronos-2 28M yields OT MASE of about 1.149, essentially unchanged, suggesting it could be useful for dashboard use cases by leveraging the efficiency of "returning all variables in a single forward pass."
Does the Ranking Hold Across Different Domains?
To check whether the ranking from ETTh1 (electricity data) holds in other domains, I ran the same conditions on the T column (room temperature in Rome) from the UCI Air Quality dataset.

| Dataset | 1st Place | 2nd Place | 3rd Place |
|---|---|---|---|
| ETTh1 (Electricity OT) | TimesFM 2.5 (1.106) | Chronos-2 120M (1.149) | Chronos-2 28M (1.154) |
| AirQuality (Room Temperature T) | Chronos-2 120M (1.206) | Chronos-2 28M (1.254) | TimesFM 2.5 (1.364) |
The rankings reversed. AirQuality has noisier time series than ETTh1 and lacks as strong a daily cycle (there's a day/night difference, but seasonal variation is also large), so with short context (c=512), TimesFM 2.5's long-context advantage doesn't seem to help. This result confirms that "TimesFM 2.5 isn't always best," and that actually comparing models on your target domain is important.
Anomaly Detection Simulation
Hearing that NV-Tesseract excels at anomaly detection naturally raises the question: how useful are Chronos-2 / TimesFM 2.5's prediction-based anomaly detection approaches? I artificially injected three types of anomalies — spikes, level shifts, and noise bursts — into the test section of ETTh1, and computed per-step ROC AUC using the absolute value of prediction residuals as anomaly scores.

| Pattern | Chronos-2 120M | Chronos-2 28M | TimesFM 2.5 200M |
|---|---|---|---|
| Spike (instantaneous anomaly) | 0.965 | 0.968 | 0.981 |
| Level Shift (persistent anomaly) | 0.719 | 0.706 | 0.685 |
| Noise Burst (short-term) | 0.658 | 0.642 | 0.663 |
All three models achieve AUC above 0.96 for spikes, meaning instantaneous outliers can be detected with near certainty. The chart below shows a qualitative example: the prediction residual spikes sharply only at the injected spike points (red circles), which is visually easy to understand.

Level shifts score AUC 0.69–0.72, a moderate result. This suggests gradual equipment drift-type anomalies can be detected to a reasonable degree. Noise bursts score lower at AUC 0.64–0.67, suggesting that short-term high-frequency noise is likely to be missed by prediction FMs alone. This is exactly where dedicated anomaly detection models like NV-Tesseract should shine, and I'd like to compare them again once they're publicly available.
NV-Tesseract
According to the official NVIDIA Developer Blog (New NVIDIA NV-Tesseract Time-Series Models and Advancing Anomaly Detection for Industry Applications), NV-Tesseract is a Transformer-based model supporting three tasks: forecasting, anomaly detection, and classification. In particular, the anomaly detection variant (NV-Tesseract-AD) appears to be designed around a mechanism called segmented / multi-scale adaptive thresholding, targeting precise detection of subtle outliers in industrial sensors.
A particularly notable real-world manufacturing example is the partnership with Cognite announced to coincide with the first day of GTC 2026 (March 16, 2026). A case study has been published of NV-Tesseract NIM running state predictions from reactor water level sensors at Celanese's chemical plant in Clear Lake, Texas. Celanese's pain point was that "traditional prediction models would produce bias jumps (stepped discontinuities in bias) every time manual sampling occurred, disrupting continuous operations" — the goal being to close this gap with real-time forecasting by NV-Tesseract.
What's interesting about the delivery format is that it connects Cognite's Industrial Knowledge Graph (a graph DB of equipment, sensors, and operational knowledge) to the NV-Tesseract NIM, making data context and model available together as a package. The target sectors extend beyond chemicals to energy, manufacturing, power and renewables, and heavy industry — suggesting a trajectory where industrial OT data platform providers serve as the gateway to optimize entire industrial lines. Combined with video foundation models like VSS and Cosmos, there seems to be a growing possibility of factory visualization progressing along both axes of "sensor time series + video."
The timeline for when it will be runnable locally on DGX Spark or Jetson is currently unannounced, but I look forward to it.
Ecosystem and Operational Comparison
Why did the benchmark results in Section 6 turn out the way they did? Looking at the internal architectures of the three models side by side, it becomes clear that design philosophy differences directly manifest as operational characteristics.
Chronos-2 outputs all horizons at once in a single encoder + quantile head forward pass, so latency doesn't grow much as horizon increases. TimesFM 2.5 generates patch units autoregressively with a decoder, so latency grows with horizon length — but in exchange, it can leverage long context. These are their respective design trade-offs.
Recommended choices by use case might look like this:
| Intended Use | Likely Best Choice |
|---|---|
| AWS-centric demand forecasting (rich covariates) | Chronos-2 + SageMaker JumpStart / Bedrock Marketplace |
| One-shot SQL forecasting on a GoogleCloud data lake | TimesFM 2.5 + BigQuery AI.FORECAST |
| IoT sensor anomaly detection (high accuracy requirements) | NV-Tesseract |
| Lightweight edge-side deployment | Chronos-2 28M |
| Long-term trend forecasting (thousands of steps) | TimesFM 2.5 (long context mode) |
| High-volume parallel short-term forecasting | Chronos-2 120M |
| Probabilistic forecasting emphasis (e.g., finance) | Chronos-2 |
Application Considerations in Real Projects
In projects I'm currently involved in, I'm thinking about a use case where real-time data acquired from PLCs (industrial controllers that manage sensors and setpoints on the factory floor) is fed into a time-series model to continuously determine "are we likely to produce products as usual given current operating conditions?" and "are there signs of heading toward failure as setpoints change?" If setpoints are passed as future covariates, future values of quality indicators can be predicted, and prediction residuals can also serve as anomaly scores. Based on the benchmark in this article, Chronos-2 28M at 3.7ms warm latency comfortably fits within PLC scan cycles (100ms–1 second), making real-time edge integration realistically feasible. The Cognite × Celanese case mentioned earlier follows exactly the same structure, implementing this by connecting an OT platform with NIM. I plan to cover the technical verification of feeding actual PLC data in a separate article.
Implementation Pitfalls to Watch Out For
Just two points where I got stuck due to the transitional state of the libraries rather than the time-series models themselves.
The predict_quantiles API in Chronos-2 was completely revamped in the 2.2 series, so copy-pasting from older articles or official examples won't work as-is. Arguments use inputs=tensor, input must be 3D (n_series, n_variates, history_length) (even for univariate, use tensor[None, None, :]), and the return value is tuple[list[Tensor], list[Tensor]] where quantiles_list[0] has shape (n_variates, horizon, num_quantiles).
The effective context limit for TimesFM 2.5 is 15,360. Due to the architectural constraint max_context + max_horizon ≤ 16384, reserving horizon=1024 limits context to 15,360. The long-context evaluation in this article was also measured at c=15,360.
Summary
In this article, I organized the major updates to time-series foundation models that started in the fall of 2025, benchmarked AWS Chronos-2 and Google TimesFM 2.5 on real hardware, and covered NV-Tesseract based on official information and the Cognite case study.
To put it simply: Chronos-2 for latency-critical high-volume parallel processing, TimesFM 2.5 when long context is needed for better accuracy, and NV-Tesseract when you want a dedicated model for anomaly detection in manufacturing. Chronos-2 28M is lightweight in both memory and inference time with accuracy nearly matching the 120M model — an interesting option for edge-side deployments as well.
Reference Links
Chronos-2
- Amazon Science Blog - Introducing Chronos-2
- GitHub - amazon-science/chronos-forecasting
- arXiv 2510.15821 - Chronos-2: From Univariate to Universal Forecasting
TimesFM 2.5
- HuggingFace - google/timesfm-2.5-200m-pytorch
- GitHub - google-research/timesfm
- BigQuery AI.FORECAST documentation
NV-Tesseract
- NVIDIA Developer Blog - NV-Tesseract Time-Series Models
- NVIDIA Developer Blog - NV-Tesseract-AD for Industry Applications
- Cognite × NVIDIA Partnership
