
I tried NVIDIA Nemotron 3 Ultra
This page has been translated by machine translation. View original
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
The Nemotron 3 family from NVIDIA has been joined by the top-tier Ultra. At a size of 550B, it was released on June 4, 2026.
Up until now, I've been working through the Nemotron series on DGX Spark, starting from the smallest models. I was able to run locally everything up to the 120B-class Super, including the small model I tried Japanese fine-tuning on, but the Ultra this time is simply on a different scale. Obviously, the weights alone won't fit in the DGX Spark's 128GB...
Nemotron 3 is published by NVIDIA itself as a free API, so even models you can't run locally can be accessed through a browser. In this article, I'll check what kind of environment Ultra requires, verify it using that free API, and finally organize "which model to use for what purpose," including Nano and Super.
What kind of model is Nemotron 3 Ultra
It's a MoE with 550B total parameters and 55B active per token. The architecture is called LatentMoE, a hybrid configuration combining Mamba-2, Transformer, and MoE.
What NVIDIA is aiming for with this model is long-running agents. The technical blog described it as "single-turn chatbots evolving into long-running agents." An agent makes plans, calls tools, launches sub-agents, and keeps feeding its history and reasoning back into the model. The longer the task, the more tokens accumulate in this back-and-forth, and the more cost and goal drift increase. Ultra felt to me like a design that pushes through that with efficiency. NVIDIA's intended use cases are also things like complex coding, long-running research, and automation of internal workflows—agents that run for hours.
The division of roles in the hybrid configuration was personally the interesting part. Mamba layers efficiently handle long contexts, while Transformer layers take on the role of "precisely recalling specific facts from within a large context." Additionally, MTP (Multi-Token Prediction) predicts multiple tokens at once, boosting throughput for long outputs. Combined with NVFP4 quantization, they claim up to 5x inference throughput compared to other open models in the same class, and up to 30% lower operational costs when running agents. It's efficiency that pays off more the longer you run.
One note of caution here: Ultra is text-only. Since Nano Omni in the same family is multimodal, you might assume images are supported too, but Ultra does not accept images or audio. Its highlights are the 1M token long context and reasoning capability. The license is OpenMDW 1.1, and commercial use is permitted.
In official benchmarks, figures include 91% on PinchBench for measuring agent productivity, 82% on IFBench for instruction following, and 95% on Ruler's 1M context. Even with NVFP4, degradation from BF16 is kept to at most 2–3 points, and some items actually exceed BF16—which is reassuring for those of us using the quantized version as a baseline.
What specs are needed to run Ultra
Since you can probably imagine that 550B wouldn't fit on DGX Spark, let me look at what kind of environment is actually intended.
Even with NVFP4 quantization, the weight size is approximately 335GB. NVIDIA's minimum configuration is 4× B200 or 8× H100 80GB in a single node. With GB200 or GB300, you need at least 4 cards. That's over 600GB of VRAM, so this is a model that assumes data center-grade GPUs. The weights alone are 2.6× the DGX Spark's 128GB, so it's simply a different playing field.
| Model | Size | Approximate requirement |
|---|---|---|
| Nemotron 3 Nano | 30B-A3B | Easily fits on DGX Spark |
| Nemotron 3 Super | 120B-A12B | At the ceiling of DGX Spark (official deployment guide available) |
| Nemotron 3 Ultra | 550B-A55B | Servers with 600GB+ VRAM such as 4×B200 |
DGX Spark can handle up to the Super class. Ultra is one level above that, and running it requires a multi-GPU server.
However, this barrier is also the flip side of Ultra's unique appeal. Ultra has everything published—weights, training data, and recipes—so as long as you have a server of this class, you can run this 550B frontier model entirely within your own environment. You can use it without worrying about per-token metered billing, and you're free to do additional training or customization with your internal data. Cloud APIs are convenient, but they mean sending your input data to an external party. On the other hand, if you can operate it confined to your own company's or country's data center, even settings like manufacturing, healthcare, or government that can't expose confidential data can use a top-tier model with peace of mind. This is what's called sovereign AI—running models while keeping data under your own control. I've been writing about it being a size you can't run locally, but the flip side is that if you have the hardware, you can keep it in your own hands, and I think that's quite significant.
So where do you run it
If local is impossible, you have no choice but to borrow somewhere that someone else is running it. Nemotron 3 Ultra was made available through quite a wide range of channels simultaneously with its release.
| Channel | Cost | Notes |
|---|---|---|
| build.nvidia.com API | Free (rate limits only) | OpenAI compatible. Used here |
| Nous Portal (Hermes Agent) | Free for 2 weeks (6/4–6/18) | Nebius partnership. Just select nvidia/nemotron-3-ultra:free |
| NVIDIA NIM container | NVIDIA AI Enterprise (90-day free evaluation) | Deploy in your own environment |
| OpenRouter | Free tier available | Provider dependent |
| Together AI | $0.60 input / $3.60 output per 1M tokens | OpenAI compatible |
| Various cloud providers | Each provider's pricing | NVIDIA mentions SageMaker JumpStart, Google Cloud, and Microsoft Foundry |
In NVIDIA's technical blog, clouds including SageMaker JumpStart, Google Cloud, Microsoft Foundry, and Oracle Cloud were named as providers. However, I couldn't track down Ultra-specific pages in each company's catalog myself, so here I'll use the most accessible option: the free API at build.nvidia.com.
There's also another interesting channel for agent use cases. Nous Research, the developer of Hermes Agent, has joined NVIDIA's Nemotron Coalition and partnered with Nebius to open Nemotron 3 Ultra for free on Nous Portal for 2 weeks (June 4–18). With Hermes Agent, whether on Desktop or CLI, you just select nvidia/nemotron-3-ultra:free as the model. I previously wrote an article about running Hermes Agent on DGX Spark's OpenShell with NemoHermes, so essentially you can run 550B directly from Hermes. Being able to immediately try Ultra—designed for long-running agents—from an agent execution platform seems like a natural fit.
Trying it out with the free API at build.nvidia.com
Since the API is OpenAI-compatible, you can call it just by swapping the base URL and model ID in the openai client.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key=os.environ["NVIDIA_API_KEY"],
)
resp = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-a55b",
messages=[{"role": "user", "content": "あなたのモデル名と開発元を1文で答えてください。"}],
max_tokens=256,
temperature=0.2,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(resp.choices[0].message.content)
This returns "I am the language model Nemotron 3 Ultra, developed by NVIDIA researchers." The response came back in under a second. You can call a 550B model from your own MacBook with just a few lines of code.
What's interesting is the behavior of reasoning mode. When you set enable_thinking to True, the thinking content is returned separately in a field called reasoning_content. For example, when I posed the classic question "A lily pad doubles every day and covers a pond on day 48. On which day was the pond half covered?" the reasoning_content field contained this:
If it doubles every day, then one day prior to being full,
it must have been half full. So, Day 47.
And the main response returned "Day 47" in Japanese. In this case, it reasoned in English and then switched to Japanese only for the final answer. A very multilingual-model-like behavior, which I found charming.
Comparing Nano, Super, and Ultra on the same benchmark
Since we can call Nano (30B-A3B) and Super (120B-A12B) through the same API, I lined up all 3 generations side by side on Japanese reasoning problems. I started with 8 basic questions involving inclusion-exclusion and chicken-and-rabbit problems, but all 3 models solved them without any trouble, and there was no differentiation at all.
So I re-tested with 8 slightly more challenging problems—larger inclusion-exclusion, combinations, and counting perfect squares and cubes. Here are the results.
| Model | Correct answers (8 hard problems) | Average latency |
|---|---|---|
| Nemotron 3 Nano (30B-A3B) | 7/8 | ~4 seconds |
| Nemotron 3 Super (120B-A12B) | 7/8 | ~16 seconds |
| Nemotron 3 Ultra (550B-A55B) | 7/8 | ~17 seconds |
The 1 question missed was one where my phrasing "how many common solutions are there" could also be read as "the number of solutions," and all 3 models answered the same way. In practice it was essentially a perfect score across the board.

Average latency per question for the easy 8 and hard 8 sets. In both sets, Nano is fastest, with time increasing as the model gets larger.
Ultra's true strength is long context
Its true strength lies in long context. I embedded a single passphrase somewhere in a long Japanese text and asked about it at the end—a needle-in-a-haystack test—varying the position of the passphrase (beginning, middle, end) and the length of the text (approximately 6,000 to 600,000 tokens).

Results of whether the passphrase could be retrieved when placed at the beginning, middle, or end across 5 different lengths. Green indicates success, and the number in each cell is the response time in seconds.
The result was 14 out of 15 cells successful. Even with text of around 600,000 tokens, it accurately retrieves the passphrase from anywhere—beginning, middle, or end—in about 35 seconds. The one miss, "near the beginning × ~6,000 tokens," was reproducibly successful when re-tested, so it was just noise. Trying to push all the way to exactly 1M tokens resulted in an input limit error, confirming that the input can use up to approximately 1 million tokens. While many hosted services cap the default at 256K, the API catalog version offers the full 1M from the start.
On quality, NVIDIA publishes their own results. On the 1M version of Ruler, which measures the ability to extract target information from long text, Ultra scores 95%. This aligns with my hands-on experience of reliably retrieving needles up to 600,000 tokens, and this capacity and stable extraction should prove very effective for agents running with hundreds of thousands of tokens of history. Score details are in the technical blog linked at the top of this article.
What is Ultra meant for
After all this hands-on exploration, what I felt is that Ultra is not simply "a bigger version of Nano or Super"—its positioning is fundamentally different. Nano and Super are everyday models you run locally on DGX Spark, and as we saw, even the 30B class handles basic Japanese reasoning just fine. For everyday smarts, small models are already sufficient these days.
What Ultra takes on is what comes after that. A 550B model that doesn't fit locally and assumes data center-grade GPUs. Its aim is placed on autonomous agents that run without stopping for hours—handling complex coding, long-running research, and automation of internal workflows. Carrying a long context and a chain of tool calls, running to the finish without losing sight of the goal. That's the playing field it was built for.
Seen from this positioning, the numbers from the verification make sense. The 1M long context demonstrated the ability to retrieve a passphrase from 600,000 tokens away, whether it was at the beginning or the end. Inference is claimed to be up to 5x faster than other open models in the same class, and agent operational costs up to 30% lower. Efficiency that pays off more the longer you run—that's for long-running agents where tokens multiply. Furthermore, since weights, data, and recipes are all published, if you have the hardware, you can run it confined to your own data center. Without worrying about token billing, without sending confidential data outside, keeping top-tier models under your own control. That is Ultra's strength.
Summary
I've been working through Nemotron 3 on DGX Spark from Nano onward, and with Ultra I've exceeded the ceiling of what's local. What I learned is that Ultra is not "the largest model that can't be run locally," but rather "a model for running long-running autonomous agents continuously—fast, cheaply, and under your own organization's control." Now that basic intelligence is achievable even in small models, Ultra's role is to endure token explosions and keep running without stopping. Even if you can't run it locally, you can try it right away with the free API. If you're curious, start by touching the 550B model at build.nvidia.com.
Reference links
- Nemotron 3 Ultra model card (Hugging Face): https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4
- NVIDIA technical blog: https://developer.nvidia.com/blog/nvidia-nemotron-3-ultra-powers-faster-more-efficient-reasoning-for-long-running-agents/
- Ultra page on build.nvidia.com: https://build.nvidia.com/nvidia/nemotron-3-ultra-550b-a55b
- Nemotron 3 Super DGX Spark deployment guide: https://docs.nvidia.com/nemotron/nightly/usage-cookbook/Nemotron-3-Super/SparkDeploymentGuide/README.html
