
I tried running NVIDIA Nemotron 3 Super on DGX Spark
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
On March 11, 2026, NVIDIA released Nemotron 3 Super. It is a reasoning model for agents that adopts a hybrid architecture with 120B total parameters and 12B active parameters. For details on the architecture and a trial run on Cloudflare Workers AI, please refer to Oguri-san's article linked below.
This time, we focus on local execution. In a previous article, we ran Nano (30B-A3B), a lightweight model from the same Nemotron 3 family, on DGX Spark. This time, we scale up to tackle the 120B Super. Since we're also curious about the performance difference from Nano, we compared them using the JCommonsenseQA benchmark.
Overview of Nemotron 3 Super
Nemotron 3 is a new generation model family that NVIDIA trained from scratch with its own original architecture. It belongs to a different lineage from the Nemotron Nano 9B v2 Japanese (a Llama-based fine-tune) covered in a previous series.
| Model | Total Parameters | Active | Context Length | Positioning |
|---|---|---|---|---|
| Nemotron 3 Nano | 30B | 3B | 128K | Lightweight / Edge-oriented |
| Nemotron 3 Super | 120B | 12B | 256K〜1M | Core / Agent-oriented |
| Nemotron 3 Ultra | Undisclosed | — | — | Large-scale (planned H1 2026) |
Super uses a hybrid configuration that alternates three types of blocks: Mamba-2, Transformer Attention, and Latent MoE (Mixture of Experts). Each is designed to handle long context processing, accurate reference retrieval, and 4x expert utilization at the same cost, respectively. It also has a built-in Multi-Token Prediction (MTP) head, supporting acceleration via native Speculative Decoding without an external draft model.
An important point when considering execution on DGX Spark is that it was pre-trained from scratch in NVFP4 (4-bit floating point). Rather than quantizing after the fact, it was trained from the start under 4-bit precision constraints, making it natively compatible with Blackwell architecture's NVFP4 optimizations. However, as described later, as of March 2026, there were issues on the inference engine side in my environment, so I verified using the GGUF version on Ollama.
Setting Up on DGX Spark
Verification Environment
| Item | Value |
|---|---|
| Hardware | NVIDIA DGX Spark (GB10 Superchip) |
| Memory | 128GB Unified Memory |
| Driver | 580.126.09 |
| CUDA | 13.0 |
| OS | Ubuntu 24.04 (aarch64) |
| Inference Engine | Ollama 0.17.2 |
| Model | nemotron-3-super:latest (Q4_K_M GGUF) |
The NVFP4 Version Didn't Work
NVIDIA's official cookbook provides instructions for launching with vLLM 0.17.1 + NVFP4 checkpoints. However, DGX Spark (GB10) has CUDA Compute Capability 12.1 (sm_121), and torch 2.10.0+cu128, which vLLM 0.17.1 depends on, only supports up to sm_120. Since the CUTLASS kernel crashes at runtime, it was not possible to run the NVFP4 variant with vLLM at this time.
I also tried NGC containers (26.01, 26.02), but 26.01 uses vLLM 0.13.x and does not support the Nemotron 3 Super architecture (nemotron_h), while 26.02 uses vLLM 0.15.x and fails to interpret the MIXED_PRECISION quantization.
TRT-LLM has an official configuration for DGX Spark (Config C), but it reportedly requires building from the main branch rather than a release version, so I passed on it this time. I hope this will be resolved in NGC 26.03 or later, or with an official TRT-LLM release.
Running with Ollama
The GGUF format allows us to avoid CUDA kernel compatibility issues. Ollama has published nemotron-3-super with Q4_K_M quantization, and it ran without issues on DGX Spark.
# Pull the model with Ollama (approx. 87GB)
ollama pull nemotron-3-super
You can also pull Nano in the same way.
# Pull Nano for comparison (approx. 24GB)
ollama pull nemotron-3-nano
Once Ollama is running, all you need to do is call the API. The initial model load takes about 1–2 minutes, but once loaded, responses come back smoothly.
Memory Usage
Since DGX Spark's unified memory is shared between CPU and GPU, it's difficult to check exact usage with nvidia-smi. Checking with Ollama's ps command, Super was loading approximately 87GB and Nano approximately 24GB. With 128GB of unified memory, Super alone has plenty of room, but loading both models simultaneously totals around 111GB, so care is needed when balancing with other processes.
Running Inference
Reasoning Mode (with thinking)
Nemotron 3 Super has a Reasoning mode that outputs the thought process during inference, similar to DeepSeek-R1. In Ollama, this can be controlled with the think parameter.
curl -s http://localhost:11434/api/chat -d '{
"model": "nemotron-3-super",
"messages": [
{"role": "user", "content": "Explain the key innovations of Mixture of Experts architecture in 3 sentences."}
],
"stream": false
}' | jq '{model, eval_count, eval_duration_ns: .eval_duration,
tok_per_sec: (.eval_count / (.eval_duration / 1e9))}'
By default, it operates with Reasoning ON (with thinking), generating a response after internally elaborating the thought process. While latency increases due to the higher token count generated, accuracy improves on complex reasoning tasks.
No-thinking (nothink) Mode
To skip the thought process and get a direct answer, specify think: false.
curl -s http://localhost:11434/api/chat -d '{
"model": "nemotron-3-super",
"messages": [
{"role": "user", "content": "What is 2+2?"}
],
"think": false,
"stream": false
}'
This mode is suited for situations requiring short answers, such as benchmarks and classification tasks.
Throughput Measurement
We compared the throughput of Nano and Super using the same prompt.
| Model | Parameters | Quantization | Model Size | Prompt Processing | Generation Speed | Generated Tokens |
|---|---|---|---|---|---|---|
| Nemotron 3 Nano | 30B (3B) | Q4_K_M | 24GB | 361.0 tok/s | 72.3 tok/s | 170 |
| Nemotron 3 Super | 120B (12B) | Q4_K_M | 87GB | 112.4 tok/s | 17.9 tok/s | 424 |
Nano being approximately 4x faster in generation speed closely matches the 4x difference between the active parameters of 3B and 12B. On the other hand, Super generates more tokens due to its longer thought process in Reasoning mode, resulting in a difference in answer quality.
JCommonsenseQA Benchmark
Why This Benchmark?
JCommonsenseQA is a dataset of 5-choice questions testing Japanese commonsense reasoning. We previously verified Nemotron Nano 9B v2 Japanese (previous generation) on the same DGX Spark in this series, providing a comparison baseline. Since the Nemotron 3 generation was trained mainly on English training data, it also serves as an indicator of how well it can handle Japanese tasks.
Measurement Conditions
| Item | Value |
|---|---|
| Dataset | JCommonsenseQA v1.1 (validation set, 1,119 questions) |
| Prompt | 3-shot, answer is a single alphabet character |
| Thinking Mode | nothink (no thinking) |
| temperature | 0 |
| Backend | Ollama (DGX Spark local) |
Results
| Model | Generation | Quantization | Accuracy | Avg. Latency | Avg. Generation Speed |
|---|---|---|---|---|---|
| Nemotron Nano 9B v2 JP (ref.) | Previous | BF16 | 91.2% | 0.98s/q | — |
| Nemotron 3 Nano (30B-A3B) | Nemotron 3 | Q4_K_M | 87.0% | 0.30s/q | 118.5 tok/s |
| Nemotron 3 Super (120B-A12B) | Nemotron 3 | Q4_K_M | 94.4% | 0.92s/q | 35.6 tok/s |
Nemotron 3 Super recorded the highest accuracy at 94.4%. The previous generation 9B v2 Japanese (fine-tuned for Japanese) achieved 91.2%, so surpassing it without any Japanese-specific training is a bit surprising. It gives the impression that the reasoning power of the 12B active parameters is directly reflected.
Nano also performed well at 87.0%. In terms of latency, Nano is overwhelmingly faster at 0.30s/question compared to Super's 0.92s/question—3x faster. However, since Super also clears 1 second per question, it felt comfortable to use in practice.
Super for accuracy, Nano for speed. The division by use case is clear.
Expectations for vLLM + NVFP4 and MTP
This time we verified with Ollama + GGUF, but NVIDIA's officially recommended configuration is vLLM 0.17.1 + NVFP4 checkpoints. TRT-LLM Config C is also available for DGX Spark.
Furthermore, Nemotron 3 Super has a Multi-Token Prediction (MTP) head built into the checkpoint, enabling native Speculative Decoding without an external draft model. In vLLM, it can be enabled simply by adding the --speculative-config option, and NVIDIA officially claims up to 3x speedup on structured generation tasks.
# To enable MTP (vLLM)
vllm serve $MODEL_CKPT \
--speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}'
Once vLLM + NVFP4 becomes operational on DGX Spark, we can expect higher throughput than the GGUF version, and we'll also be able to try out MTP-based acceleration. I look forward to the next NGC container version and the official TRT-LLM release.
June 2026 Update: NVFP4 Now at Practical Speed with Official NGC Container
Back in March, I ended by expressing hope that "things would improve once NGC official containers and vLLM 0.17+ gain sm_121 support." About three months later, the situation has changed considerably. When I revisited DGX Spark for an upcoming event, NVFP4 was working directly with the official NGC vLLM container, so I'm adding the results here.
I used nvcr.io/nvidia/vllm:26.03.post1-py3. One of the causes of the failure in March—the MIXED_PRECISION quantization interpretation failure in NGC 26.02—was resolved in 26.03, and I was able to load nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 directly.
docker run --rm --gpus all --ipc=host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_FLASHINFER_MOE_BACKEND=latency \
nvcr.io/nvidia/vllm:26.03.post1-py3 \
vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--trust-remote-code --tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 --kv-cache-dtype fp8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
What reassured me when watching the startup log was that FLASHINFER_CUTLASS was selected as the MoE backend for NVFP4. In March, it fell back to Marlin weight-only and dropped to 4.8 tok/s, but this time FlashInfer CUTLASS was selected even though Marlin was among the candidates. The mamba_ssm_cache_dtype float32 setting is now handled automatically by 26.03, so the option I had to pass manually in March is no longer needed.
Measured Throughput
Results re-measured in a warm state.
| Model | Configuration | Decode Speed |
|---|---|---|
| Nemotron 3 Nano (30B-A3B) | NVFP4 / vLLM 26.03 | 57.7 tok/s |
| Nemotron 3 Super (120B-A12B) | NVFP4 / No MTP | 14.8 tok/s |
| Nemotron 3 Super (120B-A12B) | NVFP4 / MTP num_spec=3 | 18.6 tok/s |
Compared to 4.8 tok/s via Marlin in March, Super is about 4x faster. Enabling MTP (Multi-Token Prediction) increased it from 14.8 to 18.6 tok/s, an improvement of about 26%. Since Super has the MTP head built into the checkpoint, you can use it simply by passing --speculative-config.
Summary
Running a 120B parameter model on a single desktop machine is still impactful. Achieving 94.4% on JCommonsenseQA without Japanese-specific fine-tuning speaks to the strong baseline Japanese capability.
It's also interesting that the division is clear: Nano for speed, Super for accuracy. With DGX Spark, you can load both and switch between them depending on your use case.
This time, the NVFP4 version did not work due to sm_121 compatibility issues, but with Ollama + GGUF, inference was practical at a sufficiently usable speed. As NVFP4 native support matures, further performance improvements—including MTP-based acceleration—can be expected.
I'll also be keeping an eye on further updates about the Nemotron 3 family at GTC 2026 (3/16-19).
Reference Links
- NVIDIA が最新オープンモデル Nemotron 3 Super を発表したので Cloudflare Workers AI で試してみた(DevelopersIO)
- DGX Spark で Nemotron 3 Nano を日本語ファインチューニングしてみた(DevelopersIO)
- Introducing Nemotron 3 Super(公式ブログ)
- Inside Nemotron 3: Techniques, Tools, and Data(技術解説)
- Nemotron 3 Super 技術レポート
- NVIDIA-NeMo/Nemotron(GitHub、cookbook)
- HuggingFace: NVFP4 モデル
