ちょっと話題の記事

I tried running NVIDIA Nemotron 3 Super on DGX Spark

I tried running NVIDIA's latest model Nemotron 3 Super (120B) locally on DGX Spark. I compared the performance difference with Nano using the JCommonsenseQA benchmark and will introduce a practical setup method.

森茂洋 / Hiroshi Morishige

2026.03.12

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
On March 11, 2026, NVIDIA released Nemotron 3 Super. It is a reasoning model for agents that adopts a hybrid architecture with 120B total parameters and 12B active parameters. For details on the architecture and how to try it on Cloudflare Workers AI, please refer to Oguri-san's article.
https://dev.classmethod.jp/articles/nvidia-nemotron-3-super-cloudflare-workers-ai/
This time, we'll focus on local execution. In a previous article, we ran Nano (30B-A3B), a lightweight model in the same Nemotron 3 family, on DGX Spark. This time, we're scaling up to the 120B Super. Since I'm also curious about the performance difference from Nano, I compared them using the JCommonsenseQA benchmark.
 Overview of Nemotron 3 SuperNemotron 3 is a new generation model family that NVIDIA trained from scratch with its own proprietary architecture. It is a separate lineage from Nemotron Nano 9B v2 Japanese (a Llama-based fine-tune) covered in a previous series.


Model
Total Parameters
Active
Context Length
Positioning


Nemotron 3 Nano
30B
3B
128K
Lightweight, edge-oriented

Nemotron 3 Super
120B
12B
256K〜1M
Core, agent-oriented

Nemotron 3 Ultra
Undisclosed
—
—
Large-scale (planned H1 2026)

Super uses a hybrid configuration that alternates three types of blocks: Mamba-2, Transformer Attention, and Latent MoE (Mixture of Experts). The design has each block handle long context processing, accurate reference retrieval, and 4x expert utilization at the same cost, respectively. It also has a built-in multi-token prediction (MTP) head, supporting acceleration through native Speculative Decoding without an external draft model.
What's important when thinking about running it on DGX Spark is that it was pre-trained from scratch in NVFP4 (4-bit floating point). Rather than being quantized afterward, it was trained from the start under 4-bit precision constraints, making it natively compatible with Blackwell architecture's NVFP4 optimizations. However, as described later, at my environment as of March 2026, there were issues on the inference engine side, so I verified using the GGUF version on Ollama.
 Setting Up on DGX Spark Verification Environment

Item
Value


Hardware
NVIDIA DGX Spark (GB10 Superchip)

Memory
128GB Unified Memory

Driver
580.126.09

CUDA
13.0

OS
Ubuntu 24.04 (aarch64)

Inference Engine
Ollama 0.17.2

Model
nemotron-3-super:latest (Q4_K_M GGUF)

 NVFP4 Version Did Not WorkNVIDIA's official cookbook describes how to launch using vLLM 0.17.1 + NVFP4 checkpoint. However, DGX Spark (GB10) has CUDA Compute Capability 12.1 (sm_121), and torch 2.10.0+cu128, on which vLLM 0.17.1 depends, only supports up to sm_120. Since CUTLASS kernels crash at runtime, it was not possible to run the NVFP4 variant with vLLM at this time.
I also tried NGC containers (26.01, 26.02), but 26.01 uses vLLM 0.13.x and doesn't support the Nemotron 3 Super architecture (nemotron_h), while 26.02 uses vLLM 0.15.x and fails to interpret MIXED_PRECISION quantization.
TRT-LLM has an official DGX Spark configuration (Config C), but building from the main branch rather than a release version is required, so I skipped it this time. I hope this will be resolved in NGC 26.03 or later, or with the official release of TRT-LLM.
 Running with OllamaWith the GGUF format, the CUDA kernel compatibility issue can be avoided. A Q4_K_M quantized nemotron-3-super is available on Ollama and ran without issues on DGX Spark.
# Pull the model with Ollama (approx. 87GB)
ollama pull nemotron-3-super
Nano can be pulled similarly.
# Pull Nano for comparison (approx. 24GB)
ollama pull nemotron-3-nano
Once Ollama is running, all you need to do is call the API. The first model load takes about 1–2 minutes, but once loaded, responses come back smoothly.
 Memory UsageSince DGX Spark's unified memory is shared between CPU and GPU, it's difficult to check accurate usage with nvidia-smi. Checking with Ollama's ps command, Super loaded approximately 87GB and Nano approximately 24GB. Against the 128GB unified memory, Super alone has room to spare, but loading both models simultaneously totals around 111GB, so care is needed regarding other processes.
 Running Inference Reasoning Mode (with thinking)Nemotron 3 Super has a Reasoning mode that outputs the thinking process during inference, similar to DeepSeek-R1. In Ollama, this can be controlled with the think parameter.
curl -s http://localhost:11434/api/chat -d '{
  "model": "nemotron-3-super",
  "messages": [
    {"role": "user", "content": "Explain the key innovations of Mixture of Experts architecture in 3 sentences."}
  ],
  "stream": false
}' | jq '{model, eval_count, eval_duration_ns: .eval_duration,
         tok_per_sec: (.eval_count / (.eval_duration / 1e9))}'
By default, it operates with Reasoning ON (with thinking), expanding the thinking process internally before generating a response. Because more tokens are generated, latency increases, but accuracy improves for complex reasoning tasks.
 No-Think (nothink) ModeIf you want to skip the thinking process and get a direct answer, specify think: false.
curl -s http://localhost:11434/api/chat -d '{
  "model": "nemotron-3-super",
  "messages": [
    {"role": "user", "content": "What is 2+2?"}
  ],
  "think": false,
  "stream": false
}'
This mode is suited for scenarios where short answers are needed, such as benchmarks or classification tasks.
 Throughput MeasurementI compared the throughput of Nano and Super using the same prompt.


Model
Parameters
Quantization
Model Size
Prompt Processing
Generation Speed
Generated Tokens


Nemotron 3 Nano
30B (3B)
Q4_K_M
24GB
361.0 tok/s
72.3 tok/s
170

Nemotron 3 Super
120B (12B)
Q4_K_M
87GB
112.4 tok/s
17.9 tok/s
424

The fact that Nano is about 4x faster in generation speed closely matches the 4x difference in active parameters (3B vs. 12B). On the other hand, Super includes a long thinking process in Reasoning mode, resulting in more generated tokens and a difference in answer quality.
 JCommonsenseQA Benchmark Why This Benchmark?JCommonsenseQA is a dataset of 5-choice questions testing Japanese commonsense reasoning. In a previous series, I verified Nemotron Nano 9B v2 Japanese (previous generation) on the same DGX Spark, providing a comparison baseline. Since the Nemotron 3 generation was trained primarily on English training data, this also serves as an indicator of how well it can handle Japanese tasks.
 Measurement Conditions

Item
Value


Dataset
JCommonsenseQA v1.1 (validation set, 1,119 questions)

Prompt
3-shot, answer is a single alphabetic character

Thinking Mode
nothink (no thinking)

temperature
0

Backend
Ollama (DGX Spark local)

 Results

Model
Generation
Quantization
Accuracy
Avg. Latency
Avg. Generation Speed


Nemotron Nano 9B v2 JP (ref.)
Previous
BF16
91.2%
0.98s/q
—

Nemotron 3 Nano (30B-A3B)
Nemotron 3
Q4_K_M
87.0%
0.30s/q
118.5 tok/s

Nemotron 3 Super (120B-A12B)
Nemotron 3
Q4_K_M
94.4%
0.92s/q
35.6 tok/s

Nemotron 3 Super achieved the highest accuracy at 94.4%. Since the previous generation's 9B v2 Japanese (with Japanese-specific fine-tuning) was 91.2%, it's a bit surprising that it surpassed that without any Japanese-specific training. It gives the impression that the reasoning power of the 12B active parameters is directly reflected.
Nano also performed well at 87.0%. In terms of latency, Nano is overwhelmingly faster at 0.30s/question compared to Super's 0.92s/question — 3x faster. However, since Super also comes in under 1 second per question, it felt comfortable to use in practice.
Super for accuracy, Nano for speed — the distinction for different use cases is clear.
 Expectations for vLLM + NVFP4 and MTPThis time I verified using Ollama + GGUF, but NVIDIA's officially recommended configuration is vLLM 0.17.1 + NVFP4 checkpoint. TRT-LLM Config C is also available for DGX Spark.
Furthermore, Nemotron 3 Super has a multi-token prediction (MTP) head built into the checkpoint, enabling native Speculative Decoding without an external draft model. In vLLM, it can be enabled by simply adding the --speculative-config option, and the official documentation claims up to 3x speedup for structured generation tasks.
# When enabling MTP (vLLM)
vllm serve $MODEL_CKPT \
  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}'
Once vLLM + NVFP4 becomes operational on DGX Spark, higher throughput than the GGUF version can be expected, and MTP-based acceleration can also be tested. I look forward to the next NGC container version and the official release of TRT-LLM.
 June 2026 Update: NVFP4 Reaches Practical Speed with Official NGC ContainerBack in March, I ended by expressing hope that NGC official containers and vLLM 0.17+ sm_121 support would improve — and about 3 months later, the situation has moved considerably. When I revisited DGX Spark for an upcoming event, I found that the official NGC vLLM container could run the NVFP4 version as-is, so I'm adding the results here.
I used nvcr.io/nvidia/vllm:26.03.post1-py3. One of the causes of failure in March — the MIXED_PRECISION quantization interpretation failure in NGC 26.02 — was resolved in 26.03, and nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 could be loaded directly.
docker run --rm --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  nvcr.io/nvidia/vllm:26.03.post1-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --trust-remote-code --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'
What reassured me when looking at the startup log was that FLASHINFER_CUTLASS was selected as the MoE backend for NVFP4. In March, it fell back to Marlin weight-only and dropped to 4.8 tok/s, but this time FlashInfer CUTLASS was selected even though Marlin was among the candidates. The mamba_ssm_cache_dtype float32 setting is now handled automatically by 26.03, and the options I had to pass manually in March are no longer needed.
 Measured ThroughputResults measured again in a warm state.


Model
Configuration
Decode Speed


Nemotron 3 Nano (30B-A3B)
NVFP4 / vLLM 26.03
57.7 tok/s

Nemotron 3 Super (120B-A12B)
NVFP4 / No MTP
14.8 tok/s

Nemotron 3 Super (120B-A12B)
NVFP4 / MTP num_spec=3
18.6 tok/s

Compared to 4.8 tok/s via Marlin in March, Super is about 4x faster. Enabling MTP (multi-token prediction) improved it from 14.8 to 18.6 tok/s, roughly a 26% gain. Since Super has the MTP head built into the checkpoint, it can be used simply by passing --speculative-config.
 SummaryRunning a 120B parameter model on a single desktop machine is still quite impactful. The result of 94.4% on JCommonsenseQA without any Japanese-specific fine-tuning speaks to the strong base Japanese capability of the model.
It's also interesting that the distinction is clear: Nano for speed, Super for accuracy. With DGX Spark, you can load both and switch between them depending on the use case.
This time, the NVFP4 version did not work due to sm_121 compatibility issues, but with Ollama + GGUF, inference was possible at a sufficiently practical speed. As native NVFP4 support progresses, further performance improvements can be expected, including MTP-based acceleration.
I'll also be keeping an eye on updates to the Nemotron 3 family at GTC 2026 (3/16-19).
 Reference LinksNVIDIA が最新オープンモデル Nemotron 3 Super を発表したので Cloudflare Workers AI で試してみた（DevelopersIO）
DGX Spark で Nemotron 3 Nano を日本語ファインチューニングしてみた（DevelopersIO）
Introducing Nemotron 3 Super（公式ブログ）
Inside Nemotron 3: Techniques, Tools, and Data（技術解説）
Nemotron 3 Super 技術レポート
NVIDIA-NeMo/Nemotron（GitHub、cookbook）
HuggingFace: NVFP4 モデル

I tried running NVIDIA Nemotron 3 Super on DGX Spark

Introduction

Overview of Nemotron 3 Super

Setting Up on DGX Spark

Verification Environment

NVFP4 Version Did Not Work

Running with Ollama

Memory Usage

Running Inference

Reasoning Mode (with thinking)

No-Think (nothink) Mode

Throughput Measurement

JCommonsenseQA Benchmark

Why This Benchmark?

Measurement Conditions

Results

Expectations for vLLM + NVFP4 and MTP

June 2026 Update: NVFP4 Reaches Practical Speed with Official NGC Container

Measured Throughput

Summary

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Model	Total Parameters	Active	Context Length	Positioning
Nemotron 3 Nano	30B	3B	128K	Lightweight, edge-oriented
Nemotron 3 Super	120B	12B	256K〜1M	Core, agent-oriented
Nemotron 3 Ultra	Undisclosed	—	—	Large-scale (planned H1 2026)

Item	Value
Hardware	NVIDIA DGX Spark (GB10 Superchip)
Memory	128GB Unified Memory
Driver	580.126.09
CUDA	13.0
OS	Ubuntu 24.04 (aarch64)
Inference Engine	Ollama 0.17.2
Model	`nemotron-3-super:latest` (Q4_K_M GGUF)

Model	Parameters	Quantization	Model Size	Prompt Processing	Generation Speed	Generated Tokens
Nemotron 3 Nano	30B (3B)	Q4_K_M	24GB	361.0 tok/s	72.3 tok/s	170
Nemotron 3 Super	120B (12B)	Q4_K_M	87GB	112.4 tok/s	17.9 tok/s	424

Item	Value
Dataset	JCommonsenseQA v1.1 (validation set, 1,119 questions)
Prompt	3-shot, answer is a single alphabetic character
Thinking Mode	nothink (no thinking)
temperature	0
Backend	Ollama (DGX Spark local)

Model	Generation	Quantization	Accuracy	Avg. Latency	Avg. Generation Speed
Nemotron Nano 9B v2 JP (ref.)	Previous	BF16	91.2%	0.98s/q	—
Nemotron 3 Nano (30B-A3B)	Nemotron 3	Q4_K_M	87.0%	0.30s/q	118.5 tok/s
Nemotron 3 Super (120B-A12B)	Nemotron 3	Q4_K_M	94.4%	0.92s/q	35.6 tok/s

Model	Configuration	Decode Speed
Nemotron 3 Nano (30B-A3B)	NVFP4 / vLLM 26.03	57.7 tok/s
Nemotron 3 Super (120B-A12B)	NVFP4 / No MTP	14.8 tok/s
Nemotron 3 Super (120B-A12B)	NVFP4 / MTP num_spec=3	18.6 tok/s