I tried running NVIDIA Nemotron 3 Super on DGX Spark
ちょっと話題の記事

I tried running NVIDIA Nemotron 3 Super on DGX Spark

2026.03.12

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

On March 11, 2026, NVIDIA released Nemotron 3 Super. It is a reasoning model for agents that adopts a hybrid architecture with 120B total parameters and 12B active parameters. For details on the architecture and a trial run on Cloudflare Workers AI, please refer to Oguri-san's article linked below.

https://dev.classmethod.jp/articles/nvidia-nemotron-3-super-cloudflare-workers-ai/

This time, we focus on local execution. In a previous article, we ran Nano (30B-A3B), a lightweight model from the same Nemotron 3 family, on DGX Spark. This time, we scale up to tackle the 120B Super. Since we're also curious about the performance difference from Nano, we compared them using the JCommonsenseQA benchmark.

Overview of Nemotron 3 Super

Nemotron 3 is a new generation model family that NVIDIA trained from scratch with its own original architecture. It belongs to a different lineage from the Nemotron Nano 9B v2 Japanese (a Llama-based fine-tune) covered in a previous series.

Model Total Parameters Active Context Length Positioning
Nemotron 3 Nano 30B 3B 128K Lightweight / Edge-oriented
Nemotron 3 Super 120B 12B 256K〜1M Core / Agent-oriented
Nemotron 3 Ultra Undisclosed Large-scale (planned H1 2026)

Super uses a hybrid configuration that alternates three types of blocks: Mamba-2, Transformer Attention, and Latent MoE (Mixture of Experts). Each is designed to handle long context processing, accurate reference retrieval, and 4x expert utilization at the same cost, respectively. It also has a built-in Multi-Token Prediction (MTP) head, supporting acceleration via native Speculative Decoding without an external draft model.

An important point when considering execution on DGX Spark is that it was pre-trained from scratch in NVFP4 (4-bit floating point). Rather than quantizing after the fact, it was trained from the start under 4-bit precision constraints, making it natively compatible with Blackwell architecture's NVFP4 optimizations. However, as described later, as of March 2026, there were issues on the inference engine side in my environment, so I verified using the GGUF version on Ollama.

Setting Up on DGX Spark

Verification Environment

Item Value
Hardware NVIDIA DGX Spark (GB10 Superchip)
Memory 128GB Unified Memory
Driver 580.126.09
CUDA 13.0
OS Ubuntu 24.04 (aarch64)
Inference Engine Ollama 0.17.2
Model nemotron-3-super:latest (Q4_K_M GGUF)

The NVFP4 Version Didn't Work

NVIDIA's official cookbook provides instructions for launching with vLLM 0.17.1 + NVFP4 checkpoints. However, DGX Spark (GB10) has CUDA Compute Capability 12.1 (sm_121), and torch 2.10.0+cu128, which vLLM 0.17.1 depends on, only supports up to sm_120. Since the CUTLASS kernel crashes at runtime, it was not possible to run the NVFP4 variant with vLLM at this time.

I also tried NGC containers (26.01, 26.02), but 26.01 uses vLLM 0.13.x and does not support the Nemotron 3 Super architecture (nemotron_h), while 26.02 uses vLLM 0.15.x and fails to interpret the MIXED_PRECISION quantization.

TRT-LLM has an official configuration for DGX Spark (Config C), but it reportedly requires building from the main branch rather than a release version, so I passed on it this time. I hope this will be resolved in NGC 26.03 or later, or with an official TRT-LLM release.

Running with Ollama

The GGUF format allows us to avoid CUDA kernel compatibility issues. Ollama has published nemotron-3-super with Q4_K_M quantization, and it ran without issues on DGX Spark.

# Pull the model with Ollama (approx. 87GB)
ollama pull nemotron-3-super

You can also pull Nano in the same way.

# Pull Nano for comparison (approx. 24GB)
ollama pull nemotron-3-nano

Once Ollama is running, all you need to do is call the API. The initial model load takes about 1–2 minutes, but once loaded, responses come back smoothly.

Memory Usage

Since DGX Spark's unified memory is shared between CPU and GPU, it's difficult to check exact usage with nvidia-smi. Checking with Ollama's ps command, Super was loading approximately 87GB and Nano approximately 24GB. With 128GB of unified memory, Super alone has plenty of room, but loading both models simultaneously totals around 111GB, so care is needed when balancing with other processes.

Running Inference

Reasoning Mode (with thinking)

Nemotron 3 Super has a Reasoning mode that outputs the thought process during inference, similar to DeepSeek-R1. In Ollama, this can be controlled with the think parameter.

curl -s http://localhost:11434/api/chat -d '{
  "model": "nemotron-3-super",
  "messages": [
    {"role": "user", "content": "Explain the key innovations of Mixture of Experts architecture in 3 sentences."}
  ],
  "stream": false
}' | jq '{model, eval_count, eval_duration_ns: .eval_duration,
         tok_per_sec: (.eval_count / (.eval_duration / 1e9))}'

By default, it operates with Reasoning ON (with thinking), generating a response after internally elaborating the thought process. While latency increases due to the higher token count generated, accuracy improves on complex reasoning tasks.

No-thinking (nothink) Mode

To skip the thought process and get a direct answer, specify think: false.

curl -s http://localhost:11434/api/chat -d '{
  "model": "nemotron-3-super",
  "messages": [
    {"role": "user", "content": "What is 2+2?"}
  ],
  "think": false,
  "stream": false
}'

This mode is suited for situations requiring short answers, such as benchmarks and classification tasks.

Throughput Measurement

We compared the throughput of Nano and Super using the same prompt.

Model Parameters Quantization Model Size Prompt Processing Generation Speed Generated Tokens
Nemotron 3 Nano 30B (3B) Q4_K_M 24GB 361.0 tok/s 72.3 tok/s 170
Nemotron 3 Super 120B (12B) Q4_K_M 87GB 112.4 tok/s 17.9 tok/s 424

Nano being approximately 4x faster in generation speed closely matches the 4x difference between the active parameters of 3B and 12B. On the other hand, Super generates more tokens due to its longer thought process in Reasoning mode, resulting in a difference in answer quality.

JCommonsenseQA Benchmark

Why This Benchmark?

JCommonsenseQA is a dataset of 5-choice questions testing Japanese commonsense reasoning. We previously verified Nemotron Nano 9B v2 Japanese (previous generation) on the same DGX Spark in this series, providing a comparison baseline. Since the Nemotron 3 generation was trained mainly on English training data, it also serves as an indicator of how well it can handle Japanese tasks.

Measurement Conditions

Item Value
Dataset JCommonsenseQA v1.1 (validation set, 1,119 questions)
Prompt 3-shot, answer is a single alphabet character
Thinking Mode nothink (no thinking)
temperature 0
Backend Ollama (DGX Spark local)

Results

Model Generation Quantization Accuracy Avg. Latency Avg. Generation Speed
Nemotron Nano 9B v2 JP (ref.) Previous BF16 91.2% 0.98s/q
Nemotron 3 Nano (30B-A3B) Nemotron 3 Q4_K_M 87.0% 0.30s/q 118.5 tok/s
Nemotron 3 Super (120B-A12B) Nemotron 3 Q4_K_M 94.4% 0.92s/q 35.6 tok/s

Nemotron 3 Super recorded the highest accuracy at 94.4%. The previous generation 9B v2 Japanese (fine-tuned for Japanese) achieved 91.2%, so surpassing it without any Japanese-specific training is a bit surprising. It gives the impression that the reasoning power of the 12B active parameters is directly reflected.

Nano also performed well at 87.0%. In terms of latency, Nano is overwhelmingly faster at 0.30s/question compared to Super's 0.92s/question—3x faster. However, since Super also clears 1 second per question, it felt comfortable to use in practice.

Super for accuracy, Nano for speed. The division by use case is clear.

Expectations for vLLM + NVFP4 and MTP

This time we verified with Ollama + GGUF, but NVIDIA's officially recommended configuration is vLLM 0.17.1 + NVFP4 checkpoints. TRT-LLM Config C is also available for DGX Spark.

Furthermore, Nemotron 3 Super has a Multi-Token Prediction (MTP) head built into the checkpoint, enabling native Speculative Decoding without an external draft model. In vLLM, it can be enabled simply by adding the --speculative-config option, and NVIDIA officially claims up to 3x speedup on structured generation tasks.

# To enable MTP (vLLM)
vllm serve $MODEL_CKPT \
  --speculative-config '{"method": "nemotron_h_mtp", "num_speculative_tokens": 5}'

Once vLLM + NVFP4 becomes operational on DGX Spark, we can expect higher throughput than the GGUF version, and we'll also be able to try out MTP-based acceleration. I look forward to the next NGC container version and the official TRT-LLM release.

June 2026 Update: NVFP4 Now at Practical Speed with Official NGC Container

Back in March, I ended by expressing hope that "things would improve once NGC official containers and vLLM 0.17+ gain sm_121 support." About three months later, the situation has changed considerably. When I revisited DGX Spark for an upcoming event, NVFP4 was working directly with the official NGC vLLM container, so I'm adding the results here.

I used nvcr.io/nvidia/vllm:26.03.post1-py3. One of the causes of the failure in March—the MIXED_PRECISION quantization interpretation failure in NGC 26.02—was resolved in 26.03, and I was able to load nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 directly.

docker run --rm --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_FLASHINFER_MOE_BACKEND=latency \
  nvcr.io/nvidia/vllm:26.03.post1-py3 \
  vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
  --trust-remote-code --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

What reassured me when watching the startup log was that FLASHINFER_CUTLASS was selected as the MoE backend for NVFP4. In March, it fell back to Marlin weight-only and dropped to 4.8 tok/s, but this time FlashInfer CUTLASS was selected even though Marlin was among the candidates. The mamba_ssm_cache_dtype float32 setting is now handled automatically by 26.03, so the option I had to pass manually in March is no longer needed.

Measured Throughput

Results re-measured in a warm state.

Model Configuration Decode Speed
Nemotron 3 Nano (30B-A3B) NVFP4 / vLLM 26.03 57.7 tok/s
Nemotron 3 Super (120B-A12B) NVFP4 / No MTP 14.8 tok/s
Nemotron 3 Super (120B-A12B) NVFP4 / MTP num_spec=3 18.6 tok/s

Compared to 4.8 tok/s via Marlin in March, Super is about 4x faster. Enabling MTP (Multi-Token Prediction) increased it from 14.8 to 18.6 tok/s, an improvement of about 26%. Since Super has the MTP head built into the checkpoint, you can use it simply by passing --speculative-config.

Summary

Running a 120B parameter model on a single desktop machine is still impactful. Achieving 94.4% on JCommonsenseQA without Japanese-specific fine-tuning speaks to the strong baseline Japanese capability.

It's also interesting that the division is clear: Nano for speed, Super for accuracy. With DGX Spark, you can load both and switch between them depending on your use case.

This time, the NVFP4 version did not work due to sm_121 compatibility issues, but with Ollama + GGUF, inference was practical at a sufficiently usable speed. As NVFP4 native support matures, further performance improvements—including MTP-based acceleration—can be expected.

I'll also be keeping an eye on further updates about the Nemotron 3 family at GTC 2026 (3/16-19).


生成AI活用はクラスメソッドにお任せ

過去に支援してきた生成AIの支援実績100+を元にホワイトペーパーを作成しました。御社が抱えている課題のうち、どれが解決できて、どのようなサービスが受けられるのか?4つのフェーズに分けてまとめています。どうぞお気軽にご覧ください。

生成AI資料イメージ

無料でダウンロードする

Share this article