I ran Nemotron-Labs-Diffusion on DGX Spark and actually measured tri-mode generation

I ran Nemotron-Labs-Diffusion on DGX Spark and actually measured tri-mode generation

2026.05.21

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

In May 2026, NVIDIA released a new model family called "Nemotron-Labs-Diffusion." Some of you might think this is about image generation because of the word "diffusion" in the name, but this is actually a "diffusion language model" that generates text. Keeping that in mind from the start will make the rest of this article easier to follow.

https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive

The LLMs we normally use write text one token at a time from left to right. Since the next token can't be determined until the previous one is decided, generation is inherently sequential processing. Diffusion language models change this. They generate multiple tokens in parallel by block units, and fill in uncertain parts in later steps.

What makes Nemotron-Labs-Diffusion interesting is that a single model can switch between three decoding modes (AR / Diffusion / Linear Self-Speculation). NVIDIA calls this "tri-mode." Both autoregressive (AR) and diffusion modes coexist within the same checkpoint rather than as separate models.

While reading through the published information, I personally had three things I wanted to know:

  • Do all three modes run smoothly in the local DGX Spark environment?
  • What does the diffusion mode's behavior of "generating in parallel and filling in later" actually look like when you run it?
  • How do the numbers the official documentation cites — "2.7x compared to AR" and "6× tokens per forward" — hold up in a bare run without quantization?

So I loaded the Nemotron-Labs-Diffusion 8B model onto a DGX Spark and measured all three modes. To give the conclusion upfront: all three modes ran on DGX Spark. In terms of speed, Linear Self-Speculation was 1.75–1.98x faster than AR. However, the impressive numbers in the official documentation come with conditions, and when generating longform content in BF16 without quantization, a different picture emerges. I'll cover those differences as well.

As a note, I also previously wrote an article about measuring speculative decoding with an inference engine on DGX Spark using Gemma 4 MTP. Gemma 4 MTP uses a separate draft model, while today's Linear Self-Speculation completes everything with the same model — but the basic structure of "quickly producing a draft that the main model then verifies" is common to both. Reading them together should help you understand where diffusion language models fit in.

https://dev.classmethod.jp/articles/dgx-spark-gemma4-mtp-multi-token-prediction-bench/

What is Nemotron-Labs-Diffusion's tri-mode?

Model Family and Background

Nemotron-Labs-Diffusion has text generation models in three sizes — 3B / 8B / 14B — each with a pre-training-only Base version and an instruction-tuned Instruct version. There is also a VLM-8B that can handle images. This article focuses on the instruction-tuned 8B (nvidia/Nemotron-Labs-Diffusion-8B).

The architecture is a straightforward Transformer, not a Mamba hybrid. According to the technical report, training started from the existing Ministral3-8B checkpoint, followed by continued pre-training with autoregression only on 1T tokens, then continued pre-training on 300B tokens with an objective combining autoregression and diffusion, and finally instruction tuning on 45B tokens. The license is the NVIDIA Nemotron Open Model License.

One subtly important point here is that adding diffusion training does not degrade AR accuracy. The technical report states that compared to a model trained under the same conditions without the diffusion loss, the AR accuracy after joint training actually improved by 0.14–0.43%. The fact that autoregressive capability is not sacrificed to add diffusion capability is one of the pleasing aspects of the design.

Three Decoding Modes

Let me organize what's inside tri-mode. The three generation methods are switched at inference time by changing how attention is applied, all on the same weights.

AR mode is conventional autoregressive decoding. It confirms one token from left to right with each forward pass. Called via ar_generate().

Diffusion mode divides the generation range into blocks of block_length tokens, masks everything in a block, then gradually fills in positions with high confidence. Called via generate(), passing diffusion-specific parameters block_length and threshold. threshold determines how many tokens are confirmed in each step — higher means more cautious, lower means filling in more at once.

Linear Self-Speculation mode is a self-speculative decoding approach that combines diffusion and autoregression. Diffusion generates candidate tokens in parallel as a draft, which autoregression then verifies and confirms up to the correct point. The idea is the same as speculative decoding, but there's no need to prepare a separate small draft model — the same model's diffusion mode serves as the drafter. Called via linear_spec_generate().

In that it doesn't need a separate weight file for the drafter, it's also related to DeepSeek's MTP, which embeds a dedicated drafting module in the main model. However, unlike DeepSeek which adds a dedicated module, Nemotron-Labs-Diffusion adds no such module and instead repurposes its existing diffusion mode directly as the drafter. The previously mentioned Gemma 4 MTP distributes a separate draft model, so lining up all three — "separate model," "built-in module," and "mode repurposing" — shows that the origin of the draft differs in subtle ways.

The three methods return nfe (number of forward evaluations — how many times the forward pass was run) along with the generation results. If you can produce the same length of text with fewer forward passes, that means it's faster. Throughout this article, I'll frequently use nfe and the tokens per forward (how many tokens are confirmed per forward pass) calculated from it as speed metrics.

Test Environment

I used one DGX Spark for testing. The hardware and software configuration is as follows:

Item Details
Machine NVIDIA DGX Spark (GB10, Blackwell SM121)
Memory 128GB unified memory (UMA), bandwidth 273 GB/s
Architecture aarch64 (ARM64)
Python 3.13.13 (built with uv)
PyTorch 2.12.0+cu130
transformers 5.9.0
Model nvidia/Nemotron-Labs-Diffusion-8B (BF16, ~16GB)

The three tri-mode modes are switched using proprietary methods ar_generate() / generate() / linear_spec_generate(). These are implemented in custom modeling (modeling_*.py) bundled with the model repository and cannot be easily called from vLLM's standard serving functionality. For this testing, I loaded the model from transformers with trust_remote_code=True and called the methods directly.

Setup and Model Loading

I created a virtual environment for testing with uv and installed the necessary libraries. PyTorch is specified with the cu130 build to match the DGX Spark's CUDA 13 environment.

cd ~/works/nemotron-labs-diffusion
uv venv --python 3.13 .venv
uv pip install --python .venv/bin/python \
    torch 'transformers>=5.0' peft accelerate datasets \
    --torch-backend=cu130

Model loading follows the model card sample exactly, and trust_remote_code=True is required to load the custom modeling.

from transformers import AutoModel, AutoTokenizer
import torch

repo = "nvidia/Nemotron-Labs-Diffusion-8B"
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModel.from_pretrained(repo, trust_remote_code=True)
model = model.cuda().to(torch.bfloat16).eval()

The three modes are called as follows. Applying the chat template to create prompt_ids is common to all.

history = [{"role": "user", "content": "Please explain diffusion language models in one sentence."}]
text = tokenizer.apply_chat_template(history, tokenize=False, add_generation_prompt=True)
prompt_ids = tokenizer(text, return_tensors="pt").input_ids.cuda()
eos = tokenizer.eos_token_id

# AR mode
out_ids, nfe = model.ar_generate(prompt_ids, max_new_tokens=512)

# Diffusion mode
out_ids, nfe = model.generate(
    prompt_ids, max_new_tokens=512, block_length=32, threshold=0.9, eos_token_id=eos,
)

# Linear Self-Speculation mode
out_ids, nfe = model.linear_spec_generate(
    prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=eos,
)

A LoRA adapter (linear_spec_lora) that further extends acceptance is bundled in the model repository for Linear Self-Speculation. When using it, you attach it with PeftModel and then call the method from the base model after unwrapping.

from peft import PeftModel

model = PeftModel.from_pretrained(model, repo, subfolder="linear_spec_lora").eval()
base = model.model  # Unwrap before calling linear_spec_generate
out_ids, nfe = base.linear_spec_generate(
    prompt_ids, max_new_tokens=512, block_length=32, eos_token_id=eos,
)

In practice, among these straightforward steps, the diffusion mode and Linear Self-Speculation mode initially stop with an exception in transformers 5.9. The cause and workaround are explained together in the "Common Pitfalls on DGX Spark" section. For now, I'll just note that "adding one shim makes all three modes work."

Measuring the Speed of tri-mode's Three Modes

Now for the main topic. I ran the same 12 longform prompts (asking for explanations of technical topics, mixing Japanese and English) through four configurations and took the average excluding warmup. Generation used max_new_tokens=512 with temperature 0 greedy decoding.

Nemotron-Labs-Diffusion 8B tri-mode speed (DGX Spark / BF16)
Warm tok/s for 4 configurations. Using AR as baseline: Diffusion 1.20x, Linear Self-Speculation 1.75x, with LoRA 1.98x. All using the same 8B model with the same weights, only switching the decoding method.

The numerical results were as follows:

Configuration tok/s vs AR tokens/forward mean nfe
AR (baseline) 12.6 1.00x 1.00 512
Diffusion 15.1 1.20x 1.23 420
Linear Self-Spec 22.2 1.75x 1.81 287
Linear Self-Spec + LoRA 25.0 1.98x 2.08 252

A clean staircase pattern emerged. What I want to highlight is tokens/forward and nfe. Since AR generates 1 token per forward, it runs the forward pass 512 times to produce 512 tokens. Linear Self-Speculation + LoRA confirms an average of 2.08 tokens per forward, completing roughly the same length of text in 252 forward passes. Being able to cut the number of forward passes in half directly translates into the speed difference.

Here, let me clarify the relationship to the official numbers. The technical report includes measurements on DGX Spark, showing that the 8B diffusion mode achieves 77.5 tok/s with FP8 quantization (3.14x vs AR) and 112.5 tok/s with INT4 quantization (2.69x vs AR). My measurements here look much lower, but this is because I'm running BF16 without quantization, directly from transformers without inference engine optimizations. Absolute values change considerably depending on the runtime and quantization. What I want to examine in this article is not that, but the relative speed difference between AR and diffusion variants under the same conditions. In that sense, the result of Linear Self-Speculation being 1.75–1.98x faster honestly reflects the structural benefits of diffusion language models.

I also looked at measurement variance. AR stayed within 12.6–12.7 tok/s across all 12 prompts with virtually no measurement noise. The tokens/forward staircase (AR 1.00 → Diffusion 1.23 → Linear Spec 1.81 → +LoRA 2.08) also reproduced stably.

Confirmed tokens per forward and number of forward passes
Left shows tokens per forward, right shows the total number of forward passes required for the same generation. Diffusion variants confirm more tokens per forward and accordingly reduce the total number of forward passes.

All three modes ran on DGX Spark, and the quality of the output text was comparable to AR. Here is a summary matrix of whether each mode works:

tri-mode availability matrix on DGX Spark
tri-mode operation status with 8B / BF16 / transformers 5.9. All three modes work, with Linear Self-Speculation being the fastest.

Visualizing Parallel Generation in Diffusion Mode

The speed improvement is clear. But what order does diffusion mode actually use to fill in tokens? This was personally the part I most wanted to see.

The diffusion mode's generate() internally repeats a process per block of "among the still-masked positions, confirm those with high confidence a little at a time." So I recorded "which positions are still masked" and "which positions were confirmed in this step" each time this confirmation process was called, and tracked how one block fills in.

Diffusion block filling in parallel
How a block_length=32 block fills in across denoising steps. Horizontal axis is the token position within the block, vertical axis is the step. Dark purple indicates positions confirmed in that step, light purple indicates already confirmed, gray indicates masked. With threshold=0.9, mainly 1 token is confirmed per step, but in the first step multiple positions are confirmed at once.

Looking at the heatmap, you can see that multiple positions are confirmed at once in the first step of a block, then positions fill in one by one from those with the highest confidence. Since threshold=0.9 is a cautious setting, many steps confirm just 1 token at a time. Still, the difference from autoregression is that it doesn't always proceed from left to right like "position 0 → 1 → 2...", but instead fills in from highest-confidence positions first.

The next figure shows the order in which positions within a block were confirmed.

Order in which each token position is confirmed
At which denoising step (vertical axis) each token position (horizontal axis) within the block was confirmed. You can see it fills in sporadically from high-confidence positions rather than from left to right.

Here I want to be precise about one thing. NVIDIA's explanation introduces a characteristic of diffusion language models as "not permanently committing tokens, being able to revise as they go." From what I traced of the 8B generate() internals, once a position was confirmed, it did not become a mask target again in the remaining steps of that block, meaning confirmation itself was irreversible. The essential difference from autoregression is not "revision" but rather that the order of confirmation is not fixed (from left) but by confidence level, and multiple positions can be confirmed simultaneously in one step. That's how I found it most natural to understand the actual behavior. This parallelism leads to the reduction in forward passes we saw earlier.

Sweeping threshold and block_length

Diffusion mode has two adjustment knobs: threshold and block_length. I ran a grid to see how speed and quality change when these are varied. threshold had 4 values (0.7 / 0.8 / 0.9 / 0.95), block_length had 3 values (8 / 16 / 32), for 12 combinations total, run on 8 longform prompts.

Choosing a quality metric took some thought. I initially tried to measure accuracy using a multiple-choice QA benchmark (JCommonsenseQA), but when the answer is a single character, changing threshold barely moved the results, making it useless as a metric. In the end, I used the repetition rate in the output (duplicate rate of the same 4-gram) as a proxy for quality. The technical report also notes that diffusion mode limits generation length to avoid repetition and hallucination, and repetition is the typical failure mode when diffusion mode breaks down.

diffusion mode threshold / block_length sweep
Left shows threshold vs. parallelism (tokens per forward), right shows threshold vs. repetition rate. Lowering threshold increases parallelism and makes generation faster. The repetition rate is actually lower at lower threshold values.

The results were simple. Lowering threshold consistently speeds things up. With block_length=32, tokens per forward increases from 1.14 to 1.32 when lowering threshold from 0.95 to 0.70, reducing the total number of forward passes accordingly.

What was surprising was the quality side. I expected that confirming tokens more aggressively by lowering threshold would be sloppier and increase repetition, but in practice it was the opposite. The repetition rate was lower at lower threshold values, and there was even a slight tendency for it to increase when raising threshold to 0.9 or 0.95. The values themselves were also low at 0.05–0.10, not at a level where text breaks down.

The technical report describes threshold as "the knob that determines the tradeoff between speed and token error rate." However, at least within the range of running these longform explanation tasks with 8B BF16, I found no compelling reason to set threshold high. There's also a tendency for the impact of threshold to diminish as block_length increases, so in practice it seems natural to start with a somewhat larger block_length and a lower threshold. Since this will vary by task and model size, it's worth sweeping for your own use case.

Common Pitfalls on DGX Spark

The trickiest issue during testing was the incompatibility between transformers 5.9 and the bundled custom modeling. The Nemotron-Labs-Diffusion model repository includes a Python file called modeling_ministral.py, which is executed when loading with trust_remote_code=True. However, this file calls create_causal_mask() with slightly older argument names (input_embeds= and cache_position=), and since argument names changed in transformers 5.9.0, it stops with a TypeError. Diffusion mode and Linear Self-Speculation mode go through this code path (AR mode takes a different path and was not affected).

Directly modifying files in the Hugging Face cache is undesirable because it means tampering with what the library manages. Instead, I prepared a shim on the script side that thinly wraps the mask generation function in transformers.masking_utils to absorb the old argument names. Applying this wrap before from_pretrained() loads the custom modeling makes everything work without touching any dependencies or the cache.

def apply_transformers_compat() -> None:
    import transformers.masking_utils as mu

    def _wrap(fn):
        def wrapped(*args, **kwargs):
            if "input_embeds" in kwargs and "inputs_embeds" not in kwargs:
                kwargs["inputs_embeds"] = kwargs.pop("input_embeds")
            kwargs.pop("cache_position", None)
            return fn(*args, **kwargs)
        return wrapped

    for name in ("create_causal_mask", "create_sliding_window_causal_mask"):
        setattr(mu, name, _wrap(getattr(mu, name)))

This kind of version mismatch is fairly common when running a freshly released model with the latest transformers. It's the kind of thing that becomes unnecessary once the custom modeling side is updated, so it makes sense to treat the shim as a temporary workaround and keep it self-contained within your own script without touching any dependencies.

Incidentally, while the model card doesn't mention context length, looking at config.json shows max_position_embeddings is 262144 (256K). This is useful to know when planning to feed in long documents locally.

Summary

I ran the Nemotron-Labs-Diffusion 8B on DGX Spark and measured all three tri-mode modes. Here's a summary of what I found:

  • All three modes — AR / Diffusion / Linear Self-Speculation — ran in the DGX Spark BF16 environment (transformers 5.9 requires one shim)
  • In terms of speed, Linear Self-Speculation was 1.75x faster than AR, and 1.98x with LoRA. Self-speculation where diffusion generates the draft in parallel and autoregression verifies it was the most effective in a bare configuration
  • Diffusion mode fills in positions from highest confidence first. Confirmation itself is irreversible, and the essential difference from autoregression lies in "parallelism and confirmation order" rather than "revision"
  • Lowering threshold speeds things up and actually reduces the repetition rate. In this task, there was no visible reason to set threshold high
  • The official figures of "6× tokens per forward" and "2.7x vs AR with INT4" include quantization and runtime, and in unquantized BF16 longform generation, tokens per forward was 1.2–2.1 with a speedup of just under 2x

Diffusion language models clearly show a direction where self-speculation makes things naturally faster without needing a separate draft model. They also seem well-suited for environments like DGX Spark that run locally on a single node. There's still a lot I haven't tried — quantized versions, the VLM variant, behavior with longer contexts, and more — so I hope to cover those in future articles.

The technical report and public resources for Nemotron-Labs-Diffusion:

https://research.nvidia.com/publication/2026-05_nemotron-labs-diffusion-tri-mode-language-model-unifying-autoregressive

https://huggingface.co/collections/nvidia/nemotron-labs-diffusion

https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B

Related article measuring speculative decoding on DGX Spark:

https://dev.classmethod.jp/articles/dgx-spark-gemma4-mtp-multi-token-prediction-bench/


生成AI活用はクラスメソッドにお任せ

過去に支援してきた生成AIの支援実績100+を元にホワイトペーパーを作成しました。御社が抱えている課題のうち、どれが解決できて、どのようなサービスが受けられるのか?4つのフェーズに分けてまとめています。どうぞお気軽にご覧ください。

生成AI資料イメージ

無料でダウンロードする

Share this article