Tried running DeepSeek V4 Flash-DSpark on 2 DGX Spark units

DeepSeek's official team optimized `DeepSeek-V4-Flash-DSpark` for DGX Spark, and I tried running it with vLLM across two DGX Spark units. For short code generation tasks, we achieved a decode speed of 55 tok/s and a 40% MTP acceptance rate, and it became clear that disabling thinking can further speed up the same model.

森茂洋 / Hiroshi Morishige

2026.06.30

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
In late June 2026, DeepSeek released a model with a somewhat unusual name, DeepSeek-V4-Flash-DSpark, on HuggingFace. The base is the same checkpoint as DeepSeek-V4-Flash (284B total / 13B active MoE) released in April, but the filename and config contain dspark as-is.
https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark
DeepSeek officially released a model explicitly named and optimized for specific hardware called DGX Spark. This alone was personally a somewhat notable piece of news for me, but the difference from the base version turned out to be a DGX Spark-targeted version with an MTP (Multi-Token Prediction) speculative decoding module attached — a deeper difference touching on inference acceleration internals — which made me want to try it even more.
There's also a personal reason behind why I tried it this time. I've been using NVIDIA LLM Router v3 regularly lately, and observing the routing results across a 9 model pool, I noticed that most of my everyday tasks get routed to around V4 Flash. The usability via Hermes Agent / Codex / Claude Code is good enough quality-wise. Then the DSpark version appeared at just the right time, so I thought "if I can run the same V4 Flash on my own DGX Spark without worrying about API costs, it could become a realistic everyday environment," and that's the flow of how I brought it in.
https://dev.classmethod.jp/articles/dgx-spark-nvidia-llm-router-v3/
The prerequisite of two DGX Sparks is certainly not light, so I won't say it's a realistic option for everyone. However, if you're in an environment where owning two is acceptable, given the recent local LLM landscape and LLM Router behavior, I think it's a combination well worth considering.
Initially I was gathering materials with the plan of running nvidia/DeepSeek-V4-Flash-NVFP4 across 2 nodes. The DSpark version appeared midway, so I swapped it in as the main subject. I've decided to split the parallel comparison with NVFP4 into a separate article, and this article focuses solely on getting DSpark running on "2 DGX Sparks + vLLM."
To state the conclusion upfront: with reasoning tasks and thinking disabled, I got decode 55.17 tok/s / MTP acceptance 40.4%, and was able to run a 13B active MoE at a reasonably practical speed on 2 DGX Sparks. On the other hand, I also saw real-world limitations: throughput drops by 1.32x the moment thinking is enabled, and extending context to 900K causes decode to hit a ceiling due to attention computation.
!The numbers in this article are actual measurements from my own 2-node DGX Spark environment (GB10, unified memory 128 GB × 2, QSFP direct connection). Since I applied modification patches to tonyd2wild's official Recipe for our environment, there are differences from the Recipe's original quoted values. If the implementation doesn't work or you find errors, I'd be happy if you could provide feedback in the comments.
 Reading DSpark's Identity from the ConfigFirst, what I wanted to know was "what's different between the base V4 Flash and the DSpark version?" Looking at the HuggingFace file list, the main body is 48 shards / 167.57 GB total, in the same size range as the base V4 Flash. However, there are directories called encoding/ and inference/, containing a DSML (a tool-call markup defined by DeepSeek) encoder and a minimal reference implementation.
The core of the difference is in the last 2 lines of config.json.
excerpt
{
  "model_type": "deepseek_v4",
  "num_hidden_layers": 43,
  "n_routed_experts": 256,
  "num_experts_per_tok": 6,
  "max_position_embeddings": 1048576,
  "rope_scaling": {"type": "yarn", "factor": 16},
  "quantization_config": {
    "quant_method": "fp8",
    "fmt": "e4m3",
    "scale_fmt": "ue8m0",
    "weight_block_size": [128, 128]
  },
  "num_nextn_predict_layers": 1,
  "dspark_target_layer_ids": [40, 41, 42]
}
The key point is the combination of dspark_target_layer_ids: [40, 41, 42] and num_nextn_predict_layers: 1. This is a declaration that among the 43-layer MoE, the last 3 layers are treated as speculation layers targeting DGX Spark. MTP doesn't hold a separate draft model; instead, it's designed to repurpose the tail layers of the main network directly as speculative decoding prediction layers, and the draft weights are already integrated into the main body's 48 shards rather than a separate shard.
Quantization is FP8 (E4M3) base + FP4 only for the MoE portion mixed, weight_block_size is [128, 128], and scale_fmt uses the unusual ue8m0 format. This reads as tuning to compress expert weights down to FP4 to fit the 13B active MoE into the DGX Spark's unified memory of 128 GB. In practice, with 2 nodes (TP=2) and --gpu-memory-utilization 0.82 specified, everything including KV cache fit comfortably.
The base V4 Flash itself uses a new structure that DeepSeek has published in papers: Hybrid Attention (a combination of Compressed Sparse Attention + Heavily Compressed Attention) + Manifold-Constrained Hyper-Connections. The design upper limit extends to 1M context (max_position_embeddings: 1048576, yarn factor 16), and reasoning_effort to switch inference depth is also available. The model card notation is 3 levels: non-think / high / max, but the vLLM API accepts the OpenAI standard none / low / medium / high / xhigh / max, so note that passing non-think directly returns a 400 error. The DSpark version inherits all these features, and since it uses a custom tokenizer extension called encoding_dsv4, --trust-remote-code is required when starting vLLM.
The license is MIT. The DSpark version's position is that a 13B active MoE with tool calling and reasoning is officially distributed by DeepSeek explicitly targeting DGX Spark.
 Bringing in the 2-Node Environment and tonyd2wild RecipeIf you try to load DSpark's 167 GB weights onto a single DGX Spark, the 128 GB unified memory isn't enough. The practical solution is to distribute across 2 machines and start with TP=2, and the inter-node communication determines the cluster's speed.
This time's configuration is a simple setup with just one QSFP direct-connect cable between 2 DGX Sparks (node1 / node2). The QSFP spec is 200 Gbps, but actual measurements with iperf3 showed an upper limit of around 14.9 Gbps for a single stream and 16.4 Gbps for 4 parallel streams. The kernel's streaming capacity is the bottleneck, and the full 200 Gbps can't be utilized — that's the premise. Even so, it's 2 orders of magnitude faster than via Wi-Fi (hundreds of Mbps) or Tailscale (~90 Mbps), so it's more than sufficient for TP all-reduce communication. The ping RTT is 0.5~0.7 ms, giving responses typical of a direct link.
Green is QSFP direct (the route I want vLLM's TP all-reduce to use), purple is Wi-Fi, and yellow is Tailscale. Since the default route points to the Wi-Fi side, vLLM will try to grab that if nothing is specified. This becomes a trap later, but for now the physical layer is set up this way.
 The Story of Going Around in Circles Choosing a RecipeOn the software side, I started with Aiden's Recipe (aidendle94/sparkrun-vllm-ds4-gb10:production-ready). This is a widely-used image for running the base DeepSeek-V4-Flash across 2 nodes, built so that specifying --speculative-config '{"method":"mtp", ...}' enables MTP.
However, when I fed it DSpark weights and started with method=mtp, the vLLM load process crashed like this:
File ".../vllm/models/deepseek_v4/nvidia/mtp.py", line 448, in load_weights
    params_dict[name]
KeyError: ...
It seems the MTP weight naming in the DSpark version and the MTP loader bundled in the Aiden image don't match. The plain method=mtp route can't read DSpark's MTP module as-is.
So I switched to tonyd2wild's Recipe (tonyd2wild/DeepSeek-v4-Flash-DSpark-60-tok-s-900K-ctx-2x-DGX-Spark). The repository name aggressively goes with the string "60 tok/s / 900K context / 2x DGX Spark," but the content was a solid build: a runtime overlay including Rafael Caricio's vLLM PR (rafaelcaricio/vllm#1) layered on top of a base image (ghcr.io/bjk110/vllm-spark:unholy-fusion-prod-ready). With this overlay-included vLLM, you can choose the DSpark-specific method --speculative-config '{"method":"dspark", "num_speculative_tokens":5}', and it can read DSpark's MTP weights.
Running build-dspark-vllm-runtime.sh produces vllm-dspark-runtime:clean (22.7 GB) on both nodes. This part went smoothly.
 The Wi-Fi Trap vLLM Falls Into and 3 Essential PatchesAfter building, starting vLLM caused it to crash like this when the TP all-reduce began:
RuntimeError: ifa != nullptr. Unable to find address for: enP7s7
The Gloo backend is looking for an interface called enP7s7 that doesn't exist and failing. Tracing the cause, vLLM's multiproc_executor.py was selecting the Wi-Fi side IP (192.168.64.198) of the default route as mq_connect_ip, and from there was guessing the interface name to reach, which was the problem.
As shown in the Mermaid diagram at the beginning, the DGX Spark's interface configuration means I want to use enp1s0f1np1 (QSFP), but without explicitly telling vLLM this, it goes to the Wi-Fi side. I applied the following 3 patches to the Recipe.
3 Essential Patches (click to expand)A. Add to .env.dspark
WORKER_IP=192.168.0.14    # worker QSFP IP (head reuses MASTER_ADDR)
NCCL_IB_GID_INDEX=3         # default 0 stalls with RoCEv2, so set to 3
B. environment: section in docker-compose.dspark.yml
GLOO_SOCKET_IFNAME: '${GLOO_SOCKET_IFNAME:-${NCCL_SOCKET_IFNAME}}'
TP_SOCKET_IFNAME: '${TP_SOCKET_IFNAME:-${NCCL_SOCKET_IFNAME}}'
MN_IF_NAME: '${MN_IF_NAME:-${NCCL_SOCKET_IFNAME}}'
OMPI_MCA_btl_tcp_if_include: '${OMPI_MCA_btl_tcp_if_include:-${NCCL_SOCKET_IFNAME}}'
WORKER_IP: '${WORKER_IP:-}'
C. NODE_RANK-based VLLM_HOST_IP branching at the start of command:
if [ "${NODE_RANK:-0}" = "0" ]; then
  export VLLM_HOST_IP="${MASTER_ADDR}"
else
  export VLLM_HOST_IP="${WORKER_IP:-}"
fi;
What this does, in the end, is simply "fix the interface used for inter-node communication to the QSFP side." Using the 4 variables MN_IF_NAME / GLOO_SOCKET_IFNAME / TP_SOCKET_IFNAME / OMPI_MCA_btl_tcp_if_include, it pins the routes for vLLM's multiproc, Gloo, TP, and OpenMPI respectively, and branches VLLM_HOST_IP based on NODE_RANK to head=MASTER_ADDR / worker=WORKER_IP.
After applying these and going back through the build → start flow, it stopped stalling at the TP initialization step.
 Getting a Rough Sense of the Capability with Official BenchmarksBefore measuring speed on actual hardware, I'll get a sense of V4 Flash's baseline capability from official numbers. Since the DSpark version uses the same checkpoint as the base V4 Flash, the benchmark quality numbers from V4 Flash apply directly.
The coding and long-context comprehension numbers for V4 Flash Max (Think Max mode) published by DeepSeek are as follows:


Benchmark
V4-Flash Max
Gemini 3.1 Pro
Opus 4.6 Max


LiveCodeBench
91.6
91.7
88.8

Codeforces Rating
3052
3052
—

SWE Verified
79.0
—
—

MRCR 1M
78.7
—
—

Terminal Bench 2.0
56.9
—
—

Source: deepseek-ai/DeepSeek-V4-Flash model card
LiveCodeBench is on par with Gemini 3.1 Pro, and the Codeforces rating of 3052 matches as well. Looking at SWE Verified 79.0 and MRCR 1M 78.7 together, these look like remarkably aggressive numbers for a model spec of Active 13B MoE with weights distributed under MIT.
The DSpark acceleration is added on top of this. According to the paper §5.4 and the official README, in DeepSeek's official serving environment with DSpark-5 (γ=5 + Markov head) enabled, the claim is that per-user generation speed is +60~85% faster than the MTP-1 baseline. One purpose of running this on 2 DGX Spark nodes is to verify how much of this "faster with the same weights" can actually be felt locally.
 Breakdown of Why It Took 19 Minutes to Start in vLLMApplying the patches and running docker in order of worker → head, it took about 19 minutes until the API responded on port 8888. Tracing the logs revealed the wait time broke down as follows:
Loading 48 shards at 162 seconds and mHC kernel warmup at 24 seconds were within expected range, and the remaining 15+ minutes were almost entirely consumed by sparse MLA autotune, TileLang JIT, and FlashInfer autotune. These are steps that search for the fastest parameters on the actual hardware targeting DGX Spark's sm_121a (GB10), and the first startup inevitably takes a long time.
Fortunately, the results are persisted under ~/.cache/huggingface on the host side, so subsequent startups take around 7 minutes. For operations where you're constantly hitting the API from Hermes Agent or Codex, the basic approach is to start once and keep it running, so as long as you brace for the first time, it won't be an operational burden.
Here, curl http://127.0.0.1:8888/v1/models finally returned DeepSeek-V4-Flash-DSpark, and a light math smoke test also passed with a <think> block. The max_model_len is tonyd2wild's default of 262,144 (262K), which is just right for Hermes's 256K context regular use.
 Measuring Decode tok/s for Short to Medium ContextFrom here I'll look at actual hardware speeds. To compare MTP effectiveness, I prepared 3 types of prompts:
medium: a natural language question of about 200 characters (general chat / knowledge questions)
long: a natural language explanation request of about 500 characters
reasoning_code: a short 42-token code generation request "write Python code to solve a maze with BFS"
For each, I measured decode tok/s for 2 variations: using the <think> block (thinking ON) / not using it (thinking OFF). The context is the prod 262K setting (max_num_seqs=1, single user).


Prompt
thinking ON
thinking OFF
speed-up


medium (natural language 200c)
35.44
35.00
0.99x

long (natural language 500c)
37.81
39.05
1.03x

reasoning_code (BFS Python)
41.90
55.17
1.32x

The unit of the numbers is tok/s (output token speed during the decode phase). For medium / long, there's almost no difference with or without thinking. 35~39 tok/s is the straightforward capability when running a 13B active MoE on DGX Sparks with unified memory 128 GB × 2 nodes, and it falls within the practical range for a local LLM. For use cases like sending ordinary Q&A from Hermes Agent or Codex, this becomes the baseline.
On the other hand, reasoning_code is different. It jumps from 41.90 tok/s with thinking ON to 55.17 tok/s when thinking is turned OFF. With the same prompt, just switching reasoning_effort= makes this difference. TTFT (time until the first token returns) was around 0.6~2.1 seconds for short prompts and 4.7~8.8 seconds for medium prompts, which moved in proportion to the weight of prefilling prompt tokens all at once.
What's interesting here is "why does only reasoning_code improve so much with thinking off?" If you think about it normally, it might seem like thinking ON should just be slower because of the longer output with the <think> block, but this is directly tied to the MTP acceptance rate story, which becomes the core of the next chapter.
Note that tonyd2wild's README shows 62.48 tok/s using code_completion-type 512-token prompts under the same DSpark configuration. Since the conditions aren't the same, direct comparison isn't possible, but there seems to be sufficient potential to reach this range if the prompt structure is aligned toward draft-friendly patterns. I'll leave this as a topic for a separate article.
For reference, the median throughput (aggregated over the past 30 minutes) for major providers pulling the same V4 Flash via OpenRouter lines up at around Baidu 66 / Fireworks 63 / Alibaba 62 / DeepSeek official 61 / SiliconFlow 61 tok/s. However, OpenRouter's V4 Flash defaults reasoning_effort to high, and low / off specifications are not supported. Comparing thinking ON to thinking ON, DSpark's 41 tok/s vs. cloud top-tier's ~60+ tok/s still shows a gap.
Note that the 55 tok/s with thinking OFF is a number based on directly calling curl with reasoning_effort=none. vLLM's DSpark implementation in this mode sets content to null and puts the final output in the reasoning field, so it looks like an empty response to OpenAI-compatible clients (those that read message.content like Hermes Agent / Codex / Claude Code etc.). For everyday use via Agent, the current situation means thinking ON's 41 tok/s is the effective value, so it's safer to think of it that way.
 Looking at How MTP Works with Thinking On / OffHere's the core of this article. Why does decode speed increase by 1.32x for reasoning_code when thinking is disabled? Digging deeper reveals that DSpark's MTP (Multi-Token Prediction) has a property of being "difficult to draft" for reasoning traces containing <think> blocks.
vLLM's /metrics endpoint outputs DSpark-specific speculative decoding stats (num_drafts_total / num_draft_tokens_total / num_accepted_tokens_total / num_accepted_tokens_per_pos_total). I ran reasoning_code with thinking ON / OFF one sample each, and took the difference between before (start) and after (completion) states.



drafts
accepted/draft (mean)
per-token acceptance


thinking ON
2124
1.21
24.2%

thinking OFF
1731
2.02
40.4%

Turning off thinking improves the average number of tokens accepted per draft from 1.21 → 2.02, a +67% improvement, and per-token acceptance jumps from 24.2% → 40.4%. This directly shows that DSpark's draft (the last 3 layers of the main network + Markov head) is poor at predicting "text where what comes next fluctuates every step" like reasoning traces. Conversely, for strings with fixed structure like actual code, drafts are easier to get right, and acceptance can be captured significantly even with γ=5 static draft.
The DSpark paper §1 also states "Math/code naturally sustain higher acceptance rates," and the reasoning trace portion is not that way — which is what I confirmed with my own hands this time. "For tasks that don't truly need reasoning, turning off thinking actually yields greater benefits from MTP" is a fairly useful operational distinction.
Note that the DSpark running in vLLM is not the complete version from the paper; the dynamic verification scheduler proposed in paper §3.2 is off (VLLM_DSPARK_CONFIDENCE_SCHEDULER=off), and only the Markov head + γ=5 static draft portion is in effect. DeepSeek's official serving's "+60~85% vs MTP-1 baseline" is a number including this scheduler, so if the scheduler gets ported to vLLM, there's still room to grow from here.
 What Happens When You Simply Extend Long ContextDSpark is a model designed to handle up to 1M tokens. Changing the input context to 64K / 128K / 256K while keeping the prod 262K setting and measuring decode tok/s results in the following shape:


Input context
decode tok/s (estimated)


64K
8.7

128K
8.4

256K
3.2

Compared to the 30~55 tok/s from short prompts in Chapter 6, decode falls in two stages just from expanding the input. 64K and 128K are roughly flat, but 256K drops another level.
From here, increasing max_model_len to 900K and restarting, then sending a long text of about 247K tokens with "write the next Python code" for 500 token output took about 4 minutes 32 seconds total, with decode around 2.4 tok/s. Even so, DSpark's accept rate was 46.8%, not dropping from the short-text case (40.4%), and the output code incorporated function names and signatures from the input context, confirming that draft quality and output quality are maintained even for long context. The reason decode drops is because attention computation gets heavier proportional to context length — it's not a case of quality breaking down to the point of unusability.
So personally, for use cases like keeping Hermes Agent or Codex running without cutting context, I think the most realistic operation is to set max_model_len=262144 (262K) while keeping the actual context per turn to a few K to tens of K. The impression is that use cases where 900K is beneficial are confined to long-form one-shot summarization or code reading.
 SummaryI brought DeepSeek-V4-Flash-DSpark, DeepSeek's official DGX Spark-named model, to 2 DGX Sparks using the tonyd2wild Recipe with custom patches, and was able to see 55.17 tok/s / MTP acceptance 40.4% on actual hardware for reasoning code generation.
As mentioned at the beginning, my recent hands-on experience with LLM Router v3 was that many everyday tasks get routed to around V4 Flash. As mentioned at the end of Chapter 6, the speed range seen this time is getting close to a level where comparison with OpenRouter's top providers (60 tok/s range with thinking ON) is possible. The realistic value for Agent-based operation is 41 tok/s on the thinking ON side, and while 55 tok/s with thinking OFF is achievable by directly calling curl, it's still just short of what I'd want for everyday client use. Even so, the idea of replacing the route being directed to V4 Flash by LLM Router v3 directly with the DSpark side is well within realistic range.
Coming from being used to cloud API response speeds, there are moments where the speed feels a bit lacking, though it's at a level where regular use isn't impossible. The hurdle of owning two units is certainly not trivial, so it's not a configuration I'd recommend to everyone, but if the environment is available, it's quite an interesting position to be in.
 Reference Linksdeepseek-ai/DeepSeek-V4-Flash-DSpark (HF)
DSpark paper arXiv 2606.19348
tonyd2wild/DeepSeek-v4-Flash-DSpark-60-tok-s-900K-ctx-2x-DGX-Spark
rafaelcaricio/vllm#1 (DSpark method PR)
NVIDIA DGX Spark Forum #374742 (2x DGX Spark Recipe)
Flowtivity blog (DeepSeek V4 Flash 1M context dual DGX Spark)

Tried running DeepSeek V4 Flash-DSpark on 2 DGX Spark units

Introduction

Reading DSpark's Identity from the Config

Bringing in the 2-Node Environment and tonyd2wild Recipe

The Story of Going Around in Circles Choosing a Recipe

The Wi-Fi Trap vLLM Falls Into and 3 Essential Patches

Getting a Rough Sense of the Capability with Official Benchmarks

Breakdown of Why It Took 19 Minutes to Start in vLLM

Measuring Decode tok/s for Short to Medium Context

Looking at How MTP Works with Thinking On / Off

What Happens When You Simply Extend Long Context

Summary

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Benchmark	V4-Flash Max	Gemini 3.1 Pro	Opus 4.6 Max
LiveCodeBench	91.6	91.7	88.8
Codeforces Rating	3052	3052	—
SWE Verified	79.0	—	—
MRCR 1M	78.7	—	—
Terminal Bench 2.0	56.9	—	—

Prompt	thinking ON	thinking OFF	speed-up
medium (natural language 200c)	35.44	35.00	0.99x
long (natural language 500c)	37.81	39.05	1.03x
reasoning_code (BFS Python)	41.90	55.17	1.32x

	drafts	accepted/draft (mean)	per-token acceptance
thinking ON	2124	1.21	24.2%
thinking OFF	1731	2.02	40.4%