Tried running DeepSeek V4 Flash-DSpark on 2 DGX Spark units

Tried running DeepSeek V4 Flash-DSpark on 2 DGX Spark units

2026.06.30

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

In late June 2026, DeepSeek released a model with an unusual name, DeepSeek-V4-Flash-DSpark, on HuggingFace. The base is the same checkpoint as DeepSeek-V4-Flash (284B total / 13B active MoE) released in April, but the filename and config contain dspark as-is.

https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash-DSpark

DeepSeek officially released a model named and optimized for specific hardware called DGX Spark. That alone was personally a bit of interesting news, but the difference from the base version is that it's a DGX Spark-targeted version with an MTP (Multi-Token Prediction) speculative decoding module attached, making it a difference that delves into the internals of inference acceleration, which made me want to try it out even more.

There's also a personal reason behind why I tried it out this time. I've been regularly using NVIDIA LLM Router v3 recently, and when observing routing results across a 9 model pool, I noticed that most of my everyday tasks get routed to around V4 Flash. The experience via Hermes Agent / Codex / Claude Code is running at a quality level that doesn't bother me. Then the DSpark version appeared at just the right time, so I thought "if I can run the same V4 Flash on my local DGX Spark, I won't have to worry about API costs, and it could become a realistic everyday environment," which led me to bring it in right away.

https://dev.classmethod.jp/articles/dgx-spark-nvidia-llm-router-v3/

The prerequisite of two DGX Sparks is certainly not a light one, so I won't say it's a realistic option for everyone. However, for environments where having two units is acceptable, given recent local LLM trends and LLM Router behavior, I think it's a combination worth considering.

Initially I was gathering materials with the plan to run nvidia/DeepSeek-V4-Flash-NVFP4 on 2 nodes. The DSpark version appeared midway, so I swapped it in as the main subject. I've decided to separate the parallel comparison with NVFP4 into a different article, and this article follows through to getting DSpark running on "2 DGX Sparks + vLLM" on its own.

To give the conclusion upfront: when I turned off thinking for a reasoning task, I got decode 55.17 tok/s / MTP acceptance 40.4%, running a 13B active MoE at a reasonably practical speed on 2 DGX Sparks. On the other hand, the realities also emerged: throughput drops by a factor of 1.32x the moment thinking is enabled, and stretching the context to 900K causes decode to hit a ceiling due to attention computation.

Reading DSpark's True Nature from the Config

The first thing I was curious about was "what's different between the base V4 Flash and the DSpark version?" Looking at the file list on HuggingFace, the main body is 48 shards / 167.57 GB total, in the same size range as the base V4 Flash. However, there are directories called encoding/ and inference/, which bundle a DSML (the tool-call markup defined by DeepSeek) encoder and a minimal reference implementation.

The body of the difference is in the last 2 lines of config.json.

excerpt
{
  "model_type": "deepseek_v4",
  "num_hidden_layers": 43,
  "n_routed_experts": 256,
  "num_experts_per_tok": 6,
  "max_position_embeddings": 1048576,
  "rope_scaling": {"type": "yarn", "factor": 16},
  "quantization_config": {
    "quant_method": "fp8",
    "fmt": "e4m3",
    "scale_fmt": "ue8m0",
    "weight_block_size": [128, 128]
  },
  "num_nextn_predict_layers": 1,
  "dspark_target_layer_ids": [40, 41, 42]
}

The key points are the combination of dspark_target_layer_ids: [40, 41, 42] and num_nextn_predict_layers: 1. This is a declaration that the last 3 layers of the 43-layer MoE are to be treated for speculation purposes targeting DGX Spark. MTP doesn't hold a separate draft model but uses the tail layers of the main network directly as prediction layers for speculative decoding, and the draft weights are integrated into the main 48 shards rather than a separate shard.

Quantization is FP8 (E4M3) based + mixed FP4 only for the MoE part, weight_block_size is [128, 128], and scale_fmt is ue8m0, an unusual format. Compressing expert weights down to FP4 reads as tuning to fit the 13B active MoE onto the DGX Spark's unified memory of 128 GB. In practice, specifying --gpu-memory-utilization 0.82 on 2 nodes (TP=2) fit comfortably including KV cache.

The base V4 Flash itself uses a new structure from DeepSeek papers: Hybrid Attention (a combination of Compressed Sparse Attention + Heavily Compressed Attention) + Manifold-Constrained Hyper-Connections. The design limit extends to 1M context (max_position_embeddings: 1048576, yarn factor 16), and reasoning_effort to switch inference depth is also available. The model card notation is 3 levels: non-think / high / max, but via the vLLM API it accepts OpenAI standard none / low / medium / high / xhigh / max, so be aware that sending non-think as-is will return a 400 error. The DSpark version inherits all these features, and since it uses a custom tokenizer extension called encoding_dsv4, --trust-remote-code is required when launching vLLM.

The license is MIT. The position of the DSpark version is that a 13B active MoE with tool calling and reasoning is officially distributed by DeepSeek with DGX Spark explicitly named.

Bringing in the 2-Node Environment and tonyd2wild Recipe

If you try to naively load DSpark's 167 GB weights onto a single DGX Spark, unified memory of 128 GB isn't enough. The practical solution is to distribute across 2 units and launch with TP=2, where inter-node communication determines the cluster's speed.

The configuration this time is a simple setup connecting 2 DGX Sparks (node1 / node2) with a single direct-connect cable through the QSFP port. The QSFP spec is 200 Gbps, but actual measurements with iperf3 showed a ceiling of around 14.9 Gbps for a single stream and 16.4 Gbps for 4 parallel streams. The premise is that we're hitting a ceiling due to kernel streaming capacity and can't utilize the full 200 Gbps. Even so, it's 2 orders of magnitude faster compared to Wi-Fi (hundreds of Mbps) or Tailscale (~90 Mbps), so it's sufficient for TP all-reduce communication. Ping RTT is 0.5~0.7 ms, returning the response you'd expect from a direct link.

Green is QSFP direct connect (the path I want to use for vLLM's TP all-reduce), purple is Wi-Fi, yellow is Tailscale. Since the default route points to the Wi-Fi side, vLLM will try to grab that if nothing is specified. This becomes a trap later, but for now the physical layer is set up this way.

The Story of Going Around in Circles Choosing a Recipe

On the software side, I started with Aiden's Recipe (aidendle94/sparkrun-vllm-ds4-gb10:production-ready). This is an image widely used for running the base DeepSeek-V4-Flash on 2 nodes, with a setup where specifying --speculative-config '{"method":"mtp", ...}' enables MTP.

However, when I fed it the DSpark weights and started with method=mtp, vLLM's load process crashed as follows.

File ".../vllm/models/deepseek_v4/nvidia/mtp.py", line 448, in load_weights
    params_dict[name]
KeyError: ...

It seems the MTP weight naming in the DSpark version doesn't match the MTP loader bundled in the Aiden image. The plain method=mtp path can't read DSpark's MTP module as-is.

So I switched to tonyd2wild's Recipe (tonyd2wild/DeepSeek-v4-Flash-DSpark-60-tok-s-900K-ctx-2x-DGX-Spark). The repository name attacks with the energetic string "60 tok/s / 900K context / 2x DGX Spark," but the contents are a solid build that layers a runtime overlay containing Rafael Caricio's vLLM PR (rafaelcaricio/vllm#1) on top of the base image (ghcr.io/bjk110/vllm-spark:unholy-fusion-prod-ready). With this overlay-included vLLM, you can select a DSpark-specific method --speculative-config '{"method":"dspark", "num_speculative_tokens":5}', and DSpark's MTP weights can be read.

Running build-dspark-vllm-runtime.sh produces vllm-dspark-runtime:clean (22.7 GB) on both nodes. This part went smoothly.

The Trap of vLLM Grabbing Wi-Fi and 3 Essential Patches

Even though the build succeeded, when I started it, vLLM crashed with this error right where TP all-reduce begins.

RuntimeError: ifa != nullptr. Unable to find address for: enP7s7

The Gloo backend is going to look for a non-existent interface called enP7s7 and failing. Tracing the cause, vLLM's multiproc_executor.py was selecting the Wi-Fi side IP (192.168.64.198) of the default route as mq_connect_ip, and then guessing the interface name reachable from there by name.

The DGX Spark interface configuration is as shown in the Mermaid diagram at the beginning, and while I want to use enp1s0f1np1 (QSFP), without explicitly telling vLLM this, it goes to the Wi-Fi side. I applied the following 3 patches to the Recipe.

3 Essential Patches (click to expand)

A. Add to .env.dspark

WORKER_IP=192.168.0.14    # worker QSFP IP (head reuses MASTER_ADDR)
NCCL_IB_GID_INDEX=3         # set to 3 because default 0 stalls on RoCEv2

B. environment: section of docker-compose.dspark.yml

GLOO_SOCKET_IFNAME: '${GLOO_SOCKET_IFNAME:-${NCCL_SOCKET_IFNAME}}'
TP_SOCKET_IFNAME: '${TP_SOCKET_IFNAME:-${NCCL_SOCKET_IFNAME}}'
MN_IF_NAME: '${MN_IF_NAME:-${NCCL_SOCKET_IFNAME}}'
OMPI_MCA_btl_tcp_if_include: '${OMPI_MCA_btl_tcp_if_include:-${NCCL_SOCKET_IFNAME}}'
WORKER_IP: '${WORKER_IP:-}'

C. NODE_RANK-based VLLM_HOST_IP branching at the beginning of command:

if [ "${NODE_RANK:-0}" = "0" ]; then
  export VLLM_HOST_IP="${MASTER_ADDR}"
else
  export VLLM_HOST_IP="${WORKER_IP:-}"
fi;

What's being done here is simply "fixing the interface used for inter-node communication to the QSFP side." Using the 4 variables MN_IF_NAME / GLOO_SOCKET_IFNAME / TP_SOCKET_IFNAME / OMPI_MCA_btl_tcp_if_include, the paths for vLLM's multiproc, Gloo, TP, and OpenMPI are pinned, and VLLM_HOST_IP is branched according to NODE_RANK: head=MASTER_ADDR / worker=WORKER_IP.

After applying this and returning to the build → launch flow, it no longer stalled at TP initialization.

Roughly Grasping Performance with Official Benchmarks

Before measuring speed on actual hardware, I'll get a grasp of V4 Flash's actual capability from the official numbers. Since the DSpark version shares the same checkpoint as the base V4 Flash, the benchmark quality numbers for V4 Flash apply directly.

The numbers for coding and long-context understanding for V4 Flash Max (Think Max mode) published by DeepSeek are as follows.

Benchmark V4-Flash Max Gemini 3.1 Pro Opus 4.6 Max
LiveCodeBench 91.6 91.7 88.8
Codeforces Rating 3052 3052
SWE Verified 79.0
MRCR 1M 78.7
Terminal Bench 2.0 56.9

Source: deepseek-ai/DeepSeek-V4-Flash model card

LiveCodeBench is on par with Gemini 3.1 Pro, and Codeforces rating 3052 is also aligned. Combined with SWE Verified 79.0 and MRCR 1M 78.7, these look like quite aggressive numbers for a model spec of Active 13B MoE with weights distributed under MIT.

The DSpark acceleration adds on top of this. According to paper §5.4 and the official README, in DeepSeek's official serving environment with DSpark-5 (γ=5 + Markov head) enabled, the claim is that per-user generation speed improves by +60~85% relative to the MTP-1 baseline. One purpose of running it on 2 DGX Spark nodes this time is to verify how much of this "faster with the same weights" can be felt in practice locally.

Breakdown of Why It Took 19 Minutes to Start with vLLM

After applying the patches and running docker run in the order worker → head, it took about 19 minutes for the API to respond on port 8888. Tracing the logs, the breakdown of wait time was as follows.

The 48 shards load of 162 seconds and mHC kernel warmup of 24 seconds were within the expected range, and the remaining 15+ minutes were almost entirely consumed by sparse MLA autotune, TileLang JIT, and FlashInfer autotune. This is the step where optimal parameters are searched for on actual hardware for DGX Spark's sm_121a (GB10), which unavoidably makes the first launch heavy.

Fortunately the results are persisted under ~/.cache/huggingface on the host side, so subsequent launches take about 7 minutes. Since the basic operation for use cases like continuously hitting the API from Hermes Agent or Codex is to start once and keep it running, you just need to brace yourself for the first time only, and it won't become an operational burden.

Here, finally, curl http://127.0.0.1:8888/v1/models returns DeepSeek-V4-Flash-DSpark, and a light math smoke test also passed with a <think> block. The max_model_len is tonyd2wild's default of 262,144 (262K), which is just right for Hermes' constant 256K context usage.

Measuring Decode tok/s on Short to Medium Context

From here I'll look at actual hardware speed. To compare MTP effectiveness, I prepared 3 types of prompts.

  • medium: A natural language question of about 200 characters (general conversation/knowledge questions)
  • long: A natural language explanation request of about 500 characters
  • reasoning_code: A short 42-token code generation request saying "Write Python code to solve a maze using BFS"

For each, I measured decode tok/s in 2 modes: using (thinking ON) / not using (thinking OFF) the <think> block. The context is the prod 262K setting (max_num_seqs=1, single user).

Prompt thinking ON thinking OFF speed-up
medium (natural language 200 chars) 35.44 35.00 0.99x
long (natural language 500 chars) 37.81 39.05 1.03x
reasoning_code (BFS Python) 41.90 55.17 1.32x

The unit for numbers is tok/s (output token speed during the decode phase). For medium / long, there's almost no difference between thinking ON and OFF. 35~39 tok/s is the straightforward capability when running a 13B active MoE on a 2-node DGX Spark with unified memory 128 GB × 2, which falls into practical speed range as a local LLM. For use cases like normal question answering from Hermes Agent or Codex, this becomes the baseline.

However, reasoning_code behaves differently. It jumps from thinking ON's 41.90 tok/s to 55.17 tok/s when thinking is turned OFF. This is the same prompt, just switching reasoning_effort=. TTFT (time to first token) was about 0.6~2.1 seconds for short prompts and 4.7~8.8 seconds for medium-length prompts, moving proportionally to the weight of prefilling prompt tokens all at once.

What's interesting here is "why does only reasoning_code improve so much when thinking is turned off?" Normally you might think that since thinking ON adds a <think> block the output length increases and it slows down, but this directly relates to MTP acceptance rate, which is the core of the next chapter.

Note that tonyd2wild's README shows 62.48 tok/s using code_completion-type 512-token prompts in the same DSpark configuration. Since the conditions aren't aligned, a direct comparison isn't possible, but it seems quite possible to reach this range if the prompt structure is tuned to be draft-friendly. I'll leave this as a topic for a separate article.

For reference, the median throughput (aggregated over the past 30 minutes) for major providers pulling the same V4 Flash via OpenRouter is around Baidu 66 / Fireworks 63 / Alibaba 62 / DeepSeek official 61 / SiliconFlow 61 tok/s. However, V4 Flash on OpenRouter has reasoning_effort defaulting to high, and specifying low / off is not supported. Comparing thinking ON vs. thinking ON: DSpark's 41 tok/s vs. the cloud top tier at just over 60 tok/s, so there's still a gap there.

Note that the thinking OFF 55 tok/s figure is based on the premise of directly hitting curl with reasoning_effort=none. The vLLM DSpark implementation in this mode sets content to null and puts the final output in the reasoning field, so it appears as an empty response to OpenAI-compatible clients (Hermes Agent / Codex / Claude Code etc. that read message.content). When using it regularly via Agent, it's currently safest to consider the effective value to be thinking ON's 41 tok/s.

Looking at How MTP Works with Thinking On / Off

This is the core of the article. Why does decode jump 1.32x when reasoning_code turns off thinking? Digging in, it becomes clear that DSpark's MTP (Multi-Token Prediction) has the property of being "hard to predict drafts for" when it comes to reasoning traces containing <think> blocks.

vLLM's /metrics endpoint outputs speculative decoding stats for DSpark (num_drafts_total / num_draft_tokens_total / num_accepted_tokens_total / num_accepted_tokens_per_pos_total). I ran reasoning_code one sample at a time with thinking ON / OFF and took the difference between before and after (right after launch vs. after completion).

drafts accepted/draft (mean) per-token acceptance
thinking ON 2124 1.21 24.2%
thinking OFF 1731 2.02 40.4%

Turning off thinking improves the average number of accepted tokens per draft by +67% from 1.21 → 2.02, and per-token acceptance jumps from 24.2% → 40.4%. This directly shows that DSpark's draft (the tail 3 layers of the main network + Markov head) is poor at predicting text like reasoning traces where "what comes next shifts every step." Conversely, for structured text like code itself, drafts are easier to hit, and even with γ=5 static draft, a large acceptance can be obtained.

DSpark paper §1 also states "Math/code naturally sustain higher acceptance rates," and reasoning trace portions are not included in that — which is a fact confirmed in practice this time. "For tasks where reasoning is not genuinely needed, turning off thinking actually lets you benefit more from MTP" is a fairly useful guideline for operational decision-making.

Note that the DSpark running in vLLM is not the complete version from the paper; the dynamic verification scheduler proposed in paper §3.2 is off (VLLM_DSPARK_CONFIDENCE_SCHEDULER=off), and only the Markov head + γ=5 static draft portion is effective. DeepSeek's official serving claim of "+60~85% vs. MTP-1 baseline" includes this scheduler, so if the scheduler is ported to vLLM, there is room for further improvement from here.

What Happens When You Naively Extend Long Context

DSpark is designed to handle up to 1M tokens. Changing the input context to 64K / 128K / 256K while keeping the prod 262K setting, measuring decode tok/s gives the following shape.

Input context decode tok/s (estimated)
64K 8.7
128K 8.4
256K 3.2

Compared to the 30~55 tok/s from short prompts in chapter 6, decode drops in two stages just from expanding the input. 64K and 128K are almost flat, but at 256K it drops a level.

From here, after increasing max_model_len to 900K and restarting, loading a long text of about 247K tokens with "write the next Python code" and generating 500 tokens, the total elapsed time was about 4 minutes 32 seconds with decode around 2.4 tok/s. Even so, DSpark's accept rate was 46.8%, not dropping from the short-text value (40.4%), and the generated code reflected the function names and signatures from the input context, confirming that draft quality and output quality are maintained even in long context. The reason decode drops is that attention computation gets heavier with context length — it's not a matter of quality breaking down and becoming unusable.

So personally, for use cases like Hermes Agent or Codex where context is kept continuously without cutting, I think the most practical operation is to set max_model_len=262144 (262K) while keeping the actual context per turn to a few K to a few tens of K. The impression is that cases where using 900K is beneficial are limited to long single-shot summarization or code reading.

Summary

I brought DeepSeek's officially DGX Spark-named model DeepSeek-V4-Flash-DSpark onto 2 DGX Sparks using the tonyd2wild Recipe combined with custom patches, and was able to see on actual hardware 55.17 tok/s / MTP acceptance 40.4% for reasoning code generation.

As mentioned at the beginning, the recent personal feeling when regularly using LLM Router v3 is that many everyday tasks get routed to around V4 Flash. The speed range seen this time, as also mentioned at the end of chapter 6, is approaching a line that can be compared to OpenRouter's top providers (around 60 tok/s for thinking ON). The realistic value for Agent-based operation is 41 tok/s on the thinking ON side, and while you can draw out 55 tok/s thinking OFF by directly hitting curl, it's perhaps just one step short of being a go-to everyday client. Still, the idea of replacing the path routed to V4 Flash by LLM Router v3 with the DSpark side is close enough to be fully within consideration.

There are moments where the speed feels a bit slower compared to being accustomed to cloud API responses, but it's bearable for regular use. There's no question that the hurdle of having 2 units isn't light, so it's not a configuration I can recommend to everyone, but if the environment is in place, it occupies a quite interesting position.


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article