I tried retraining the NVIDIA LLM Router to match my own persona (Training Edition)

I've been using Claude Codex for everything from heavy design discussions to light brainstorming, and I built my own checkpoint for NVIDIA LLM Router v3 from scratch. Here's how I achieved up to 99.3% cost reduction on light-to-medium questions while maintaining quality, using the new 9-model pool.

森茂洋 / Hiroshi Morishige

2026.06.21

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
After trying out NVIDIA LLM Router v3 in the basics article, I've been using the default checkpoint for my own work for a while. The standard distribution includes a 9-model general-purpose pool for Nemotron / GPT-OSS / Qwen / GPT-5 / Opus series (v1-9models-qwen08b.yaml), and routing judgment works fine with just this.
https://dev.classmethod.jp/articles/dgx-spark-nvidia-llm-router-v3/
After tinkering with it for a while, what started to bother me was that this default pool was "slightly off from my preferences." The latest generation Opus 4.8 / Sonnet 4.6 / Gemini 3.5 Flash weren't included yet, and the lineup didn't align well with the emerging models I typically want to call (DeepSeek V4 / Qwen 3.7 / Kimi K2.6 / GLM 4.7, etc.). The MLP training data also seems to be built for general-purpose use, so it didn't appear to be tuned for my usage patterns (which tend to split between heavy design discussions and light brainstorming at both extremes).
Fortunately, NVIDIA LLM Router v3 officially provides resources to rebuild your own checkpoint through the three steps of collect → train → evaluate. By reorganizing the pool YAML to your liking and converting your usual questions into training data, you can build routing from scratch tailored to that person. Since the framework is already in place, let's rebuild it once to match my own persona — that's the starting point for this article.
Specifically, I took my persona (how I normally use Claude Code / Codex, blog articles I'm writing, design discussions around Hermes Agent and NemoClaw, etc.) and distilled it into 480 training questions, then rebuilt the checkpoint from scratch for a new 9-model pool (including the latest Opus 4.8 / Sonnet 4.6 / Gemini 3.5 Flash). By designing toward quality-preserving routing (heavy questions properly go to Opus, light questions go to cheaper models), I arrived at a configuration that raises Opus 4.8's adoption rate to 43.1% while demonstrating 98–99% cost reduction (up to 99.3%) for light-to-medium questions.
On the topic of LLM cost optimization, there are approaches like task-type-based rule routing as covered in MindStudio's "Run Local AI Models with Claude Code" article. The reason I chose NVIDIA LLM Router v3 this time is that I wanted to train a Qwen3.5-0.8B encoder + PCA + MLP on the quality×price tradeoff within the same task type (even for the same code generation, deep dependency tasks go to Opus while light one-liners go to gpt-oss-120b). If you just want to offload auxiliary tasks locally, rule routing is sufficient, so I think using them based on purpose is the pragmatic approach. This article focuses specifically on the use case of "someone who has been using Claude Code with Opus as the default, and wants to switch to routing without sacrificing quality."
 Pool Redesign Policy Lessons from the Failed 5-Model ConfigurationIn the initial 5-model pool I built (Local + gpt-oss-120b + Kimi K2.7 + GLM 5.2 + Opus 4.6), feeding in 5 use cases caused all 4 non-lightweight cases to concentrate on gpt-oss-120b. Looking at the cause, the pricing ladder was uneven — there was a 14x cost cliff between gpt-oss-120b ($0.05 / $0.21) and the next most expensive kimi ($0.74 / $3.50). The AUC difference versus Opus was only about 5 points, and the MLP straightforwardly learned that "gpt-oss-120b is sufficient for mid-range workloads" when accounting for economics.
This is a subtly impactful trap when building an LLM Router — when the cost difference between adjacent models is too large, the MLP tends to favor the cheaper side. I learned from experience here that the basic practice for building quality-preserving routing is to maintain continuity with increments of about 1.5–3x.
 The 9-Model LadderThe final configuration I rebuilt is as follows.


Slot
Model
OpenRouter slug
output $/M
Adjacent ratio
Role


1
Nemotron 3 Nano 30B-A3B (Local) ⁽*⁾
openrouter/nvidia/nemotron-3-nano-30b-a3b
0 ⁽*⁾
—
Local free anchor, lightweight reasoning

2
DeepSeek V4-Flash
deepseek/deepseek-v4-flash
0.18
—
Lightweight general (emerging 1 / code / long ctx 1M)

3
GLM 4.7 Flash
z-ai/glm-4.7-flash
0.40
2.22x
Lightweight (emerging 2 / Chinese accuracy / 202k ctx)

4
DeepSeek V4-Pro
deepseek/deepseek-v4-pro
0.87
2.18x
Mid-weight reasoning (emerging 3, cost-effective / long ctx 1M)

5
Qwen 3.7-Plus
qwen/qwen3.7-plus
1.28
1.47x
Mid-weight general (emerging 4 / latest gen / long ctx 1M)

6
Kimi K2.6
moonshotai/kimi-k2.6
3.50
2.73x
Mid-heavy general (emerging 5, high versatility)

7
Gemini 3.5 Flash
google/gemini-3.5-flash
9.00
2.57x
Mid-heavy reasoning + multimodal dedicated path

8
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
15.00
1.67x
Heavy quality middle, bridge to Opus

9
Claude Opus 4.8
anthropic/claude-opus-4.8
25.00
1.67x
Heaviest, quality anchor

⁽*⁾ Slot 1 is running on OpenRouter Nano paid version ($0.05 / $0.20) to improve reproducibility across this series, but the cost_per_m_*_tokens in the pool YAML is kept at 0. Those who have an environment where they can run models locally can switch Slot 1 to local operation by replacing litellm_model with openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 + api_base: http://localhost:8000/v1. The MLP label stays as local-nemotron-3-nano, and routing judgment can be reused as-is without re-training.
Note that slug notation varies by provider (Anthropic / Google / Moonshot / Z.AI use dot format, DeepSeek / NVIDIA use hyphen format) due to OpenRouter's specifications. This table directly reproduces slugs verified against OpenRouter's /api/v1/models. In the default pool table in the basics article, we listed hyphen-format Anthropic slugs (anthropic/claude-opus-4-6), which is because the default config v1-9models-qwen08b.yaml retained the notation from when Part 1 was published.
The key design point is keeping the adjacent ratio to a maximum of 2.73x. By reducing from the 14x jump in the old 5-model configuration by about a factor of 5, the aim is to prevent the MLP from exhibiting the phenomenon of "pushing down to one tier lower due to a cost cliff."
※ Display within Mermaid shows output unit price only. Refer to the table above for input unit prices.
 The Reality of 2026 Generation BenchmarksBefore finalizing the configuration, I wanted to organize the actual capabilities of each of the 9 models, but it appeared that 2026-generation frontier models have removed the legacy MMLU / HumanEval / MBPP / Aider polyglot from their official benchmarks. Anthropic / DeepMind / Z.AI / Moonshot / Qwen have all essentially dropped them from official reporting. The reality seems to be that the metrics have converged around 7 axes: GPQA Diamond / AIME / HLE / SWE-bench Verified / MMMU-Pro / LiveCodeBench v6 / Arena Elo.
The following is a reorganization focused on GPQA Diamond and Arena Elo for practical verification (as of writing, June 2026; sources are each company's official June 2026 releases and lmarena.ai).


Model
GPQA Diamond
SWE-bench Verified
MMMU-Pro
Arena Elo


Nemotron 3 Nano 30B-A3B
71.9
38.8
N/A
N/A

DeepSeek V4-Flash
87.4
~74
(text)
N/A

GLM 4.7-Flash
75.2
59.2
(text)
N/A

DeepSeek V4-Pro
90.1
80.6
(text)
1467

Qwen 3.7-Plus
90.3 (vendor)
~78
N/A
N/A

Kimi K2.6
90.5
80.2
79.4
1466

Gemini 3.5 Flash
90.4 (parent)
78 (internal)
83.6
1480

Claude Sonnet 4.6
89.9
79.6
74.5
1467

Claude Opus 4.8
93.6
88.6
N/A
1512

※ (vendor) = vendor self-reported value, (parent) = equivalent value from parent model (Gemini 3 Pro series), (internal) = internal company evaluation.
What surprised me when laying this out was that the top 3 emerging models (V4-Pro / Qwen 3.7-Plus / Kimi K2.6) are nearly tied with Sonnet 4.6 in GPQA / Arena. The price difference is 17.2x (V4-Pro $0.87 vs. Sonnet $15), so there was a case for "removing Sonnet from the pool," but I kept it due to operational reasons (running with 9 slots) and wanting to capture any differences that emerge in long-form coherence or safety nuances specific to Anthropic models. As I'll describe later, Sonnet 4.6 ended up solidly filling the role of "mid-heavy quality middle" with a 23.1% adoption rate.
 Gemini 3.5 Flash's Role and Securing the Multimodal PathAnother challenge in pool design was how to handle Gemini 3.5 Flash. While its multimodal capability (MMMU-Pro 83.6) is particularly strong, NVIDIA LLM Router v3's Prefill Router is text-only by design (it works by taking the hidden states of the prompt using the Qwen3.5-0.8B encoder), and image / audio / video blocks don't affect the judgment. With short prompts like "Please analyze the following image," the moment routing selects DeepSeek or Local Nemotron, OpenRouter returns 404 No endpoints found that support input image.
The approach here was to ensure Gemini's text-based roles (Google grounding / diagram structuring / Arena Elo 1480 general quality) through training data, while adding a thin shim to the litellm adapter layer to handle the multimodal path at a separate layer. It runs on a static capability lookup for the 9-model pool (image → {gemini-3.5-flash, sonnet-4.6, opus-4.8} / audio & video → {gemini-3.5-flash}), and in actual testing, 10×10 PNG / 1-second WAV / 1-second MP4 all route to Gemini 3.5 Flash, returning image, audio, and video analysis. I've submitted this shim as a proposal to upstream as PR #34, so check that out if you're interested. If you want to build full-fledged multimodal routing based on CLIP (NVIDIA AI Blueprint v2's Auto-Router), that's a different story, but I think this thin shim sufficiently covers the practical need to "properly route image / audio / video requests to a capable upstream."
 Collect Question Data Design Overall StructureThe training data consists of 480 questions total, broken down as follows.


Category
Count
Primary Model


Personal persona (curated heavy-leaning 40 + new heavy 60)
100
Heavy 60 skews toward Sonnet / Opus

Opus-favoring questions (heavy signal reinforcement)
150
Opus 4.8 dominant

Gemini-favoring questions (text-based only)
30
Gemini 3.5 Flash exclusive

Lightweight/mid-weight questions (curated light 60 + generated 40)
100
Nemotron / V4-Flash / GLM

Public datasets (MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40)
100
Bias avoidance

The Opus-favoring 150 questions are the core of this configuration, with the intent to "densely include heavy signals that only Opus can answer correctly, so the MLP doesn't lose out to the economic rationality of the emerging models." The Gemini-favoring 30 questions reinforce learning of text-based uniqueness (grounding / diagrams / structuring). Sonnet-favoring questions are intentionally excluded (since the top emerging models are nearly tied on GPQA, making differentiation questions hard to predict, the decision was to leave it to natural distribution). This results in a ratio of heavy-leaning 210 questions (44%) + Gemini-leaning 30 questions (6%).
 How to Create the 100 Personal Persona QuestionsFirst articulating the patterns of questions you typically ask Claude Code makes it easier to balance against the Opus-favoring questions.


Subcategory
Count
Content examples


Existing (article writing / Codex / tailscale / Hermes / NemoClaw, etc.)
40
Selected 40 heavy-leaning ones from old curated set

Large-scale refactoring plans
15
Multi-axis topics like "DDD + async migration for 50kLoC monolith," with longer code context

Infrastructure build plans
15
"Multi-region Kubernetes + IaC + monitoring design," "DGX Spark + on-prem GPU cluster build," etc.

PR reviews
15
Actual code diffs (50-200 lines) + review perspectives (security / performance / maintainability)

Issue triage
15
"Isolate a flaky test with unknown repro conditions," "root cause analysis for production incident," etc.

The latter 60 questions (refactoring / infrastructure / PR / issues) are clearly heavy questions, so they work strongly as signals for "heavy = Opus."
 How to Create the 150 Opus-Favoring QuestionsI organized the scenarios where Opus 4.8 clearly outperforms top emerging models by characteristic and structured the breakdown of 150 questions.


Characteristic
Count
Examples


Philosophy/ethics dilemmas
30
Trolley / Frankfurt / euthanasia — collisions of multiple principles

Long-form coherence
25
Identifying contradictions between character statements and past settings in 4000-5000 character context

Multi-step reasoning
25
Bayesian updates, causal inference, dual of linear programming

Refactoring
20
Redesigning a 50kLoC monolith with DDD, including migration risk and rollback strategy

Creative/persona imitation
20
Maintaining Natsume Soseki style or a specific character's tone over long passages

Constrained judgment
15
Code/design that must satisfy 10+ constraints

Cultural sensitivity/keigo
15
Politely disagreeing with a superior in English following Japanese business norms

Detailed question examples will be placed in nvidia-llm-router-v3-training/data/questions-opus-favor-150.txt on GitHub. One thing I was careful about when creating questions was "not making them too abstract" — phrasing like "Please explain ○○" doesn't produce much difference from top emerging models, but including constraints like "Discuss the X aspect of ○○ under Y and Z constraints, divided into N stages in 600 characters" tends to bring out Opus's true value.
 How to Create the 30 Gemini-Favoring Questions (Text-Based Only)This is the part where I want the MLP to learn Gemini 3.5 Flash's uniqueness. Since the multimodal path itself was separated into a different layer via the adapter shim mentioned earlier, the training data focuses on 30 text-based signal questions (grounding / diagrams / structuring) that I want the MLP to learn.


Category
Count
Content examples


Google search grounding required
10
"Quote the on-demand pricing for AWS Bedrock Claude Sonnet 4.5 as of June 2026"

Mermaid diagram output
8
"Draw the NVIDIA LLM Router v3 request flow as a Mermaid sequenceDiagram"

Multi-level structured markdown
7
"Explain the Kubernetes Operator pattern in a 5-chapter structure, combining tables and callouts"

Citation-based summary
5
"Summarize Anthropic Constitutional AI in [1][2][3] citation format in 500 characters"

The key is explicitly including keywords like "as of June 2026 or later" and "the latest" that indicate grounding is needed. While the Prefill Router itself can only judge from the text before the call, prompts carrying these keywords have feature values indicating "text that requires Google search grounding," so they work as signals that indirectly make the MLP more likely to select Gemini. The actual grounding then runs on the Gemini side once it's called.
 Sampling the 100 Public Dataset QuestionsFor bias avoidance, 100 questions are sampled from a combination of MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40. They are deterministically extracted (with fixed seed) from a pre-built pool of 300 questions, ensuring reproducibility.
 Running the Training Pipeline Step 1. Environment Prerequisites and Connectivity CheckThe configuration is DGX Spark (aarch64, GB10, 128GB UMA) + vLLM Nemotron 3 Nano 30B-A3B NVFP4 (local) + OpenRouter (remaining 8 models). Use probe-models.py to send Reply with the single word 'pong' to each model and verify HTTP 200 / latency / cost.
# Please adapt the verification scripts in this article to your working directory
cd workspace/blog/scripts/nvidia-llm-router-v3-training
uv run --with httpx --with pyyaml scripts/probe-models.py \
  --config configs/my-pool-9models.yaml
Once you get 9/9 OK, proceed to the next step.
 Step 2. dry-run (10 questions × 9 models)Before the actual collect, run a dry-run with 90 calls to estimate cost and latency. I hit about 2 thinking-related traps here, but by the 3rd attempt the configuration of cutting thinking with extra_body.reasoning.enabled: false + making Gemini 3.5 Flash an exception with max_tokens: 2048, a 9/9 fully OK configuration was finalized. The actual cost was 1/4 of the preliminary estimate of $45-55, with a projection of being able to run all 480 questions for $11.27.
dry-run 3 attempts and thinking trap detailsWhen I first ran it with thinking on, GLM 4.7-Flash / Kimi K2.6 / Gemini 3.5 Flash / DeepSeek V4-Pro all consumed their max_tokens=1024 on thinking with nothing left for the actual response, which occurred frequently. Records with reply_len=0 appeared in large numbers, and these can't be used as quality signals.
So I adopted the policy of uniformly applying extra_body.reasoning.enabled: false to the 6 OpenRouter models, which dramatically changed latency.


Model
thinking on
thinking off
Ratio


GLM 4.7-Flash
29.75s
5.69s
5.2x

Qwen 3.7-Plus
34.03s
2.94s
11.6x

Kimi K2.6
13.40s
3.51s
3.8x

But another trap was waiting here. Gemini 3.5 Flash has mandatory reasoning, and passing reasoning: { enabled: false } returns HTTP 400 ("Reasoning is mandatory for this endpoint and cannot be disabled."). The exception handling was to keep thinking on for Gemini only and raise max_tokens: 2048 to secure tokens for the actual response.
The cost progression across 3 attempts is as follows, and I was pleasantly surprised at how cost-effective the max_tokens=1024 + thinking off combination turned out to be.


Attempt
Cost (90 calls)
480q estimate


1st (thinking on)
$0.265
$12.70

2nd (thinking off all models)
$0.148
$7.12

3rd (thinking off + Gemini exception)
$0.235
$11.27

 Step 3. The judge=vote Trap and Switching to judge=llmAfter completing the dry-run, I ran the actual collect with judge=vote and got eyebrow-raising numbers: per-model accuracy of "Local Nemotron 99.8% / all other models 0-0.2%". The cause was in the _judge_vote logic in collect.py — when each model returns different natural language responses, the Counter ends up with ties and always treats the head of the pool (Local Nemotron) as "correct" based on insertion order. This works for MMLU-style choice matching, but cannot be used for generation tasks.
So I switched to judge=llm + Sonnet 4.6 as judge and re-collected (7h15m). While self-scoring bias remains since Sonnet itself is the judge, a reasonable distribution emerged with Opus at the top, the top 3 emerging models clustered together, and Local / GLM at the bottom — making it usable as a quality signal.


Model
accuracy (Sonnet judge)
avg tokens


Opus 4.8
69.17%
738

Sonnet 4.6 ⁽*⁾
65.2%
709

DeepSeek V4-Flash
63.1%
724

DeepSeek V4-Pro
61.0%
718

Kimi K2.6
60.0%
946

Qwen 3.7-Plus
56.5%
741

Gemini 3.5 Flash
54.6%
1,719

GLM 4.7-Flash
39.0%
742

Local Nemotron
38.3%
940

⁽*⁾ Sonnet 4.6 also serves as judge, so self-scoring bias is present
judge=vote bug implementation details and re-collect commandThe initial judge=vote results were as follows — with the free local model beating all 8 other top models at 99.8% victory, this was clearly a problem with the logic.


Model
accuracy


Local Nemotron
99.8%

DeepSeek V4-Flash
0.0%

GLM 4.7-Flash
0.2%

DeepSeek V4-Pro
0.2%

Qwen 3.7-Plus
0.2%

Kimi K2.6
0.0%

Gemini 3.5 Flash
0.0%

Claude Sonnet 4.6
0.0%

Claude Opus 4.8
0.0%

Reading through _judge_vote in collect.py, the cause became immediately clear.
def _normalize(text: str) -> str:
    return " ".join(re.split(r"\s+", text.strip().lower()))

def _judge_vote(outputs: list[str]) -> str:
    normalized = [_normalize(o) for o in outputs]
    counts = Counter(normalized)
    return counts.most_common(1)[0][0]
When each model returns different natural language responses, all strings remain unique even after normalization. Counter respects insertion order for ties, so the first model to appear (Local Nemotron at Slot 1 in the pool YAML) always becomes the majority. Only models whose responses match Local's response are judged as "correct," and everything else is 0%.
The re-collect command is as follows (with the judge parallelization patch applied, max_workers=5, targeting the upper limit that avoids hitting OpenRouter's rate limit).
model-router collect \
  --config configs/my-pool-9models.yaml \
  --questions data/questions-9models.txt \
  --judge llm \
  --judge-model openrouter/anthropic/claude-sonnet-4.6 \
  --output data/collected-my-pool-9models.csv
If you're creating your own routing for other use cases, be careful of the same pitfall. If you're using MMLU-style selections (where the choice is returned as A/B/C/D characters), you can use it as-is, but for generation tasks, using judge=llm is the safe choice.
 Step 4. train (Completed in 5 minutes)model-router train \
  --config configs/my-pool-9models.yaml \
  --data data/collected-my-pool-9models.csv \
  --output-dir checkpoints/my-router-9models/
Training runs with accelerate, 9-dim MLP, Qwen/Qwen3.5-0.8B encoder. It completed Stages 1–6 (Load → Extract → PCA → MLP ensemble → Calibrate → Save) in 5 minutes and 1 second.
The per-model AUC for the completed checkpoint is as follows.


Model
AUC (Shared trunk ensemble)


Qwen 3.7-Plus
0.9430 (highest among 9 models)

Gemini 3.5 Flash
0.9250

Claude Sonnet 4.6
0.9162

Claude Opus 4.8
0.9103

DeepSeek V4-Flash
0.9065

GLM 4.7-Flash
0.8937

Local Nemotron
0.8837

DeepSeek V4-Pro
0.8757

Kimi K2.6
0.8517

All models achieved AUC above 0.85, exceeding the target values I had assumed (per-model AUC 0.5-0.9). Compared to the default checkpoint where p_max was 0.07-0.10, just preparing training data tightens the MLP predictions this much — which was a quietly pleasant discovery.
 Step 5. evaluate (Quality Improvement by the Numbers)model-router evaluate \
  --config configs/my-pool-9models.yaml \
  --checkpoint checkpoints/my-router-9models/prefill_router.pt \
  --data data/collected-my-pool-9models.csv \
  --output results/eval-my-router-9models.json


Metric
Value


Oracle accuracy
0.9250

Best single model
0.6917 (Opus 4.8 standalone)

Router accuracy (argmax)
0.7937 (+10.20pp vs Opus standalone)

Headroom captured
43.7%

The quantitative takeaway is: "compared to 69.17% accuracy when solving 480 questions with Opus alone, the router achieves 79.37%—an improvement of over 10 percentage points." Of the gap between this and the Oracle accuracy (the upper bound when the best model is always chosen) of 92.5%, the router closes 43.7%.
Let's also take a look at the argmax distribution from the evaluation.


Model
Times Selected
Ratio
Accuracy When Selected


Claude Opus 4.8
207
43.1%
64.7%

Claude Sonnet 4.6
111
23.1%
99.1%

Gemini 3.5 Flash
59
12.3%
98.3%

DeepSeek V4-Pro
33
6.9%
69.7%

Qwen 3.7-Plus
31
6.5%
90.3%

DeepSeek V4-Flash
23
4.8%
82.6%

Kimi K2.6
16
3.3%
56.3%

Local Nemotron
0
0.0%
—

GLM 4.7-Flash
0
0.0%
—

Opus 4.8 leads with 43.1% selection, followed by Sonnet 4.6 at 23.1% with a remarkable 99.1% accuracy. Gemini 3.5 Flash comes in at 12.3% with 98.3%. During design, I had said "Sonnet might get selected occasionally," but when the results came in, it was surprisingly stable as the main player in the mid-weight quality middle tier.
It may seem counterintuitive that Opus at 64.7% is lower than Sonnet at 99.1%, but this simply reflects the fact that "the difficulty of the sets selected by argmax differs." The 207 questions for which Opus is selected by argmax are a cluster of hard questions where the MLP determined that P(correct) won't be high unless Opus is used, so even Opus faces a higher bar to be judged correct, resulting in 64.7%. The 99.1% / 98.3% for Sonnet / Gemini are the result of comparatively easy questions being routed there—ones the MLP determined could be answered without Opus. Just to note: the previously mentioned 69.17% for Opus standalone is aggregated across all 480 questions, so it uses a different set than the 64.7% here (a subset of 207 questions).
Conversely, Local Nemotron and GLM 4.7-Flash are at 0% in the argmax results. Their accuracy is around 38–39%, but under argmax with tolerance=0, they always lose because "other models produce higher P(correct) for the same question." In this configuration where training data is skewed 44% toward heavy models, they never get a turn. Raising the tolerance to 0.20 would allow them to be properly picked up for lightweight questions.
 Behavior of the Persona-Optimized Checkpoint Routing Demonstration by Tolerance (bench_persona_3tol)bench_persona_3tol.py is a script that submits 5 representative question types (casual chat / code generation / technical explanation / math proof / philosophy) at 3 tolerance levels (0.05 / 0.10 / 0.20) and compares the routing distribution and costs for each. I launched the new checkpoint with model-router serve --port 8204 and ran it.


Tolerance
Main Models Selected
Total Cost (5 questions)
Reduction Rate (vs. All-Opus)


0.05
deepseek-v4-flash dominant
$0.00055
99.3%

0.10
deepseek-v4-flash + glm-4.7-flash
$0.00080
99.0%

0.20
lightweight + Local Nemotron + glm
$0.00133
98.3%

(All-Opus)
Opus 4.6 direct (fixed in bench script)
$0.07664
Baseline

The (All-Opus) row uses anthropic/claude-opus-4-6 as fixed in the bench script's comparison baseline, so it is the old Opus 4.6 rather than the new pool's Slot 9 (Opus 4.8). The output price per token is close enough that it works as a comparison baseline, but if you want exact figures, change OPUS_MODEL in the bench script to claude-opus-4.8.
The 5 questions in the bench are all "lightweight to mid-weight," so the absence of Opus is as expected. The fact that even at tol=0.05 deepseek-v4-flash dominates may seem like a gap compared to V4-Flash being selected only 4.8% of the time in evaluate (480 questions / argmax), but this is because the 5 bench questions are skewed toward the left tail (lightweight side) of the difficulty distribution in evaluate. Raising the tolerance to 0.20 causes Local Nemotron to be picked up as well, making lightweight questions effectively free.
 Interpreting the Routing Distribution and Sonnet's RoleLet me organize the point that routing behavior looks completely different between evaluate (480 questions / argmax) and bench (5 light-to-mid-weight questions / with tolerance).


Evaluation Axis
Environment
Main Models Selected
Interpretation


evaluate
480 questions / argmax (tol=0)
Opus 43.1% + Sonnet 23.1% + Gemini 12.3%
Quality-preserving routing

bench_persona_3tol
5 light-mid questions / tol 0.05-0.20
deepseek-v4-flash + glm-4.7-flash + Local
Cost-reduction routing

This is the ideal design realized: routing automatically switches based on question difficulty and tolerance. Heavy questions are handled by Opus / Sonnet to preserve quality, while lightweight questions are handled by lightweight models to cut costs—a clear and well-defined contrast.
During the design phase, my stance was: "Sonnet ties with the emerging upper tier (V4-Pro / Qwen 3.7-Plus / Kimi K2.6) on GPQA, so I won't include differentiating questions in the training data—it might get selected occasionally." But evaluate returned 23.1% selection / 99.1% accuracy. Since argmax (tol=0) doesn't factor in cost, "it was selected because it's cheap" doesn't explain it. The possibilities are: (1) using Sonnet 4.6 as the judge introduced a self-scoring bias that inflated Sonnet's correct labels and caused the MLP to learn a higher P(correct) for Sonnet, or (2) for the 111 questions it was selected for, the MLP's Sonnet predictions genuinely exceeded Opus. Without re-collecting with a different judge (GPT-5.4 or Gemini 3.5 Pro), it's hard to distinguish between the two—but the fact that Sonnet is functioning stably as a mid-weight quality middle tier remains unchanged, and the design ideal of "maintaining quality while also reducing costs" is working in a quantitatively effective way.
 Increasing Local Nemotron's Selection RateThe result was 0% for Local Nemotron in evaluate's argmax, and only 2/5 in bench at tol=0.20. If you want to "route more casual/brainstorming-type requests to Local," there are 3 levels of adjustment: (1) add extra_body.routing.tolerance: 0.20 to the request body on a per-request basis to raise the tolerance (easiest, no restart required), (2) launch a separate server on a different port with --models restricted to lightweight models only, or (3) add 50–100 questions of casual chat / short translation / simple memo organization to the training data and re-collect / re-train. In practice, (1) is the most convenient, and by differentiating per Hermes Agent profile—"tolerance=0.20 for casual chat, 0.05 for code generation via Claude Code"—you can run multiple use cases from a single routing service. The story of integrating this with flows running on the Hermes side is planned for a practical implementation article.
 The Cost Cliff Visible in Comparison with the Old 5-Model CheckpointFor reference, here is the routing distribution for the old 5-model pool (Local Nemotron / gpt-oss-120b / Kimi K2.7 Code / GLM 5.2 / Opus 4.6) mentioned at the beginning of the pool redesign. When 5 use cases were submitted, all 4 non-lightweight cases were consolidated into gpt-oss-120b with p_max of 0.79–0.94 (high confidence), but Opus / Kimi / GLM were never called—a structure where they had no role to play.


Use Case
Model Selected
p_max


Casual chat (lightweight)
local-nemotron-3-nano
0.94

Code generation (mid-weight)
gpt-oss-120b
0.87

Technical explanation (mid-weight)
gpt-oss-120b
0.83

Math proof (mid-heavy)
gpt-oss-120b
0.79

Philosophy (heavy)
gpt-oss-120b
0.81

Placing this alongside the new 9-model ladder makes the cause—the 14x jump—visually obvious.
※ All prices within Mermaid are output prices. For input prices, please refer to the respective tables above.
As a basic principle when building your own routing setup: if you keep in mind from the start that "the cost multiplier between adjacent models should be within 3x, ideally around 2x," you'll find it easier to avoid cost cliffs like this.
 SummaryStarting from the feeling of "this might be slightly different from my preferences" after exploring the NVIDIA LLM Router v3 default checkpoint, I pushed all the way through rebuilding a 9-model pool from scratch and running persona training over 480 questions. The result: with Opus 4.8 selected at 43.1% to preserve quality, lightweight and mid-weight questions at tolerance 0.05–0.20 achieved cost reductions in the 98–99% range (up to 99.3%), giving a tangible feel for quality-preserving routing. The pitfalls I hit along the way (thinking on consuming all max_tokens, insertion order bug in judge=vote, Gemini's mandatory reasoning, the 14x cost cliff jump) are written out in detail in hopes they'll be useful to others building their own routing for different use cases. In particular, the pool design principle of "keep the cost multiplier between adjacent models within 3x, ideally around 2x" is something I think is worth being aware of from the very beginning.
To avoid "building and calling it done," observability after putting the checkpoint into operation is also important. The routing service used in this article has a thin Langfuse callback and routing decision stdout logging layered in, allowing me to check after the fact how many times each model was called, what the costs looked like, and how the distribution shifts when tolerance changes. I'm thinking of covering this observability topic in more depth in a separate article.
Even though it says "reference implementation only," v3 is this much fun to play with—and the feeling of training gradually molding it to fit your own persona is quietly enjoyable.
 Reference Links NVIDIA LLM RouterNVIDIA-AI-Blueprints/llm-router (GitHub) — The repository including the v3 branch covered in this article
NVIDIA LLM Router で LLM の用途別使い分け環境を構築してみた（基礎編） — Series Part 1, covering the default checkpoint behavior and OpenAI-compatible endpoint basics
 Upstream PRs Submitted During Verification for This ArticlePR #32: forward all OpenAI-compatible fields to upstream — Proposed fix for the issue where OpenAI-compatible fields such as tools were being dropped
PR #33: translate upstream errors to OpenAI-compatible responses — Proposed fix for the issue where upstream 4xx errors were being returned as 500
PR #34: route image/audio/video requests to capability-matched models — Proposed adapter shim to route multimodal content to capability-matched models
 Related TopicsOpenRouter — API gateway used as the upstream for this article's pool
Langfuse — Integrated via LiteLLM callback for observability
lmarena.ai — Source of Arena Elo referenced in the benchmark tables in this article

I tried retraining the NVIDIA LLM Router to match my own persona (Training Edition)

Introduction

Pool Redesign Policy

Lessons from the Failed 5-Model Configuration

The 9-Model Ladder

The Reality of 2026 Generation Benchmarks

Gemini 3.5 Flash's Role and Securing the Multimodal Path

Collect Question Data Design

Overall Structure

How to Create the 100 Personal Persona Questions

How to Create the 150 Opus-Favoring Questions

How to Create the 30 Gemini-Favoring Questions (Text-Based Only)

Sampling the 100 Public Dataset Questions

Running the Training Pipeline

Step 1. Environment Prerequisites and Connectivity Check

Step 2. dry-run (10 questions × 9 models)

Step 3. The judge=vote Trap and Switching to judge=llm

Step 4. train (Completed in 5 minutes)

Step 5. evaluate (Quality Improvement by the Numbers)

Behavior of the Persona-Optimized Checkpoint

Routing Demonstration by Tolerance (bench_persona_3tol)

Interpreting the Routing Distribution and Sonnet's Role

Increasing Local Nemotron's Selection Rate

The Cost Cliff Visible in Comparison with the Old 5-Model Checkpoint

Summary

Reference Links

NVIDIA LLM Router

Upstream PRs Submitted During Verification for This Article

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Slot	Model	OpenRouter slug	output $/M	Adjacent ratio	Role
1	Nemotron 3 Nano 30B-A3B (Local) ⁽*⁾	`openrouter/nvidia/nemotron-3-nano-30b-a3b`	0 ⁽*⁾	—	Local free anchor, lightweight reasoning
2	DeepSeek V4-Flash	`deepseek/deepseek-v4-flash`	0.18	—	Lightweight general (emerging 1 / code / long ctx 1M)
3	GLM 4.7 Flash	`z-ai/glm-4.7-flash`	0.40	2.22x	Lightweight (emerging 2 / Chinese accuracy / 202k ctx)
4	DeepSeek V4-Pro	`deepseek/deepseek-v4-pro`	0.87	2.18x	Mid-weight reasoning (emerging 3, cost-effective / long ctx 1M)
5	Qwen 3.7-Plus	`qwen/qwen3.7-plus`	1.28	1.47x	Mid-weight general (emerging 4 / latest gen / long ctx 1M)
6	Kimi K2.6	`moonshotai/kimi-k2.6`	3.50	2.73x	Mid-heavy general (emerging 5, high versatility)
7	Gemini 3.5 Flash	`google/gemini-3.5-flash`	9.00	2.57x	Mid-heavy reasoning + multimodal dedicated path
8	Claude Sonnet 4.6	`anthropic/claude-sonnet-4.6`	15.00	1.67x	Heavy quality middle, bridge to Opus
9	Claude Opus 4.8	`anthropic/claude-opus-4.8`	25.00	1.67x	Heaviest, quality anchor

Model	GPQA Diamond	SWE-bench Verified	MMMU-Pro	Arena Elo
Nemotron 3 Nano 30B-A3B	71.9	38.8	N/A	N/A
DeepSeek V4-Flash	87.4	~74	(text)	N/A
GLM 4.7-Flash	75.2	59.2	(text)	N/A
DeepSeek V4-Pro	90.1	80.6	(text)	1467
Qwen 3.7-Plus	90.3 (vendor)	~78	N/A	N/A
Kimi K2.6	90.5	80.2	79.4	1466
Gemini 3.5 Flash	90.4 (parent)	78 (internal)	83.6	1480
Claude Sonnet 4.6	89.9	79.6	74.5	1467
Claude Opus 4.8	93.6	88.6	N/A	1512

Category	Count	Primary Model
Personal persona (curated heavy-leaning 40 + new heavy 60)	100	Heavy 60 skews toward Sonnet / Opus
Opus-favoring questions (heavy signal reinforcement)	150	Opus 4.8 dominant
Gemini-favoring questions (text-based only)	30	Gemini 3.5 Flash exclusive
Lightweight/mid-weight questions (curated light 60 + generated 40)	100	Nemotron / V4-Flash / GLM
Public datasets (MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40)	100	Bias avoidance

Subcategory	Count	Content examples
Existing (article writing / Codex / tailscale / Hermes / NemoClaw, etc.)	40	Selected 40 heavy-leaning ones from old curated set
Large-scale refactoring plans	15	Multi-axis topics like "DDD + async migration for 50kLoC monolith," with longer code context
Infrastructure build plans	15	"Multi-region Kubernetes + IaC + monitoring design," "DGX Spark + on-prem GPU cluster build," etc.
PR reviews	15	Actual code diffs (50-200 lines) + review perspectives (security / performance / maintainability)
Issue triage	15	"Isolate a flaky test with unknown repro conditions," "root cause analysis for production incident," etc.

Characteristic	Count	Examples
Philosophy/ethics dilemmas	30	Trolley / Frankfurt / euthanasia — collisions of multiple principles
Long-form coherence	25	Identifying contradictions between character statements and past settings in 4000-5000 character context
Multi-step reasoning	25	Bayesian updates, causal inference, dual of linear programming
Refactoring	20	Redesigning a 50kLoC monolith with DDD, including migration risk and rollback strategy
Creative/persona imitation	20	Maintaining Natsume Soseki style or a specific character's tone over long passages
Constrained judgment	15	Code/design that must satisfy 10+ constraints
Cultural sensitivity/keigo	15	Politely disagreeing with a superior in English following Japanese business norms

Category	Count	Content examples
Google search grounding required	10	"Quote the on-demand pricing for AWS Bedrock Claude Sonnet 4.5 as of June 2026"
Mermaid diagram output	8	"Draw the NVIDIA LLM Router v3 request flow as a Mermaid sequenceDiagram"
Multi-level structured markdown	7	"Explain the Kubernetes Operator pattern in a 5-chapter structure, combining tables and callouts"
Citation-based summary	5	"Summarize Anthropic Constitutional AI in [1][2][3] citation format in 500 characters"

Model	thinking on	thinking off	Ratio
GLM 4.7-Flash	29.75s	5.69s	5.2x
Qwen 3.7-Plus	34.03s	2.94s	11.6x
Kimi K2.6	13.40s	3.51s	3.8x

Attempt	Cost (90 calls)	480q estimate
1st (thinking on)	$0.265	$12.70
2nd (thinking off all models)	$0.148	$7.12
3rd (thinking off + Gemini exception)	$0.235	$11.27

Model	accuracy (Sonnet judge)	avg tokens
Opus 4.8	69.17%	738
Sonnet 4.6 ⁽*⁾	65.2%	709
DeepSeek V4-Flash	63.1%	724
DeepSeek V4-Pro	61.0%	718
Kimi K2.6	60.0%	946
Qwen 3.7-Plus	56.5%	741
Gemini 3.5 Flash	54.6%	1,719
GLM 4.7-Flash	39.0%	742
Local Nemotron	38.3%	940

Model	accuracy
Local Nemotron	99.8%
DeepSeek V4-Flash	0.0%
GLM 4.7-Flash	0.2%
DeepSeek V4-Pro	0.2%
Qwen 3.7-Plus	0.2%
Kimi K2.6	0.0%
Gemini 3.5 Flash	0.0%
Claude Sonnet 4.6	0.0%
Claude Opus 4.8	0.0%

Model	AUC (Shared trunk ensemble)
Qwen 3.7-Plus	0.9430 (highest among 9 models)
Gemini 3.5 Flash	0.9250
Claude Sonnet 4.6	0.9162
Claude Opus 4.8	0.9103
DeepSeek V4-Flash	0.9065
GLM 4.7-Flash	0.8937
Local Nemotron	0.8837
DeepSeek V4-Pro	0.8757
Kimi K2.6	0.8517

Metric	Value
Oracle accuracy	0.9250
Best single model	0.6917 (Opus 4.8 standalone)
Router accuracy (argmax)	0.7937 (+10.20pp vs Opus standalone)
Headroom captured	43.7%

Model	Times Selected	Ratio	Accuracy When Selected
Claude Opus 4.8	207	43.1%	64.7%
Claude Sonnet 4.6	111	23.1%	99.1%
Gemini 3.5 Flash	59	12.3%	98.3%
DeepSeek V4-Pro	33	6.9%	69.7%
Qwen 3.7-Plus	31	6.5%	90.3%
DeepSeek V4-Flash	23	4.8%	82.6%
Kimi K2.6	16	3.3%	56.3%
Local Nemotron	0	0.0%	—
GLM 4.7-Flash	0	0.0%	—

Tolerance	Main Models Selected	Total Cost (5 questions)	Reduction Rate (vs. All-Opus)
0.05	deepseek-v4-flash dominant	$0.00055	99.3%
0.10	deepseek-v4-flash + glm-4.7-flash	$0.00080	99.0%
0.20	lightweight + Local Nemotron + glm	$0.00133	98.3%
(All-Opus)	Opus 4.6 direct (fixed in bench script)	$0.07664	Baseline

Evaluation Axis	Environment	Main Models Selected	Interpretation
evaluate	480 questions / argmax (tol=0)	Opus 43.1% + Sonnet 23.1% + Gemini 12.3%	Quality-preserving routing
bench_persona_3tol	5 light-mid questions / tol 0.05-0.20	deepseek-v4-flash + glm-4.7-flash + Local	Cost-reduction routing

Use Case	Model Selected	p_max
Casual chat (lightweight)	local-nemotron-3-nano	0.94
Code generation (mid-weight)	gpt-oss-120b	0.87
Technical explanation (mid-weight)	gpt-oss-120b	0.83
Math proof (mid-heavy)	gpt-oss-120b	0.79
Philosophy (heavy)	gpt-oss-120b	0.81