I tried retraining the NVIDIA LLM Router to match my own persona (Training Edition)

I tried retraining the NVIDIA LLM Router to match my own persona (Training Edition)

2026.06.21

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

After running NVIDIA LLM Router v3 end-to-end in the basics article, I used the default checkpoint for my own work for a while. A 9-model general-purpose pool of Nemotron / GPT-OSS / Qwen / GPT-5 / Opus variants (v1-9models-qwen08b.yaml) is distributed as standard, and routing judgment works just fine with this alone.

https://dev.classmethod.jp/articles/dgx-spark-nvidia-llm-router-v3/

After tinkering with it for a while, what started to bother me was that this default pool was "slightly off from my preferences." The latest-generation Opus 4.8 / Sonnet 4.6 / Gemini 3.5 Flash weren't included yet, and the lineup didn't match the emerging model group I typically want to call (DeepSeek V4 / Qwen 3.7 / Kimi K2.6 / GLM 4.7, etc.). The MLP training data also seemed to be organized with general-purpose use in mind, so it didn't appear to be tuned to my usage patterns (which tend to split to the extremes: heavy design discussions and light brainstorming sessions).

Fortunately, NVIDIA LLM Router v3 officially provides all the resources needed to rebuild a custom checkpoint in 3 steps: collect → train → evaluate. By reorganizing the pool YAML to your liking and converting your everyday questions into training data, you can build routing from scratch tailored to that individual. Since the framework was already in place, the starting point of this article is the decision to reorganize it once to suit my own persona.

Specifically, I took my personal persona (my typical usage of Claude Code / Codex, blog articles I'm writing, design discussions around Hermes Agent and NemoClaw, etc.) and converted it into 480 training questions, then rebuilt a checkpoint from scratch for a new 9-model pool (including the latest Opus 4.8 / Sonnet 4.6 / Gemini 3.5 Flash). By designing with quality-preserving routing in mind (heavy questions properly go to Opus, light questions go to cheaper models), I arrived at a configuration that raises Opus 4.8's adoption rate to 43.1% while demonstrating 98–99% cost reduction (up to 99.3%) for light-to-medium questions.

As for LLM cost optimization, there's also the approach of handling it with rule-based routing by task type, as in MindStudio's "Run Local AI Models with Claude Code" explanation. The reason I chose NVIDIA LLM Router v3 this time was that I wanted to train the Qwen3.5-0.8B encoder + PCA + MLP on the quality × price tradeoff within the same task type (even for the same code generation, deep dependency cases go to Opus, light one-liners go to gpt-oss-120b). If you just want to offload auxiliary tasks locally, rule-based routing is sufficient, so I think using them based on the use case is the practical approach. In this article, I'll focus on the use case of "someone who has been using Claude Code with Opus as the default and wants to switch to routing without degrading quality."

Pool Redesign Policy

Lessons from the Failed 5-Model Configuration

In the first 5-model pool I built (Local + gpt-oss-120b + Kimi K2.7 + GLM 5.2 + Opus 4.6), with 5 use cases input, 4 out of 5 non-lightweight cases all ended up consolidated on gpt-oss-120b. Looking at the cause, the pricing ladder was distorted—there was a 14x cost cliff between gpt-oss-120b ($0.05 / $0.21) and the next most expensive kimi ($0.74 / $3.50). The AUC difference with Opus was only about 5 points, and the MLP honestly learned that "gpt-oss-120b is sufficient for the mid-range tier," factoring in cost-effectiveness.

This is a trap that subtly affects LLM Router configurations—when the cost difference between adjacent models is too large, the MLP tends to lean toward the cheaper side. I experienced firsthand here that keeping a continuous gradient of roughly 1.5–3x steps is the basic practice for building quality-preserving routing.

9-Model Ladder

The final configuration I rebuilt is as follows.

Slot Model OpenRouter slug output $/M Adjacent multiplier Role
1 Nemotron 3 Nano 30B-A3B (Local) ⁽*⁾ openrouter/nvidia/nemotron-3-nano-30b-a3b 0 ⁽*⁾ Local free anchor, lightweight reasoning
2 DeepSeek V4-Flash deepseek/deepseek-v4-flash 0.18 Lightweight general-purpose (emerging 1 / code / long ctx 1M)
3 GLM 4.7 Flash z-ai/glm-4.7-flash 0.40 2.22x Lightweight (emerging 2 / Chinese accuracy / 202k ctx)
4 DeepSeek V4-Pro deepseek/deepseek-v4-pro 0.87 2.18x Mid-range inference (emerging 3, cost-effective / long ctx 1M)
5 Qwen 3.7-Plus qwen/qwen3.7-plus 1.28 1.47x Mid-range general-purpose (emerging 4 / latest generation / long ctx 1M)
6 Kimi K2.6 moonshotai/kimi-k2.6 3.50 2.73x Mid-heavy general-purpose (emerging 5, high versatility)
7 Gemini 3.5 Flash google/gemini-3.5-flash 9.00 2.57x Mid-heavy reasoning + multimodal unique pathway
8 Claude Sonnet 4.6 anthropic/claude-sonnet-4.6 15.00 1.67x Heavy quality middle, bridge to Opus
9 Claude Opus 4.8 anthropic/claude-opus-4.8 25.00 1.67x Heaviest, quality anchor

⁽*⁾ Slot 1 is running on the OpenRouter Nano paid version ($0.05 / $0.20) to improve reproducibility in the series, but cost_per_m_*_tokens in the pool YAML is maintained at 0. If you have an environment where you can run models locally, you can switch Slot 1 to local operation by replacing litellm_model with openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 + api_base: http://localhost:8000/v1. The MLP label remains local-nemotron-3-nano, and the routing judgment can be used as-is without retraining.

Note that the slug notation varies by provider (Anthropic / Google / Moonshot / Z.AI use dot format, DeepSeek / NVIDIA use hyphen format)—this is OpenRouter's specification. This table reproduces slugs confirmed against OpenRouter /api/v1/models. In the default pool table from the basics article, hyphen-format Anthropic slugs (anthropic/claude-opus-4-6) were listed, as this matched the default config v1-9models-qwen08b.yaml exactly as it appeared at the time of Part 1 publication.

The key design point is keeping the adjacent multiplier to a maximum of 2.73x. The goal is to reduce it by about 5x from the 14x jump in the old 5-model configuration and prevent the phenomenon where the MLP leans toward a lower-tier model due to cost cliffs.

※ The display in Mermaid shows output unit price only. Refer to the table above for input unit prices.

The Reality of 2026-Generation Benchmarks

Before finalizing the configuration, I wanted to organize the actual capabilities of each of the 9 models, but it appeared that 2026-generation frontier models have removed the traditional MMLU / HumanEval / MBPP / Aider polyglot from their official benchmarks. Almost all of Anthropic / DeepMind / Z.AI / Moonshot / Qwen have dropped them from official declarations. The reality seems to be that metrics are now converging on about 7 axes: GPQA Diamond / AIME / HLE / SWE-bench Verified / MMMU-Pro / LiveCodeBench v6 / Arena Elo.

The following is a reorganization centered on GPQA Diamond and Arena Elo for practical verification (as of writing, June 2026; sources are each company's official June 2026 releases and lmarena.ai).

Model GPQA Diamond SWE-bench Verified MMMU-Pro Arena Elo
Nemotron 3 Nano 30B-A3B 71.9 38.8 N/A N/A
DeepSeek V4-Flash 87.4 ~74 (text) N/A
GLM 4.7-Flash 75.2 59.2 (text) N/A
DeepSeek V4-Pro 90.1 80.6 (text) 1467
Qwen 3.7-Plus 90.3 (vendor) ~78 N/A N/A
Kimi K2.6 90.5 80.2 79.4 1466
Gemini 3.5 Flash 90.4 (parent) 78 (internal) 83.6 1480
Claude Sonnet 4.6 89.9 79.6 74.5 1467
Claude Opus 4.8 93.6 88.6 N/A 1512

※ (vendor) = vendor self-reported value, (parent) = equivalent value from parent model (Gemini 3 Pro series) used as proxy, (internal) = internal evaluation.

What surprised me when laying it out was that the top 3 emerging models (V4-Pro / Qwen 3.7-Plus / Kimi K2.6) are nearly tied with Sonnet 4.6 on GPQA / Arena. The price difference is 17.2x at V4-Pro $0.87 vs Sonnet $15, so there was an argument for "removing Sonnet from the pool," but I kept it in for operational reasons with a 9-slot setup and to capture any differences in Anthropic's long-context coherence and safety nuances. As described later, Sonnet 4.6 ends up fulfilling its role as "mid-heavy quality middle" with a 23.1% adoption rate.

Gemini 3.5 Flash's Role and Securing the Multimodal Pathway

Another tricky aspect of pool design was how to handle Gemini 3.5 Flash. While it excels notably in multimodal capabilities (MMMU-Pro 83.6), NVIDIA LLM Router v3's Prefill Router is text-only by design (taking hidden states from the prompt via the Qwen3.5-0.8B encoder), meaning image / audio / video blocks don't factor into the judgment. With a short prompt like "Please analyze the image below," the moment routing selects DeepSeek or Local Nemotron, OpenRouter returns 404 No endpoints found that support input image.

The approach here was to ensure Gemini's text-based role (Google grounding / diagram structuring / Arena Elo 1480 general quality) through training data, while adding a thin shim to the litellm adapter layer to handle the multimodal pathway at a separate layer. It runs with a static capability lookup for the 9-model pool (image → {gemini-3.5-flash, sonnet-4.6, opus-4.8} / audio/video → {gemini-3.5-flash}), and in real-machine probes, 10×10 PNG / 1-second WAV / 1-second MP4 were all routed to Gemini 3.5 Flash with image, audio, and video analysis returned. This shim has been submitted as a proposal upstream as PR #34, so interested parties are welcome to check it out. Building full-fledged multimodal routing based on CLIP (NVIDIA AI Blueprint v2's Auto-Router) is a different story, but I think this thin shim adequately covers the practical need of "properly routing image / audio / video requests to a capable upstream model."

Collect Question Data Design

Overall Structure

The training data consists of 480 questions in total, broken down as follows.

Category Count Primary model
Personal persona (curated heavy-leaning 40 + new heavy 60) 100 Heavy 60 leans toward Sonnet / Opus
Opus-favorable questions (heavy signal reinforcement) 150 Opus 4.8 dominant
Gemini-favorable questions (text-only) 30 Gemini 3.5 Flash exclusive
Lightweight/mid-range questions (curated lightweight 60 + generated 40) 100 Nemotron / V4-Flash / GLM
Public datasets (MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40) 100 Bias avoidance

The 150 Opus-favorable questions are the core of this configuration, with the goal of "feeding in heavy signals where only Opus can answer correctly, so the MLP doesn't get outweighed by the cost-effectiveness of emerging models." The 30 Gemini-favorable questions reinforce Gemini's text-based uniqueness (grounding / diagrams / structuring). No Sonnet-favorable questions were intentionally included (since the top emerging models are tied with it on GPQA, creating differentiating questions would be hard to predict, so the approach was to leave it to the natural distribution). The ratio comes out to 210 heavy-leaning questions (44%) + 30 Gemini-leaning questions (6%).

How to Create the 100 Personal Persona Questions

It helps to articulate your own typical tendencies when prompting Claude Code before creating the Opus-favorable questions, to make it easier to balance them.

Subcategory Count Contents
Existing (article writing / Codex / Tailscale / Hermes / NemoClaw, etc.) 40 Select 40 heavy-leaning ones from previous curated set
Large-scale refactoring plans 15 Multi-axis cases like "50 kLoC monolith to DDD + async," longer code context
Infrastructure construction plans 15 "Multi-region Kubernetes + IaC + monitoring design," "DGX Spark + on-prem GPU cluster setup," etc.
PR reviews 15 Actual code diff (50-200 lines) + review perspectives (security / performance / maintainability)
Issue triage 15 "Isolating a flaky test with unknown reproduction conditions," "Root cause analysis of production incident," etc.

The latter 60 questions (refactoring / infrastructure / PR / issue) are clearly heavy questions, so they work strongly as a "heavy = Opus" signal.

How to Create the 150 Opus-Favorable Questions

I organized the scenarios where Opus 4.8 clearly outperforms top emerging models by characteristic, and structured the 150-question breakdown.

Characteristic Count Example
Philosophy & ethical dilemmas 30 Trolley / Frankfurt / euthanasia cases with multiple conflicting principles
Long-context coherence 25 Pointing out contradictions between character statements and past settings within 4000-5000 character context
Multi-step reasoning 25 Bayesian updating, causal inference, dual of linear programming
Refactoring 20 Redesigning a 50 kLoC monolith with DDD, including migration risk and rollback strategy
Creative writing & persona imitation 20 Maintaining Natsume Soseki style or a specific character's tone over long output
Constrained judgment 15 Code/design that must satisfy 10+ constraints simultaneously
Cultural sensitivity & honorifics 15 Politely disagreeing with a superior in English using Japanese business customs

Detailed question examples are planned to be placed in GitHub's nvidia-llm-router-v3-training/data/questions-opus-favor-150.txt. One thing to be careful of when creating the questions is "not being too abstract"—questions like "please explain ○○" don't produce much differentiation from top emerging models, whereas including constraints like "discuss ○○ aspect of X under constraints Y and Z, divided into N stages in 600 characters" tends to bring out Opus's true value.

How to Create the 30 Gemini-Favorable Questions (Text-Only)

This is the part where I want the MLP to learn Gemini 3.5 Flash's uniqueness. Since the multimodal pathway itself is separated to another layer via the adapter shim mentioned earlier, the training data is limited to 30 questions focused on text-based signals I want the MLP to learn (grounding / diagrams / structuring).

Category Count Example
Google search grounding required 10 "What is the on-demand pricing for AWS Bedrock Claude Sonnet 4.5 as of June 2026, with citations?"
Mermaid diagram output 8 "Show the NVIDIA LLM Router v3 request flow as a Mermaid sequenceDiagram"
Multi-level structured markdown 7 "Describe the Kubernetes Operator pattern in 5 chapters, using tables and callouts"
Citation-based summary 5 "Summarize Anthropic Constitutional AI in 500 characters in [1][2][3] citation format"

The key is explicitly including keywords that require grounding, such as "as of June 2026 or later" and "latest." While the Prefill Router itself can only judge based on text before the call is made, prompts with these keywords carry features of "text that requires Google search grounding," which acts as a signal that indirectly makes the MLP more likely to select Gemini. When the called Gemini then actually runs grounding—that's the flow.

Sampling the 100 Public Dataset Questions

100 questions are sampled from MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40 for bias avoidance. They are deterministically extracted (with fixed seed) from a pre-built pool of 300 questions, in a form that ensures reproducibility.

Training Pipeline Execution

Step 1. Environment Prerequisites and Connectivity Check

The configuration is DGX Spark (aarch64, GB10, 128GB UMA) + vLLM Nemotron 3 Nano 30B-A3B NVFP4 (local) + OpenRouter (remaining 8 models). Use probe-models.py to send Reply with the single word 'pong' to each model and confirm HTTP 200 / latency / cost.

# Please adapt the verification scripts in this article to your own working directory
cd workspace/blog/scripts/nvidia-llm-router-v3-training
uv run --with httpx --with pyyaml scripts/probe-models.py \
  --config configs/my-pool-9models.yaml

Once 9/9 OK, proceed to the next step.

Step 2. Dry-run (10 Questions × 9 Models)

Before the full collect, run a dry-run with 90 calls to estimate cost and latency. Two thinking-related pitfalls were encountered here, but after the 3rd attempt, a fully OK configuration for all 9 models was settled on—with the policy of disabling thinking via extra_body.reasoning.enabled: false plus an exception for Gemini 3.5 Flash with max_tokens: 2048. The actual cost was 1/4 of the preliminary estimate of $45-55, with a prospect of completing all 480 questions for $11.27.

Dry-run 3 Attempts and Thinking Pitfall Details

Running initially with thinking enabled, a phenomenon occurred frequently where GLM 4.7-Flash / Kimi K2.6 / Gemini 3.5 Flash / DeepSeek V4-Pro consumed all max_tokens=1024 with thinking, leaving the actual response empty. A large number of reply_len=0 records appeared, making them unusable as quality signals.

Setting extra_body.reasoning.enabled: false uniformly across the 6 OpenRouter models dramatically changed latency.

Model thinking on thinking off Multiplier
GLM 4.7-Flash 29.75s 5.69s 5.2x
Qwen 3.7-Plus 34.03s 2.94s 11.6x
Kimi K2.6 13.40s 3.51s 3.8x

However, another pitfall awaited here. Gemini 3.5 Flash has mandatory reasoning, and passing reasoning: { enabled: false } returns HTTP 400 ("Reasoning is mandatory for this endpoint and cannot be disabled."). An exception was made for Gemini only—keeping thinking on with max_tokens: 2048 to secure tokens for the actual response.

The cost progression across the 3 attempts was as follows, and it was a quietly pleasant discovery that the combination of max_tokens=1024 + thinking off would be this cost-efficient.

Attempt Cost (90 calls) 480q estimate
1st (thinking on) $0.265 $12.70
2nd (thinking off for all models) $0.148 $7.12
3rd (thinking off + Gemini exception) $0.235 $11.27

Step 3. The judge=vote Pitfall and Switching to judge=llm

After the dry-run was complete, submitting the full collect with judge=vote resulted in disbelief-inducing numbers—per-model accuracy of "Local Nemotron 99.8% / all other models 0-0.2%." The cause was the logic in _judge_vote in collect.py: when each model returns a different natural language response, the Counter ties on count and Local Nemotron—first in insertion order at the head of the pool—is always treated as "correct." This works for MMLU-style option matching, but can't be used for generative tasks.

Switching to judge=llm + Sonnet 4.6 as judge and re-collecting (7h 15m), a self-scoring bias remains since Sonnet is both judge and participant, but a reasonable distribution emerged—Opus at the top, the top 3 emerging models clustered together, Local / GLM at the bottom—making it usable as a quality signal.

Model accuracy (Sonnet judge) avg tokens
Opus 4.8 69.17% 738
Sonnet 4.6 ⁽*⁾ 65.2% 709
DeepSeek V4-Flash 63.1% 724
DeepSeek V4-Pro 61.0% 718
Kimi K2.6 60.0% 946
Qwen 3.7-Plus 56.5% 741
Gemini 3.5 Flash 54.6% 1,719
GLM 4.7-Flash 39.0% 742
Local Nemotron 38.3% 940

⁽*⁾ Sonnet 4.6 also serves as judge, so there is a self-scoring bias

Implementation Details of the judge=vote Bug and Re-collect Command

The initial judge=vote results were as follows, clearly indicating a logic-side issue where the local free model defeats all 8 of the other most powerful models with a 99.8% win rate.

Model accuracy
Local Nemotron 99.8%
DeepSeek V4-Flash 0.0%
GLM 4.7-Flash 0.2%
DeepSeek V4-Pro 0.2%
Qwen 3.7-Plus 0.2%
Kimi K2.6 0.0%
Gemini 3.5 Flash 0.0%
Claude Sonnet 4.6 0.0%
Claude Opus 4.8 0.0%

Reading _judge_vote in collect.py immediately revealed the cause.

def _normalize(text: str) -> str:
    return " ".join(re.split(r"\s+", text.strip().lower()))

def _judge_vote(outputs: list[str]) -> str:
    normalized = [_normalize(o) for o in outputs]
    counts = Counter(normalized)
    return counts.most_common(1)[0][0]

When each model returns a different natural language response, all strings remain unique even after normalization. Since Counter respects insertion order for ties, the response from the first model (Local Nemotron, at Slot 1 in the pool YAML) always becomes the majority. Only models whose response matches Local's are judged as "correct," and everything else gets 0%.

The re-collect command is as follows (with the judge parallelization patch applied, max_workers=5, targeting the upper limit that doesn't hit OpenRouter's rate limit).

model-router collect \
  --config configs/my-pool-9models.yaml \
  --questions data/questions-9models.txt \
  --judge llm \
  --judge-model openrouter/anthropic/claude-sonnet-4.6 \
  --output data/collected-my-pool-9models.csv

Those building their own routing for other use cases should watch out for the same pitfall. If you're using MMLU-style tasks (where A/B/C/D options are returned as strings), it works as-is, but for generative tasks, choosing judge=llm is the safe option.

Step 4. Train (Completed in 5 Minutes)

model-router train \
  --config configs/my-pool-9models.yaml \
  --data data/collected-my-pool-9models.csv \
  --output-dir checkpoints/my-router-9models/

Training runs with accelerate enabled, 9-dim MLP, and Qwen/Qwen3.5-0.8B encoder. It completed through Stages 1-6 (Load → Extract → PCA → MLP ensemble → Calibrate → Save) in 5 minutes and 1 second.

The per-model AUC of the completed checkpoint is as follows.

Model AUC (Shared trunk ensemble)
Qwen 3.7-Plus 0.9430 (highest among 9 models)
Gemini 3.5 Flash 0.9250
Claude Sonnet 4.6 0.9162
Claude Opus 4.8 0.9103
DeepSeek V4-Flash 0.9065
GLM 4.7-Flash 0.8937
Local Nemotron 0.8837
DeepSeek V4-Pro 0.8757
Kimi K2.6 0.8517

All models achieved AUC above 0.85, exceeding the anticipated target (per-model AUC 0.5-0.9). Compared to the default checkpoint where p_max was 0.07-0.10, it's a quiet revelation that just organizing the training data tightens up the MLP's predictions this much.

Step 5. evaluate (Quality Improvement by the Numbers)

model-router evaluate \
  --config configs/my-pool-9models.yaml \
  --checkpoint checkpoints/my-router-9models/prefill_router.pt \
  --data data/collected-my-pool-9models.csv \
  --output results/eval-my-router-9models.json
Metric Value
Oracle accuracy 0.9250
Best single model 0.6917 (Opus 4.8 alone)
Router accuracy (argmax) 0.7937 (+10.20pp vs Opus alone)
Headroom captured 43.7%

The quantitative picture here is: "Compared to the 69.17% accuracy when solving 480 questions with Opus alone, the router achieves 79.37%, an improvement of over 10 percentage points." Of the gap from the Oracle accuracy (the upper bound when always choosing the best model) of 92.5%, the router closes 43.7% of it.

Let's also take a look at the argmax distribution from evaluate.

Model Times selected Ratio Accuracy when selected
Claude Opus 4.8 207 43.1% 64.7%
Claude Sonnet 4.6 111 23.1% 99.1%
Gemini 3.5 Flash 59 12.3% 98.3%
DeepSeek V4-Pro 33 6.9% 69.7%
Qwen 3.7-Plus 31 6.5% 90.3%
DeepSeek V4-Flash 23 4.8% 82.6%
Kimi K2.6 16 3.3% 56.3%
Local Nemotron 0 0.0%
GLM 4.7-Flash 0 0.0%

Opus 4.8 leads with 43.1% selection rate, followed by Sonnet 4.6 at 23.1% with a remarkable 99.1% accuracy. Gemini 3.5 Flash also comes in at 12.3% with 98.3%. At design time I said "Sonnet might get selected occasionally," but when the lid came off, it turned out to be consistently selected as the mid-weight quality middle player — which was a pleasant surprise.

It may seem counterintuitive that Opus at 64.7% is lower than Sonnet at 99.1%, but this is simply because "the difficulty of the questions selected by argmax differs." The 207 questions where Opus is selected by argmax are a cluster of hard questions where the MLP judged that "P(correct) won't be high unless it's Opus," so even Opus faces a higher bar to be judged correct by the judge, dropping to 64.7%. Sonnet/Gemini's 99.1%/98.3% result from being assigned relatively easy questions where the MLP judged "these can be solved without Opus." I'll just add a note that the earlier figure of 69.17% for Opus alone is aggregated across all 480 questions, so it comes from a different set than the 64.7% here (207-question subset).

Conversely, Local Nemotron and GLM 4.7-Flash have 0% in argmax. Their accuracy itself is around 38–39%, but under argmax judgment with tolerance=0, they always lose because "other models produce a higher P(correct) for the same question," so in this configuration where training data skews 44% toward heavy models, they never get a turn. If you raise the tolerance to 0.20, they will properly be picked up for lightweight questions.

Behavior of the Persona-Optimized Checkpoint

Routing demonstration by tolerance (bench_persona_3tol)

bench_persona_3tol.py is a script that submits 5 representative question types (casual chat / code generation / technical explanation / math proof / philosophy) at 3 tolerance levels (0.05 / 0.10 / 0.20) and compares the resulting routing distribution and cost for each. I launched the new checkpoint with model-router serve --port 8204 and ran it through.

tolerance Mainly selected models Total cost (5 questions) Reduction rate (vs All-Opus)
0.05 deepseek-v4-flash dominant $0.00055 99.3%
0.10 deepseek-v4-flash + glm-4.7-flash $0.00080 99.0%
0.20 Lightweight + Local Nemotron + glm $0.00133 98.3%
(All-Opus) Opus 4.6 direct call (bench script fixed) $0.07664 Baseline

The (All-Opus) row uses anthropic/claude-opus-4-6 as the comparison baseline fixed in the bench script, so it's the old Opus 4.6 rather than Slot 9 (Opus 4.8) in the new pool. The output prices are close enough that this is fine as a comparison baseline, but if you want to align the numbers precisely, rewrite OPUS_MODEL in the bench script to claude-opus-4.8.

The 5 bench questions are all "lightweight to mid-weight," so Opus not appearing is as expected. The fact that even at tol=0.05 deepseek-v4-flash dominates is a gap compared to V4-Flash being selected only 4.8% in evaluate (480 questions / argmax), but this is because the 5 bench questions are skewed toward the left tail (lightweight side) of the difficulty distribution in evaluate. Raising tolerance to 0.20 brings Local Nemotron into the mix, making lightweight questions effectively zero-cost.

Interpreting the routing distribution and Sonnet's role

Let me organize the point that routing behavior looks completely different between evaluate (480 questions / argmax) and bench (5 light/mid-weight questions / with tolerance).

Evaluation axis Environment Mainly selected models Interpretation
evaluate 480 questions / argmax (tol=0) Opus 43.1% + Sonnet 23.1% + Gemini 12.3% Quality-preserving routing
bench_persona_3tol 5 light/mid-weight questions / tol 0.05–0.20 deepseek-v4-flash + glm-4.7-flash + Local Cost-reduction routing

This is the ideal form of the design — routing automatically switching based on question difficulty and tolerance — realized as intended. The balance is effective: heavy questions are handled by Opus/Sonnet to preserve quality, while lightweight questions use lightweight models to cut costs.

At the design stage, my stance was "Sonnet is tied with emerging upper-tier models (V4-Pro / Qwen 3.7-Plus / Kimi K2.6) on GPQA, so I won't add differentiation questions in the training data — it might get selected occasionally." But evaluate showed a 23.1% selection rate and 99.1% accuracy. Since cost isn't factored into argmax (tol=0), this can't be explained by "selected because it's cheap." The possible explanations are: (1) using Sonnet 4.6 as the judge introduced a self-scoring bias that boosted Sonnet's correct labels, causing the MLP to learn a higher P(correct) for Sonnet; or (2) for the 111 selected questions, the MLP's Sonnet predictions genuinely exceeded Opus. Without re-collecting with a different judge (GPT-5.4 or Gemini 3.5 Pro), we can't separate these. But the fact that Sonnet is stably functioning as a mid-weight quality middle player remains unchanged, and the design ideal of "maintaining quality while also reducing cost" is working in a quantitatively effective way.

If you want to increase Local Nemotron's selection rate

The result was 0% for Local Nemotron in evaluate's argmax, and only 2/5 selected at bench tol=0.20. If you want to "push more brainstorming-type tasks to Local," you can adjust in 3 steps: (1) add extra_body.routing.tolerance: 0.20 to the request body to raise tolerance per request (easiest, no restart needed); (2) spin up a separate server on a different port with --models limited to just the lightweight pool; (3) add 50–100 casual chat / short translation / simple memo-organizing questions to the training data and re-collect / re-train. In practice, (1) is the most straightforward, and you can run multiple use cases through a single routing service by using per-profile settings in Hermes Agent like "tolerance=0.20 for casual chat, 0.05 for code generation via Claude Code." The topic of integrating this with flows running on the Hermes side is planned for coverage in the practical guide.

The Cost Cliff Visible in Comparison with the Old 5-Model Checkpoint

For reference, here is the routing distribution of the old 5-model pool mentioned at the start of the Pool redesign section (Local Nemotron / gpt-oss-120b / Kimi K2.7 Code / GLM 5.2 / Opus 4.6). When 5 use cases were submitted, all 4 non-lightweight cases were routed to gpt-oss-120b, with high confidence (p_max 0.79–0.94), but Opus / Kimi / GLM were never called in a structure where they'd never have a turn.

Use case Selected model p_max
Casual chat (lightweight) local-nemotron-3-nano 0.94
Code generation (mid) gpt-oss-120b 0.87
Technical explanation (mid) gpt-oss-120b 0.83
Math proof (mid-heavy) gpt-oss-120b 0.79
Philosophy discussion (heavy) gpt-oss-120b 0.81

When placed side by side with the new 9-model ladder, the 14x jump that caused the problem is visually striking.

※ All prices inside Mermaid are standardized to output prices. For input prices, refer to the individual tables shown earlier.

As a fundamental practice when building your own routing, if you keep in mind from the start that "the cost multiplier between adjacent models should be within 3x, ideally around 2x," you can more easily avoid cost cliffs like this one.

Summary

Starting from a feeling of "this might not quite match my preferences" after going through the NVIDIA LLM Router v3 default checkpoint, I went all the way through rebuilding a 9-model pool from scratch and running persona training on 480 questions. The result showed a 43.1% Opus 4.8 selection rate maintaining quality, while lightweight/mid-weight questions with tolerance 0.05–0.20 achieved 98–99%+ cost reduction (up to 99.3%), giving a concrete sense of what quality-preserving routing can deliver. The pitfalls I hit along the way (thinking on consuming all max_tokens, the insertion order bug in judge=vote, Gemini's mandatory reasoning, the 14x cost cliff jump) are written down in concrete detail in hopes they'll be useful to others building their own routing for different use cases. In particular, I think "keep the cost multiplier between adjacent models within 3x, ideally around 2x" is a good foundational practice to be aware of from the start when designing a pool.

To avoid letting it end at "built and done," observability after putting the checkpoint into production also becomes important. The routing service used in this article has a thin Langfuse callback and routing decision stdout logging layered in, allowing after-the-fact confirmation of which models were called how many times, what the resulting costs were, and how the distribution shifts when tolerance is changed. I'm thinking of covering this observability topic more thoroughly in a separate article.

Even labeled "reference implementation only," v3 is this much fun to play with — the feeling of training it and having it quietly conform to your own persona is quietly enjoyable.

NVIDIA LLM Router

Upstream PRs submitted during verification for this article

  • OpenRouter — API gateway used as the upstream for the pool in this article
  • Langfuse — Integrated via LiteLLM callback for observability
  • lmarena.ai — Source of the Arena Elo figures referenced in the benchmark tables in this article

国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article