
I tried retraining the NVIDIA LLM Router to match my own persona (Training Edition)
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
After running NVIDIA LLM Router v3 end-to-end in the basics article, I used the default checkpoint for my own work for a while. A 9-model general-purpose pool of Nemotron / GPT-OSS / Qwen / GPT-5 / Opus variants (v1-9models-qwen08b.yaml) is distributed as standard, and routing judgment works just fine with this alone.
After tinkering with it for a while, what started to bother me was that this default pool was "slightly off from my preferences." The latest-generation Opus 4.8 / Sonnet 4.6 / Gemini 3.5 Flash weren't included yet, and the lineup didn't match the emerging model group I typically want to call (DeepSeek V4 / Qwen 3.7 / Kimi K2.6 / GLM 4.7, etc.). The MLP training data also seemed to be organized with general-purpose use in mind, so it didn't appear to be tuned to my usage patterns (which tend to split to the extremes: heavy design discussions and light brainstorming sessions).
Fortunately, NVIDIA LLM Router v3 officially provides all the resources needed to rebuild a custom checkpoint in 3 steps: collect → train → evaluate. By reorganizing the pool YAML to your liking and converting your everyday questions into training data, you can build routing from scratch tailored to that individual. Since the framework was already in place, the starting point of this article is the decision to reorganize it once to suit my own persona.
Specifically, I took my personal persona (my typical usage of Claude Code / Codex, blog articles I'm writing, design discussions around Hermes Agent and NemoClaw, etc.) and converted it into 480 training questions, then rebuilt a checkpoint from scratch for a new 9-model pool (including the latest Opus 4.8 / Sonnet 4.6 / Gemini 3.5 Flash). By designing with quality-preserving routing in mind (heavy questions properly go to Opus, light questions go to cheaper models), I arrived at a configuration that raises Opus 4.8's adoption rate to 43.1% while demonstrating 98–99% cost reduction (up to 99.3%) for light-to-medium questions.
As for LLM cost optimization, there's also the approach of handling it with rule-based routing by task type, as in MindStudio's "Run Local AI Models with Claude Code" explanation. The reason I chose NVIDIA LLM Router v3 this time was that I wanted to train the Qwen3.5-0.8B encoder + PCA + MLP on the quality × price tradeoff within the same task type (even for the same code generation, deep dependency cases go to Opus, light one-liners go to gpt-oss-120b). If you just want to offload auxiliary tasks locally, rule-based routing is sufficient, so I think using them based on the use case is the practical approach. In this article, I'll focus on the use case of "someone who has been using Claude Code with Opus as the default and wants to switch to routing without degrading quality."
Pool Redesign Policy
Lessons from the Failed 5-Model Configuration
In the first 5-model pool I built (Local + gpt-oss-120b + Kimi K2.7 + GLM 5.2 + Opus 4.6), with 5 use cases input, 4 out of 5 non-lightweight cases all ended up consolidated on gpt-oss-120b. Looking at the cause, the pricing ladder was distorted—there was a 14x cost cliff between gpt-oss-120b ($0.05 / $0.21) and the next most expensive kimi ($0.74 / $3.50). The AUC difference with Opus was only about 5 points, and the MLP honestly learned that "gpt-oss-120b is sufficient for the mid-range tier," factoring in cost-effectiveness.
This is a trap that subtly affects LLM Router configurations—when the cost difference between adjacent models is too large, the MLP tends to lean toward the cheaper side. I experienced firsthand here that keeping a continuous gradient of roughly 1.5–3x steps is the basic practice for building quality-preserving routing.
9-Model Ladder
The final configuration I rebuilt is as follows.
| Slot | Model | OpenRouter slug | output $/M | Adjacent multiplier | Role |
|---|---|---|---|---|---|
| 1 | Nemotron 3 Nano 30B-A3B (Local) ⁽*⁾ | openrouter/nvidia/nemotron-3-nano-30b-a3b |
0 ⁽*⁾ | — | Local free anchor, lightweight reasoning |
| 2 | DeepSeek V4-Flash | deepseek/deepseek-v4-flash |
0.18 | — | Lightweight general-purpose (emerging 1 / code / long ctx 1M) |
| 3 | GLM 4.7 Flash | z-ai/glm-4.7-flash |
0.40 | 2.22x | Lightweight (emerging 2 / Chinese accuracy / 202k ctx) |
| 4 | DeepSeek V4-Pro | deepseek/deepseek-v4-pro |
0.87 | 2.18x | Mid-range inference (emerging 3, cost-effective / long ctx 1M) |
| 5 | Qwen 3.7-Plus | qwen/qwen3.7-plus |
1.28 | 1.47x | Mid-range general-purpose (emerging 4 / latest generation / long ctx 1M) |
| 6 | Kimi K2.6 | moonshotai/kimi-k2.6 |
3.50 | 2.73x | Mid-heavy general-purpose (emerging 5, high versatility) |
| 7 | Gemini 3.5 Flash | google/gemini-3.5-flash |
9.00 | 2.57x | Mid-heavy reasoning + multimodal unique pathway |
| 8 | Claude Sonnet 4.6 | anthropic/claude-sonnet-4.6 |
15.00 | 1.67x | Heavy quality middle, bridge to Opus |
| 9 | Claude Opus 4.8 | anthropic/claude-opus-4.8 |
25.00 | 1.67x | Heaviest, quality anchor |
⁽*⁾ Slot 1 is running on the OpenRouter Nano paid version ($0.05 / $0.20) to improve reproducibility in the series, but cost_per_m_*_tokens in the pool YAML is maintained at 0. If you have an environment where you can run models locally, you can switch Slot 1 to local operation by replacing litellm_model with openai/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 + api_base: http://localhost:8000/v1. The MLP label remains local-nemotron-3-nano, and the routing judgment can be used as-is without retraining.
Note that the slug notation varies by provider (Anthropic / Google / Moonshot / Z.AI use dot format, DeepSeek / NVIDIA use hyphen format)—this is OpenRouter's specification. This table reproduces slugs confirmed against OpenRouter /api/v1/models. In the default pool table from the basics article, hyphen-format Anthropic slugs (anthropic/claude-opus-4-6) were listed, as this matched the default config v1-9models-qwen08b.yaml exactly as it appeared at the time of Part 1 publication.
The key design point is keeping the adjacent multiplier to a maximum of 2.73x. The goal is to reduce it by about 5x from the 14x jump in the old 5-model configuration and prevent the phenomenon where the MLP leans toward a lower-tier model due to cost cliffs.
※ The display in Mermaid shows output unit price only. Refer to the table above for input unit prices.
The Reality of 2026-Generation Benchmarks
Before finalizing the configuration, I wanted to organize the actual capabilities of each of the 9 models, but it appeared that 2026-generation frontier models have removed the traditional MMLU / HumanEval / MBPP / Aider polyglot from their official benchmarks. Almost all of Anthropic / DeepMind / Z.AI / Moonshot / Qwen have dropped them from official declarations. The reality seems to be that metrics are now converging on about 7 axes: GPQA Diamond / AIME / HLE / SWE-bench Verified / MMMU-Pro / LiveCodeBench v6 / Arena Elo.
The following is a reorganization centered on GPQA Diamond and Arena Elo for practical verification (as of writing, June 2026; sources are each company's official June 2026 releases and lmarena.ai).
| Model | GPQA Diamond | SWE-bench Verified | MMMU-Pro | Arena Elo |
|---|---|---|---|---|
| Nemotron 3 Nano 30B-A3B | 71.9 | 38.8 | N/A | N/A |
| DeepSeek V4-Flash | 87.4 | ~74 | (text) | N/A |
| GLM 4.7-Flash | 75.2 | 59.2 | (text) | N/A |
| DeepSeek V4-Pro | 90.1 | 80.6 | (text) | 1467 |
| Qwen 3.7-Plus | 90.3 (vendor) | ~78 | N/A | N/A |
| Kimi K2.6 | 90.5 | 80.2 | 79.4 | 1466 |
| Gemini 3.5 Flash | 90.4 (parent) | 78 (internal) | 83.6 | 1480 |
| Claude Sonnet 4.6 | 89.9 | 79.6 | 74.5 | 1467 |
| Claude Opus 4.8 | 93.6 | 88.6 | N/A | 1512 |
※ (vendor) = vendor self-reported value, (parent) = equivalent value from parent model (Gemini 3 Pro series) used as proxy, (internal) = internal evaluation.
What surprised me when laying it out was that the top 3 emerging models (V4-Pro / Qwen 3.7-Plus / Kimi K2.6) are nearly tied with Sonnet 4.6 on GPQA / Arena. The price difference is 17.2x at V4-Pro $0.87 vs Sonnet $15, so there was an argument for "removing Sonnet from the pool," but I kept it in for operational reasons with a 9-slot setup and to capture any differences in Anthropic's long-context coherence and safety nuances. As described later, Sonnet 4.6 ends up fulfilling its role as "mid-heavy quality middle" with a 23.1% adoption rate.
Gemini 3.5 Flash's Role and Securing the Multimodal Pathway
Another tricky aspect of pool design was how to handle Gemini 3.5 Flash. While it excels notably in multimodal capabilities (MMMU-Pro 83.6), NVIDIA LLM Router v3's Prefill Router is text-only by design (taking hidden states from the prompt via the Qwen3.5-0.8B encoder), meaning image / audio / video blocks don't factor into the judgment. With a short prompt like "Please analyze the image below," the moment routing selects DeepSeek or Local Nemotron, OpenRouter returns 404 No endpoints found that support input image.
The approach here was to ensure Gemini's text-based role (Google grounding / diagram structuring / Arena Elo 1480 general quality) through training data, while adding a thin shim to the litellm adapter layer to handle the multimodal pathway at a separate layer. It runs with a static capability lookup for the 9-model pool (image → {gemini-3.5-flash, sonnet-4.6, opus-4.8} / audio/video → {gemini-3.5-flash}), and in real-machine probes, 10×10 PNG / 1-second WAV / 1-second MP4 were all routed to Gemini 3.5 Flash with image, audio, and video analysis returned. This shim has been submitted as a proposal upstream as PR #34, so interested parties are welcome to check it out. Building full-fledged multimodal routing based on CLIP (NVIDIA AI Blueprint v2's Auto-Router) is a different story, but I think this thin shim adequately covers the practical need of "properly routing image / audio / video requests to a capable upstream model."
Collect Question Data Design
Overall Structure
The training data consists of 480 questions in total, broken down as follows.
| Category | Count | Primary model |
|---|---|---|
| Personal persona (curated heavy-leaning 40 + new heavy 60) | 100 | Heavy 60 leans toward Sonnet / Opus |
| Opus-favorable questions (heavy signal reinforcement) | 150 | Opus 4.8 dominant |
| Gemini-favorable questions (text-only) | 30 | Gemini 3.5 Flash exclusive |
| Lightweight/mid-range questions (curated lightweight 60 + generated 40) | 100 | Nemotron / V4-Flash / GLM |
| Public datasets (MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40) | 100 | Bias avoidance |
The 150 Opus-favorable questions are the core of this configuration, with the goal of "feeding in heavy signals where only Opus can answer correctly, so the MLP doesn't get outweighed by the cost-effectiveness of emerging models." The 30 Gemini-favorable questions reinforce Gemini's text-based uniqueness (grounding / diagrams / structuring). No Sonnet-favorable questions were intentionally included (since the top emerging models are tied with it on GPQA, creating differentiating questions would be hard to predict, so the approach was to leave it to the natural distribution). The ratio comes out to 210 heavy-leaning questions (44%) + 30 Gemini-leaning questions (6%).
How to Create the 100 Personal Persona Questions
It helps to articulate your own typical tendencies when prompting Claude Code before creating the Opus-favorable questions, to make it easier to balance them.
| Subcategory | Count | Contents |
|---|---|---|
| Existing (article writing / Codex / Tailscale / Hermes / NemoClaw, etc.) | 40 | Select 40 heavy-leaning ones from previous curated set |
| Large-scale refactoring plans | 15 | Multi-axis cases like "50 kLoC monolith to DDD + async," longer code context |
| Infrastructure construction plans | 15 | "Multi-region Kubernetes + IaC + monitoring design," "DGX Spark + on-prem GPU cluster setup," etc. |
| PR reviews | 15 | Actual code diff (50-200 lines) + review perspectives (security / performance / maintainability) |
| Issue triage | 15 | "Isolating a flaky test with unknown reproduction conditions," "Root cause analysis of production incident," etc. |
The latter 60 questions (refactoring / infrastructure / PR / issue) are clearly heavy questions, so they work strongly as a "heavy = Opus" signal.
How to Create the 150 Opus-Favorable Questions
I organized the scenarios where Opus 4.8 clearly outperforms top emerging models by characteristic, and structured the 150-question breakdown.
| Characteristic | Count | Example |
|---|---|---|
| Philosophy & ethical dilemmas | 30 | Trolley / Frankfurt / euthanasia cases with multiple conflicting principles |
| Long-context coherence | 25 | Pointing out contradictions between character statements and past settings within 4000-5000 character context |
| Multi-step reasoning | 25 | Bayesian updating, causal inference, dual of linear programming |
| Refactoring | 20 | Redesigning a 50 kLoC monolith with DDD, including migration risk and rollback strategy |
| Creative writing & persona imitation | 20 | Maintaining Natsume Soseki style or a specific character's tone over long output |
| Constrained judgment | 15 | Code/design that must satisfy 10+ constraints simultaneously |
| Cultural sensitivity & honorifics | 15 | Politely disagreeing with a superior in English using Japanese business customs |
Detailed question examples are planned to be placed in GitHub's nvidia-llm-router-v3-training/data/questions-opus-favor-150.txt. One thing to be careful of when creating the questions is "not being too abstract"—questions like "please explain ○○" don't produce much differentiation from top emerging models, whereas including constraints like "discuss ○○ aspect of X under constraints Y and Z, divided into N stages in 600 characters" tends to bring out Opus's true value.
How to Create the 30 Gemini-Favorable Questions (Text-Only)
This is the part where I want the MLP to learn Gemini 3.5 Flash's uniqueness. Since the multimodal pathway itself is separated to another layer via the adapter shim mentioned earlier, the training data is limited to 30 questions focused on text-based signals I want the MLP to learn (grounding / diagrams / structuring).
| Category | Count | Example |
|---|---|---|
| Google search grounding required | 10 | "What is the on-demand pricing for AWS Bedrock Claude Sonnet 4.5 as of June 2026, with citations?" |
| Mermaid diagram output | 8 | "Show the NVIDIA LLM Router v3 request flow as a Mermaid sequenceDiagram" |
| Multi-level structured markdown | 7 | "Describe the Kubernetes Operator pattern in 5 chapters, using tables and callouts" |
| Citation-based summary | 5 | "Summarize Anthropic Constitutional AI in 500 characters in [1][2][3] citation format" |
The key is explicitly including keywords that require grounding, such as "as of June 2026 or later" and "latest." While the Prefill Router itself can only judge based on text before the call is made, prompts with these keywords carry features of "text that requires Google search grounding," which acts as a signal that indirectly makes the MLP more likely to select Gemini. When the called Gemini then actually runs grounding—that's the flow.
Sampling the 100 Public Dataset Questions
100 questions are sampled from MMLU 30 + HumanEval 15 + GSM8K 15 + DollyJA 40 for bias avoidance. They are deterministically extracted (with fixed seed) from a pre-built pool of 300 questions, in a form that ensures reproducibility.
Training Pipeline Execution
Step 1. Environment Prerequisites and Connectivity Check
The configuration is DGX Spark (aarch64, GB10, 128GB UMA) + vLLM Nemotron 3 Nano 30B-A3B NVFP4 (local) + OpenRouter (remaining 8 models). Use probe-models.py to send Reply with the single word 'pong' to each model and confirm HTTP 200 / latency / cost.
# Please adapt the verification scripts in this article to your own working directory
cd workspace/blog/scripts/nvidia-llm-router-v3-training
uv run --with httpx --with pyyaml scripts/probe-models.py \
--config configs/my-pool-9models.yaml
Once 9/9 OK, proceed to the next step.
Step 2. Dry-run (10 Questions × 9 Models)
Before the full collect, run a dry-run with 90 calls to estimate cost and latency. Two thinking-related pitfalls were encountered here, but after the 3rd attempt, a fully OK configuration for all 9 models was settled on—with the policy of disabling thinking via extra_body.reasoning.enabled: false plus an exception for Gemini 3.5 Flash with max_tokens: 2048. The actual cost was 1/4 of the preliminary estimate of $45-55, with a prospect of completing all 480 questions for $11.27.
Dry-run 3 Attempts and Thinking Pitfall Details
Running initially with thinking enabled, a phenomenon occurred frequently where GLM 4.7-Flash / Kimi K2.6 / Gemini 3.5 Flash / DeepSeek V4-Pro consumed all max_tokens=1024 with thinking, leaving the actual response empty. A large number of reply_len=0 records appeared, making them unusable as quality signals.
Setting extra_body.reasoning.enabled: false uniformly across the 6 OpenRouter models dramatically changed latency.
| Model | thinking on | thinking off | Multiplier |
|---|---|---|---|
| GLM 4.7-Flash | 29.75s | 5.69s | 5.2x |
| Qwen 3.7-Plus | 34.03s | 2.94s | 11.6x |
| Kimi K2.6 | 13.40s | 3.51s | 3.8x |
However, another pitfall awaited here. Gemini 3.5 Flash has mandatory reasoning, and passing reasoning: { enabled: false } returns HTTP 400 ("Reasoning is mandatory for this endpoint and cannot be disabled."). An exception was made for Gemini only—keeping thinking on with max_tokens: 2048 to secure tokens for the actual response.
The cost progression across the 3 attempts was as follows, and it was a quietly pleasant discovery that the combination of max_tokens=1024 + thinking off would be this cost-efficient.
| Attempt | Cost (90 calls) | 480q estimate |
|---|---|---|
| 1st (thinking on) | $0.265 | $12.70 |
| 2nd (thinking off for all models) | $0.148 | $7.12 |
| 3rd (thinking off + Gemini exception) | $0.235 | $11.27 |
Step 3. The judge=vote Pitfall and Switching to judge=llm
After the dry-run was complete, submitting the full collect with judge=vote resulted in disbelief-inducing numbers—per-model accuracy of "Local Nemotron 99.8% / all other models 0-0.2%." The cause was the logic in _judge_vote in collect.py: when each model returns a different natural language response, the Counter ties on count and Local Nemotron—first in insertion order at the head of the pool—is always treated as "correct." This works for MMLU-style option matching, but can't be used for generative tasks.
Switching to judge=llm + Sonnet 4.6 as judge and re-collecting (7h 15m), a self-scoring bias remains since Sonnet is both judge and participant, but a reasonable distribution emerged—Opus at the top, the top 3 emerging models clustered together, Local / GLM at the bottom—making it usable as a quality signal.
| Model | accuracy (Sonnet judge) | avg tokens |
|---|---|---|
| Opus 4.8 | 69.17% | 738 |
| Sonnet 4.6 ⁽*⁾ | 65.2% | 709 |
| DeepSeek V4-Flash | 63.1% | 724 |
| DeepSeek V4-Pro | 61.0% | 718 |
| Kimi K2.6 | 60.0% | 946 |
| Qwen 3.7-Plus | 56.5% | 741 |
| Gemini 3.5 Flash | 54.6% | 1,719 |
| GLM 4.7-Flash | 39.0% | 742 |
| Local Nemotron | 38.3% | 940 |
⁽*⁾ Sonnet 4.6 also serves as judge, so there is a self-scoring bias
Implementation Details of the judge=vote Bug and Re-collect Command
The initial judge=vote results were as follows, clearly indicating a logic-side issue where the local free model defeats all 8 of the other most powerful models with a 99.8% win rate.
| Model | accuracy |
|---|---|
| Local Nemotron | 99.8% |
| DeepSeek V4-Flash | 0.0% |
| GLM 4.7-Flash | 0.2% |
| DeepSeek V4-Pro | 0.2% |
| Qwen 3.7-Plus | 0.2% |
| Kimi K2.6 | 0.0% |
| Gemini 3.5 Flash | 0.0% |
| Claude Sonnet 4.6 | 0.0% |
| Claude Opus 4.8 | 0.0% |
Reading _judge_vote in collect.py immediately revealed the cause.
def _normalize(text: str) -> str:
return " ".join(re.split(r"\s+", text.strip().lower()))
def _judge_vote(outputs: list[str]) -> str:
normalized = [_normalize(o) for o in outputs]
counts = Counter(normalized)
return counts.most_common(1)[0][0]
When each model returns a different natural language response, all strings remain unique even after normalization. Since Counter respects insertion order for ties, the response from the first model (Local Nemotron, at Slot 1 in the pool YAML) always becomes the majority. Only models whose response matches Local's are judged as "correct," and everything else gets 0%.
The re-collect command is as follows (with the judge parallelization patch applied, max_workers=5, targeting the upper limit that doesn't hit OpenRouter's rate limit).
model-router collect \
--config configs/my-pool-9models.yaml \
--questions data/questions-9models.txt \
--judge llm \
--judge-model openrouter/anthropic/claude-sonnet-4.6 \
--output data/collected-my-pool-9models.csv
Those building their own routing for other use cases should watch out for the same pitfall. If you're using MMLU-style tasks (where A/B/C/D options are returned as strings), it works as-is, but for generative tasks, choosing judge=llm is the safe option.
Step 4. Train (Completed in 5 Minutes)
model-router train \
--config configs/my-pool-9models.yaml \
--data data/collected-my-pool-9models.csv \
--output-dir checkpoints/my-router-9models/
Training runs with accelerate enabled, 9-dim MLP, and Qwen/Qwen3.5-0.8B encoder. It completed through Stages 1-6 (Load → Extract → PCA → MLP ensemble → Calibrate → Save) in 5 minutes and 1 second.
The per-model AUC of the completed checkpoint is as follows.
| Model | AUC (Shared trunk ensemble) |
|---|---|
| Qwen 3.7-Plus | 0.9430 (highest among 9 models) |
| Gemini 3.5 Flash | 0.9250 |
| Claude Sonnet 4.6 | 0.9162 |
| Claude Opus 4.8 | 0.9103 |
| DeepSeek V4-Flash | 0.9065 |
| GLM 4.7-Flash | 0.8937 |
| Local Nemotron | 0.8837 |
| DeepSeek V4-Pro | 0.8757 |
| Kimi K2.6 | 0.8517 |
All models achieved AUC above 0.85, exceeding the anticipated target (per-model AUC 0.5-0.9). Compared to the default checkpoint where p_max was 0.07-0.10, it's a quiet revelation that just organizing the training data tightens up the MLP's predictions this much.
Step 5. evaluate (Quality Improvement by the Numbers)
model-router evaluate \
--config configs/my-pool-9models.yaml \
--checkpoint checkpoints/my-router-9models/prefill_router.pt \
--data data/collected-my-pool-9models.csv \
--output results/eval-my-router-9models.json
| Metric | Value |
|---|---|
| Oracle accuracy | 0.9250 |
| Best single model | 0.6917 (Opus 4.8 alone) |
| Router accuracy (argmax) | 0.7937 (+10.20pp vs Opus alone) |
| Headroom captured | 43.7% |
The quantitative picture here is: "Compared to the 69.17% accuracy when solving 480 questions with Opus alone, the router achieves 79.37%, an improvement of over 10 percentage points." Of the gap from the Oracle accuracy (the upper bound when always choosing the best model) of 92.5%, the router closes 43.7% of it.
Let's also take a look at the argmax distribution from evaluate.
| Model | Times selected | Ratio | Accuracy when selected |
|---|---|---|---|
| Claude Opus 4.8 | 207 | 43.1% | 64.7% |
| Claude Sonnet 4.6 | 111 | 23.1% | 99.1% |
| Gemini 3.5 Flash | 59 | 12.3% | 98.3% |
| DeepSeek V4-Pro | 33 | 6.9% | 69.7% |
| Qwen 3.7-Plus | 31 | 6.5% | 90.3% |
| DeepSeek V4-Flash | 23 | 4.8% | 82.6% |
| Kimi K2.6 | 16 | 3.3% | 56.3% |
| Local Nemotron | 0 | 0.0% | — |
| GLM 4.7-Flash | 0 | 0.0% | — |
Opus 4.8 leads with 43.1% selection rate, followed by Sonnet 4.6 at 23.1% with a remarkable 99.1% accuracy. Gemini 3.5 Flash also comes in at 12.3% with 98.3%. At design time I said "Sonnet might get selected occasionally," but when the lid came off, it turned out to be consistently selected as the mid-weight quality middle player — which was a pleasant surprise.
It may seem counterintuitive that Opus at 64.7% is lower than Sonnet at 99.1%, but this is simply because "the difficulty of the questions selected by argmax differs." The 207 questions where Opus is selected by argmax are a cluster of hard questions where the MLP judged that "P(correct) won't be high unless it's Opus," so even Opus faces a higher bar to be judged correct by the judge, dropping to 64.7%. Sonnet/Gemini's 99.1%/98.3% result from being assigned relatively easy questions where the MLP judged "these can be solved without Opus." I'll just add a note that the earlier figure of 69.17% for Opus alone is aggregated across all 480 questions, so it comes from a different set than the 64.7% here (207-question subset).
Conversely, Local Nemotron and GLM 4.7-Flash have 0% in argmax. Their accuracy itself is around 38–39%, but under argmax judgment with tolerance=0, they always lose because "other models produce a higher P(correct) for the same question," so in this configuration where training data skews 44% toward heavy models, they never get a turn. If you raise the tolerance to 0.20, they will properly be picked up for lightweight questions.
Behavior of the Persona-Optimized Checkpoint
Routing demonstration by tolerance (bench_persona_3tol)
bench_persona_3tol.py is a script that submits 5 representative question types (casual chat / code generation / technical explanation / math proof / philosophy) at 3 tolerance levels (0.05 / 0.10 / 0.20) and compares the resulting routing distribution and cost for each. I launched the new checkpoint with model-router serve --port 8204 and ran it through.
| tolerance | Mainly selected models | Total cost (5 questions) | Reduction rate (vs All-Opus) |
|---|---|---|---|
| 0.05 | deepseek-v4-flash dominant | $0.00055 | 99.3% |
| 0.10 | deepseek-v4-flash + glm-4.7-flash | $0.00080 | 99.0% |
| 0.20 | Lightweight + Local Nemotron + glm | $0.00133 | 98.3% |
| (All-Opus) | Opus 4.6 direct call (bench script fixed) | $0.07664 | Baseline |
The (All-Opus) row uses anthropic/claude-opus-4-6 as the comparison baseline fixed in the bench script, so it's the old Opus 4.6 rather than Slot 9 (Opus 4.8) in the new pool. The output prices are close enough that this is fine as a comparison baseline, but if you want to align the numbers precisely, rewrite OPUS_MODEL in the bench script to claude-opus-4.8.
The 5 bench questions are all "lightweight to mid-weight," so Opus not appearing is as expected. The fact that even at tol=0.05 deepseek-v4-flash dominates is a gap compared to V4-Flash being selected only 4.8% in evaluate (480 questions / argmax), but this is because the 5 bench questions are skewed toward the left tail (lightweight side) of the difficulty distribution in evaluate. Raising tolerance to 0.20 brings Local Nemotron into the mix, making lightweight questions effectively zero-cost.
Interpreting the routing distribution and Sonnet's role
Let me organize the point that routing behavior looks completely different between evaluate (480 questions / argmax) and bench (5 light/mid-weight questions / with tolerance).
| Evaluation axis | Environment | Mainly selected models | Interpretation |
|---|---|---|---|
| evaluate | 480 questions / argmax (tol=0) | Opus 43.1% + Sonnet 23.1% + Gemini 12.3% | Quality-preserving routing |
| bench_persona_3tol | 5 light/mid-weight questions / tol 0.05–0.20 | deepseek-v4-flash + glm-4.7-flash + Local | Cost-reduction routing |
This is the ideal form of the design — routing automatically switching based on question difficulty and tolerance — realized as intended. The balance is effective: heavy questions are handled by Opus/Sonnet to preserve quality, while lightweight questions use lightweight models to cut costs.
At the design stage, my stance was "Sonnet is tied with emerging upper-tier models (V4-Pro / Qwen 3.7-Plus / Kimi K2.6) on GPQA, so I won't add differentiation questions in the training data — it might get selected occasionally." But evaluate showed a 23.1% selection rate and 99.1% accuracy. Since cost isn't factored into argmax (tol=0), this can't be explained by "selected because it's cheap." The possible explanations are: (1) using Sonnet 4.6 as the judge introduced a self-scoring bias that boosted Sonnet's correct labels, causing the MLP to learn a higher P(correct) for Sonnet; or (2) for the 111 selected questions, the MLP's Sonnet predictions genuinely exceeded Opus. Without re-collecting with a different judge (GPT-5.4 or Gemini 3.5 Pro), we can't separate these. But the fact that Sonnet is stably functioning as a mid-weight quality middle player remains unchanged, and the design ideal of "maintaining quality while also reducing cost" is working in a quantitatively effective way.
If you want to increase Local Nemotron's selection rate
The result was 0% for Local Nemotron in evaluate's argmax, and only 2/5 selected at bench tol=0.20. If you want to "push more brainstorming-type tasks to Local," you can adjust in 3 steps: (1) add extra_body.routing.tolerance: 0.20 to the request body to raise tolerance per request (easiest, no restart needed); (2) spin up a separate server on a different port with --models limited to just the lightweight pool; (3) add 50–100 casual chat / short translation / simple memo-organizing questions to the training data and re-collect / re-train. In practice, (1) is the most straightforward, and you can run multiple use cases through a single routing service by using per-profile settings in Hermes Agent like "tolerance=0.20 for casual chat, 0.05 for code generation via Claude Code." The topic of integrating this with flows running on the Hermes side is planned for coverage in the practical guide.
The Cost Cliff Visible in Comparison with the Old 5-Model Checkpoint
For reference, here is the routing distribution of the old 5-model pool mentioned at the start of the Pool redesign section (Local Nemotron / gpt-oss-120b / Kimi K2.7 Code / GLM 5.2 / Opus 4.6). When 5 use cases were submitted, all 4 non-lightweight cases were routed to gpt-oss-120b, with high confidence (p_max 0.79–0.94), but Opus / Kimi / GLM were never called in a structure where they'd never have a turn.
| Use case | Selected model | p_max |
|---|---|---|
| Casual chat (lightweight) | local-nemotron-3-nano | 0.94 |
| Code generation (mid) | gpt-oss-120b | 0.87 |
| Technical explanation (mid) | gpt-oss-120b | 0.83 |
| Math proof (mid-heavy) | gpt-oss-120b | 0.79 |
| Philosophy discussion (heavy) | gpt-oss-120b | 0.81 |
When placed side by side with the new 9-model ladder, the 14x jump that caused the problem is visually striking.
※ All prices inside Mermaid are standardized to output prices. For input prices, refer to the individual tables shown earlier.
As a fundamental practice when building your own routing, if you keep in mind from the start that "the cost multiplier between adjacent models should be within 3x, ideally around 2x," you can more easily avoid cost cliffs like this one.
Summary
Starting from a feeling of "this might not quite match my preferences" after going through the NVIDIA LLM Router v3 default checkpoint, I went all the way through rebuilding a 9-model pool from scratch and running persona training on 480 questions. The result showed a 43.1% Opus 4.8 selection rate maintaining quality, while lightweight/mid-weight questions with tolerance 0.05–0.20 achieved 98–99%+ cost reduction (up to 99.3%), giving a concrete sense of what quality-preserving routing can deliver. The pitfalls I hit along the way (thinking on consuming all max_tokens, the insertion order bug in judge=vote, Gemini's mandatory reasoning, the 14x cost cliff jump) are written down in concrete detail in hopes they'll be useful to others building their own routing for different use cases. In particular, I think "keep the cost multiplier between adjacent models within 3x, ideally around 2x" is a good foundational practice to be aware of from the start when designing a pool.
To avoid letting it end at "built and done," observability after putting the checkpoint into production also becomes important. The routing service used in this article has a thin Langfuse callback and routing decision stdout logging layered in, allowing after-the-fact confirmation of which models were called how many times, what the resulting costs were, and how the distribution shifts when tolerance is changed. I'm thinking of covering this observability topic more thoroughly in a separate article.
Even labeled "reference implementation only," v3 is this much fun to play with — the feeling of training it and having it quietly conform to your own persona is quietly enjoyable.
Reference Links
NVIDIA LLM Router
- NVIDIA-AI-Blueprints/llm-router (GitHub) — The repository including the v3 branch covered in this article
- NVIDIA LLM Router で LLM の用途別使い分け環境を構築してみた(基礎編) — Series Part 1, covering the default checkpoint behavior and basics around the OpenAI-compatible endpoint
Upstream PRs submitted during verification for this article
- PR #32: forward all OpenAI-compatible fields to upstream — Proposed fix for the issue where OpenAI-compatible fields such as
toolswere being dropped - PR #33: translate upstream errors to OpenAI-compatible responses — Proposed fix for the issue where upstream 4xx errors were being returned as 500
- PR #34: route image/audio/video requests to capability-matched models — Proposed adapter shim to route multimodal content to capability-matched models
Related Topics
- OpenRouter — API gateway used as the upstream for the pool in this article
- Langfuse — Integrated via LiteLLM callback for observability
- lmarena.ai — Source of the Arena Elo figures referenced in the benchmark tables in this article
