I tried building an environment for using LLMs differently by purpose with NVIDIA LLM Router (Basics Edition)

I tried building an environment for using LLMs differently by purpose with NVIDIA LLM Router (Basics Edition)

2026.06.21

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from the Classmethod Manufacturing Business Technology Department.

Model usage fees have increased significantly over the past few months... Claude Opus 4.8 output is $25 per 1M tokens, and GPT-5.5 and Gemini 3.1 Pro have come to the same price range. For individual use, it can be absorbed within a Claude Code or Codex subscription, but when scaling within an organization, "all Opus" becomes a tough conversation for monthly budgets.

Many people probably share the frustration of thinking "it feels like Opus is being called even for simple inquiries, but it's scary to assign small models to difficult problems."

That's where an LLM Router comes in handy — a mechanism that automatically switches models based on the content of the question. While there are features like OpenRouter's Auto Router, this time I tried running NVIDIA LLM Router v3, which you can build yourself, on a DGX Spark. The scope of this article covers not just cloud models via API, but also routing that mixes in local LLMs running on the DGX Spark into the same pool.

Background: Why Mixing Models Has Become a Realistic Option

For light personal use, you can squeeze it into a subscription for a few thousand yen. However, once you start regularly using Claude Code or Codex for work, that's no longer feasible. With just 1 user at 300 requests per day and an average output of 1,500 tokens, that's 6,600 requests per month — nearly $250/month with all Opus 4.8. That's right around where the Claude Code Max ($200/month) ceiling gets exceeded.

Scale this to a team of 5 where each person is running agents continuously, and you're up another level. I've laid out estimates assuming 5 people × 33,000 requests/month × average output of 1,500 tokens per person (totaling 165,000 requests/month).

Configuration Unit Price Estimated Monthly
All Opus 4.8 output $25 / 1M ~$6,200
All Sonnet 4.6 output $15 / 1M ~$3,700
Opus 30% / Sonnet 50% / Haiku 20% mix weighted avg $16/1M ~$4,000

The figures use the latest-generation Claude Opus 4.8 / Sonnet 4.6 / Haiku 4.5 output rates as of 2026 (the v3 default pool includes Opus 4.6, but since the price range and behavior are similar, the monthly estimates use the latest-generation rates).

Looking at the numbers alone tends to lead to "Sonnet is enough," but since these are averaged on token counts, in practice difficult questions tend to produce longer outputs, increasing the Opus ratio and pushing costs a bit higher. Conversely, many simple questions could be much shorter and handled fine by Haiku. There's room for one more tier of model differentiation between "all Opus" and "all Sonnet."

When implementing this differentiation at the organizational level, there are three things I personally want to keep in mind: cost predictability — keeping the monthly budget foreseeable; auditability — logging both the reasoning behind routing decisions and the actual calls made; and data sovereignty — drawing a boundary where sensitive queries go on-premise while everything else goes to the cloud.

When you scale to an organization, all of these hit at once. That's where the main discussion begins.

Reviewing the Router Options

When you think "I want a router," the first candidate that comes to mind is OpenRouter. By simply specifying the openrouter/auto model slug, a NotDiamond-based router looks at the prompt and selects a model. Recently, separate router types like Fusion Router and Pareto Router have been added, expanding the choices.

https://openrouter.ai/

Here's how they compare with NVIDIA LLM Router v3:

Aspect OpenRouter Auto OpenRouter Fusion OpenRouter Pareto NVIDIA LLM Router v3
Purpose Cost optimization Quality improvement (multi-model consensus) Coding-specific Cost optimization, self-host
Decision logic NotDiamond (closed) Judge model comparison JSON AA percentile (external) Encoder + MLP to estimate P(correct)
Model pool Curated by the operator 1–8 panel + 1 judge Tier-based curated Can retrain with self-hosted, on-prem mix
Cost structure Single model pricing 4–5× single Single model pricing Self-host inference + single model pricing
Retrain on your data Not possible Not possible Not possible Possible

Fusion Router is a separate concept — "consensus via multiple models" — oriented toward quality improvement rather than cost optimization. Since it runs 3 panel models + 1 judge model per request, the cost is 4–5× that of a single model. Pareto Router uses Artificial Analysis coding percentile and picks the cheapest from 3 tiers using min_coding_score, a philosophy somewhat similar to v3's tolerance.

This raises the question: if SaaS OpenRouter is sufficient, what's the point of self-hosting NVIDIA LLM Router v3? My breakdown is: the ability to visualize the reasoning behind routing decisions, freezing a shortlist for reproducibility, mixing on-prem models into the pool, and retraining on your own data — these four points are the differentiators. If you have enterprise requirements like "routing decisions can't be a black box for audits" or "sensitive queries must go on-prem," then forking and running v3 becomes a viable option.

Where NVIDIA LLM Router v3 Stands Today

When you visit the LLM Router repository, several branches are listed, which is a bit confusing at first. It has actually gone through generations from v1 to v2 to v3, and what's currently active is the v3 branch (based on the repository description).

https://github.com/NVIDIA-AI-Blueprints/llm-router/tree/experimental

Here's the status of each branch's latest commits:

Branch Latest Commit Implementation Status
main (v1) 2026-04-29 Implementation stopped 2025-12-19, only Helm example updates since
experimental (v2) 2026-04-14 Implementation stopped 2025-12-31, only CI config since
v3 2026-05-07 Ongoing, LiteLLM Proxy external sidecar hook added

v1 and v2 have effectively entered maintenance mode, with v3 being the de facto active continuation. The default branch is still experimental, which might cause a moment of confusion about which one to look at, but following the README straightforwardly leads you to v3.

The v3 README opens with this:

Reference implementation only.
This branch is a reference implementation demonstrating prefill-based LLM routing.
For production deployment, please fork this repository.

"Reference implementation only" — in other words, if you're taking this to production, fork it and integrate it at your own responsibility. The fact that the "recommended for production deployment" Blueprint designation has been removed in v3 is something users don't want to miss. Conversely, because it's designed with the expectation that you'll fork it and adapt it to your own use case, there's significant room for customization.

The README also includes a self-comparison table of v1 / v2 / v3. v3 restores the proxying functionality from v1 while dropping v2's multimodal support to focus on text-only. The pool includes 9 models with a price difference of about 500× between the cheapest and most expensive, and a pre-trained routing model is bundled right in. The fastest approach is probably to just run it and get a feel for how it behaves.

One structural point worth noting: v3 is the layer that decides "which model to call," while the actual model invocation is delegated to OpenRouter or LiteLLM. The default pool's cloud models are curated assuming OpenRouter's prefix, so routing results can be sent directly through OpenRouter, or you can insert routing decisions into an existing LiteLLM Proxy. The selection algorithm covered in the next section is the "choosing side" of this division of responsibilities.

How Does It Select a Model?

The v3 routing decision works roughly as follows:

The query is passed through an Encoder (Qwen3.5-0.8B, ~100ms on GPU, ~5 seconds on CPU) to extract hidden states that represent the "semantic vector" of the query. These are dimensionality-reduced via PCA, then fed through a decision MLP (a small multi-layer neural network), which produces P(correct) — "the probability of answering this question correctly" — for each model in the pool.

This is where the tolerance parameter comes into play. It sets a threshold as threshold = max(P) − tolerance, and the cheapest model above that threshold is selected. With tolerance = 0, the smartest model is always chosen; with tolerance = 1.0, the cheapest is always chosen. The default of 0.20 is positioned as the balance point between maintaining quality and reducing cost.

From the README example:

P(correct):                Cost:
  nemotron-nano:  0.92       $0.05/M
  gpt-oss-120b:   0.95       $0.43/M
  claude-opus:    0.97       $25.78/M

tolerance = 0.20
→ threshold = 0.97 − 0.20 = 0.77

Models above threshold:
  nemotron-nano  ✓  → cheapest, so selected
  gpt-oss-120b   ✓
  claude-opus    ✓

Escaping the state of "calling Opus even for easy questions" is exactly thanks to this mechanism. When I actually submitted a simple query like "What is 2 + 2?" on real hardware, the confidence for all 9 models stayed low in the range of 0.03–0.26. The absolute value of P(correct) is a relative scale with respect to the training judge, so seeing "addition at 0.17" is by design. The key is to understand this as a relative value within the same query, not an absolute measure.

The implementation has a 4-layer structure: Core covers BaseRouter / PrefillRouter / PoolConfig / RoutingResult; Training covers collect / train / evaluate CLIs; Adapters includes 6 types such as LiteLLM Strategy, Standalone Server, LiteLLM Proxy, and Sidecar; and Plugins come with the NemoHermes OpenClaw plugin bundled by default.

First, Run Just the Routing Decision

From here, let's try it on real hardware. The v3 default pool is configured to call cloud models via OpenRouter, but the routing decision itself is self-contained with the encoder and checkpoint, so you can get a feel for the behavior without an OpenRouter API key. Let's go through it step by step, starting with forking and checking out.

git clone https://github.com/NVIDIA-AI-Blueprints/llm-router
cd llm-router
git checkout v3
git lfs install
git lfs pull
pip install -e '.[prefill,litellm]'

Once setup is complete, call it directly from Python:

from model_router_toolkit.config import load_config, build_router_from_config

config = load_config("configs/v1-9models-qwen08b.yaml")
router = build_router_from_config(config)

result = router.route("What is the capital of France?", tolerance=0.20)
print(f"Selected: {result.selected_model}")
print(f"Confidence: {result.selected_confidence:.3f}")

The v1- prefix in configs/v1-9models-qwen08b.yaml is confusing, but it means "pool configuration version v1" and is unrelated to the router's v1 main branch. Looking inside, a 9-model pool is defined, ranging from Nemotron nano to GPT-5 and Claude Opus 4.6. Since Claude Opus is included in the default pool, routing to Anthropic works out of the box with a single OpenRouter API key.

Slot Model OpenRouter slug output $/M
1 Nemotron 3 Nano (Reasoning) nvidia/nemotron-3-nano-30b-a3b 0.20
2 GPT-OSS 20B High openai/gpt-oss-20b 0.25
3 Nemotron 3 Super (free) nvidia/nemotron-3-super-120b-a12b:free 0.00
4 GPT-OSS 120B High openai/gpt-oss-120b 0.43
5 Qwen 3.5 35B qwen/qwen3.5-35b-a3b 1.30
6 Qwen 3.5 122B qwen/qwen3.5-122b-a10b 2.08
7 GPT-5.2 High openai/gpt-5.2 14.00
8 GPT-5.4 High openai/gpt-5.4 15.00
9 Claude Opus 4.6 High anthropic/claude-opus-4-6 25.78

A 9-tier configuration spanning from the cheapest Nemotron 3 Super (free tier) to the highest-priced Opus 4.6 at output prices — a range of about one order of magnitude. One thing to keep in mind: reasoning-type models (those with reasoning_effort: high or enable_thinking: true in their configs) are mixed in, so thinking tokens will add a bit to the actual cost even for simple queries.

To view the playground UI, start the Standalone Server:

model-router serve --config configs/v1-9models-qwen08b.yaml --port 8100

Opening http://localhost:8100 brings up the UI where you can submit a question and see the routing decision, each model's confidence, and cost estimates all at once. The fastest way to check the default pool contents is to hit /api/models.

Model Router Toolkit Playground home screen. Query input field and sample question buttons on the left, right sidebar shows tolerance slider, 9-model pool, Auto Review toggle, and Session Stats

Mixing Claude and GPT into the Pool via OpenRouter

To connect through to actual calls, you'll need one OpenRouter API key. Since the default pool is built assuming OpenRouter's prefix, just exporting the key gets Claude / GPT / Gemini / DeepSeek and others lined up in the pool.

export OPENROUTER_API_KEY=your-key
model-router serve --config configs/v1-9models-qwen08b.yaml --port 8100

Submitting "What is 2 + 2?" in the UI with the default tolerance=0.05 gives this screen:

Result of submitting "What is 2 + 2?" with tolerance=0.05. p_max 0.260 / threshold 0.210 — GPT-5.4 High is selected as the only model exceeding it, with 9 model confidences ranging from 0.035 to 0.260. Session Stats shows 1 query with 42% savings

Only GPT-5.4 exceeded the threshold of 0.210, so it was selected. The earlier note that "confidence absolute values stay low even for simple questions" is confirmed here with the same numbers (0.035–0.260) visible in the UI.

What you're probably wondering now is: "with the default tolerance, which models actually get called?" I tried 20 questions of varying natures across 5 tolerance levels (0.00 / 0.05 / 0.10 / 0.15 / 0.20). Pulling out just the routing decisions via /v1/route, the distribution of selected_model moved like this:

tolerance claude-opus gpt-5-4 gpt-5-2 qwen-3-5-122b nemotron-3-super nemotron-3-nano
0.00 6 queries 14 queries 0 queries 0 queries 0 queries 0 queries
0.05 0 queries 14 queries 4 queries 0 queries 0 queries 2 queries
0.10 0 queries 3 queries 12 queries 1 query 0 queries 4 queries
0.15 0 queries 1 query 2 queries 1 query 2 queries 14 queries
0.20 0 queries 0 queries 1 query 1 query 0 queries 18 queries

Even in max-quality mode with tolerance = 0, only the 6 questions judged as difficult were sent to Claude Opus, with the remaining 14 concentrated on gpt-5-4. Moving tolerance from 0.05 to 0.20 continuously descends from gpt-5-4 → gpt-5-2 → nemotron-3-nano, and at 0.20 almost everything is handled by Nemotron.

Tracking a single question through the UI makes this intuitive. Submitting the same "What is 2 + 2?" with tol=0.05 selects GPT-5.4 ($15/M output), and resubmitting with tol=0.20 drops the threshold to 0.060, switching to the cheapest-tier Nemotron 3 Super ($0/M output).

Results of submitting "What is 2 + 2?" with tolerance=0.05 and tolerance=0.20 shown vertically. With tol=0.05, threshold 0.210 selects GPT-5.4 High ($15/M); with tol=0.20, threshold drops to 0.060 and switches to Nemotron 3 Super ($0/M), and Session Stats savings increase from 42% to 71%

The Session Stats savings going from 42% → 71% is the effect of moving tolerance by one step. The behavior seen in the 20-question sweep is tangible even when tracking just one question.

At this point, some of you might be thinking "can't LiteLLM handle this?" In fact, LiteLLM does have an automatic routing mechanism — the AutoRouter added in 2025 offers a Semantic Router that routes based on embedding similarity to user-written examples, and a Complexity Router that classifies queries into 4 tiers using rule-based logic from token counts and keywords like step by step. Beyond that, there are strategies for each routing dimension: cost-based-routing, latency-based-routing, usage-based-routing-v2, and provider-budget-limiting. Honestly, if your goal is purely cost optimization or load balancing, LiteLLM alone gets you quite far.

https://www.litellm.ai/

Where v3 takes it one step further is its approach of using ML to simultaneously estimate the probability of a correct answer for each candidate model. LiteLLM's Semantic Router requires manually written examples, and the Complexity Router uses rule-based English keyword judgment, so prompts with Japanese or mathematical expressions, or unexpected questions, fall back to the default. v3 uses DeBERTa-v3 + MLP head to output P(correct) for all 9 models and selects the cheapest while guaranteeing a quality floor via tolerance — addressing both directions of "calling Opus for simple questions" and "assigning Nano to a hard math problem." The ability to run collect → train → evaluate on your own data and retrain to match your question distribution is also a strength unique to v3. This training side will be covered in Part 2.

The roles are complementary: v3 uses ML to decide "which model to call," while LiteLLM handles the abstraction layer for "actually making the call" along with guardrails and budget management. Virtual keys, provider-budget-limiting, and guardrails integrations with Aporia, Presidio, etc. are outside v3's scope, so for enterprise operations, stacking these two layers is actually the realistic approach. If you just want to reduce cost and latency, LiteLLM alone is sufficient; if you want to protect quality while reducing costs based on query content, or preserve model-level confidence scores in routing decision logs, then adding v3 on top is how I'd frame it.

Measuring Savings with a Persona of a Regular Opus User

This is probably the part most relevant when considering organizational adoption: "how much does it actually save?" I set up a persona and measured it.

The persona: "a user who always calls Claude Opus." I prepared 5 questions of varying natures — light chitchat, writing a code decorator, technical explanation, mathematical proof, and a philosophical question — and compared the total cost of sending everything to Opus versus routing through v3 (using the default pool as-is).

Route Total Cost Reduction vs All-Opus
All Opus 4.6 (baseline) $0.0913
Routing (tol=0.05) $0.0193 78.8%
Routing (tol=0.10) $0.0198 78.3%
Routing (tol=0.20) $0.0014 98.5%

The interesting finding is that tol=0.05 and tol=0.10 landed at nearly the same reduction rate (78.8% and 78.3%). With the default pool, at intermediate tolerance levels, calls concentrate on gpt-5.2 / gpt-5.4 cloud models, so moving tolerance by 0.05 barely changes where calls land.

Pushing to tol=0.20, almost all 5 questions consolidated onto nemotron-3-nano-reasoning (the cheapest slot in the pool), reaching 98.5% savings compared to Opus. The cloud call cost for the Nemotron side isn't quite zero, but it's practically negligible compared to Opus.

Applying this to the 5-person team assumption from the introduction (165,000 requests/month × 1,500 tokens output): All-Opus runs about $6,200/month, tol=0.10 routing around $1,350/month, and tol=0.20 around $90/month. The realistic approach from here is tuning to determine "how aggressively can we push while maintaining quality."

Honestly, being able to get this far with the default checkpoint that's labeled "Reference implementation only" exceeded my expectations. That said, the stagnation in savings between tol=0.05 and tol=0.10 shows there's still room in pool design. Let's see in the next section how the local mix affects intermediate tolerance behavior.

Mixing DGX Spark Local Models into the Pool

This is where the difference between OpenRouter's Auto / Fusion / Pareto and LLM Router v3 really shows. By simply writing a locally running vLLM endpoint into a pool slot, it becomes a routing target on equal footing with cloud models.

With vLLM running Nemotron 3 Nano 30B-A3B in NVFP4 quantization on the DGX Spark with --gpu-memory-utilization 0.4, create one YAML file in configs/ for the tier-design:

models:
  - name: nemotron-3-nano-reasoning # Replace slot 1 with local
    api_base: http://localhost:8000/v1
    api_key_env: LOCAL_KEY
    model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
    cost_per_m_input_tokens: 0
    cost_per_m_output_tokens: 0
  - name: gpt-5-2-high # Replace slot 7 with Kimi K2.7 Code
    api_base: https://openrouter.ai/api/v1
    api_key_env: OPENROUTER_API_KEY
    model_name: moonshotai/kimi-k2.7-code-20260612
    # ...
  - name: gpt-5-4-high # Replace slot 8 with GLM 5.2
    model_name: z-ai/glm-5.2-20260616
    # ...

The slots remain as default, so the trained MLP's selection logic works unchanged — only the actual call destinations are swapped. Looking at the /v1/chat/completions response, the model field correctly records nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 or z-ai/glm-5.2-20260616, making it visually clear that "only the actual entity is swapped while keeping the same slot name."

The ability to design data sovereignty in a single pool config — sending sensitive light queries to local, externally shareable queries to emerging cloud models, and only difficult problems to Opus — is where v3 is well-suited for enterprise requirements. OpenRouter Auto and Pareto pools are confined to models within OpenRouter, so this "on-prem + cloud hybrid" is the exclusive domain of v3 (and self-hosted routers like v3).

Running the same 5 questions from Chapter 7 against this mixed pool, the savings changed as follows:

Route Total Cost Reduction vs All-Opus
All Opus 4.6 (baseline) $0.0859
Routing (tol=0.05, mixed) $0.0211 75.4%
Routing (tol=0.10, mixed) $0.0110 87.2%
Routing (tol=0.20, mixed) $0.0000 100%

With the default pool in Chapter 7, both tol=0.05 and tol=0.10 plateaued around 78%, but with the mixed pool, tol=0.10 stretched to 87.2%. The replaced Kimi K2.7 Code and GLM 5.2 began handling medium-volume queries, and calls that had been concentrated on gpt-5.2 / gpt-5.4 flowed to the more cost-effective emerging models. At tol=0.20, almost everything consolidates onto the local Nemotron, bringing cloud-side costs literally to $0.

Latency is also a concern, so I measured 3 runs each for a lightweight chitchat query and a heavy philosophy query (each median reported):

Query Claude Opus 4.6 Local Nemotron
chitchat_ja (lightweight) 3.87 sec 3.37 sec
philosophy (heavy) 6.08 sec 30.02 sec

On the lightweight side, they're nearly equivalent — actually slightly faster for local. But on the heavy side, the tables turn: Nemotron takes 5× longer. This is because while Opus returned roughly 138 tokens concisely for the prompt Discuss the Chinese Room argument in 100 words., Nemotron returned a 1,600–2,100 token long-form response. The difference in "whether instruction-following respects the 100-word constraint" directly translates into perceived latency — something worth keeping in mind when designing the pool.

How to Audit Routing Decisions

When considering enterprise adoption, you'll inevitably be asked about routing decision auditability. Whether you can trace "which query was routed to which model and why" through logs matters greatly for compliance and post-incident analysis.

The RoutingResult returned by v3 includes not just the selected model, but also the confidence for the entire pool, cost estimates, and latency — all of it. Streaming this directly to an observability platform lets you retroactively detect incidents like "Opus was being called even for simple questions."

To stream to Langfuse, call router.route() inside langfuse.start_span() and put selected_model and confidences in metadata — then you can filter in the dashboard afterward. Being able to view at the granularity of distribution per tolerance, monthly cost drift, and Opus ratio per persona enables faster operational decisions.

Compared to OpenRouter Auto: since Auto's NotDiamond decision logic is a black box, "why was that model selected" cannot be reproduced after the fact. v3 can record P(correct) for all 9 models, so if your enterprise requirements include "obligation to preserve routing decision rationale," that's your motivation to choose v3 here. Quietly useful.

Going one step further, viewing logs over time reveals indicators that inform when to retrain: routing drift — "the selected_model distribution for the same question group has shifted over time" — and cost drift — "the cost ratio per pool model has deviated from the initial estimate." These operational considerations will be discussed in the follow-up Part 2.

Calling from an App as an OpenAI-Compatible Endpoint

So far we've looked at the routing service behavior and auditing, but let me briefly touch on how to connect when actually calling from an app or agent.

The /v1/chat/completions exposed by v3's model-router serve is OpenAI-compatible. It accepts the messages array as-is and returns responses in OpenAI style with { id, choices, model, usage }. For the model in the request, you simply specify the policy name written in the pool config (for the default checkpoint it's default, or whatever name you gave your custom mixed pool), and internally MLP runs and forwards to the actual model.

curl -X POST http://<DGX_SPARK_IP>:8202/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "default",
    "messages": [
      {"role": "user", "content": "What is 2+2?"}
    ],
    "tolerance": 0.10
  }'

The model field in the response contains the actual routed destination (for example, nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 or anthropic/claude-opus-4-6). tolerance is a custom extension parameter that v3 adds to the OpenAI-compatible interface; if omitted, the default value is used.

For coding agents like OpenAI SDK, LiteLLM, or OpenCode that use OpenAI-compatible clients, simply switching the BASE_URL to http://<DGX_SPARK_IP>:8202/v1 means your usual tools go through the routing service as-is. What's running inside the pool doesn't need to be a concern on the client side — the router handles the distribution behind the scenes, and that's the nice thing about being OpenAI-compatible.

One caveat: Claude Code sends requests in /v1/messages (Anthropic format), so pointing it directly at /v1/chat/completions won't connect. You'll need to insert an Anthropic-compatible shim (such as LiteLLM proxy's passthrough feature).

Summary

As model prices have risen, using different models for different purposes at organizational scale has become a realistic consideration. SaaS options like OpenRouter Auto / Fusion / Pareto are convenient, but when requirements include visibility into routing decision rationale, freezing the shortlist, mixing in on-premises models, and retraining on your own data, self-hosted NVIDIA LLM Router v3 becomes a candidate.

Running it on actual hardware, even with the default checkpoint, produced 75–87% cost reduction assuming Opus-exclusive users. Pushing tolerance = 0.20 and consolidating to local Nemotron achieves 100% — meaning you can create a configuration with $0 in cloud API payments. Since the cost-quality balance shifts smoothly with how you tune tolerance, being able to gradually push according to your organization's risk tolerance is what makes it practical from an implementation perspective.

However, the default checkpoint has its limits, and its behavior is governed by an MLP trained on slot-order-based signals from a 9-model pool. Even if you add newer models (Claude Opus 4.8 / GPT-5.5 / Kimi K2.7 Code / DeepSeek V4 / Qwen 3.7 / GLM 5.2, etc.) to the pool, since they weren't present in the training signals, there's no guarantee the model that works best for your use case will be selected. To fundamentally solve this, the option of building your own checkpoint through the collecttrainevaluate training pipeline comes into play. At the scale of 500 questions, the estimate is $5–25 and half a day, so I think it's within the range of being recoverable within one month through monthly per-persona savings.

Also, as the README explicitly states: "Reference implementation only. For production deployment, please fork this repository" — v3 is meant to be used by forking it and adapting it to your own use case. During the verification for this article, I found several areas for improvement around the OpenAI-compatible endpoint, so in the follow-up Part 2, I'll apply minimal improvements to the serve code after forking, then move into the persona training section.

It says reference implementation only, yet you can play with v3 this much — the direction it's heading is quite interesting.

https://dev.classmethod.jp/articles/dgx-spark-nvidia-llm-router-v3-training/

NVIDIA LLM Router

OpenRouter Routers

LiteLLM

  • LiteLLM Routing — List of strategies including cost-based-routing, latency-based-routing, usage-based-routing-v2, and others
  • LiteLLM Auto Routing — Documentation for AutoRouter (Semantic + Complexity) added in 2025-07
  • aurelio-labs/semantic-router — The semantic routing library used internally by LiteLLM AutoRouter (Semantic)

国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article