
I tried Hermes Agent's Mixture of Agents (MoA)
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
A feature called Mixture of Agents (MoA) has been added to Nous Research's Hermes Agent. The idea of "throwing the same question to multiple LLMs and having another LLM synthesize their opinions" has apparently existed as a paper since 2024, but I'd like to explore why it's attracting renewed attention now by actually getting hands-on with it.
To state the conclusion upfront: while the claim is a +7.82% quality improvement over Claude Opus 4.8 alone on HermesBench, Nous's internal benchmark, I personally confirmed an extreme figure on my end where the actual cost on OpenRouter was approximately 80 times that of Opus alone. I'll gather hands-on data and organize it so readers can judge for themselves whether it's worth paying for.
Why MoA Is Attracting Renewed Attention Now
The concept of MoA itself is not new. The starting point was Together.ai's paper Mixture-of-Agents Enhances Large Language Model Capabilities published in June 2024, known as the pattern where "N reference models provide opinions, and one aggregator model synthesizes them."
The background behind this topic resurfacing at this point in 2026 involves a structural change where access to frontier models is being gradually restricted. Most recently, on June 12, 2026, Anthropic's Fable 5 and Mythos 5 were suspended for all customers due to U.S. government export regulations. It was quite a heavy response, with all deployment methods—including both API and web interfaces—suddenly cut off, regardless of whether users were inside or outside the U.S., foreign nationals or not. Combined with rate limit increases and price hikes for Claude Opus 4.8 and GPT-5.5, and the emergence of "closed orchestration systems like Sakana Fugu that bundle multiple models behind the scenes," the strongest models are increasingly becoming something only "a select few" can access.
In that atmosphere, Nous Research sent out the following message on their official X:
The strongest models are gated and access is granted only to a select few. Hermes Agent now exposes MoA presets as virtual models, giving you capabilities beyond the publicly available frontier: 8% higher than Opus 4.8 and 11% higher than GPT 5.5 on our upcoming benchmark.
This feels like a statement that captures the essence of MoA—"if the strongest models can only be touched by certain people, bundle multiple models to exceed the frontier"—overlaid onto the current industry situation. Teknium also followed up saying he "wants to aim for Opus-equivalent quality with a combination of open source models," and it seems MoA has once again become a realistic option as a consensus pattern.
MoA's Position in Hermes Agent
MoA is integrated into Hermes Agent as a "virtual model provider" and can be invoked with a single call using the /moa command or --provider moa specification.
There are several other patterns for bundling multiple LLMs, each with different philosophies. Comparing them with what I've been working with recently, here's how they break down:
| Category | Representative Examples | Mechanism |
|---|---|---|
| Router type | NVIDIA LLM Router v3 / OpenRouter Auto | MLP classifier or heuristic selects and routes to one model |
| Coordinator LLM type | Sakana Fugu | A small trained LLM calls multiple models in the background and synthesizes |
| Aggregator type | Hermes MoA | N reference models provide opinions, one aggregator synthesizes them |
Hermes MoA falls into this third category, and what I personally found most interesting was the transparency that allows users to directly edit the configuration of references and aggregator.
How MoA Works (Reference + Aggregator)
The official docs describe MoA as follows:
Use MoA when a hard task benefits from multiple model perspectives but still needs Hermes' normal agent loop: tool calls, follow-up iterations, interrupts, transcript persistence, and the same session context as any other message.
The guidance says "use this for difficult tasks where multiple model perspectives help, but Hermes' normal agent loop (tool calls and session continuity) is maintained as-is."
Illustrated as a diagram, the flow looks like this:
One key point is that only the aggregator side ultimately calls tools. Tool state is also passed to the reference side, and the flow involves references returning opinions that take the tool context into account, after which the aggregator bundles them and calls the tool. This impacts the latency I observed later when testing tool calling.
Claimed Values on HermesBench
Regarding the basis for the numbers mentioned at the beginning, the official docs show a table like this. HermesBench is Nous's internal benchmark, and the leaderboard itself has not yet been made public.
| Model | HermesBench Score |
|---|---|
| MoA (Opus aggregator) | 0.8202 |
| Claude Opus 4.8 alone | 0.7607 |
| GPT-5.5 alone | 0.7412 |
That's +0.0595 in absolute value, +7.82% compared to Opus 4.8 alone, and +10.66% compared to GPT-5.5 alone. This is consistent with the official X post's claim of "8% higher than Opus 4.8 and 11% higher than GPT 5.5," so these three numbers serve as the provisional reference point.
Once the leaderboard is made public, we should be able to see more detailed evaluation axes.
Environment Setup and Prerequisites
The version of Hermes Agent used during verification is v0.17.0 (2026.6.19) · upstream 88b3d863.
hermes --version
# Hermes Agent v0.17.0 (2026.6.19) · upstream 88b3d863
hermes moa list
One thing to be careful about is that there's a gap of about one week between the released v0.17.0 tag and the official docs. Running hermes update --check returns "1003 commits behind origin/main," which reveals that the core MoA functionality (the hermes moa subcommand and billing routes for moa://local) was all added in commits after the tag. Since the docs are written in sync with the latest origin/main, if you installed right after the tag, you'll encounter a situation where "things don't work even if you follow the docs."
hermes update --check
# ⚕ Update available: 1003 commits behind origin/main.
hermes update --yes --backup
# git pull + dependency rebuild + skill sync + config v29→v30 migrate
hermes update internally runs git pull origin main, so running this will give you behavior that matches the docs. The version string in hermes --version will remain v0.17.0 until the next tag bump, but you can verify sync with the latest main by checking the upstream <commit-hash> portion.
Running with the Default Preset
First, I'll verify operation with the same default preset as the official docs. Looking at the contents with hermes moa list, the configuration is as follows:
* default
Reference models:
1. openai-codex:gpt-5.5
2. openrouter:deepseek/deepseek-v4-pro
Aggregator: openrouter:anthropic/claude-opus-4.8
This matches the YAML example in the official docs, and what's described as a "configuration example" is the same as the actual default contents. The Active in config: (off) display means that the moa: section won't be written to ~/.hermes/config.yaml until you carve it out with hermes moa configure.
Invocation can be done as a oneshot execution with hermes -z PROMPT --provider moa --model default. I ran this with 2 prompts (a light question: summarize MoA in 80 characters or less, and a light reasoning question: a fair way to divide 3 apples and 4 oranges) × 4 configurations, then compiled the subprocess wall-clock and hermes sessions export usage into compare.py.
| config | prompt | latency (s) | input_tok | output_tok | api_calls | billing | cost_status |
|---|---|---|---|---|---|---|---|
| hermes-default | light_question | 9.87 | 24,402 | 83 | 1 | custom | unknown |
| moa-default | light_question | 40.60 | 41,351 | 46 | 1 | moa | unknown |
| opus-4.8-single | light_question | 6.84 | 2 | 83 | 1 | openrouter | estimated |
| gpt-5.5-single | light_question | 11.94 | 24,873 | 179 | 2 | openai-codex | included |
| moa-default | light_reasoning | 9.26 | 41,462 | 110 | 1 | moa | unknown |
| opus-4.8-single | light_reasoning | 7.11 | 2 | 153 | 1 | openrouter | estimated |
Looking at the numbers, I noticed three things.
First, latency stretched to roughly 6 times that of a single model. Even for a light question like "summarize the concept of MoA in 80 characters," moa-default took 40.60 seconds. Opus alone took 6.84 seconds and GPT-5.5 alone took 11.94 seconds, so I could feel firsthand that this is a design where N+1 calls hit wall-clock directly. For reasoning-oriented questions, there were cases where it returned in 9.26 seconds, so the variance is large depending on the response length of references.
Second, input_tokens consistently hovered around 41K. Since the design injects the opinions of 2 reference models into the aggregator's context, this scale of input expansion is per the official spec. The fact that opus-4.8-single shows only 2 tokens in input is because when directly calling the openrouter provider, the agent loop context isn't added to the path—an extreme value as a comparison target. Even accounting for that, MoA processes an input roughly 1.7 times larger than the standard hermes-default (24K) every time.
Third, it returns with cost_status: unknown. This means actual MoA costs are not visible from the Hermes side. It's processed internally with a proprietary schema of billing_provider: moa / billing_base_url: moa://local, and both actual_cost_usd and estimated_cost_usd return as null. Tracking "how much was spent" requires a separate route.
On the response quality side, there was a difference between MoA and GPT-5.5 alone. For the reasoning-oriented question, MoA carefully showed the procedure: "For apples, first distribute 1 each from 2, then cut the remaining 1 in half for 1/2 each. For oranges, distribute 2 each. Result: each person gets 1.5 apples and 2 oranges, which is fair." Meanwhile, GPT-5.5 alone went with the shortest answer: "Give 1 apple each and cut the remaining 1 in half. Give 2 oranges each. This is fair." Which is preferable may depend on the person.
Creating a Custom Preset to Peek at Actual Costs
I'll tackle from the outside the constraint I hit earlier—"costs aren't visible from the Hermes side"—using the OpenRouter Management API. If I route all references through OpenRouter, I can read "how much was consumed on OpenRouter during testing" externally from the total_usage difference in /api/v1/credits.
Custom presets are registered by appending to the end of ~/.hermes/config.yaml.
moa:
presets:
openrouter-only:
reference_models:
- provider: openrouter
model: openai/gpt-5.5
- provider: openrouter
model: anthropic/claude-sonnet-4.6
- provider: openrouter
model: deepseek/deepseek-v4-pro
aggregator:
provider: openrouter
model: anthropic/claude-opus-4.8
reference_temperature: 0.6
aggregator_temperature: 0.4
max_tokens: 4096
enabled: true
After confirming that openrouter-only is loaded with hermes moa list, I take a snapshot of OpenRouter credits just before testing, run hermes -z PROMPT --provider moa --model openrouter-only for 2 prompts, then take credits again afterward to calculate the difference.
| config | prompt | latency (s) | input_tok | output_tok |
|---|---|---|---|---|
| moa-openrouter-only | light_question | 27.18 | 41,467 | 53 |
| opus-4.8-single | light_question | 7.19 | 2 | 83 |
| moa-openrouter-only | light_reasoning | 23.41 | 41,564 | 140 |
| opus-4.8-single | light_reasoning | 7.94 | 2 | 133 |
Latency stabilized at 23–27 seconds. Input remained consistently at 41K. I thought increasing references from 2 to 3 would add a bit more time, but it wasn't as much as expected.
The key OpenRouter-side costs came out like this. The total_usage difference for these 4 requests (2 minutes) was $0.483842. Estimating 2 requests of Opus alone based on OpenRouter's official pricing (Opus 4.8 at $5/M input + $25/M output) comes to roughly $0.006, meaning the remaining $0.478 was consumed by the 2 MoA openrouter-only requests.
| Metric | Amount (approximate) |
|---|---|
| Total difference for 4 requests | $0.483842 |
| Opus single 1 call | $0.003 |
| MoA openrouter-only 1 call | $0.24 |
| MoA / Opus single ratio | approximately 80x |
While the claim is "+7.82% quality improvement" on HermesBench, peeking at OpenRouter's credit meter reveals you're paying approximately 80 times the cost just to run the same prompt. This is a structural cost unavoidable given the mechanism of running 3 reference models, feeding all their opinions into the aggregator's context, and synthesizing with Opus 4.8.
When asked "would you pay 80 times the cost for +7.82% quality," I think the answer varies by use case. For drafting legal documents or bouncing important decisions, it might be worth paying, but for a chatbot use case running 100 times a day, it's honestly a tough number. I hope readers can keep this 80x figure in the back of their minds as material for drawing the line of "I can pay here, but not here" for their own workloads.
Checking Tool Calling Behavior
I also wanted to confirm "whether tool calling works properly with MoA." I specified the code_execution toolset with -t code_execution and ran the prompt "Calculate 17 × 23 in Python and return only the result" across 4 configurations.
| config | latency (s) | input_tok | output_tok | tool_calls | Result |
|---|---|---|---|---|---|
| hermes-default | 8.58 | 11,878 | 104 | 1 | 391 |
| moa-default | 28.73 | 18,303 | 58 | 1 | 391 |
| moa-openrouter-only | 30.80 | 18,762 | 58 | 1 | 391 |
| opus-4.8-single | 6.75 | 104 | 58 | 1 | 391 |
All configurations returned the correct answer of 391 with tool_call_count: 1, confirming that tool calling works fine with MoA as well. What's noteworthy is MoA's latency—it stretched to 28–30 seconds compared to Opus alone (6.75 seconds), more than 4 times longer. This is because each reference reads the tool context and returns its opinion, which then feeds directly into wall-clock time. The input_tokens side also ballooned to 18K for MoA, compared to 104 tokens for Opus alone—a difference of more than 170 times.
The api_call_count: 2 structure itself is the same as Opus alone, and the flow of the aggregator calling the tool based on consensus opinions that incorporate tool context worked as-is.
I also noticed that with the tool calling route, MoA's input nearly halved from 41K to 18K. This appears to be the result of restricting the toolset with -t code_execution, which stripped away the profile's system prompt and irrelevant context.
The official docs also explicitly state other specifications such as "when reference authentication fails, errors are injected into context" and "recursive MoA (specifying another moa preset as aggregator) is prohibited to prevent infinite loops," but I didn't perform that level of destructive testing this time. For those who are curious, please back up ~/.hermes/config.yaml and try tripping over it yourself.
Comparison with Existing Consensus Patterns
Based on the hands-on data so far, I've organized the approaches for bundling multiple LLMs side by side. Since I'm often asked internally "which should we actually use," this table doubles as organizing my own thoughts. It's somewhat distinctive that only Hermes MoA in the aggregator category has cost transparency that's invisible from the Hermes side.
| Pattern | Examples | Consensus Mechanism | Calls per Request | Cost Transparency | Self-Editing Configuration |
|---|---|---|---|---|---|
| Router type | LLM Router v3 / OpenRouter Auto | MLP / heuristic selects 1 model | 1 | ◯ Handled entirely on provider side | Requires pool retraining |
| Coordinator LLM type | Sakana Fugu | Trained LLM synthesizes multiple models in background | 1 req = multiple behind the scenes | △ Partially exposed via orchestration_tokens | Preset editing not possible |
| Aggregator type | Hermes MoA | 1 aggregator synthesizes N reference opinions | N+1 | Unknown from Hermes side | Free editing via preset YAML |
| MCP route type | OpenRouter MCP Server | Via MCP with bearer authentication | 1 | ◯ Detailed retrieval via generation-get | Provider list is fixed |
For practical usage differentiation, I think the following breakdown makes sense: router-type LLM Router v3 for scenarios where you want to run fast and cheap; coordinator-type Sakana Fugu when you want to delegate backend logic and just get smart results; aggregator-type Hermes MoA when you want to tune reference configurations yourself and push past the frontier on benchmarks; and MCP route-type OpenRouter MCP when you want to cross-call multiple providers from an agent.
For more details on LLM Router v3 and Sakana Fugu, I've written separate articles, so please check the linked articles for deeper coverage.
Conclusion
After spending a day hands-on with Hermes Agent's MoA, I was able to confirm several things.
On the benchmark side, the claim is +7.82% over Opus 4.8 alone / +10.66% over GPT-5.5 alone on HermesBench, while the actual cost measured via the OpenRouter Management API came out to approximately 80 times that of Opus alone. It turned out to be a somewhat sharp result, with a quite extreme tradeoff between quality improvement and cost. Since the leaderboard doesn't seem to be publicly available yet, I hope to dig deeper in a follow-up.
Tool calling worked fine with MoA too—references returned opinions including tool context, and the aggregator called the tool and synthesized. However, the internal billing in Hermes remains cost_status: unknown, so the practical solution is to separately track costs on the OpenRouter side. The configuration of incorporating Nous Portal credits via hermes proxy into references—which I didn't get to try this time—remains a candidate to test in a follow-up.
On the environment side, while the official docs are written in sync with the latest origin/main, release tags are cut with a delay. Since Hermes itself was being actively updated during my testing in this article, I strongly recommend confirming you have the latest with hermes update --check before getting started when reproducing on your own machine.
If Teknium's statement about "wanting to target Opus-equivalent quality with a combination of open source models" becomes a realistic solution, the 80x cost might come down to around 10x, making it much more practical for everyday use. If the trend of increasingly restricted access to frontier models continues, I have a feeling the entire aggregator-type consensus pattern will mature another step, so I hope to verify again after the HermesBench leaderboard is published.
Reference Links
- Mixture of Agents — Hermes Agent docs
- Mixture-of-Agents Enhances Large Language Model Capabilities (arxiv 2024.06)
- Statement on the US government directive to suspend access to Fable 5 and Mythos 5 — Anthropic
- Trying Sakana Fugu (GA) on a Subscription Plan
- Trying PLaMo 3.0 Prime
- Trying OpenRouter's MCP Server
- Building a Use-Case-Specific LLM Environment with NVIDIA LLM Router (Basics)
- Retraining NVIDIA LLM Router to Match Your Own Persona (Training)
