I tried Hermes Agent's Mixture of Agents (MoA)

We verified the Mixture of Agents (MoA) feature added to Nous Research's Hermes Agent by actually running it. While it claims a +7.82% quality improvement on benchmarks, the actual cost turns out to be approximately 80 times higher — an extreme result. We organize with real data whether it is worth paying for.

森茂洋 / Hiroshi Morishige

2026.06.29

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
A feature called Mixture of Agents (MoA) has been added to Nous Research's Hermes Agent. The idea of "sending the same question to multiple LLMs and having another LLM synthesize their opinions" has apparently existed as a paper since 2024, but I wanted to get hands-on with the actual hardware to understand why it's attracting renewed attention right now.
To state the conclusion upfront: while they claim a +7.82% quality improvement over Claude Opus 4.8 standalone on HermesBench, an internal Nous benchmark, I was able to confirm on my end that the actual cost on OpenRouter was approximately 80 times that of Opus standalone—an extreme figure. I'll compile real-world data to give readers the materials they need to judge for themselves whether it's worth paying for.
 Why MoA Is Attracting Renewed Attention NowThe concept of MoA itself is not new. The starting point was Together.ai's paper Mixture-of-Agents Enhances Large Language Model Capabilities published in June 2024, which established the pattern of "N reference models provide opinions, and one aggregator model synthesizes them."
The reason this is back in the spotlight at this point in 2026 lies in the structural change where access to frontier models is being gradually restricted. Most recently, on June 12, 2026, Anthropic's Fable 5 and Mythos 5 were suspended for all customers due to U.S. government export regulations. It was a fairly heavy response where all deployment methods—including both API and web interfaces, regardless of whether you're inside or outside the U.S., foreign national or not—were suddenly cut off. Combined with rate limit increases and price hikes for Claude Opus 4.8 and GPT-5.5, and the emergence of "closed orchestration systems that bundle multiple models behind the scenes" like Sakana Fugu, it feels like the most powerful models are increasingly becoming the domain of a "select few."
Against this backdrop, Nous Research posted the following on official X:
The strongest models are gated and access is granted only to a select few. Hermes Agent now exposes MoA presets as virtual models, giving you capabilities beyond the publicly available frontier: 8% higher than Opus 4.8 and 11% higher than GPT 5.5 on our upcoming benchmark.
This feels like a message that captures the essence of MoA—"if the strongest models can only be accessed by certain people, bundle multiple models together to surpass the frontier"—overlaid onto the current state of the industry. Teknium also followed up with "I want to aim for Opus-level performance using combinations of open source models," and it seems fair to say that MoA has once again become a realistic option as a consensus pattern.
 MoA's Position in Hermes AgentMoA is built into Hermes Agent as a "virtual model provider" and can be invoked in one shot using the /moa command or the --provider moa flag.
There are several other patterns for bundling multiple LLMs, each with a different philosophy. Comparing them with what I've been working with recently, the breakdown looks like this:


Category
Representative Examples
Mechanism


Router-based
NVIDIA LLM Router v3 / OpenRouter Auto
An MLP classifier or heuristic selects and sends to one model

Coordinator LLM-based
Sakana Fugu
A trained small LLM calls and synthesizes multiple models behind the scenes

Aggregator-based
Hermes MoA
N reference models provide opinions, one aggregator synthesizes them

Hermes MoA falls into this third category, and personally I found the transparency of being able to directly edit the reference and aggregator configuration the most interesting part.
 How MoA Works (reference + aggregator)The official docs describe MoA as follows:
Use MoA when a hard task benefits from multiple model perspectives but still needs Hermes' normal agent loop: tool calls, follow-up iterations, interrupts, transcript persistence, and the same session context as any other message.
The guidance says "use it for difficult tasks where multiple model perspectives help, but Hermes' normal agent loop (tool calls, session continuity) is maintained as-is."
The flow looks like this when diagrammed:
A key point is that only the aggregator side actually calls tools. The tool state is also passed to the reference side, so the flow involves the reference models returning opinions informed by the tool context, and then the aggregator bundling those opinions and calling the tool. This affects the latency I'll observe later when testing tool calling.
 Claimed Values from HermesBenchRegarding the basis for the numbers mentioned at the outset, the official docs show a table like this. HermesBench is an internal Nous benchmark, and the leaderboard itself has not yet been made public.


Model
HermesBench Score


MoA (Opus aggregator)
0.8202

Claude Opus 4.8 standalone
0.7607

GPT-5.5 standalone
0.7412

In absolute terms: +0.0595, a +7.82% improvement over Opus 4.8 standalone, and +10.66% over GPT-5.5 standalone. This is consistent with the "8% higher than Opus 4.8 and 11% higher than GPT 5.5" that official X posted, so these three figures serve as the primary reference point for now.
Once the leaderboard is published, we should be able to see more detailed evaluation criteria.
 Environment Setup & PrerequisitesThe version of Hermes Agent used during verification is v0.17.0 (2026.6.19) · upstream 88b3d863.
hermes --version
# Hermes Agent v0.17.0 (2026.6.19) · upstream 88b3d863

hermes moa list
One thing to be aware of is that there is about a one-week gap between the released v0.17.0 tag and the official docs. Running hermes update --check returns "1003 commits behind origin/main," indicating that the core MoA functionality (the hermes moa subcommand and the moa://local billing path) was added in commits bundled after the tag. Since the docs are written in sync with the latest origin/main, if you installed right after the tag, you'll encounter a situation where "following the docs doesn't work."
hermes update --check
# ⚕ Update available: 1003 commits behind origin/main.

hermes update --yes --backup
# git pull + dependency rebuild + skill sync + config v29→v30 migrate
hermes update runs git pull origin main internally, so running this will bring behavior in line with the docs. The hermes --version display string will remain v0.17.0 until the next tag bump, but you can verify sync with the latest main by checking the upstream <commit-hash> portion.
 Running with the Default PresetFirst, let's confirm operation with the same default preset as in the official docs. Looking at the contents with hermes moa list, the configuration is as follows:
* default
  Reference models:
    1. openai-codex:gpt-5.5
    2. openrouter:deepseek/deepseek-v4-pro
  Aggregator: openrouter:anthropic/claude-opus-4.8
This matches the YAML example in the official docs, and what was listed as a "configuration example" turned out to be identical to the actual default content. The display Active in config: (off) appears to mean that the moa: section won't be written to ~/.hermes/config.yaml until you carve it out with hermes moa configure.
You can invoke it as a one-shot execution with hermes -z PROMPT --provider moa --model default. I ran this with 2 prompts (a light question: summarize MoA in 80 characters, and light reasoning: fair way to divide 3 apples and 4 oranges) × 4 configurations, then used compare.py to compile the subprocess wall-clock times and usage from hermes sessions export.


config
prompt
latency (s)
input_tok
output_tok
api_calls
billing
cost_status


hermes-default
light_question
9.87
24,402
83
1
custom
unknown

moa-default
light_question
40.60
41,351
46
1
moa
unknown

opus-4.8-single
light_question
6.84
2
83
1
openrouter
estimated

gpt-5.5-single
light_question
11.94
24,873
179
2
openai-codex
included

moa-default
light_reasoning
9.26
41,462
110
1
moa
unknown

opus-4.8-single
light_reasoning
7.11
2
153
1
openrouter
estimated

Three things stood out when looking at the numbers.
First, latency stretched to about 6 times that of the standalone model. Even for a light question like "please summarize the concept of MoA in 80 characters," moa-default took 40.60 seconds. Opus standalone was 6.84 seconds and GPT-5.5 standalone was 11.94 seconds, so you can feel firsthand that the N+1 call design hits directly on wall-clock time. For reasoning-type questions, there was a case where it returned in 9.26 seconds, and there's significant variability depending on the length of the reference responses.
Second, input_tokens consistently hovered around 41K. Since the design injects the opinions of 2 reference models into the aggregator's context, input ballooning to this scale is consistent with the official specification. The fact that opus-4.8-single only counts 2 input tokens is because the path that directly calls the openrouter provider doesn't carry the agent loop context—it's an extreme value as a comparison baseline. Even setting that aside, the calculation is that routing through MoA consumes input roughly 1.7 times larger than the normal hermes-default (24K) on every call.
Third, it returns with cost_status: unknown. This means the actual cost of MoA isn't visible from the Hermes side. It's processed internally with a proprietary schema of billing_provider: moa / billing_base_url: moa://local, and both actual_cost_usd and estimated_cost_usd come back as null. Tracking "how much was spent" requires a separate path.
On response quality, there was a difference between MoA and GPT-5.5 standalone. For the reasoning question, MoA carefully showed the procedure: "First distribute 2 apples, 1 each; cut the remaining 1 in half, giving 1/2 each. Distribute 4 oranges 2 each. Result: each person gets 1.5 apples and 2 oranges—fair." Meanwhile, GPT-5.5 standalone answered as briefly as possible: "Give 1 apple each, cut the remaining 1 in half. Give 2 oranges each. This is fair." Preference between the two may vary by person.
 Creating a Custom Preset to Peek at Actual CostsI'll address the constraint I noticed earlier—"cost isn't visible from the Hermes side"—by approaching it from the outside using the OpenRouter Management API. If I align all references to go through OpenRouter, I can read "how much was consumed on OpenRouter during verification" from the outside by taking the difference in total_usage from /api/v1/credits.
A custom preset is registered by appending it to the end of ~/.hermes/config.yaml.
moa:
  presets:
    openrouter-only:
      reference_models:
        - provider: openrouter
          model: openai/gpt-5.5
        - provider: openrouter
          model: anthropic/claude-sonnet-4.6
        - provider: openrouter
          model: deepseek/deepseek-v4-pro
      aggregator:
        provider: openrouter
        model: anthropic/claude-opus-4.8
      reference_temperature: 0.6
      aggregator_temperature: 0.4
      max_tokens: 4096
      enabled: true
After confirming that openrouter-only is loaded by running hermes moa list, I snapshot the OpenRouter credit immediately before verification, run hermes -z PROMPT --provider moa --model openrouter-only with 2 prompts, then take the credit reading again afterward to calculate the difference.


config
prompt
latency (s)
input_tok
output_tok


moa-openrouter-only
light_question
27.18
41,467
53

opus-4.8-single
light_question
7.19
2
83

moa-openrouter-only
light_reasoning
23.41
41,564
140

opus-4.8-single
light_reasoning
7.94
2
133

Latency stabilized at 23–27 seconds. Input remained fixed at 41K, same as with default. I thought increasing references from 2 to 3 might add a bit more time, but it wasn't as much as expected.
The crucial OpenRouter-side cost came out like this. Over these 4 requests (2 minutes), the total_usage difference was $0.483842. Using OpenRouter's official pricing (Opus 4.8 at $5/M input + $25/M output) to estimate 2 Opus standalone requests gives roughly $0.006, so the remaining $0.478 was consumed by the 2 MoA openrouter-only requests.


Perspective
Amount (approximate)


Total difference for 4 requests
$0.483842

Opus standalone 1 call
$0.003

MoA openrouter-only 1 call
$0.24

MoA / Opus standalone ratio
approximately 80x

While they claim "+7.82% quality improvement" on HermesBench, peeking at the OpenRouter credit meter shows a cost of approximately 80 times that of Opus standalone for the same prompts. This is a structural cost that's unavoidable given the design: running 3 reference models, feeding all their opinions into the aggregator's context, then synthesizing with Opus 4.8.
Asked "would you pay 80x for a +7.82% quality improvement," I think the answer varies completely by use case. For drafting legal documents or thinking through important decisions, it might be worth paying. But for a chatbot that gets called 100 times a day, honestly it's a tough number. I'd like readers to keep this "80x" figure in mind as material to draw their own line of "here I'd pay, here I wouldn't" for their own workloads.
 Checking Tool Calling BehaviorI also want to confirm the point I was curious about: "does tool calling work properly with MoA?" I specified the code_execution toolset with -t code_execution and hit 4 configurations with the prompt "calculate 17 × 23 in Python and just return the result."


config
latency (s)
input_tok
output_tok
tool_calls
Result


hermes-default
8.58
11,878
104
1
391

moa-default
28.73
18,303
58
1
391

moa-openrouter-only
30.80
18,762
58
1
391

opus-4.8-single
6.75
104
58
1
391

All configs returned the correct answer of 391 with tool_call_count: 1, confirming that tool calling works fine with MoA as well. What's notable is MoA's latency: stretched to more than 4x compared to Opus standalone (6.75 seconds) at 28–30 seconds. The reference models read the tool context and return their opinions, which feeds directly into wall-clock time. The input_tokens side also ballooned to 18K for MoA, over 170 times the 104 tokens for Opus standalone.
The api_call_count: 2 structure itself is the same as standalone Opus, and the flow where the aggregator calls the tool after deliberation informed by tool context worked successfully.
I also noticed that in the tool calling path, MoA's input nearly halved from 41K to 18K. This appears to be the result of the system prompt from the profile and irrelevant context being stripped away because the toolset was narrowed with -t code_execution.
The official docs also explicitly specify behaviors like "authentication failures in references get injected as errors into the context" and "recursive MoA (specifying another moa preset as the aggregator) is prohibited to prevent infinite loops," but I didn't do destructive testing to that extent this time. If you're curious, make a backup of ~/.hermes/config.yaml and try it out yourself.
 Comparison with Existing Consensus PatternsDrawing on the hands-on data so far, I've organized the approaches for bundling multiple LLMs side by side. Since "so which one should we actually use?" comes up often internally, this table also serves to organize my own thinking. It's somewhat unusual that only the aggregator-type Hermes MoA has cost transparency that's invisible from the Hermes side.


Pattern
Example
Consensus Mechanism
Calls per Request
Cost Transparency
Edit Configuration Yourself


Router-based
LLM Router v3 / OpenRouter Auto
MLP / heuristic selects 1 model
1
◯ Completed on provider side
Requires pool retraining

Coordinator LLM-based
Sakana Fugu
Trained LLM synthesizes multiple models behind the scenes
1 request = multiple behind the scenes
△ Partial exposure via orchestration_tokens
Cannot edit presets

Aggregator-based
Hermes MoA
1 aggregator synthesizes N reference opinions
N+1
Unknown from Hermes side
Free editing via preset YAML

MCP route-based
OpenRouter MCP Server
Via MCP with bearer authentication
1
◯ Detailed retrieval via generation-get
Provider list is fixed

For practical usage differentiation, I think something like the following breakdown is realistic. Router-based LLM Router v3 for situations where you want to run things cheaply and quickly; Coordinator-based Sakana Fugu when you want to delegate the logic and just need intelligence; Aggregator-based Hermes MoA when you want to tune the reference configuration yourself and aim to surpass the frontier on benchmarks; MCP route-based OpenRouter MCP when you want to call multiple providers cross-functionally from an agent.
I've written detailed articles on LLM Router v3 and Sakana Fugu separately, so please check those links for deeper dives.
 ConclusionAfter spending a day hands-on with Hermes Agent's MoA, I was able to confirm several things.
While benchmarks show claims of +7.82% vs Opus 4.8 standalone / +10.66% vs GPT-5.5 standalone on HermesBench, the actual cost measured via the OpenRouter Management API was approximately 80x that of Opus standalone—a somewhat sharp result showing an extreme tradeoff between quality improvement and cost. The leaderboard doesn't appear to be public yet, so I'd like to dig deeper in a follow-up once it's available.
Tool calling worked fine with MoA too, with the flow being: references return opinions incorporating tool context, the aggregator calls the tool and synthesizes. However, since Hermes internal billing stays at cost_status: unknown, the practical solution is to separately track costs on the OpenRouter side. Incorporating Nous Portal credits through the hermes proxy path into references—something I didn't get to try this time—remains a candidate for a follow-up.
On the environment side, official docs are written in sync with the latest origin/main while release tags are cut with a delay. Hermes itself was actively being updated during my verification, so I strongly recommend running hermes update --check to confirm you have the latest before trying to reproduce anything in this article.
If Teknium's vision of "aiming for Opus-level performance using combinations of open source models" becomes reality, the 80x cost might come down to around 10x, making it much more practical for everyday use. If the trend of increasingly restricted access to frontier models continues, I have a feeling the aggregator-type consensus pattern as a whole will mature another notch, so I'm hoping to verify again after the HermesBench leaderboard is published.
 Reference LinksMixture of Agents — Hermes Agent docs
Mixture-of-Agents Enhances Large Language Model Capabilities (arxiv 2024.06)
Statement on the US government directive to suspend access to Fable 5 and Mythos 5 — Anthropic
Trying Sakana Fugu (GA) with a subscription plan
Trying PLaMo 3.0 Prime
Trying OpenRouter's MCP Server
Building an LLM usage differentiation environment by purpose with NVIDIA LLM Router (Fundamentals)
Retraining NVIDIA LLM Router to match my own persona (Training)

I tried Hermes Agent's Mixture of Agents (MoA)

Introduction

Why MoA Is Attracting Renewed Attention Now

MoA's Position in Hermes Agent

How MoA Works (reference + aggregator)

Claimed Values from HermesBench

Environment Setup & Prerequisites

Running with the Default Preset

Creating a Custom Preset to Peek at Actual Costs

Checking Tool Calling Behavior

Comparison with Existing Consensus Patterns

Conclusion

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Category	Representative Examples	Mechanism
Router-based	NVIDIA LLM Router v3 / OpenRouter Auto	An MLP classifier or heuristic selects and sends to one model
Coordinator LLM-based	Sakana Fugu	A trained small LLM calls and synthesizes multiple models behind the scenes
Aggregator-based	Hermes MoA	N reference models provide opinions, one aggregator synthesizes them

Model	HermesBench Score
MoA (Opus aggregator)	0.8202
Claude Opus 4.8 standalone	0.7607
GPT-5.5 standalone	0.7412

config	prompt	latency (s)	input_tok	output_tok	api_calls	billing	cost_status
hermes-default	light_question	9.87	24,402	83	1	custom	unknown
moa-default	light_question	40.60	41,351	46	1	moa	unknown
opus-4.8-single	light_question	6.84	2	83	1	openrouter	estimated
gpt-5.5-single	light_question	11.94	24,873	179	2	openai-codex	included
moa-default	light_reasoning	9.26	41,462	110	1	moa	unknown
opus-4.8-single	light_reasoning	7.11	2	153	1	openrouter	estimated

config	prompt	latency (s)	input_tok	output_tok
moa-openrouter-only	light_question	27.18	41,467	53
opus-4.8-single	light_question	7.19	2	83
moa-openrouter-only	light_reasoning	23.41	41,564	140
opus-4.8-single	light_reasoning	7.94	2	133

Perspective	Amount (approximate)
Total difference for 4 requests	$0.483842
Opus standalone 1 call	$0.003
MoA openrouter-only 1 call	$0.24
MoA / Opus standalone ratio	approximately 80x

config	latency (s)	input_tok	output_tok	tool_calls	Result
hermes-default	8.58	11,878	104	1	391
moa-default	28.73	18,303	58	1	391
moa-openrouter-only	30.80	18,762	58	1	391
opus-4.8-single	6.75	104	58	1	391

Pattern	Example	Consensus Mechanism	Calls per Request	Cost Transparency	Edit Configuration Yourself
Router-based	LLM Router v3 / OpenRouter Auto	MLP / heuristic selects 1 model	1	◯ Completed on provider side	Requires pool retraining
Coordinator LLM-based	Sakana Fugu	Trained LLM synthesizes multiple models behind the scenes	1 request = multiple behind the scenes	△ Partial exposure via orchestration_tokens	Cannot edit presets
Aggregator-based	Hermes MoA	1 aggregator synthesizes N reference opinions	N+1	Unknown from Hermes side	Free editing via preset YAML
MCP route-based	OpenRouter MCP Server	Via MCP with bearer authentication	1	◯ Detailed retrieval via generation-get	Provider list is fixed