I tried Hermes Agent's Mixture of Agents (MoA)

I tried Hermes Agent's Mixture of Agents (MoA)

2026.06.29

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

A feature called Mixture of Agents (MoA) has been added to Nous Research's Hermes Agent. The idea of "throwing the same question to multiple LLMs and having another LLM synthesize their opinions" has apparently existed as a paper since 2024, but I'd like to explore why it's attracting renewed attention now by actually getting hands-on with it.

To state the conclusion upfront: while the claim is a +7.82% quality improvement over Claude Opus 4.8 alone on HermesBench, Nous's internal benchmark, I personally confirmed an extreme figure on my end where the actual cost on OpenRouter was approximately 80 times that of Opus alone. I'll gather hands-on data and organize it so readers can judge for themselves whether it's worth paying for.

Why MoA Is Attracting Renewed Attention Now

The concept of MoA itself is not new. The starting point was Together.ai's paper Mixture-of-Agents Enhances Large Language Model Capabilities published in June 2024, known as the pattern where "N reference models provide opinions, and one aggregator model synthesizes them."

The background behind this topic resurfacing at this point in 2026 involves a structural change where access to frontier models is being gradually restricted. Most recently, on June 12, 2026, Anthropic's Fable 5 and Mythos 5 were suspended for all customers due to U.S. government export regulations. It was quite a heavy response, with all deployment methods—including both API and web interfaces—suddenly cut off, regardless of whether users were inside or outside the U.S., foreign nationals or not. Combined with rate limit increases and price hikes for Claude Opus 4.8 and GPT-5.5, and the emergence of "closed orchestration systems like Sakana Fugu that bundle multiple models behind the scenes," the strongest models are increasingly becoming something only "a select few" can access.

In that atmosphere, Nous Research sent out the following message on their official X:

The strongest models are gated and access is granted only to a select few. Hermes Agent now exposes MoA presets as virtual models, giving you capabilities beyond the publicly available frontier: 8% higher than Opus 4.8 and 11% higher than GPT 5.5 on our upcoming benchmark.

This feels like a statement that captures the essence of MoA—"if the strongest models can only be touched by certain people, bundle multiple models to exceed the frontier"—overlaid onto the current industry situation. Teknium also followed up saying he "wants to aim for Opus-equivalent quality with a combination of open source models," and it seems MoA has once again become a realistic option as a consensus pattern.

MoA's Position in Hermes Agent

MoA is integrated into Hermes Agent as a "virtual model provider" and can be invoked with a single call using the /moa command or --provider moa specification.

There are several other patterns for bundling multiple LLMs, each with different philosophies. Comparing them with what I've been working with recently, here's how they break down:

Category Representative Examples Mechanism
Router type NVIDIA LLM Router v3 / OpenRouter Auto MLP classifier or heuristic selects and routes to one model
Coordinator LLM type Sakana Fugu A small trained LLM calls multiple models in the background and synthesizes
Aggregator type Hermes MoA N reference models provide opinions, one aggregator synthesizes them

Hermes MoA falls into this third category, and what I personally found most interesting was the transparency that allows users to directly edit the configuration of references and aggregator.

How MoA Works (Reference + Aggregator)

The official docs describe MoA as follows:

Use MoA when a hard task benefits from multiple model perspectives but still needs Hermes' normal agent loop: tool calls, follow-up iterations, interrupts, transcript persistence, and the same session context as any other message.

The guidance says "use this for difficult tasks where multiple model perspectives help, but Hermes' normal agent loop (tool calls and session continuity) is maintained as-is."

Illustrated as a diagram, the flow looks like this:

One key point is that only the aggregator side ultimately calls tools. Tool state is also passed to the reference side, and the flow involves references returning opinions that take the tool context into account, after which the aggregator bundles them and calls the tool. This impacts the latency I observed later when testing tool calling.

Claimed Values on HermesBench

Regarding the basis for the numbers mentioned at the beginning, the official docs show a table like this. HermesBench is Nous's internal benchmark, and the leaderboard itself has not yet been made public.

Model HermesBench Score
MoA (Opus aggregator) 0.8202
Claude Opus 4.8 alone 0.7607
GPT-5.5 alone 0.7412

That's +0.0595 in absolute value, +7.82% compared to Opus 4.8 alone, and +10.66% compared to GPT-5.5 alone. This is consistent with the official X post's claim of "8% higher than Opus 4.8 and 11% higher than GPT 5.5," so these three numbers serve as the provisional reference point.

Once the leaderboard is made public, we should be able to see more detailed evaluation axes.

Environment Setup and Prerequisites

The version of Hermes Agent used during verification is v0.17.0 (2026.6.19) · upstream 88b3d863.

hermes --version
# Hermes Agent v0.17.0 (2026.6.19) · upstream 88b3d863

hermes moa list

One thing to be careful about is that there's a gap of about one week between the released v0.17.0 tag and the official docs. Running hermes update --check returns "1003 commits behind origin/main," which reveals that the core MoA functionality (the hermes moa subcommand and billing routes for moa://local) was all added in commits after the tag. Since the docs are written in sync with the latest origin/main, if you installed right after the tag, you'll encounter a situation where "things don't work even if you follow the docs."

hermes update --check
# ⚕ Update available: 1003 commits behind origin/main.

hermes update --yes --backup
# git pull + dependency rebuild + skill sync + config v29→v30 migrate

hermes update internally runs git pull origin main, so running this will give you behavior that matches the docs. The version string in hermes --version will remain v0.17.0 until the next tag bump, but you can verify sync with the latest main by checking the upstream <commit-hash> portion.

Running with the Default Preset

First, I'll verify operation with the same default preset as the official docs. Looking at the contents with hermes moa list, the configuration is as follows:

* default
  Reference models:
    1. openai-codex:gpt-5.5
    2. openrouter:deepseek/deepseek-v4-pro
  Aggregator: openrouter:anthropic/claude-opus-4.8

This matches the YAML example in the official docs, and what's described as a "configuration example" is the same as the actual default contents. The Active in config: (off) display means that the moa: section won't be written to ~/.hermes/config.yaml until you carve it out with hermes moa configure.

Invocation can be done as a oneshot execution with hermes -z PROMPT --provider moa --model default. I ran this with 2 prompts (a light question: summarize MoA in 80 characters or less, and a light reasoning question: a fair way to divide 3 apples and 4 oranges) × 4 configurations, then compiled the subprocess wall-clock and hermes sessions export usage into compare.py.

config prompt latency (s) input_tok output_tok api_calls billing cost_status
hermes-default light_question 9.87 24,402 83 1 custom unknown
moa-default light_question 40.60 41,351 46 1 moa unknown
opus-4.8-single light_question 6.84 2 83 1 openrouter estimated
gpt-5.5-single light_question 11.94 24,873 179 2 openai-codex included
moa-default light_reasoning 9.26 41,462 110 1 moa unknown
opus-4.8-single light_reasoning 7.11 2 153 1 openrouter estimated

Looking at the numbers, I noticed three things.

First, latency stretched to roughly 6 times that of a single model. Even for a light question like "summarize the concept of MoA in 80 characters," moa-default took 40.60 seconds. Opus alone took 6.84 seconds and GPT-5.5 alone took 11.94 seconds, so I could feel firsthand that this is a design where N+1 calls hit wall-clock directly. For reasoning-oriented questions, there were cases where it returned in 9.26 seconds, so the variance is large depending on the response length of references.

Second, input_tokens consistently hovered around 41K. Since the design injects the opinions of 2 reference models into the aggregator's context, this scale of input expansion is per the official spec. The fact that opus-4.8-single shows only 2 tokens in input is because when directly calling the openrouter provider, the agent loop context isn't added to the path—an extreme value as a comparison target. Even accounting for that, MoA processes an input roughly 1.7 times larger than the standard hermes-default (24K) every time.

Third, it returns with cost_status: unknown. This means actual MoA costs are not visible from the Hermes side. It's processed internally with a proprietary schema of billing_provider: moa / billing_base_url: moa://local, and both actual_cost_usd and estimated_cost_usd return as null. Tracking "how much was spent" requires a separate route.

On the response quality side, there was a difference between MoA and GPT-5.5 alone. For the reasoning-oriented question, MoA carefully showed the procedure: "For apples, first distribute 1 each from 2, then cut the remaining 1 in half for 1/2 each. For oranges, distribute 2 each. Result: each person gets 1.5 apples and 2 oranges, which is fair." Meanwhile, GPT-5.5 alone went with the shortest answer: "Give 1 apple each and cut the remaining 1 in half. Give 2 oranges each. This is fair." Which is preferable may depend on the person.

Creating a Custom Preset to Peek at Actual Costs

I'll tackle from the outside the constraint I hit earlier—"costs aren't visible from the Hermes side"—using the OpenRouter Management API. If I route all references through OpenRouter, I can read "how much was consumed on OpenRouter during testing" externally from the total_usage difference in /api/v1/credits.

Custom presets are registered by appending to the end of ~/.hermes/config.yaml.

moa:
  presets:
    openrouter-only:
      reference_models:
        - provider: openrouter
          model: openai/gpt-5.5
        - provider: openrouter
          model: anthropic/claude-sonnet-4.6
        - provider: openrouter
          model: deepseek/deepseek-v4-pro
      aggregator:
        provider: openrouter
        model: anthropic/claude-opus-4.8
      reference_temperature: 0.6
      aggregator_temperature: 0.4
      max_tokens: 4096
      enabled: true

After confirming that openrouter-only is loaded with hermes moa list, I take a snapshot of OpenRouter credits just before testing, run hermes -z PROMPT --provider moa --model openrouter-only for 2 prompts, then take credits again afterward to calculate the difference.

config prompt latency (s) input_tok output_tok
moa-openrouter-only light_question 27.18 41,467 53
opus-4.8-single light_question 7.19 2 83
moa-openrouter-only light_reasoning 23.41 41,564 140
opus-4.8-single light_reasoning 7.94 2 133

Latency stabilized at 23–27 seconds. Input remained consistently at 41K. I thought increasing references from 2 to 3 would add a bit more time, but it wasn't as much as expected.

The key OpenRouter-side costs came out like this. The total_usage difference for these 4 requests (2 minutes) was $0.483842. Estimating 2 requests of Opus alone based on OpenRouter's official pricing (Opus 4.8 at $5/M input + $25/M output) comes to roughly $0.006, meaning the remaining $0.478 was consumed by the 2 MoA openrouter-only requests.

Metric Amount (approximate)
Total difference for 4 requests $0.483842
Opus single 1 call $0.003
MoA openrouter-only 1 call $0.24
MoA / Opus single ratio approximately 80x

While the claim is "+7.82% quality improvement" on HermesBench, peeking at OpenRouter's credit meter reveals you're paying approximately 80 times the cost just to run the same prompt. This is a structural cost unavoidable given the mechanism of running 3 reference models, feeding all their opinions into the aggregator's context, and synthesizing with Opus 4.8.

When asked "would you pay 80 times the cost for +7.82% quality," I think the answer varies by use case. For drafting legal documents or bouncing important decisions, it might be worth paying, but for a chatbot use case running 100 times a day, it's honestly a tough number. I hope readers can keep this 80x figure in the back of their minds as material for drawing the line of "I can pay here, but not here" for their own workloads.

Checking Tool Calling Behavior

I also wanted to confirm "whether tool calling works properly with MoA." I specified the code_execution toolset with -t code_execution and ran the prompt "Calculate 17 × 23 in Python and return only the result" across 4 configurations.

config latency (s) input_tok output_tok tool_calls Result
hermes-default 8.58 11,878 104 1 391
moa-default 28.73 18,303 58 1 391
moa-openrouter-only 30.80 18,762 58 1 391
opus-4.8-single 6.75 104 58 1 391

All configurations returned the correct answer of 391 with tool_call_count: 1, confirming that tool calling works fine with MoA as well. What's noteworthy is MoA's latency—it stretched to 28–30 seconds compared to Opus alone (6.75 seconds), more than 4 times longer. This is because each reference reads the tool context and returns its opinion, which then feeds directly into wall-clock time. The input_tokens side also ballooned to 18K for MoA, compared to 104 tokens for Opus alone—a difference of more than 170 times.

The api_call_count: 2 structure itself is the same as Opus alone, and the flow of the aggregator calling the tool based on consensus opinions that incorporate tool context worked as-is.

I also noticed that with the tool calling route, MoA's input nearly halved from 41K to 18K. This appears to be the result of restricting the toolset with -t code_execution, which stripped away the profile's system prompt and irrelevant context.

The official docs also explicitly state other specifications such as "when reference authentication fails, errors are injected into context" and "recursive MoA (specifying another moa preset as aggregator) is prohibited to prevent infinite loops," but I didn't perform that level of destructive testing this time. For those who are curious, please back up ~/.hermes/config.yaml and try tripping over it yourself.

Comparison with Existing Consensus Patterns

Based on the hands-on data so far, I've organized the approaches for bundling multiple LLMs side by side. Since I'm often asked internally "which should we actually use," this table doubles as organizing my own thoughts. It's somewhat distinctive that only Hermes MoA in the aggregator category has cost transparency that's invisible from the Hermes side.

Pattern Examples Consensus Mechanism Calls per Request Cost Transparency Self-Editing Configuration
Router type LLM Router v3 / OpenRouter Auto MLP / heuristic selects 1 model 1 ◯ Handled entirely on provider side Requires pool retraining
Coordinator LLM type Sakana Fugu Trained LLM synthesizes multiple models in background 1 req = multiple behind the scenes △ Partially exposed via orchestration_tokens Preset editing not possible
Aggregator type Hermes MoA 1 aggregator synthesizes N reference opinions N+1 Unknown from Hermes side Free editing via preset YAML
MCP route type OpenRouter MCP Server Via MCP with bearer authentication 1 ◯ Detailed retrieval via generation-get Provider list is fixed

For practical usage differentiation, I think the following breakdown makes sense: router-type LLM Router v3 for scenarios where you want to run fast and cheap; coordinator-type Sakana Fugu when you want to delegate backend logic and just get smart results; aggregator-type Hermes MoA when you want to tune reference configurations yourself and push past the frontier on benchmarks; and MCP route-type OpenRouter MCP when you want to cross-call multiple providers from an agent.

For more details on LLM Router v3 and Sakana Fugu, I've written separate articles, so please check the linked articles for deeper coverage.

Conclusion

After spending a day hands-on with Hermes Agent's MoA, I was able to confirm several things.

On the benchmark side, the claim is +7.82% over Opus 4.8 alone / +10.66% over GPT-5.5 alone on HermesBench, while the actual cost measured via the OpenRouter Management API came out to approximately 80 times that of Opus alone. It turned out to be a somewhat sharp result, with a quite extreme tradeoff between quality improvement and cost. Since the leaderboard doesn't seem to be publicly available yet, I hope to dig deeper in a follow-up.

Tool calling worked fine with MoA too—references returned opinions including tool context, and the aggregator called the tool and synthesized. However, the internal billing in Hermes remains cost_status: unknown, so the practical solution is to separately track costs on the OpenRouter side. The configuration of incorporating Nous Portal credits via hermes proxy into references—which I didn't get to try this time—remains a candidate to test in a follow-up.

On the environment side, while the official docs are written in sync with the latest origin/main, release tags are cut with a delay. Since Hermes itself was being actively updated during my testing in this article, I strongly recommend confirming you have the latest with hermes update --check before getting started when reproducing on your own machine.

If Teknium's statement about "wanting to target Opus-equivalent quality with a combination of open source models" becomes a realistic solution, the 80x cost might come down to around 10x, making it much more practical for everyday use. If the trend of increasingly restricted access to frontier models continues, I have a feeling the entire aggregator-type consensus pattern will mature another step, so I hope to verify again after the HermesBench leaderboard is published.


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article