I ran Ornith 1.0 on DGX Spark and compared its Japanese language performance against Gemma 4 / Nemotron

I ran Ornith 1.0 on DGX Spark and compared its Japanese language performance against Gemma 4 / Nemotron

2026.06.28

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

On June 25, 2026, DeepReinforce released a new open-source LLM family, Ornith 1.0. It comes in three sizes — 9B Dense / 35B MoE / 397B MoE — with an MIT license, 262K context, and a sharp design focused on "agentic coding." Official benchmarks claim scores approaching Claude Opus 4.7 on Terminal-Bench 2.1 / SWE-Bench Verified / ClawEval and others, and download numbers on HuggingFace have been climbing since release.

https://deep-reinforce.com/ornith_1_0.html

However, all of these official benchmarks are English-language tasks. There are no claims about Japanese performance anywhere on the LP or model cards.

So, as always with these articles, I decided to verify "whether the English agentic-specialized model Ornith 1.0, when run through Japanese benchmarks on a DGX Spark, can hold its own alongside existing Japanese-capable LLMs (Gemma 4 / Nemotron / Qwen3.6, etc.)."

The verification proceeded across the following 5 phases:

  1. Phase A Light Qualitative — Compare light questions and light reasoning with <think> block ON / OFF
  2. Phase B JCQ — Measure Japanese common sense with a 300-question subset of JCommonsenseQA v1.1
  3. Phase C ELYZA-tasks-100 — Score 100 free-text tasks using LLM-as-a-Judge
  4. Phase D Performance Measurement — Measure tok/s and GPU memory, then cross-reference against wesche.com Spark-Bench v2 public values
  5. Phase E BFCL v4 — Measure agentic suitability with the standard Tool Calling / Function Calling benchmark

"How well can an agentic-specialized model perform on Japanese benchmarks, or is the strength shown in official benchmarks limited to English?" To give the conclusion upfront: Ornith 1.0 9B came out on top among 5 models in Japanese free-text writing — a fairly encouraging result.

The Outline of Ornith 1.0

DeepReinforce is a team of 5 researchers who have previously published papers such as GrandCode (an RL training method for competitive programming) and CUDA-L2 (GPU kernel optimization). Following that lineage, Ornith 1.0 is a model family trained with the idea of "self-scaffolding" — where the model itself continuously generates the scaffold (the framework of steps and prerequisites) used in agentic tasks via RL.

Organizing the size and quantization lineup from the official HuggingFace collection:

Model Parameters Quantization Approx. Size Use Case
Ornith-1.0-9B 9B Dense bf16 ~18 GB Edge / single-machine
Ornith-1.0-9B-GGUF 9B Dense Q4_K_M / Q5_K_M ~5-7 GB llama.cpp systems
Ornith-1.0-35B 35B MoE (Active 3B) bf16 ~70 GB Mid-scale server
Ornith-1.0-35B-FP8 35B MoE (Active 3B) FP8 (compressed-tensors) ~36 GB When you want to run on a single GPU
Ornith-1.0-35B-GGUF 35B MoE (Active 3B) GGUF various llama.cpp systems
Ornith-1.0-397B 397B MoE bf16 ~800 GB Multi-node
Ornith-1.0-397B-FP8 397B MoE FP8 ~400 GB Multi-node

The 9B uses a qwen3_5-based architecture, while the 35B / 397B use a qwen3_5_moe-based MoE architecture. The official LP states the design is based on Gemma 4 and Qwen 3.5 with post-training applied.

What catches the eye in the official LP's comparison table is the 35B class positioning. Even within the same 35B-A3B class, it outpaces other vendors' Qwen3.5-35B / Qwen3.6-35B on ClawEval Avg and Terminal-Bench, and puts up numbers approaching Qwen3.5-397B (11 times the parameter count), the flagship in the 35B class.

Ornith-1.0-35B achieves 64.2 on Terminal-Bench 2.1 (Terminus-2) and 69.8 on ClawEval Avg, matching Qwen3.5-397B (53.5 / 70.7).
( Source: deep-reinforce.com/ornith_1_0.html )

What I always wonder here is: "Is Japanese support even intended?" Reading the LP and model cards all the way through, there are no claims about Japanese. The language ratio of training data is also not disclosed.

In other words, whether Ornith 1.0 can be used in Japanese is something we can only find out by actually testing it on real hardware.

Running on DGX Spark

I thought through how to run Ornith 1.0 on an NVIDIA DGX Spark (GB10 architecture, unified memory 128 GB, aarch64).

Size DGX Spark Single Node Expected Configuration
9B Dense (bf16) ◎ Comfortable ~18 GB, fits within ~30+ GB including KV cache
35B MoE (bf16) △ Tight ~70 GB, fits in unified memory but ~100 GB including KV cache, little headroom
35B MoE (FP8) ◎ Works ~36 GB, ~50-60 GB including KV cache, a realistic line
35B MoE (GGUF Q4_K_M) ◎ Light ~20-25 GB via llama.cpp, treated as reference
397B MoE (FP8) × Single node impossible ~400 GB, requires multi-node of 2+ nodes

For straightforward operation on a single DGX Spark, the two choices are 9B bf16 or 35B FP8. I included both in this verification. Since 397B cannot run on a single node, I limited it to citing official scores.

Setting Up the Verification Environment

This verification was conducted on a DGX Spark. Starting with the setup of Docker, HuggingFace CLI, and uv, I fetched the weights for Ornith 1.0 9B and 35B-FP8 from HuggingFace. The 9B consists of 4 shards at approximately 18 GB, and the 35B-FP8 is 30 files at approximately 36 GB. Both completed via hf download parallel fetching in approximately 7 minutes and 8 minutes, respectively.

For this run, I used vllm/vllm-openai:latest from Docker Hub (vLLM 0.23.0).

The vLLM startup command looks like this:

scripts/start-vllm-ornith.sh
docker run -d \
  --name ornith-vllm-9b \
  --gpus all --shm-size 16g --ipc host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 \
  vllm/vllm-openai:latest \
  --model deepreinforce-ai/Ornith-1.0-9B \
  --served-model-name ornith-9b \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --dtype bfloat16 \
  --reasoning-parser qwen3

For 9B, specify --dtype bfloat16; for 35B-FP8, explicitly specify --quantization compressed-tensors.

Time from startup to readiness was approximately 5 minutes for 9B, approximately 7-8 minutes for 35B-FP8, and approximately 8 minutes and 4 minutes for Qwen3.6-35B-A3B-FP8 and Nemotron Nano 30B-A3B-NVFP4 respectively. The breakdown is roughly 1.5-3 minutes for weight loading, 25-30 seconds for torch.compile, and 40-60 seconds for CUDAGraph capture (51+35 types of sizes from 1 to 512).

Here is a table summarizing the comparison models:

Model Quantization Size Runtime Source
Ornith 1.0 9B bf16 ~18 GB vLLM 0.23.0 deepreinforce-ai/Ornith-1.0-9B
Ornith 1.0 35B-FP8 FP8 (compressed-tensors) ~36 GB vLLM 0.23.0 deepreinforce-ai/Ornith-1.0-35B-FP8
Qwen3.6-35B-A3B-FP8 FP8 (auto-detected from config) ~35 GB vLLM 0.23.0 Qwen/Qwen3.6-35B-A3B-FP8
Nemotron 9B-v2-Japanese bf16 ~18 GB vLLM 0.23.0 nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese
Nemotron 3 Nano 30B-A3B-NVFP4 NVFP4 ~17 GB vLLM 0.23.0 nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
Gemma 4 26B-A4B-NVFP4 NVFP4 ~13 GB vLLM 0.23.0 nvidia/Gemma-4-26B-A4B-NVFP4

Ornith's 9B and 35B-FP8 serve as the main subjects, with 9B-class (Nemotron 9B-v2-JP), 30B-class (Nemotron Nano, Gemma 4 26B-A4B, Qwen3.6) placed alongside for comparison.

Phase A — Light Qualitative Testing Reveals <think> Always ON

First, I sent just two types of inputs to Ornith 9B: a light question and a light reasoning task. The prompts were as follows:

  • Light question: "Please introduce the model called Ornith 1.0 to readers in about 50 characters"
  • Light reasoning: "We split 3 apples and 4 oranges between 2 people. Please answer within 100 characters how many each person gets and suggest a fair way to divide them"

I sent each under two conditions: with the system prompt instructing "please do not output the <think> block" and instructing "please show your thought process and then give your final answer," comparing response length / latency / presence of <think>.

The results are summarized in the table below:

Prompt think instruction latency completion_tokens Final answer length finish_reason
Light question off 26.9 sec 356 45 chars stop
Light question on 56.1 sec 508 60 chars stop
Light reasoning off 13.0 sec 171 56 chars stop
Light reasoning on 38.8 sec 512 88 chars length

What I noticed here was that even when instructed "please do not output <think>," Ornith 9B internally generates <think>...</think> blocks. Looking at the raw responses, even in the think=off case, the thought process text always appears in the first half of the response, followed by the final answer after </think>.

The evidence is that completion_tokens reached 356 even for the light question with think=off. The actual final answer is 45 characters (a few dozen tokens), yet more than 10 times that number of tokens was consumed "before the final answer."

Checking the HuggingFace model card to be sure, it reads:

By default the assistant turn opens with a <think> … </think>. There is no documented switch to disable it.

In other words, Ornith 1.0 is a model where <think> always ON is a design specification that cannot be disabled via system prompt or API parameters. There is no reasoning_effort toggle like in Sakana Fugu or PLaMo 3.0 Prime.

However, there is a workaround. Enabling vLLM's --reasoning-parser qwen3 separates the API response into content (final answer) and reasoning_content (thought process). Testing this actually produces a response like:

{
  "content": "\n\nA",
  "reasoning_content": "Thinking Process:\n1. **Analyze the Request:** ...\n2. **Determine the Correct Answer:** ...\n"
}

With this, for tasks like JCQ where you want to parse a single character from choices, you can directly parse content for one character. I decided to always attach this flag when running subsequent benchmarks.

Here is a summary of Ornith 1.0 9B's raw behavior discovered up to this point:

  • <think> is always ON, cannot be disabled
  • Even light questions take 26-56 seconds (depending on thinking length)
  • If max_tokens is small, <think> consumes it all and the final answer becomes empty
  • --reasoning-parser qwen3 can mechanically separate content and reasoning

This is an important prerequisite that directly affects max_tokens design in subsequent JCQ / ELYZA / BFCL phases. Without setting max_tokens to 1024 or higher, you get an error pattern of "cut off mid-thought with zero answer" even for multiple-choice tasks.

Phase B — 6 Models Side-by-Side on JCommonsenseQA

To examine basic Japanese common sense capability, I took a 300-question subset (seed-fixed) from the validation split (1,119 questions) of leemeng/jcommonsenseqa-v1.1 and ran a side-by-side benchmark across 6 models. The setup was 3-shot, temperature=0, max_tokens=1024, system prompt instructing "please return only a single alphabet character A through E," with reasoning-parser qwen3 to parse a single character from the content side.

Results:

Model Quantization Accuracy avg latency avg completion tokens length cutoffs
Gemma 4 26B-A4B-NVFP4 NVFP4 97.7% (293/300) 0.1 sec 2 0
Nemotron 3 Nano 30B-A3B-NVFP4 NVFP4 94.0% (282/300) 3.3 sec 199 4
Ornith 1.0 9B bf16 93.0% (279/300) 28.1 sec 359 10
Qwen3.6-35B-A3B-FP8 FP8 93.0% (279/300) 10.3 sec 360 17
Ornith 1.0 35B-FP8 FP8 92.0% (276/300) 14.8 sec 570 19
Nemotron 9B-v2-Japanese bf16 87.3% (262/300) 30.0 sec 402 26

Reading the Ornith-related takeaways:

The biggest personal discovery was Ornith 1.0 9B scoring 93.0%, surpassing the Japanese-specialized Nemotron 9B-v2-Japanese by 5.7 percentage points. Honestly, it was unexpected that an agentic coding-specialized model would beat a "model SFT'd specifically for Japanese" on Japanese common sense questions. The natural interpretation is that the base Qwen 3.5 / Gemma 4 included substantial Japanese in pretraining, and that foundation persisted even through Ornith's self-scaffolding RL. As a 9B Dense model, 93.0% is solidly in the practical range.

Ornith 1.0 35B-FP8 also put up 92.0%, on par with 9B. For a 35B class Japanese common sense task with MoE Active 3B + FP8, those numbers are sufficient. For questions where fitting the answer within 1024 <think> tokens is feasible, both 9B and 35B proved stable.

For reference, the reason Gemma 4 26B-A4B-NVFP4 delivers 97.7% in 0.1 seconds with 2 tokens is that Gemma 4 has no thinking component and is designed to directly output minimal responses like A. This direct style dominates in multiple-choice tasks, but as we'll see in later chapters, free-text generation is a different story.

Within the JCQ framework, the first answer has emerged: Ornith 1.0, in both 9B and 35B, falls within the practical range for Japanese common sense. "Despite being English agentic-specialized, it works fine in Japanese." That alone makes it a viable candidate for local deployment on DGX Spark.

However, since JCQ is a multiple-choice task, quality in long-form generation is a separate axis. In the next chapter where I ran ELYZA-tasks-100, Ornith 1.0 9B produced even more interesting results.

Phase C — Ornith 9B Tops Japanese Free-Text Writing on ELYZA-tasks-100

Next up is ELYZA-tasks-100. This is the standard benchmark for Japanese LLM free-text evaluation: scoring 100 tasks of free-form Japanese responses from 1 to 5 points using LLM-as-a-Judge. This time I used Anthropic's claude-haiku-4.5 (via API) as the judge and lined up 4 comparison models. max_tokens is uniformly set to 1,024 including thinking.

Model Quantization avg score Count of 5-point scores
Ornith 1.0 9B bf16 3.89 47
Nemotron 3 Nano 30B-A3B-NVFP4 NVFP4 3.29 14
Nemotron 9B-v2-Japanese bf16 3.08 33
Ornith 1.0 35B-FP8 FP8 2.61 24
Qwen3.6-35B-A3B-FP8 FP8 1.72 10

Ornith 1.0 9B topped the 5 models with avg 3.89. With 47 five-point scores, and 67 combined 4-point and 5-point scores, this is solidly practical. For reference, ELYZA-Llama-3-8B's public value is around avg 3.0, and GPT-3.5-class models around 3.5, so this exceeds those and comes close to, but doesn't quite reach, the GPT-4-class range of 4.5.

This frankly exceeded expectations. Ornith is an English-based 9B model focused on agentic coding that has not received Japanese-specialized SFT. The natural reading of these numbers is that the base Qwen 3.5 / Gemma 4 included substantial Japanese in pretraining, and Ornith's self-scaffolding RL extended the model's capabilities toward agentic tasks while retaining that knowledge. This is a subtly important observation for operational design: "the Japanese foundation of Qwen 3.5 / Gemma 4-based models persists even through post-training."

Note that Ornith 35B-FP8 scored lower in avg than its own 9B, but this is the result of comparison under a single uniform condition of max_tokens=1,024. Ornith 35B-FP8 still produced 24 five-point scores, showing its generation quality itself is stable. If you're seriously using Ornith 35B-FP8 for long-form tasks, the right operational design would be to set max_tokens to 2,048-4,096. When deploying a reasoning model with always-ON <think> for long-form content — whether in code or infrastructure — setting a higher token budget is the safe approach.

The practical takeaway from these benchmark numbers is simple: Ornith 1.0 9B is the most straightforward choice for bringing Japanese free-text writing to the practical range on DGX Spark. MIT license, 18 GB in bf16, it thinks carefully with <think> before answering, and delivers an average 3.89 on ELYZA. It's a sufficient candidate for an open-source Japanese LLM to keep resident on your local DGX Spark.

Phase D — tok/s Measurement and Cross-Reference with wesche.com Official Values

Beyond quality, let me organize the speed and memory picture. I ran a simple benchmark of warmup 1 + 5 runs across 5 models using a longer prompt (~100 characters) + max_tokens=512.

Model Quantization Architecture tok/s (avg) latency (avg)
Nemotron 3 Nano 30B-A3B NVFP4 MoE Active 3B 59.81 8.6 sec
Qwen3.6-35B-A3B FP8 MoE Active 3B 51.61 9.9 sec
Ornith 1.0 35B FP8 (compressed-tensors) MoE Active 3B 37.50 13.7 sec
Nemotron 9B-v2-JP bf16 Dense 13.37 38.3 sec
Ornith 1.0 9B bf16 Dense 12.65 40.5 sec

Here are the key points related to Ornith:

Ornith 1.0 35B-FP8 leveraged the efficiency of MoE Active 3B to deliver 37.5 tok/s for the 35B class. Compared to bf16 Dense 9B (12.65 tok/s), that's 3x faster, and a gen latency of 22 seconds for 100 ELYZA tasks is in the practical range for a reasoning model at the 35B class. Ornith is provided officially on HF as FP8 (compressed-tensors), and it runs as-is with --quantization compressed-tensors on vLLM 0.23.0 + vllm/vllm-openai:latest. The practicality of "fits within 35-40 GB on a single DGX Spark with the officially provided quantization, delivering 37 tok/s" is a welcome figure for operations looking to host a 35B class locally.

Ornith 1.0 9B is Dense, so it comes in at 1/3 the speed of 35B-FP8 at 12.65 tok/s. Being the type that reasons carefully via long-form thinking, interactive use will involve response waits, but in exchange it delivers ELYZA 3.89 / JCQ 93.0% / BFCL simple_python 90.25% quality.

Aligning both quality and speed for "so which do you actually use?" looks like this:

Model ELYZA avg tok/s ELYZA avg latency per task Quantization Expected Use Case
Ornith 1.0 9B 3.89 12.65 42.8 sec bf16 Batch processing prioritizing quality, overnight automated reports, dialogue where quality cannot be sacrificed
Ornith 1.0 35B-FP8 2.61 37.5 22.2 sec FP8 Interactive apps, UX where wait time matters, streaming responses

Simply put, Ornith 1.0 9B is the "sweet spot when you want quality," and Ornith 1.0 35B-FP8 is the "sweet spot when you want speed." Being able to choose honestly between quality and speed within the same family is a straightforwardly appealing setup for operational design.

One note: the 35B-FP8 ELYZA score of 2.61 was measured under the single condition of max_tokens=1,024. Ornith 35B-FP8 still produced 24 five-point scores, and its generation quality itself is stable. If you're seriously using 35B-FP8 for long-form tasks, designing with max_tokens set to 2,048-4,096 to let <think> fully play out should improve quality while preserving the speed advantage. This is a topic I want to revisit while running it in actual production.

Note that wesche.com Spark-Bench v2's published value of 66.9 tok/s for Ornith-1.0-35B was measured with NVFP4 quantization ("Local models: served via llama.cpp (Q4_K_M quantization) or vLLM (NVFP4)", source: wesche.com/dgx/). Since the official Ornith HuggingFace collection currently has no NVFP4 version, wesche measured their own NVFP4-quantized version. If Ornith officially releases an NVFP4 version, reaching the 60 tok/s range for the 35B class becomes a realistic prospect. Separately measuring Nemotron 3 Nano 30B-A3B-NVFP4 on my local DGX Spark yielded 59.81 tok/s (same 30B class MoE Active 3B + NVFP4), consistent with wesche's order of magnitude.

At this point, the "Japanese performance" and "speed/balance" picture for Ornith 1.0 9B / 35B is complete. Finally, let's look at agentic task suitability from both my own BFCL v4 evaluation and the official LP's Terminal-Bench / SWE-Bench / ClawEval scores.

Phase E — Measuring Ornith 9B's Tool Calling Aptitude with BFCL v4

BFCL (Berkeley Function Calling Leaderboard) is the industry de-facto evaluation benchmark for measuring Tool Use / Function Calling capability. The latest v4 covers over 20 categories, from single-turn simple variants to parallel, irrelevance reject, live (real-world queries), multi_turn, and web_search.

Since Ornith is not registered in the BFCL catalog, I rewrote vLLM's --served-model-name to Qwen/Qwen3-8B and reused BFCL's Qwen/Qwen3-8B-FC handler. Since it's based on Qwen 3.5, the chat template is the same type, and the handler's _format_prompt works without issues. I used --skip-server-setup + REMOTE_OPENAI_BASE_URL to point directly to the external vLLM endpoint.

Running 8 categories — simple_python / multiple / parallel / parallel_multiple / irrelevance / live_simple / live_relevance / live_parallel — with 1,514 items in parallel using num_threads=4, it completed in 1 hour 42 minutes.

Here are the results for Ornith 1.0 9B:

Category Count Ornith 9B Accuracy
multiple (function selection) 199 92.00%
simple_python (basic FC) 399 90.25%
live_relevance (real-world relevance) 16 87.50%
parallel_multiple (parallel + selection) 199 87.00%
irrelevance (reject) 239 83.33%
parallel (parallel call) 199 78.50%
live_simple (real-world single) 257 74.42%
live_parallel (real-world parallel) 16 68.75%

Three things I can read from this:

First, Non-Live categories (human-written) are consistently 78-92%. simple_python at 90%, multiple at 92%, parallel_multiple at 87%. The most basic agentic operation — reading a tool spec, selecting the right function, and filling in the argument JSON — works at a practical level (80%+) even with a 9B Dense model.

Second, Live categories (real-world queries) drop 10-16 percentage points across all categories. simple_python at 90.25% versus live_simple at 74.42%. There's a clear degradation going from "grammatically clean, human-crafted function specs" to "rough specs that might appear in actual products." This is a trend seen across the Live categories in BFCL generally, not just for Ornith, but the fact that this gap is not small even for a 9B Dense model is a number worth knowing for operational design.

Third, parallel and live_parallel are on the lower end (78.50% and 68.75%). Tasks requiring multiple tool calls to be issued in parallel showed a tendency for the structure to break down as <think> grew longer. This is an industry-wide phenomenon where reasoning models with always-ON <think> occasionally falter on structured output — and Ornith is no exception.

While I wanted to run parallel evaluation for the other models, running all 4 models would take another 16+ hours, which didn't fit this article's publication timeline. For comparison targets, I referenced official scores for Qwen3-8B-FC and Qwen3-30B-A3B-Instruct-2507-FC from the BFCL official leaderboard (gorilla.cs.berkeley.edu/leaderboard.html). Ornith 9B's results here are roughly on par with the official scores for Qwen3-8B-FC — my honest impression is that rather than being exceptionally strong for an agentic coding-specialized model, it's "standard for the Qwen 3.5 8B class."

This might be the natural read: that Ornith's true agentic value lies not in the base 8B class, but in the 397B flagship design that, as the official LP claims, aims to match Claude Opus 4.7 / 4.8.

Citing Official Agentic Scores (Terminal-Bench / SWE-Bench / ClawEval / NL2Repo)

I'll pull out some of the particularly strong numbers from the size-by-size comparison table published on the Ornith official LP.

The 397B flagship model outperforms Claude Opus 4.7 in several categories.

Bench Ornith-1.0-397B Claude Opus 4.7 Claude Opus 4.8
Terminal-Bench 2.1 (Terminus-2) 77.5 70.3 85.0
Terminal-Bench 2.1 (Claude Code) 78.2 69.7 78.9
SWE-Bench Verified 82.4 80.8 87.6
SWE-Bench Pro 62.2 64.3 69.2
ClawEval Avg 77.1 78.2
NL2Repo 48.2 69.7

Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks.
( Source: deep-reinforce.com/ornith_1_0.html )

The 35B-MoE also puts up exceptional numbers for its class.

Bench Ornith-1.0-35B Qwen3.5-35B Qwen3.6-35B Gemma4-31B Qwen3.5-397B
Terminal-Bench 2.1 (Terminus-2) 64.2 41.4 52.5 42.1 53.5
SWE-Bench Verified 75.6 70.0 73.4 52.0 76.4
ClawEval Avg 69.8 65.4 68.7 48.5 70.7

The picture here is that within the same 35B class, it pulls 11.7 points ahead of Qwen3.6-35B and closes in on flagship Qwen3.5-397B (which has 11× the parameters).

The 9B (Dense) is positioned for edge use, yet it still leaves the 35B-class Qwen3.5-9B (Dense) far behind.

Bench Ornith-1.0-9B Qwen3.5-9B Gemma4-12B Gemma4-31B
Terminal-Bench 2.1 (Terminus-2) 43.1 21.3 21.0 42.1
SWE-Bench Verified 69.4 53.2 44.2 52.0
ClawEval Avg 63.1 53.2 32.5 48.5

It's genuinely interesting that the 9B Dense nearly ties the 31B Dense (Gemma 4-31B) on Terminal-Bench. Given the large gap it opens up over Qwen 3.5 9B, one could hypothesize that the effect of self-scaffolding RL is particularly pronounced in smaller-scale models.

Summary — What Makes Ornith 1.0 Appealing

Here are my takeaways after running Ornith 1.0 — released as an English agentic coding-specialized model — through 4 Japanese axes plus 1 agentic axis on a DGX Spark.

The things I personally find appealing about Ornith 1.0 are as follows.

  • Even the 9B Dense reaches practical territory for Japanese common sense, open-ended writing, and tool calling. JCQ 93.0%, ELYZA-tasks-100 avg 3.89 (top among 5 models), BFCL v4 simple_python 90.25%. These numbers from a 9B positioned as edge-oriented are more than sufficient as a candidate for a local LLM to run resident on a DGX Spark.
  • The 35B-FP8 is provided under MIT from the official HF, and delivers 37 tok/s straight out of the box on a single DGX Spark. It spins up easily with vllm/vllm-openai:latest + --quantization compressed-tensors, and the MoE Active 3B efficiency contributes as well.
  • <think> can be split into content / reasoning by vLLM's --reasoning-parser qwen3. The handling when incorporating a reasoning model into structured tasks is standardized, requiring no extra ingenuity in operational design.
  • The official LP claims the 397B rivals Claude Opus 4.7. Terminal-Bench 2.1 at 77.5, SWE-Bench Verified at 82.4, ClawEval Avg at 77.1. Even in the 35B class, with ClawEval 69.8 / Terminal-Bench 64.2, it closes in on Qwen3.5-397B (11× the parameters).
  • MIT license + Qwen 3.5 / Gemma 4 base, with the Japanese foundation inherited from the base and the agentic capability from self-scaffolding RL working well in tandem. It's also easy to use as a base for derivative models or fine-tuning going forward.

When running on DGX Spark, the choice between quality and speed depends on use case.

Use Case Recommended Reason
Quality-focused batch processing / nightly automated reports Ornith 1.0 9B (bf16) Top among 5 models at ELYZA avg 3.89, JCQ 93.0%, stable on long-form tasks
Interactive UX / streaming responses Ornith 1.0 35B-FP8 37.5 tok/s, 3× faster than 9B, ELYZA 1 task in 22.2 sec, official agentic scores also higher than 9B
Serious agentic workloads 35B-FP8 + expanded max_tokens Carries official ClawEval 69.8 / Terminal-Bench 64.2, also practical in terms of speed
Future flagship operation on a cluster Ornith 1.0 397B Official scores rivaling Claude Opus 4.7, assumes multiple DGX Sparks or cloud GPUs

"9B Dense for quality, 35B-FP8 for speed" — the straightforward division of roles within the same family is one of Ornith 1.0's strong points, and the fact that you don't need to switch models during the operational design phase is genuinely convenient. For tasks that require the highest final quality in Japanese open-ended writing, route to 9B; for tasks with a lot of interactive back-and-forth, route to 35B-FP8 at the backend service layer — and you can get the best of both worlds.

Having <think> always ON does require a design that allocates max_tokens of 2,048–4,096 for long-form tasks, but this is a premise shared by all reasoning models in the Qwen 3.5 / Gemma 4 / DeepSeek family, so it's not a burden unique to Ornith. What's interesting about Ornith is rather that scores commensurate with that thinking length are produced even from the 9B.

Topics I'd Like to Continue Investigating

  • Increase max_tokens to 4,096 on Ornith 35B-FP8 and examine its true quality on long-form tasks
  • Run Ornith 9B resident in Hermes Agent or Claude Code Router and write up the hands-on feel of operating a reasoning model in production based on a week's routing logs
  • When an NVFP4 edition is officially released, run 35B-NVFP4 / 397B-NVFP4 on DGX Spark to confirm speed and quality gains
  • Set up Terminal-Bench 2.1 as a permanent fixture on DGX Spark and turn it into a validation platform that regularly runs Ornith 9B / 35B and other OSS models

国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article