I ran Ornith 1.0 on DGX Spark and compared its Japanese language performance with Gemma 4 / Nemotron

I tested Ornith 1.0, a new open-source LLM from DeepReinforce, on Japanese benchmarks using DGX Spark and compared it against existing models. Here I present the results of measuring how well this English model specialized for agentic coding performs in Japanese, evaluated across 5 phases.

森茂洋 / Hiroshi Morishige

2026.06.28

This page has been translated by machine translation. View original

 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
On June 25, 2026, DeepReinforce released a new open-source LLM family, Ornith 1.0. It comes in three sizes — 9B Dense / 35B MoE / 397B MoE — with an MIT license, a 262K context, and a sharp design focused on "agentic coding". The official benchmarks claim scores approaching Claude Opus 4.7 on Terminal-Bench 2.1 / SWE-Bench Verified / ClawEval and others, and the number of downloads on HuggingFace has been growing since the release.
https://deep-reinforce.com/ornith_1_0.html
However, all of these official benchmarks are English-language tasks. There are no claims about Japanese performance anywhere on the LP or model cards.
So, as usual with this blog, I decided to verify: "If we run the English agentic-specialized model Ornith 1.0 through Japanese benchmarks on a DGX Spark, can it hold its own alongside existing Japanese-capable LLMs (Gemma 4 / Nemotron / Qwen3.6, etc.)?"
The verification was carried out across the following 5 phases.
Phase A Light Qualitative — Compare light questions and light reasoning with <think> block ON / OFF
Phase B JCQ — Measure Japanese common sense on a 300-question subset of JCommonsenseQA v1.1
Phase C ELYZA-tasks-100 — Score 100 free-form tasks using LLM-as-a-Judge
Phase D Performance Measurement — Measure actual tok/s and GPU memory, then cross-check with published values from wesche.com Spark-Bench v2
Phase E BFCL v4 — Measure agentic suitability with the standard benchmark for Tool Calling / Function Calling
"How well can a model specialized for agentic tasks hold up on Japanese benchmarks, or is the strength shown in official benchmarks limited to English?" To state the conclusion upfront: Ornith 1.0 9B came out on top in Japanese free-form writing among the 5 models — a surprisingly encouraging result.
!The evaluation in this article is my own independent verification, with all models aligned under the same conditions (max_tokens=1024 / temperature=0, etc.), so the numbers do not fully draw out each model's recommended settings and may differ from official values. If you find any errors or oversights, I'd appreciate feedback in the comments.
 The Outline of Ornith 1.0DeepReinforce is a team of 5 researchers who have previously published papers such as GrandCode (an RL training method for competitive programming) and CUDA-L2 (GPU kernel optimization). Ornith 1.0 follows from that work, trained with the concept of "self-scaffolding" — where the model continuously generates the scaffold (the foundation of steps and premises) it uses for agentic tasks via RL.
Organizing the size and quantization lineup from the official HuggingFace collection gives the following:


Model
Parameters
Quantization
Approx. Size
Use Case


Ornith-1.0-9B
9B Dense
bf16
~18 GB
Edge / single-machine

Ornith-1.0-9B-GGUF
9B Dense
Q4_K_M / Q5_K_M
~5-7 GB
llama.cpp systems

Ornith-1.0-35B
35B MoE (Active 3B)
bf16
~70 GB
Mid-scale server

Ornith-1.0-35B-FP8
35B MoE (Active 3B)
FP8 (compressed-tensors)
~36 GB
When you want to run on a single GPU

Ornith-1.0-35B-GGUF
35B MoE (Active 3B)
Various GGUF
—
llama.cpp systems

Ornith-1.0-397B
397B MoE
bf16
~800 GB
Multi-node

Ornith-1.0-397B-FP8
397B MoE
FP8
~400 GB
Multi-node

The 9B uses a qwen3_5-based architecture, while the 35B / 397B use a qwen3_5_moe-based MoE architecture. The official LP states the design builds post-training on top of Gemma 4 and Qwen 3.5 as base models.
What stands out in the official LP's comparison table is the positioning of the 35B class. Even within the same 35B-A3B class, it surpasses other companies' Qwen3.5-35B / Qwen3.6-35B on ClawEval Avg and Terminal-Bench, and posts numbers approaching the 35B-class flagship Qwen3.5-397B (which has 11x the parameter count).
Ornith-1.0-35B achieves 64.2 on Terminal-Bench 2.1 (Terminus-2) and 69.8 on ClawEval Avg, matching Qwen3.5-397B (53.5 / 70.7).

( Source: deep-reinforce.com/ornith_1_0.html )
What always concerns me here is the question: "Is Japanese task performance even a consideration?" Even after reading the LP and model cards to the end, there are no claims about Japanese. The language ratio of training data is not disclosed either.
In other words, whether Ornith 1.0 can be used in Japanese is something you can only find out by trying it on actual hardware.
 Running It on DGX SparkI thought about how to run Ornith 1.0 on an NVIDIA DGX Spark (GB10 architecture, unified memory 128 GB, aarch64).


Size
DGX Spark Single Node
Expected Configuration


9B Dense (bf16)
◎ Comfortable
~18 GB, fits within ~30+ GB including KV cache

35B MoE (bf16)
△ Tight
~70 GB; fits in unified memory, but ~100 GB including KV cache — little headroom

35B MoE (FP8)
◎ Works
~36 GB, ~50-60 GB including KV cache — a realistic line

35B MoE (GGUF Q4_K_M)
◎ Light
~20-25 GB via llama.cpp — treated as reference

397B MoE (FP8)
× Single node infeasible
~400 GB, requires multi-node setup of 2+ nodes

For a straightforward single-node DGX Spark setup, the choice comes down to either 9B bf16 or 35B FP8. I included both in this verification. The 397B can't run on a single node, so I'll limit it to citing official scores.
 Setting Up the Verification EnvironmentThis verification was conducted on a DGX Spark. Starting with Docker, HuggingFace CLI, and uv setup, I pulled the weights for Ornith 1.0 9B and 35B-FP8 from HuggingFace. The 9B is a 4-shard configuration at ~18 GB, and the 35B-FP8 is 30 files at ~36 GB. Both were downloaded in parallel using hf download, completing in approximately 7 and 8 minutes respectively.
This time I used Docker Hub's vllm/vllm-openai:latest (vLLM 0.23.0).
The vLLM launch command looks like this:
scripts/start-vllm-ornith.sh
docker run -d \
  --name ornith-vllm-9b \
  --gpus all --shm-size 16g --ipc host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_USE_FLASHINFER_SAMPLER=0 \
  vllm/vllm-openai:latest \
  --model deepreinforce-ai/Ornith-1.0-9B \
  --served-model-name ornith-9b \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --dtype bfloat16 \
  --reasoning-parser qwen3
For the 9B, --dtype bfloat16 is specified; for the 35B-FP8, --quantization compressed-tensors is explicitly set.
Time from launch to readiness was approximately 5 minutes for the 9B, approximately 7-8 minutes for the 35B-FP8, and approximately 8 and 4 minutes for Qwen3.6-35B-A3B-FP8 and Nemotron Nano 30B-A3B-NVFP4 respectively. The breakdown is roughly: weight loading takes 1.5-3 minutes, torch.compile takes 25-30 seconds, and CUDAGraph capture (51+35 sizes from 1 to 512) takes 40-60 seconds.
The comparison models are summarized in the table below.


Model
Quantization
Size
Runtime
Source


Ornith 1.0 9B
bf16
~18 GB
vLLM 0.23.0
deepreinforce-ai/Ornith-1.0-9B

Ornith 1.0 35B-FP8
FP8 (compressed-tensors)
~36 GB
vLLM 0.23.0
deepreinforce-ai/Ornith-1.0-35B-FP8

Qwen3.6-35B-A3B-FP8
FP8 (auto config)
~35 GB
vLLM 0.23.0
Qwen/Qwen3.6-35B-A3B-FP8

Nemotron 9B-v2-Japanese
bf16
~18 GB
vLLM 0.23.0
nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese

Nemotron 3 Nano 30B-A3B-NVFP4
NVFP4
~17 GB
vLLM 0.23.0
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

Gemma 4 26B-A4B-NVFP4
NVFP4
~13 GB
vLLM 0.23.0
nvidia/Gemma-4-26B-A4B-NVFP4

Ornith's 9B and 35B-FP8 are the main subjects, with 9B-class (Nemotron 9B-v2-JP), 30B-class (Nemotron Nano, Gemma 4 26B-A4B, Qwen3.6) lined up as comparison targets.
 Phase A — Light Qualitative Test Reveals <think> Always ONFirst, I sent two types of inputs to Ornith 9B: a light question and a light reasoning task. The prompts were as follows:
Light question: "Please introduce what the model called Ornith 1.0 is to readers in about 50 characters"
Light reasoning: "We divide 3 apples and 4 oranges between 2 people. Please answer within 100 characters how many each person gets and suggest a fair way to split them"
I tested each under two conditions: a system prompt instructing "do not output the <think> block" and "output your thought process before giving your final answer", comparing response character count / latency / presence of <think>.
The results in table form are as follows:


Prompt
think instruction
latency
completion_tokens
Final answer chars
finish_reason


Light question
off
26.9 s
356
45 chars
stop

Light question
on
56.1 s
508
60 chars
stop

Light reasoning
off
13.0 s
171
56 chars
stop

Light reasoning
on
38.8 s
512
88 chars
length

What I noticed here was that even when instructed "please do not output <think>," Ornith 9B still generates <think>...</think> blocks internally. Looking at the raw response, even in the think=off case, the thought process text always appears in the first half of the response, followed by the final answer after </think>.
The 356 completion tokens for the light question with think=off is proof of this. The actual final answer is 45 characters (a few dozen tokens), yet more than 10 times that number of tokens were consumed "before the final answer."
Checking the model card on HuggingFace, it says:
By default the assistant turn opens with a <think> … </think>. There is no documented switch to disable it.
In other words, Ornith 1.0 is a model where <think> always ON is a design specification and cannot be disabled via system prompt or API parameters. There is no switching via reasoning_effort like with Sakana Fugu or PLaMo 3.0 Prime.
However, there is a workaround. Enabling vLLM's --reasoning-parser qwen3 separates the API response into content (final answer) and reasoning_content (thought process). Testing this in practice returns something like:
{
  "content": "\n\nA",
  "reasoning_content": "Thinking Process:\n1. **Analyze the Request:** ...\n2. **Determine the Correct Answer:** ...\n"
}
With this, for tasks like JCQ where you want a single character from multiple choices, you can directly parse one character from content. I decided to always apply this flag when running subsequent benchmarks.
The raw behavior of Ornith 1.0 9B discovered up to this point is as follows:
<think> is always ON, cannot be disabled
Even light questions take 26-56 seconds (depends on thinking length)
If max_tokens is small, <think> consumes the budget and the final answer ends up empty
--reasoning-parser qwen3 can mechanically separate content from reasoning
This is an important premise that directly affects the max_tokens design in the subsequent JCQ / ELYZA / BFCL phases. If max_tokens isn't set to 1024 or more, even choice-based tasks produce error patterns where "thinking gets cut off mid-way and the answer is zero."
 Phase B — 6-Model Side-by-Side on JCommonsenseQATo check the baseline Japanese common sense capability, I took a seed-fixed 300-question subset from the validation split (1,119 questions) of leemeng/jcommonsenseqa-v1.1 and ran a side-by-side benchmark across 6 models. The setup used 3-shot, temperature=0, max_tokens=1024, a system prompt instructing "return only a single alphabet character A through E," and parsed a single character from the content side using reasoning-parser qwen3.
The results are as follows:


Model
Quantization
Accuracy
avg latency
avg completion tokens
length cutoffs


Gemma 4 26B-A4B-NVFP4
NVFP4
97.7% (293/300)
0.1 s
2
0

Nemotron 3 Nano 30B-A3B-NVFP4
NVFP4
94.0% (282/300)
3.3 s
199
4

Ornith 1.0 9B
bf16
93.0% (279/300)
28.1 s
359
10

Qwen3.6-35B-A3B-FP8
FP8
93.0% (279/300)
10.3 s
360
17

Ornith 1.0 35B-FP8
FP8
92.0% (276/300)
14.8 s
570
19

Nemotron 9B-v2-Japanese
bf16
87.3% (262/300)
30.0 s
402
26

The key takeaways regarding Ornith are as follows:
Ornith 1.0 9B scoring 93.0%, outperforming the Japanese-specialized Nemotron 9B-v2-Japanese by 5.7 percentage points, was personally the biggest discovery. The fact that a model specialized for agentic coding beats a "model SFT'd specifically for Japanese" on Japanese common sense questions is, frankly, unexpected. The natural interpretation is that Qwen 3.5 / Gemma 4's base pretraining included a considerable amount of Japanese, and that knowledge remained intact through Ornith's self-scaffolding RL. For a 9B Dense model, 93.0% is well within the practical range.
Ornith 1.0 35B-FP8 also scored 92.0%, on par with the 9B. For the 35B class, this is a sufficient number for Japanese common sense tasks, achieved with MoE Active 3B + FP8. It was confirmed that both 9B and 35B perform stably for problems that can be answered within 1,024 <think> tokens.
For reference, Gemma 4 26B-A4B-NVFP4 scoring 97.7% in 0.1 seconds with 2 tokens is because Gemma 4 doesn't have thinking and is designed to directly output minimal responses like A. In choice-based tasks, this direct style is the strongest, but as we'll see in later sections, it's a different story for free-form generation.
Within the JCQ framework, the first answer is now in: Ornith 1.0, both 9B and 35B, is within the practical range for Japanese common sense. "Despite being an English agentic-specialized model, it works fine in Japanese." That alone makes it a passing grade as a candidate for local deployment on a DGX Spark.
However, since JCQ is a multiple-choice task, quality in long-form generation is a separate axis. In the next section, when I ran ELYZA-tasks-100, Ornith 1.0 9B produced even more interesting results.
 Phase C — Ornith 9B Takes the Top Spot in Japanese Free-Form Writing on ELYZA-tasks-100Next is ELYZA-tasks-100. This is a standard benchmark for Japanese LLM free-form writing evaluation, scoring 100 free-form Japanese response tasks on a 1-to-5 scale using LLM-as-a-Judge. This time I used Anthropic's claude-haiku-4.5 (via API) as the judge, comparing against 4 reference models. max_tokens is set to 1,024 including thinking across all models.


Model
Quantization
avg score
Count of 5-point scores


Ornith 1.0 9B
bf16
3.89
47

Nemotron 3 Nano 30B-A3B-NVFP4
NVFP4
3.29
14

Nemotron 9B-v2-Japanese
bf16
3.08
33

Ornith 1.0 35B-FP8
FP8
2.61
24

Qwen3.6-35B-A3B-FP8
FP8
1.72
10

Ornith 1.0 9B came first out of 5 models with an average of 3.89. With 47 five-point scores, and 67 tasks scoring 4 or 5 points combined, it's well within practical range. For reference, the published value for ELYZA-Llama-3-8B is around avg 3.0 and GPT-3.5-class is around 3.5, so this model surpasses those while falling slightly short of GPT-4-class at 4.5.
This honestly exceeded expectations. Ornith is an English-based 9B model claiming to specialize in agentic coding, and it hasn't undergone Japanese-specialized SFT. The fact that these numbers emerged suggests that Qwen 3.5 / Gemma 4's base pretraining included a considerable amount of Japanese, and that the self-scaffolding RL preserved that knowledge while extending it toward agentic tasks. This is a subtly important operational observation: "The Japanese foundation of Qwen 3.5 / Gemma 4 systems persists even after post-training."
Note that Ornith 35B-FP8 scored lower than its own 9B, but this is the result of comparing under a single condition of max_tokens=1,024. The 35B-FP8 still produced 24 five-point responses, and its raw generation quality is stable. For serious use of Ornith 35B-FP8 on long-form tasks, the operational design should assume setting max_tokens to 2,048-4,096. When deploying reasoning models with always-on <think> for long-form tasks — whether in code or infrastructure — bumping up the token budget one notch is the safe approach.
The simple takeaway from the benchmark numbers is that Ornith 1.0 9B is the most straightforward choice for "bringing Japanese free-form writing to a practical level on DGX Spark." MIT license, ~18 GB at bf16, it thinks carefully with <think> before answering, and scores a 3.89 average on ELYZA. It's a sufficient candidate as an open-source Japanese LLM to keep running locally on a DGX Spark.
 Phase D — Actual tok/s Measurement and Cross-Check with wesche.com Official ValuesBeyond quality, let me also organize the speed and memory story. I ran a simple benchmark of 5 models with a longer prompt (~100 characters) + max_tokens=512, using warmup 1 + 5 runs.


Model
Quantization
Architecture
tok/s (avg)
latency (avg)


Nemotron 3 Nano 30B-A3B
NVFP4
MoE Active 3B
59.81
8.6 s

Qwen3.6-35B-A3B
FP8
MoE Active 3B
51.61
9.9 s

Ornith 1.0 35B
FP8 (compressed-tensors)
MoE Active 3B
37.50
13.7 s

Nemotron 9B-v2-JP
bf16
Dense
13.37
38.3 s

Ornith 1.0 9B
bf16
Dense
12.65
40.5 s

Let me organize the key points relating to Ornith.
Ornith 1.0 35B-FP8 leverages MoE Active 3B efficiency to achieve 37.5 tok/s for the 35B class. Compared to the bf16 Dense 9B (12.65 tok/s), that's 3x faster; a gen latency of 22 seconds for 100 ELYZA tasks is within practical range for a reasoning model + 35B class. Ornith's official HF provides it as FP8 (compressed-tensors), and it can be run as-is with --quantization compressed-tensors on vLLM 0.23.0 + vllm/vllm-openai:latest. The practicality of "officially provided quantization, fits in 35-40 GB on a single DGX Spark, achieves 37 tok/s" is a welcome proposition for operations wanting to keep a 35B-class model locally.
Ornith 1.0 9B is Dense, so it's 12.65 tok/s — one third of 35B-FP8. Being a type that thinks carefully with long-form reasoning, interactive use will have response wait times, but in exchange it delivers ELYZA 3.89 / JCQ 93.0% / BFCL simple_python 90.25% quality.
Laying out "which one to use" from both quality and speed perspectives gives the following:


Model
ELYZA avg
tok/s
ELYZA avg latency per task
Quantization
Intended Use


Ornith 1.0 9B
3.89
12.65
42.8 s
bf16
Batch processing prioritizing quality, overnight automated reports, dialogues where quality can't be sacrificed

Ornith 1.0 35B-FP8
2.61
37.5
22.2 s
FP8
Interactive applications, UX with reduced wait times, streaming responses

Simply put, Ornith 1.0 9B is the sweet spot "when you want quality," and Ornith 1.0 35B-FP8 is the sweet spot "when you want speed." Being able to choose quality vs. speed honestly within the same family is a genuinely convenient setup for operational design.
One supplementary note: the 35B-FP8's ELYZA score of 2.61 was measured under the single condition of max_tokens=1,024. The 35B-FP8 still produced 24 five-point scores, and its raw generation quality is stable. If you seriously use 35B-FP8 for long-form tasks and set max_tokens to 2,048-4,096 to let <think> fully complete, the quality should improve while retaining the speed advantage. This is a theme I'd like to confirm through actual operational use.
Also, the 66.9 tok/s for Ornith-1.0-35B published by wesche.com Spark-Bench v2 was measured with NVFP4 quantization ("Local models: served via llama.cpp (Q4_K_M quantization) or vLLM (NVFP4)", source: wesche.com/dgx/). Since there is currently no NVFP4 version in the official Ornith HuggingFace collection, it appears wesche did their own NVFP4 quantization for measurement. If an official NVFP4 version comes out for Ornith, it would be on track to reach the 60 tok/s range for the 35B class. Separately measuring Nemotron 3 Nano 30B-A3B-NVFP4 on my local DGX Spark gives 59.81 tok/s (same 30B class MoE Active 3B + NVFP4), which is consistent with wesche's order of magnitude.
By this point, both "Japanese performance" and "speed / balance" for Ornith 1.0 9B / 35B have been covered. Finally, let's look at agentic task suitability from both my own BFCL v4 evaluation and the Terminal-Bench / SWE-Bench / ClawEval scores from the official LP.
 Phase E — Measuring Ornith 9B's Tool Calling Suitability with BFCL v4BFCL (Berkeley Function Calling Leaderboard) is the industry de facto evaluation benchmark for measuring Tool Use / Function Calling capability. The latest v4 covers more than 20 categories, from single-turn simple types to parallel, irrelevance reject, live (real-world queries), multi_turn, and web_search.
Since Ornith is not registered in the BFCL catalog, I rewrote vLLM's --served-model-name to Qwen/Qwen3-8B and borrowed the Qwen/Qwen3-8B-FC handler from BFCL. Since it's Qwen 3.5-based, the chat template is identical, and the handler's _format_prompt works without issue. I used --skip-server-setup + REMOTE_OPENAI_BASE_URL to point directly at the external vLLM endpoint.
I ran 8 categories and 1,514 items — simple_python / multiple / parallel / parallel_multiple / irrelevance / live_simple / live_relevance / live_parallel — in parallel with num_threads=4, completing the run in 1 hour and 42 minutes.
Here are the results for Ornith 1.0 9B:


Category
Count
Ornith 9B Accuracy


multiple (function selection)
199
92.00%

simple_python (basic FC)
399
90.25%

live_relevance (real-world relevance)
16
87.50%

parallel_multiple (parallel + selection)
199
87.00%

irrelevance (reject)
239
83.33%

parallel (parallel call)
199
78.50%

live_simple (real-world single)
257
74.42%

live_parallel (real-world parallel)
16
68.75%

Three things I can read from this:
First, the Non-Live categories (human-crafted) are consistently 78-92%. simple_python at 90%, multiple at 92%, parallel_multiple at 87%. The most basic agentic behavior — "reading the tool spec, selecting the right function, and filling in the argument JSON" — works at a practical level (80%+) even with a 9B Dense model.
Second, the Live categories (real-world queries) drop 10-16 points across all categories. simple_python at 90.25% vs. live_simple at 74.42%. This reveals that performance degrades significantly between "grammatically clean human-crafted function specs" and "rough specs likely used in real products." This is a tendency seen across BFCL's Live categories in general and not unique to Ornith, but the fact that this gap is not small even for a 9B Dense model is a number worth knowing for operational design.
Third, parallel and live_parallel are on the lower end (78.50% and 68.75%). Tasks requiring multiple parallel tool calls showed a tendency for <think> to grow long and structured output to break down. This is consistent with the industry-wide phenomenon of reasoning models with always-on <think> occasionally stumbling on structured output — and Ornith is no exception.
I wanted to run parallel evaluations across all models, but running all 4 models would take an additional ~16+ hours, which doesn't fit this publication timeline. For comparison targets, I referenced official scores from the BFCL leaderboard (gorilla.cs.berkeley.edu/leaderboard.html) for Qwen3-8B-FC and Qwen3-30B-A3B-Instruct-2507-FC. Ornith 9B's results here are roughly on par with Qwen3-8B-FC's official scores, and the honest impression is that rather than being exceptionally strong for a model claiming to specialize in agentic coding, it's "standard for Qwen 3.5 8B class."
The natural reading here might be that the true value of Ornith in the agentic dimension is not in the base 8B class, but rather — as the official LP claims — designed to compete with Claude Opus 4.7 / 4.8 at the 397B flagship level.
 Citing Official Agentic Scores (Terminal-Bench / SWE-Bench / ClawEval / NL2Repo)Pulling out a few particularly strong numbers from the size-by-size comparison table on the official Ornith landing page.
The 397B flagship model surpasses Claude Opus 4.7 in several categories.


Bench
Ornith-1.0-397B
Claude Opus 4.7
Claude Opus 4.8


Terminal-Bench 2.1 (Terminus-2)
77.5
70.3
85.0

Terminal-Bench 2.1 (Claude Code)
78.2
69.7
78.9

SWE-Bench Verified
82.4
80.8
87.6

SWE-Bench Pro
62.2
64.3
69.2

ClawEval Avg
77.1
78.2
—

NL2Repo
48.2
—
69.7

Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks.

( Source: deep-reinforce.com/ornith_1_0.html )
The 35B-MoE also puts up exceptional numbers for its class.


Bench
Ornith-1.0-35B
Qwen3.5-35B
Qwen3.6-35B
Gemma4-31B
Qwen3.5-397B


Terminal-Bench 2.1 (Terminus-2)
64.2
41.4
52.5
42.1
53.5

SWE-Bench Verified
75.6
70.0
73.4
52.0
76.4

ClawEval Avg
69.8
65.4
68.7
48.5
70.7

The picture here is that within the same 35B class, it leads Qwen3.6-35B by 11.7 points and closes in on the flagship Qwen3.5-397B (11× the parameters).
The 9B (Dense) is positioned for edge use, yet it still pulls far ahead of the 35B-class Qwen3.5-9B (Dense).


Bench
Ornith-1.0-9B
Qwen3.5-9B
Gemma4-12B
Gemma4-31B


Terminal-Bench 2.1 (Terminus-2)
43.1
21.3
21.0
42.1

SWE-Bench Verified
69.4
53.2
44.2
52.0

ClawEval Avg
63.1
53.2
32.5
48.5

The fact that the 9B Dense nearly matches the 31B Dense (Gemma 4-31B) on Terminal-Bench is genuinely interesting. Given how much it extends beyond Qwen 3.5 9B, one hypothesis worth entertaining is that the effects of self-scaffolding RL tend to be especially pronounced in smaller models.
 Summary — What Makes Ornith 1.0 CompellingHere is a summary of my conclusions after running Ornith 1.0—released as an English agentic coding specialist model—through four Japanese axes plus one agentic axis on the DGX Spark.
These are the points I personally find most appealing about Ornith 1.0.
Even the 9B Dense reaches practical territory for Japanese common sense, open-ended generation, and tool calling. JCQ 93.0%, ELYZA-tasks-100 avg 3.89 (top among 5 models), BFCL v4 simple_python 90.25%. Given that the 9B is positioned as an edge model, these numbers are more than sufficient as a candidate local LLM to keep resident on a DGX Spark.
The 35B-FP8 is provided officially from HF under MIT, and runs at 37 tok/s out of the box on a single DGX Spark. It spins up easily with vllm/vllm-openai:latest + --quantization compressed-tensors, and the MoE Active 3B efficiency works in its favor.
The <think> block can be split into content / reasoning via vLLM's --reasoning-parser qwen3. The handling when incorporating a reasoning model into structured tasks is standardized, requiring no extra ingenuity in operational design.
The official LP claims the 397B stands alongside Claude Opus 4.7. Terminal-Bench 2.1 at 77.5, SWE-Bench Verified at 82.4, ClawEval Avg at 77.1. Even in the 35B class, ClawEval 69.8 / Terminal-Bench 64.2 puts it within reach of Qwen3.5-397B (11× the parameters).
MIT license + Qwen 3.5 / Gemma 4 base means the Japanese foundation inherited from the base and the agentic capability from self-scaffolding RL coexist well. It is also easy to work with as a base for derivative models and fine-tuning going forward.
When running on a DGX Spark, the choice between quality and speed comes down to use case.


Use Case
Recommendation
Rationale


Quality-focused batch processing / nightly auto reports
Ornith 1.0 9B (bf16)
ELYZA avg 3.89 — top among 5 models, JCQ 93.0%, stable on long-form tasks

Interactive UX / streaming responses
Ornith 1.0 35B-FP8
37.5 tok/s — 3× faster than 9B, ELYZA 1 task in 22.2 s, official agentic scores also higher than 9B

Serious agentic workloads
35B-FP8 + max_tokens bump
Carries official ClawEval 69.8 / Terminal-Bench 64.2, and is also practical on the speed side

Future flagship deployment on a cluster
Ornith 1.0 397B
Official scores on par with Claude Opus 4.7; assumes multiple DGX Sparks or cloud GPUs

"9B Dense for quality, 35B-FP8 for speed" — the clean division of roles within the same family is one of Ornith 1.0's strengths, and it is genuinely convenient not to have to swap models at the operational design stage. For tasks where final Japanese open-ended generation quality matters, route to the 9B; for tasks with a lot of back-and-forth dialogue, route to the 35B-FP8 at the backend service layer, and you can have the best of both worlds.
Always-on <think> requires a design that allocates max_tokens at 2,048–4,096 for long-form tasks, but this is a prerequisite shared by all reasoning models in the Qwen 3.5 / Gemma 4 / DeepSeek family, so it is not a burden unique to Ornith. What makes Ornith interesting is, rather, that scores commensurate with that thinking length emerge even from the 9B.
 Topics I Want to Continue InvestigatingIncrease max_tokens to 4,096 on Ornith 35B-FP8 and examine its true quality on long-form tasks
Keep Ornith 9B resident in Hermes Agent or Claude Code Router and write up the practical feel of running a reasoning model in production based on one week of routing logs
When an official NVFP4 build is released, run 35B-NVFP4 / 397B-NVFP4 on a DGX Spark and confirm the speed and quality gains
Set up Terminal-Bench 2.1 as a permanent fixture on a DGX Spark and use it as a recurring evaluation platform for Ornith 9B / 35B / other OSS models
 Reference LinksOrnith-1.0: Self-Scaffolding LLMs for Agentic Coding | DeepReinforce Blog
HuggingFace deepreinforce-ai/Ornith-1.0-9B
HuggingFace deepreinforce-ai/Ornith-1.0-35B-FP8
wesche.com Spark-Bench v2
Berkeley Function Calling Leaderboard (BFCL)
ELYZA-tasks-100
JCommonsenseQA v1.1

I ran Ornith 1.0 on DGX Spark and compared its Japanese language performance with Gemma 4 / Nemotron

Introduction

The Outline of Ornith 1.0

Running It on DGX Spark

Setting Up the Verification Environment

Phase A — Light Qualitative Test Reveals `<think>` Always ON

Phase B — 6-Model Side-by-Side on JCommonsenseQA

Phase C — Ornith 9B Takes the Top Spot in Japanese Free-Form Writing on ELYZA-tasks-100

Phase D — Actual tok/s Measurement and Cross-Check with wesche.com Official Values

Phase E — Measuring Ornith 9B's Tool Calling Suitability with BFCL v4

Citing Official Agentic Scores (Terminal-Bench / SWE-Bench / ClawEval / NL2Repo)

Summary — What Makes Ornith 1.0 Compelling

Topics I Want to Continue Investigating

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Model	Parameters	Quantization	Approx. Size	Use Case
Ornith-1.0-9B	9B Dense	bf16	~18 GB	Edge / single-machine
Ornith-1.0-9B-GGUF	9B Dense	Q4_K_M / Q5_K_M	~5-7 GB	llama.cpp systems
Ornith-1.0-35B	35B MoE (Active 3B)	bf16	~70 GB	Mid-scale server
Ornith-1.0-35B-FP8	35B MoE (Active 3B)	FP8 (compressed-tensors)	~36 GB	When you want to run on a single GPU
Ornith-1.0-35B-GGUF	35B MoE (Active 3B)	Various GGUF	—	llama.cpp systems
Ornith-1.0-397B	397B MoE	bf16	~800 GB	Multi-node
Ornith-1.0-397B-FP8	397B MoE	FP8	~400 GB	Multi-node

Size	DGX Spark Single Node	Expected Configuration
9B Dense (bf16)	◎ Comfortable	~18 GB, fits within ~30+ GB including KV cache
35B MoE (bf16)	△ Tight	~70 GB; fits in unified memory, but ~100 GB including KV cache — little headroom
35B MoE (FP8)	◎ Works	~36 GB, ~50-60 GB including KV cache — a realistic line
35B MoE (GGUF Q4_K_M)	◎ Light	~20-25 GB via llama.cpp — treated as reference
397B MoE (FP8)	× Single node infeasible	~400 GB, requires multi-node setup of 2+ nodes

Model	Quantization	Size	Runtime	Source
Ornith 1.0 9B	bf16	~18 GB	vLLM 0.23.0	`deepreinforce-ai/Ornith-1.0-9B`
Ornith 1.0 35B-FP8	FP8 (compressed-tensors)	~36 GB	vLLM 0.23.0	`deepreinforce-ai/Ornith-1.0-35B-FP8`
Qwen3.6-35B-A3B-FP8	FP8 (auto config)	~35 GB	vLLM 0.23.0	`Qwen/Qwen3.6-35B-A3B-FP8`
Nemotron 9B-v2-Japanese	bf16	~18 GB	vLLM 0.23.0	`nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese`
Nemotron 3 Nano 30B-A3B-NVFP4	NVFP4	~17 GB	vLLM 0.23.0	`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4`
Gemma 4 26B-A4B-NVFP4	NVFP4	~13 GB	vLLM 0.23.0	`nvidia/Gemma-4-26B-A4B-NVFP4`

Prompt	think instruction	latency	completion_tokens	Final answer chars	finish_reason
Light question	off	26.9 s	356	45 chars	stop
Light question	on	56.1 s	508	60 chars	stop
Light reasoning	off	13.0 s	171	56 chars	stop
Light reasoning	on	38.8 s	512	88 chars	length

Model	Quantization	Accuracy	avg latency	avg completion tokens	length cutoffs
Gemma 4 26B-A4B-NVFP4	NVFP4	97.7% (293/300)	0.1 s	2	0
Nemotron 3 Nano 30B-A3B-NVFP4	NVFP4	94.0% (282/300)	3.3 s	199	4
Ornith 1.0 9B	bf16	93.0% (279/300)	28.1 s	359	10
Qwen3.6-35B-A3B-FP8	FP8	93.0% (279/300)	10.3 s	360	17
Ornith 1.0 35B-FP8	FP8	92.0% (276/300)	14.8 s	570	19
Nemotron 9B-v2-Japanese	bf16	87.3% (262/300)	30.0 s	402	26

Model	Quantization	avg score	Count of 5-point scores
Ornith 1.0 9B	bf16	3.89	47
Nemotron 3 Nano 30B-A3B-NVFP4	NVFP4	3.29	14
Nemotron 9B-v2-Japanese	bf16	3.08	33
Ornith 1.0 35B-FP8	FP8	2.61	24
Qwen3.6-35B-A3B-FP8	FP8	1.72	10

Model	Quantization	Architecture	tok/s (avg)	latency (avg)
Nemotron 3 Nano 30B-A3B	NVFP4	MoE Active 3B	59.81	8.6 s
Qwen3.6-35B-A3B	FP8	MoE Active 3B	51.61	9.9 s
Ornith 1.0 35B	FP8 (compressed-tensors)	MoE Active 3B	37.50	13.7 s
Nemotron 9B-v2-JP	bf16	Dense	13.37	38.3 s
Ornith 1.0 9B	bf16	Dense	12.65	40.5 s

Category	Count	Ornith 9B Accuracy
multiple (function selection)	199	92.00%
simple_python (basic FC)	399	90.25%
live_relevance (real-world relevance)	16	87.50%
parallel_multiple (parallel + selection)	199	87.00%
irrelevance (reject)	239	83.33%
parallel (parallel call)	199	78.50%
live_simple (real-world single)	257	74.42%
live_parallel (real-world parallel)	16	68.75%

Bench	Ornith-1.0-397B	Claude Opus 4.7	Claude Opus 4.8
Terminal-Bench 2.1 (Terminus-2)	77.5	70.3	85.0
Terminal-Bench 2.1 (Claude Code)	78.2	69.7	78.9
SWE-Bench Verified	82.4	80.8	87.6
SWE-Bench Pro	62.2	64.3	69.2
ClawEval Avg	77.1	78.2	—
NL2Repo	48.2	—	69.7

Bench	Ornith-1.0-35B	Qwen3.5-35B	Qwen3.6-35B	Gemma4-31B	Qwen3.5-397B
Terminal-Bench 2.1 (Terminus-2)	64.2	41.4	52.5	42.1	53.5
SWE-Bench Verified	75.6	70.0	73.4	52.0	76.4
ClawEval Avg	69.8	65.4	68.7	48.5	70.7

Bench	Ornith-1.0-9B	Qwen3.5-9B	Gemma4-12B	Gemma4-31B
Terminal-Bench 2.1 (Terminus-2)	43.1	21.3	21.0	42.1
SWE-Bench Verified	69.4	53.2	44.2	52.0
ClawEval Avg	63.1	53.2	32.5	48.5

Use Case	Recommendation	Rationale
Quality-focused batch processing / nightly auto reports	Ornith 1.0 9B (bf16)	ELYZA avg 3.89 — top among 5 models, JCQ 93.0%, stable on long-form tasks
Interactive UX / streaming responses	Ornith 1.0 35B-FP8	37.5 tok/s — 3× faster than 9B, ELYZA 1 task in 22.2 s, official agentic scores also higher than 9B
Serious agentic workloads	35B-FP8 + `max_tokens` bump	Carries official ClawEval 69.8 / Terminal-Bench 64.2, and is also practical on the speed side
Future flagship deployment on a cluster	Ornith 1.0 397B	Official scores on par with Claude Opus 4.7; assumes multiple DGX Sparks or cloud GPUs