
I ran Ornith 1.0 on DGX Spark and compared its Japanese language performance against Gemma 4 / Nemotron
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
On June 25, 2026, DeepReinforce released a new open-source LLM family, Ornith 1.0. It comes in three sizes — 9B Dense / 35B MoE / 397B MoE — with an MIT license, 262K context, and a sharp design focused on "agentic coding." Official benchmarks claim scores approaching Claude Opus 4.7 on Terminal-Bench 2.1 / SWE-Bench Verified / ClawEval and others, and download numbers on HuggingFace have been climbing since release.
However, all of these official benchmarks are English-language tasks. There are no claims about Japanese performance anywhere on the LP or model cards.
So, as always with these articles, I decided to verify "whether the English agentic-specialized model Ornith 1.0, when run through Japanese benchmarks on a DGX Spark, can hold its own alongside existing Japanese-capable LLMs (Gemma 4 / Nemotron / Qwen3.6, etc.)."
The verification proceeded across the following 5 phases:
- Phase A Light Qualitative — Compare light questions and light reasoning with
<think>block ON / OFF - Phase B JCQ — Measure Japanese common sense with a 300-question subset of JCommonsenseQA v1.1
- Phase C ELYZA-tasks-100 — Score 100 free-text tasks using LLM-as-a-Judge
- Phase D Performance Measurement — Measure tok/s and GPU memory, then cross-reference against wesche.com Spark-Bench v2 public values
- Phase E BFCL v4 — Measure agentic suitability with the standard Tool Calling / Function Calling benchmark
"How well can an agentic-specialized model perform on Japanese benchmarks, or is the strength shown in official benchmarks limited to English?" To give the conclusion upfront: Ornith 1.0 9B came out on top among 5 models in Japanese free-text writing — a fairly encouraging result.
The Outline of Ornith 1.0
DeepReinforce is a team of 5 researchers who have previously published papers such as GrandCode (an RL training method for competitive programming) and CUDA-L2 (GPU kernel optimization). Following that lineage, Ornith 1.0 is a model family trained with the idea of "self-scaffolding" — where the model itself continuously generates the scaffold (the framework of steps and prerequisites) used in agentic tasks via RL.
Organizing the size and quantization lineup from the official HuggingFace collection:
| Model | Parameters | Quantization | Approx. Size | Use Case |
|---|---|---|---|---|
| Ornith-1.0-9B | 9B Dense | bf16 | ~18 GB | Edge / single-machine |
| Ornith-1.0-9B-GGUF | 9B Dense | Q4_K_M / Q5_K_M | ~5-7 GB | llama.cpp systems |
| Ornith-1.0-35B | 35B MoE (Active 3B) | bf16 | ~70 GB | Mid-scale server |
| Ornith-1.0-35B-FP8 | 35B MoE (Active 3B) | FP8 (compressed-tensors) | ~36 GB | When you want to run on a single GPU |
| Ornith-1.0-35B-GGUF | 35B MoE (Active 3B) | GGUF various | — | llama.cpp systems |
| Ornith-1.0-397B | 397B MoE | bf16 | ~800 GB | Multi-node |
| Ornith-1.0-397B-FP8 | 397B MoE | FP8 | ~400 GB | Multi-node |
The 9B uses a qwen3_5-based architecture, while the 35B / 397B use a qwen3_5_moe-based MoE architecture. The official LP states the design is based on Gemma 4 and Qwen 3.5 with post-training applied.
What catches the eye in the official LP's comparison table is the 35B class positioning. Even within the same 35B-A3B class, it outpaces other vendors' Qwen3.5-35B / Qwen3.6-35B on ClawEval Avg and Terminal-Bench, and puts up numbers approaching Qwen3.5-397B (11 times the parameter count), the flagship in the 35B class.
Ornith-1.0-35B achieves 64.2 on Terminal-Bench 2.1 (Terminus-2) and 69.8 on ClawEval Avg, matching Qwen3.5-397B (53.5 / 70.7).
( Source: deep-reinforce.com/ornith_1_0.html )
What I always wonder here is: "Is Japanese support even intended?" Reading the LP and model cards all the way through, there are no claims about Japanese. The language ratio of training data is also not disclosed.
In other words, whether Ornith 1.0 can be used in Japanese is something we can only find out by actually testing it on real hardware.
Running on DGX Spark
I thought through how to run Ornith 1.0 on an NVIDIA DGX Spark (GB10 architecture, unified memory 128 GB, aarch64).
| Size | DGX Spark Single Node | Expected Configuration |
|---|---|---|
| 9B Dense (bf16) | ◎ Comfortable | ~18 GB, fits within ~30+ GB including KV cache |
| 35B MoE (bf16) | △ Tight | ~70 GB, fits in unified memory but ~100 GB including KV cache, little headroom |
| 35B MoE (FP8) | ◎ Works | ~36 GB, ~50-60 GB including KV cache, a realistic line |
| 35B MoE (GGUF Q4_K_M) | ◎ Light | ~20-25 GB via llama.cpp, treated as reference |
| 397B MoE (FP8) | × Single node impossible | ~400 GB, requires multi-node of 2+ nodes |
For straightforward operation on a single DGX Spark, the two choices are 9B bf16 or 35B FP8. I included both in this verification. Since 397B cannot run on a single node, I limited it to citing official scores.
Setting Up the Verification Environment
This verification was conducted on a DGX Spark. Starting with the setup of Docker, HuggingFace CLI, and uv, I fetched the weights for Ornith 1.0 9B and 35B-FP8 from HuggingFace. The 9B consists of 4 shards at approximately 18 GB, and the 35B-FP8 is 30 files at approximately 36 GB. Both completed via hf download parallel fetching in approximately 7 minutes and 8 minutes, respectively.
For this run, I used vllm/vllm-openai:latest from Docker Hub (vLLM 0.23.0).
The vLLM startup command looks like this:
docker run -d \
--name ornith-vllm-9b \
--gpus all --shm-size 16g --ipc host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_USE_FLASHINFER_SAMPLER=0 \
vllm/vllm-openai:latest \
--model deepreinforce-ai/Ornith-1.0-9B \
--served-model-name ornith-9b \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--dtype bfloat16 \
--reasoning-parser qwen3
For 9B, specify --dtype bfloat16; for 35B-FP8, explicitly specify --quantization compressed-tensors.
Time from startup to readiness was approximately 5 minutes for 9B, approximately 7-8 minutes for 35B-FP8, and approximately 8 minutes and 4 minutes for Qwen3.6-35B-A3B-FP8 and Nemotron Nano 30B-A3B-NVFP4 respectively. The breakdown is roughly 1.5-3 minutes for weight loading, 25-30 seconds for torch.compile, and 40-60 seconds for CUDAGraph capture (51+35 types of sizes from 1 to 512).
Here is a table summarizing the comparison models:
| Model | Quantization | Size | Runtime | Source |
|---|---|---|---|---|
| Ornith 1.0 9B | bf16 | ~18 GB | vLLM 0.23.0 | deepreinforce-ai/Ornith-1.0-9B |
| Ornith 1.0 35B-FP8 | FP8 (compressed-tensors) | ~36 GB | vLLM 0.23.0 | deepreinforce-ai/Ornith-1.0-35B-FP8 |
| Qwen3.6-35B-A3B-FP8 | FP8 (auto-detected from config) | ~35 GB | vLLM 0.23.0 | Qwen/Qwen3.6-35B-A3B-FP8 |
| Nemotron 9B-v2-Japanese | bf16 | ~18 GB | vLLM 0.23.0 | nvidia/NVIDIA-Nemotron-Nano-9B-v2-Japanese |
| Nemotron 3 Nano 30B-A3B-NVFP4 | NVFP4 | ~17 GB | vLLM 0.23.0 | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 |
| Gemma 4 26B-A4B-NVFP4 | NVFP4 | ~13 GB | vLLM 0.23.0 | nvidia/Gemma-4-26B-A4B-NVFP4 |
Ornith's 9B and 35B-FP8 serve as the main subjects, with 9B-class (Nemotron 9B-v2-JP), 30B-class (Nemotron Nano, Gemma 4 26B-A4B, Qwen3.6) placed alongside for comparison.
Phase A — Light Qualitative Testing Reveals <think> Always ON
First, I sent just two types of inputs to Ornith 9B: a light question and a light reasoning task. The prompts were as follows:
- Light question: "Please introduce the model called Ornith 1.0 to readers in about 50 characters"
- Light reasoning: "We split 3 apples and 4 oranges between 2 people. Please answer within 100 characters how many each person gets and suggest a fair way to divide them"
I sent each under two conditions: with the system prompt instructing "please do not output the <think> block" and instructing "please show your thought process and then give your final answer," comparing response length / latency / presence of <think>.
The results are summarized in the table below:
| Prompt | think instruction | latency | completion_tokens | Final answer length | finish_reason |
|---|---|---|---|---|---|
| Light question | off | 26.9 sec | 356 | 45 chars | stop |
| Light question | on | 56.1 sec | 508 | 60 chars | stop |
| Light reasoning | off | 13.0 sec | 171 | 56 chars | stop |
| Light reasoning | on | 38.8 sec | 512 | 88 chars | length |
What I noticed here was that even when instructed "please do not output <think>," Ornith 9B internally generates <think>...</think> blocks. Looking at the raw responses, even in the think=off case, the thought process text always appears in the first half of the response, followed by the final answer after </think>.
The evidence is that completion_tokens reached 356 even for the light question with think=off. The actual final answer is 45 characters (a few dozen tokens), yet more than 10 times that number of tokens was consumed "before the final answer."
Checking the HuggingFace model card to be sure, it reads:
By default the assistant turn opens with a
<think> … </think>. There is no documented switch to disable it.
In other words, Ornith 1.0 is a model where <think> always ON is a design specification that cannot be disabled via system prompt or API parameters. There is no reasoning_effort toggle like in Sakana Fugu or PLaMo 3.0 Prime.
However, there is a workaround. Enabling vLLM's --reasoning-parser qwen3 separates the API response into content (final answer) and reasoning_content (thought process). Testing this actually produces a response like:
{
"content": "\n\nA",
"reasoning_content": "Thinking Process:\n1. **Analyze the Request:** ...\n2. **Determine the Correct Answer:** ...\n"
}
With this, for tasks like JCQ where you want to parse a single character from choices, you can directly parse content for one character. I decided to always attach this flag when running subsequent benchmarks.
Here is a summary of Ornith 1.0 9B's raw behavior discovered up to this point:
<think>is always ON, cannot be disabled- Even light questions take 26-56 seconds (depending on thinking length)
- If max_tokens is small,
<think>consumes it all and the final answer becomes empty --reasoning-parser qwen3can mechanically separate content and reasoning
This is an important prerequisite that directly affects max_tokens design in subsequent JCQ / ELYZA / BFCL phases. Without setting max_tokens to 1024 or higher, you get an error pattern of "cut off mid-thought with zero answer" even for multiple-choice tasks.
Phase B — 6 Models Side-by-Side on JCommonsenseQA
To examine basic Japanese common sense capability, I took a 300-question subset (seed-fixed) from the validation split (1,119 questions) of leemeng/jcommonsenseqa-v1.1 and ran a side-by-side benchmark across 6 models. The setup was 3-shot, temperature=0, max_tokens=1024, system prompt instructing "please return only a single alphabet character A through E," with reasoning-parser qwen3 to parse a single character from the content side.
Results:
| Model | Quantization | Accuracy | avg latency | avg completion tokens | length cutoffs |
|---|---|---|---|---|---|
| Gemma 4 26B-A4B-NVFP4 | NVFP4 | 97.7% (293/300) | 0.1 sec | 2 | 0 |
| Nemotron 3 Nano 30B-A3B-NVFP4 | NVFP4 | 94.0% (282/300) | 3.3 sec | 199 | 4 |
| Ornith 1.0 9B | bf16 | 93.0% (279/300) | 28.1 sec | 359 | 10 |
| Qwen3.6-35B-A3B-FP8 | FP8 | 93.0% (279/300) | 10.3 sec | 360 | 17 |
| Ornith 1.0 35B-FP8 | FP8 | 92.0% (276/300) | 14.8 sec | 570 | 19 |
| Nemotron 9B-v2-Japanese | bf16 | 87.3% (262/300) | 30.0 sec | 402 | 26 |
Reading the Ornith-related takeaways:
The biggest personal discovery was Ornith 1.0 9B scoring 93.0%, surpassing the Japanese-specialized Nemotron 9B-v2-Japanese by 5.7 percentage points. Honestly, it was unexpected that an agentic coding-specialized model would beat a "model SFT'd specifically for Japanese" on Japanese common sense questions. The natural interpretation is that the base Qwen 3.5 / Gemma 4 included substantial Japanese in pretraining, and that foundation persisted even through Ornith's self-scaffolding RL. As a 9B Dense model, 93.0% is solidly in the practical range.
Ornith 1.0 35B-FP8 also put up 92.0%, on par with 9B. For a 35B class Japanese common sense task with MoE Active 3B + FP8, those numbers are sufficient. For questions where fitting the answer within 1024 <think> tokens is feasible, both 9B and 35B proved stable.
For reference, the reason Gemma 4 26B-A4B-NVFP4 delivers 97.7% in 0.1 seconds with 2 tokens is that Gemma 4 has no thinking component and is designed to directly output minimal responses like A. This direct style dominates in multiple-choice tasks, but as we'll see in later chapters, free-text generation is a different story.
Within the JCQ framework, the first answer has emerged: Ornith 1.0, in both 9B and 35B, falls within the practical range for Japanese common sense. "Despite being English agentic-specialized, it works fine in Japanese." That alone makes it a viable candidate for local deployment on DGX Spark.
However, since JCQ is a multiple-choice task, quality in long-form generation is a separate axis. In the next chapter where I ran ELYZA-tasks-100, Ornith 1.0 9B produced even more interesting results.
Phase C — Ornith 9B Tops Japanese Free-Text Writing on ELYZA-tasks-100
Next up is ELYZA-tasks-100. This is the standard benchmark for Japanese LLM free-text evaluation: scoring 100 tasks of free-form Japanese responses from 1 to 5 points using LLM-as-a-Judge. This time I used Anthropic's claude-haiku-4.5 (via API) as the judge and lined up 4 comparison models. max_tokens is uniformly set to 1,024 including thinking.
| Model | Quantization | avg score | Count of 5-point scores |
|---|---|---|---|
| Ornith 1.0 9B | bf16 | 3.89 | 47 |
| Nemotron 3 Nano 30B-A3B-NVFP4 | NVFP4 | 3.29 | 14 |
| Nemotron 9B-v2-Japanese | bf16 | 3.08 | 33 |
| Ornith 1.0 35B-FP8 | FP8 | 2.61 | 24 |
| Qwen3.6-35B-A3B-FP8 | FP8 | 1.72 | 10 |
Ornith 1.0 9B topped the 5 models with avg 3.89. With 47 five-point scores, and 67 combined 4-point and 5-point scores, this is solidly practical. For reference, ELYZA-Llama-3-8B's public value is around avg 3.0, and GPT-3.5-class models around 3.5, so this exceeds those and comes close to, but doesn't quite reach, the GPT-4-class range of 4.5.
This frankly exceeded expectations. Ornith is an English-based 9B model focused on agentic coding that has not received Japanese-specialized SFT. The natural reading of these numbers is that the base Qwen 3.5 / Gemma 4 included substantial Japanese in pretraining, and Ornith's self-scaffolding RL extended the model's capabilities toward agentic tasks while retaining that knowledge. This is a subtly important observation for operational design: "the Japanese foundation of Qwen 3.5 / Gemma 4-based models persists even through post-training."
Note that Ornith 35B-FP8 scored lower in avg than its own 9B, but this is the result of comparison under a single uniform condition of max_tokens=1,024. Ornith 35B-FP8 still produced 24 five-point scores, showing its generation quality itself is stable. If you're seriously using Ornith 35B-FP8 for long-form tasks, the right operational design would be to set max_tokens to 2,048-4,096. When deploying a reasoning model with always-ON <think> for long-form content — whether in code or infrastructure — setting a higher token budget is the safe approach.
The practical takeaway from these benchmark numbers is simple: Ornith 1.0 9B is the most straightforward choice for bringing Japanese free-text writing to the practical range on DGX Spark. MIT license, 18 GB in bf16, it thinks carefully with <think> before answering, and delivers an average 3.89 on ELYZA. It's a sufficient candidate for an open-source Japanese LLM to keep resident on your local DGX Spark.
Phase D — tok/s Measurement and Cross-Reference with wesche.com Official Values
Beyond quality, let me organize the speed and memory picture. I ran a simple benchmark of warmup 1 + 5 runs across 5 models using a longer prompt (~100 characters) + max_tokens=512.
| Model | Quantization | Architecture | tok/s (avg) | latency (avg) |
|---|---|---|---|---|
| Nemotron 3 Nano 30B-A3B | NVFP4 | MoE Active 3B | 59.81 | 8.6 sec |
| Qwen3.6-35B-A3B | FP8 | MoE Active 3B | 51.61 | 9.9 sec |
| Ornith 1.0 35B | FP8 (compressed-tensors) | MoE Active 3B | 37.50 | 13.7 sec |
| Nemotron 9B-v2-JP | bf16 | Dense | 13.37 | 38.3 sec |
| Ornith 1.0 9B | bf16 | Dense | 12.65 | 40.5 sec |
Here are the key points related to Ornith:
Ornith 1.0 35B-FP8 leveraged the efficiency of MoE Active 3B to deliver 37.5 tok/s for the 35B class. Compared to bf16 Dense 9B (12.65 tok/s), that's 3x faster, and a gen latency of 22 seconds for 100 ELYZA tasks is in the practical range for a reasoning model at the 35B class. Ornith is provided officially on HF as FP8 (compressed-tensors), and it runs as-is with --quantization compressed-tensors on vLLM 0.23.0 + vllm/vllm-openai:latest. The practicality of "fits within 35-40 GB on a single DGX Spark with the officially provided quantization, delivering 37 tok/s" is a welcome figure for operations looking to host a 35B class locally.
Ornith 1.0 9B is Dense, so it comes in at 1/3 the speed of 35B-FP8 at 12.65 tok/s. Being the type that reasons carefully via long-form thinking, interactive use will involve response waits, but in exchange it delivers ELYZA 3.89 / JCQ 93.0% / BFCL simple_python 90.25% quality.
Aligning both quality and speed for "so which do you actually use?" looks like this:
| Model | ELYZA avg | tok/s | ELYZA avg latency per task | Quantization | Expected Use Case |
|---|---|---|---|---|---|
| Ornith 1.0 9B | 3.89 | 12.65 | 42.8 sec | bf16 | Batch processing prioritizing quality, overnight automated reports, dialogue where quality cannot be sacrificed |
| Ornith 1.0 35B-FP8 | 2.61 | 37.5 | 22.2 sec | FP8 | Interactive apps, UX where wait time matters, streaming responses |
Simply put, Ornith 1.0 9B is the "sweet spot when you want quality," and Ornith 1.0 35B-FP8 is the "sweet spot when you want speed." Being able to choose honestly between quality and speed within the same family is a straightforwardly appealing setup for operational design.
One note: the 35B-FP8 ELYZA score of 2.61 was measured under the single condition of max_tokens=1,024. Ornith 35B-FP8 still produced 24 five-point scores, and its generation quality itself is stable. If you're seriously using 35B-FP8 for long-form tasks, designing with max_tokens set to 2,048-4,096 to let <think> fully play out should improve quality while preserving the speed advantage. This is a topic I want to revisit while running it in actual production.
Note that wesche.com Spark-Bench v2's published value of 66.9 tok/s for Ornith-1.0-35B was measured with NVFP4 quantization ("Local models: served via llama.cpp (Q4_K_M quantization) or vLLM (NVFP4)", source: wesche.com/dgx/). Since the official Ornith HuggingFace collection currently has no NVFP4 version, wesche measured their own NVFP4-quantized version. If Ornith officially releases an NVFP4 version, reaching the 60 tok/s range for the 35B class becomes a realistic prospect. Separately measuring Nemotron 3 Nano 30B-A3B-NVFP4 on my local DGX Spark yielded 59.81 tok/s (same 30B class MoE Active 3B + NVFP4), consistent with wesche's order of magnitude.
At this point, the "Japanese performance" and "speed/balance" picture for Ornith 1.0 9B / 35B is complete. Finally, let's look at agentic task suitability from both my own BFCL v4 evaluation and the official LP's Terminal-Bench / SWE-Bench / ClawEval scores.
Phase E — Measuring Ornith 9B's Tool Calling Aptitude with BFCL v4
BFCL (Berkeley Function Calling Leaderboard) is the industry de-facto evaluation benchmark for measuring Tool Use / Function Calling capability. The latest v4 covers over 20 categories, from single-turn simple variants to parallel, irrelevance reject, live (real-world queries), multi_turn, and web_search.
Since Ornith is not registered in the BFCL catalog, I rewrote vLLM's --served-model-name to Qwen/Qwen3-8B and reused BFCL's Qwen/Qwen3-8B-FC handler. Since it's based on Qwen 3.5, the chat template is the same type, and the handler's _format_prompt works without issues. I used --skip-server-setup + REMOTE_OPENAI_BASE_URL to point directly to the external vLLM endpoint.
Running 8 categories — simple_python / multiple / parallel / parallel_multiple / irrelevance / live_simple / live_relevance / live_parallel — with 1,514 items in parallel using num_threads=4, it completed in 1 hour 42 minutes.
Here are the results for Ornith 1.0 9B:
| Category | Count | Ornith 9B Accuracy |
|---|---|---|
| multiple (function selection) | 199 | 92.00% |
| simple_python (basic FC) | 399 | 90.25% |
| live_relevance (real-world relevance) | 16 | 87.50% |
| parallel_multiple (parallel + selection) | 199 | 87.00% |
| irrelevance (reject) | 239 | 83.33% |
| parallel (parallel call) | 199 | 78.50% |
| live_simple (real-world single) | 257 | 74.42% |
| live_parallel (real-world parallel) | 16 | 68.75% |
Three things I can read from this:
First, Non-Live categories (human-written) are consistently 78-92%. simple_python at 90%, multiple at 92%, parallel_multiple at 87%. The most basic agentic operation — reading a tool spec, selecting the right function, and filling in the argument JSON — works at a practical level (80%+) even with a 9B Dense model.
Second, Live categories (real-world queries) drop 10-16 percentage points across all categories. simple_python at 90.25% versus live_simple at 74.42%. There's a clear degradation going from "grammatically clean, human-crafted function specs" to "rough specs that might appear in actual products." This is a trend seen across the Live categories in BFCL generally, not just for Ornith, but the fact that this gap is not small even for a 9B Dense model is a number worth knowing for operational design.
Third, parallel and live_parallel are on the lower end (78.50% and 68.75%). Tasks requiring multiple tool calls to be issued in parallel showed a tendency for the structure to break down as <think> grew longer. This is an industry-wide phenomenon where reasoning models with always-ON <think> occasionally falter on structured output — and Ornith is no exception.
While I wanted to run parallel evaluation for the other models, running all 4 models would take another 16+ hours, which didn't fit this article's publication timeline. For comparison targets, I referenced official scores for Qwen3-8B-FC and Qwen3-30B-A3B-Instruct-2507-FC from the BFCL official leaderboard (gorilla.cs.berkeley.edu/leaderboard.html). Ornith 9B's results here are roughly on par with the official scores for Qwen3-8B-FC — my honest impression is that rather than being exceptionally strong for an agentic coding-specialized model, it's "standard for the Qwen 3.5 8B class."
This might be the natural read: that Ornith's true agentic value lies not in the base 8B class, but in the 397B flagship design that, as the official LP claims, aims to match Claude Opus 4.7 / 4.8.
Citing Official Agentic Scores (Terminal-Bench / SWE-Bench / ClawEval / NL2Repo)
I'll pull out some of the particularly strong numbers from the size-by-size comparison table published on the Ornith official LP.
The 397B flagship model outperforms Claude Opus 4.7 in several categories.
| Bench | Ornith-1.0-397B | Claude Opus 4.7 | Claude Opus 4.8 |
|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 77.5 | 70.3 | 85.0 |
| Terminal-Bench 2.1 (Claude Code) | 78.2 | 69.7 | 78.9 |
| SWE-Bench Verified | 82.4 | 80.8 | 87.6 |
| SWE-Bench Pro | 62.2 | 64.3 | 69.2 |
| ClawEval Avg | 77.1 | 78.2 | — |
| NL2Repo | 48.2 | — | 69.7 |
Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks.
( Source: deep-reinforce.com/ornith_1_0.html )
The 35B-MoE also puts up exceptional numbers for its class.
| Bench | Ornith-1.0-35B | Qwen3.5-35B | Qwen3.6-35B | Gemma4-31B | Qwen3.5-397B |
|---|---|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 64.2 | 41.4 | 52.5 | 42.1 | 53.5 |
| SWE-Bench Verified | 75.6 | 70.0 | 73.4 | 52.0 | 76.4 |
| ClawEval Avg | 69.8 | 65.4 | 68.7 | 48.5 | 70.7 |
The picture here is that within the same 35B class, it pulls 11.7 points ahead of Qwen3.6-35B and closes in on flagship Qwen3.5-397B (which has 11× the parameters).
The 9B (Dense) is positioned for edge use, yet it still leaves the 35B-class Qwen3.5-9B (Dense) far behind.
| Bench | Ornith-1.0-9B | Qwen3.5-9B | Gemma4-12B | Gemma4-31B |
|---|---|---|---|---|
| Terminal-Bench 2.1 (Terminus-2) | 43.1 | 21.3 | 21.0 | 42.1 |
| SWE-Bench Verified | 69.4 | 53.2 | 44.2 | 52.0 |
| ClawEval Avg | 63.1 | 53.2 | 32.5 | 48.5 |
It's genuinely interesting that the 9B Dense nearly ties the 31B Dense (Gemma 4-31B) on Terminal-Bench. Given the large gap it opens up over Qwen 3.5 9B, one could hypothesize that the effect of self-scaffolding RL is particularly pronounced in smaller-scale models.
Summary — What Makes Ornith 1.0 Appealing
Here are my takeaways after running Ornith 1.0 — released as an English agentic coding-specialized model — through 4 Japanese axes plus 1 agentic axis on a DGX Spark.
The things I personally find appealing about Ornith 1.0 are as follows.
- Even the 9B Dense reaches practical territory for Japanese common sense, open-ended writing, and tool calling. JCQ 93.0%, ELYZA-tasks-100 avg 3.89 (top among 5 models), BFCL v4 simple_python 90.25%. These numbers from a 9B positioned as edge-oriented are more than sufficient as a candidate for a local LLM to run resident on a DGX Spark.
- The 35B-FP8 is provided under MIT from the official HF, and delivers 37 tok/s straight out of the box on a single DGX Spark. It spins up easily with
vllm/vllm-openai:latest+--quantization compressed-tensors, and the MoE Active 3B efficiency contributes as well. <think>can be split into content / reasoning by vLLM's--reasoning-parser qwen3. The handling when incorporating a reasoning model into structured tasks is standardized, requiring no extra ingenuity in operational design.- The official LP claims the 397B rivals Claude Opus 4.7. Terminal-Bench 2.1 at 77.5, SWE-Bench Verified at 82.4, ClawEval Avg at 77.1. Even in the 35B class, with ClawEval 69.8 / Terminal-Bench 64.2, it closes in on Qwen3.5-397B (11× the parameters).
- MIT license + Qwen 3.5 / Gemma 4 base, with the Japanese foundation inherited from the base and the agentic capability from self-scaffolding RL working well in tandem. It's also easy to use as a base for derivative models or fine-tuning going forward.
When running on DGX Spark, the choice between quality and speed depends on use case.
| Use Case | Recommended | Reason |
|---|---|---|
| Quality-focused batch processing / nightly automated reports | Ornith 1.0 9B (bf16) | Top among 5 models at ELYZA avg 3.89, JCQ 93.0%, stable on long-form tasks |
| Interactive UX / streaming responses | Ornith 1.0 35B-FP8 | 37.5 tok/s, 3× faster than 9B, ELYZA 1 task in 22.2 sec, official agentic scores also higher than 9B |
| Serious agentic workloads | 35B-FP8 + expanded max_tokens |
Carries official ClawEval 69.8 / Terminal-Bench 64.2, also practical in terms of speed |
| Future flagship operation on a cluster | Ornith 1.0 397B | Official scores rivaling Claude Opus 4.7, assumes multiple DGX Sparks or cloud GPUs |
"9B Dense for quality, 35B-FP8 for speed" — the straightforward division of roles within the same family is one of Ornith 1.0's strong points, and the fact that you don't need to switch models during the operational design phase is genuinely convenient. For tasks that require the highest final quality in Japanese open-ended writing, route to 9B; for tasks with a lot of interactive back-and-forth, route to 35B-FP8 at the backend service layer — and you can get the best of both worlds.
Having <think> always ON does require a design that allocates max_tokens of 2,048–4,096 for long-form tasks, but this is a premise shared by all reasoning models in the Qwen 3.5 / Gemma 4 / DeepSeek family, so it's not a burden unique to Ornith. What's interesting about Ornith is rather that scores commensurate with that thinking length are produced even from the 9B.
Topics I'd Like to Continue Investigating
- Increase
max_tokensto 4,096 on Ornith 35B-FP8 and examine its true quality on long-form tasks - Run Ornith 9B resident in Hermes Agent or Claude Code Router and write up the hands-on feel of operating a reasoning model in production based on a week's routing logs
- When an NVFP4 edition is officially released, run 35B-NVFP4 / 397B-NVFP4 on DGX Spark to confirm speed and quality gains
- Set up Terminal-Bench 2.1 as a permanent fixture on DGX Spark and turn it into a validation platform that regularly runs Ornith 9B / 35B and other OSS models
