I tried PLaMo 3.0 Prime

The domestically produced full-scratch LLM "PLaMo 3.0 Prime" has been officially released with a Reasoning mode. This is a one-day record from signing up for the Standard plan to hands-on testing of this model, which is implemented as an OpenAI-compatible API. I tracked in detail the behavioral differences between Reasoning ON/OFF, covering everything from simple questions to code generation, tool calling, and integration with Claude Code.
森茂洋 / Hiroshi Morishige
2026.06.22
This page has been translated by machine translation. View original
 IntroductionHello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
On 2026-06-22, Preferred Networks (PFN) announced the official release of PLaMo 3.0 Prime, a domestically developed large language model built from scratch. As if by coincidence, Sakana AI's Fugu also had its GA release on the same day, making it a day when two domestic LLMs came out at once.
Among these, PLaMo 3.0 Prime is introduced as the first domestic full-scratch model to include a Reasoning (deep thinking) mode. With OpenAI-compatible API, a long context of 256K, Tool calling support, and top scores claimed on medical and legal domain benchmarks, it's a very intriguing package.
This article is a record of everything I went through in a single day: from signing up for the Standard plan and issuing an API Key, to observing the behavior of Reasoning OFF/ON with the OpenAI-compatible SDK on actual hardware, and finally getting it working from Claude Code.
 What is PLaMo 3.0 Prime?PLaMo 3.0 Prime is a domestically developed generative AI foundation model independently created by Preferred Networks (PFN). It is the latest in the PLaMo series that PFN has been continuously releasing, and it is designed to switch between a "deep thinking mode" with Reasoning and a Non-Reasoning mode prioritizing responsiveness.
The official blog lists a wide range of benchmarks spanning both English and Japanese, including IFBench / JFBench / MT-bench / Japanese MT-bench / BFCL v4 (tool use) / BrowseComp-Plus / LongBench v1, v2 / AIME 2024 / GPQA-Diamond / LiveCodeBench / lawqa_jp / MedRECT / National Medical Licensing Examination / HELM Safety, as evaluation axes. The comparison targets in the same price range are Qwen3.6-27B / gpt-oss-120b / GPT-5.4 Mini / Claude Haiku 4.5, and in the higher tier, DeepSeek V4 Pro / GPT-5.5 Pro.
The delivery format is either a cloud API via the PLaMo Platform or on-premises deployment. The weights of PLaMo 3.0 Prime itself are not publicly available.
 Model Specifications and Pricing PlansThe specifications needed when using it via the API are summarized as follows. The context and output limits are values confirmed in the official API reference.


Item
Value


Model ID
plamo-3.0-prime

Context
256K (expanded from 64K in beta)

Output max_tokens cap
20,000

Reasoning toggle
none or medium via reasoning_effort

Tool calling
Supported (evaluated with BFCL v4)

Delivery format
Cloud API / On-premises

As of 2026-06-22, three pricing plans are available, and I chose Standard this time.


Plan
Input
Output


Standard
¥60 / 1M tok
¥250 / 1M tok

Free
Usage limits apply
Usage limits apply

Provider
Custom quote
Custom quote

The unit price of ¥250 per 1M output tokens feels quite accessible even compared to overseas models in the same price range. Furthermore, as a GA release campaign running through 2026-07-31, new registrants receive credits equivalent to 10 million tokens just by signing up. Since this verification was essentially running within that quota, I think anyone looking to try it out can experiment freely for the first few days. I'll cover how the usage quota appears under the Standard plan later.
 Positioning Against Same-Price-Range ModelsThe same-price-range models listed in the official blog (Qwen3.6-27B / gpt-oss-120b / GPT-5.4 Mini / Claude Haiku 4.5) are all primarily overseas offerings. PFN's own claim is that it "matches or exceeds the same price range in instruction following, dialogue, tool use, and medical domains" and that "HELM Safety is on par with or better than overseas models," with specific numbers displayed in graphs within the article.
Since I didn't re-run the benchmarks myself, I've chosen not to reproduce the figures here. This article takes a focused scope of simply hitting the API with the OpenAI-compatible SDK and observing how Reasoning mode behaves. For objective numerical comparisons, please refer to the official blog directly.
In the context of domestic full-scratch LLMs, I thought the top-ranking claims on Japanese benchmarks for medicine (National Medical Licensing Examination / MedRECT) and law (lawqa_jp) were an angle that overseas models don't have.
 Trying It Out: From Console Sign-Up to API Key IssuanceAccessing the PLaMo Platform Console (console.platform.preferredai.jp) brings up a sign-up screen. Proceeding with the Standard plan contract takes you to the API Key issuance screen. Since the full key is only visible immediately after issuance, I copied it on the spot and wrote it into my local .envrc.
The Playground can be accessed from plamo.preferredai.jp/chat. You can toggle Reasoning ON/OFF in the chat UI, so it's smoother to get a rough sense of the behavior there before hitting the API.
The verification below assumes that PLAMO_API_KEY is set as an environment variable.
 Hitting It with the OpenAI-Compatible APISince PLaMo Platform is an OpenAI-compatible API, you can call it from the Python openai package simply by replacing the base_url. To observe behavior on actual hardware, I wrote a small script that sends one light question and one light reasoning task, each with Reasoning OFF and Reasoning ON.
from openai import OpenAI

client = OpenAI(
    base_url="https://api.platform.preferredai.jp/v1",
    api_key=os.environ["PLAMO_API_KEY"],
)

# Reasoning OFF（default）
resp = client.chat.completions.create(
    model="plamo-3.0-prime",
    messages=[{"role": "user", "content": "..."}],
)

# Reasoning ON（long-think）
resp = client.chat.completions.create(
    model="plamo-3.0-prime",
    messages=[{"role": "user", "content": "..."}],
    extra_body={"reasoning_effort": "medium"},
)
I sent the following two types of prompts:
light_question: "Please introduce the model called PLaMo 3.0 Prime to readers in about 50 characters."
light_reasoning: "We split 3 apples and 4 mandarin oranges between 2 people. Please answer how many each person gets and suggest a fair way to divide them within 100 characters."
Regarding the reasoning_effort values, the official documentation only describes two values: none and medium. Out of curiosity, I tried high and low in the familiar low / medium / high sense, and both returned HTTP 422.
{
  "detail": [
    {
      "type": "literal_error",
      "loc": ["body", "reasoning_effort"],
      "msg": "Input should be 'none' or 'medium'",
      "ctx": { "expected": "'none' or 'medium'" },
      "input": "high"
    }
  ]
}
If you're used to OpenAI or Anthropic using three levels of low / medium / high, you'll naturally want to aim for high first, but PLaMo strictly validates only the two values none and medium. Both low and high are rejected with the same error message, so since the valid values are officially specified, there's less room for confusion.
The results of running both Reasoning OFF/ON on actual hardware are shown in the table below.


variant
reasoning_effort
prompt
latency (s)
completion_tokens


reasoning_off
none
light_question
3.77
60

reasoning_medium
medium
light_question
22.69
754

reasoning_off
none
light_reasoning
2.49
42

reasoning_medium
medium
light_reasoning
43.11
1454

For light_question, completion_tokens ballooned from 60 → 754, about 12x, and for light_reasoning from 42 → 1454, about 35x. The image is that with Reasoning ON, that many more tokens are consumed internally beyond the visible response text.
Latency also scaled to 6–17x with Reasoning ON. The deep thinking time naturally becomes waiting time.
What caught my attention here was the structure of the usage field. In OpenAI's Responses API and similar, you can see "how many tokens were used for deep thinking" via fields like usage.completion_tokens_details.reasoning_tokens, but in PLaMo's case, completion_tokens_details is returned as null as shown below.
{
  "completion_tokens": 754,
  "prompt_tokens": 33,
  "total_tokens": 787,
  "completion_tokens_details": null,
  "prompt_tokens_details": null
}
The deep thinking portion is lumped into completion_tokens, and the breakdown is not visible. Since the Standard plan charges ¥250 / 1M tok for the entire output, it's worth being aware that Reasoning ON will deplete the output quota faster.
Let me also touch on the content of the responses. For light_question, Reasoning OFF returned approximately 135 characters, exceeding the 50-character guideline. Meanwhile, Reasoning ON returned 38 characters: "PLaMo 3.0 Prime is a state-of-the-art large language model that achieves advanced natural language processing." — correctly adhering to the character constraint. It was nice to observe the phenomenon where "instruction compliance accuracy improves" as a result of deep thinking.
For light_reasoning, Reasoning OFF diligently listed each element as "1.5 apples per person, 2 mandarins each, 3.5 total," while Reasoning ON stated the total first — "7 total, 3.5 per person" — then added a twist to the fair division proposal: "Option A: 2 apples + 1 mandarin, Option B: 1 apple + 3 mandarins, and the remaining 1 can be cut in half or shared by turns." There's a sense that the response style itself changes.
 Diving Deeper with Deep Thinking ModeWith just light questions, it's hard to see the effect of Reasoning ON clearly, so I tried pushing into code generation. The subject was asking for "a Python script that evolves the string HELLO PLAMO using an evolutionary algorithm (genetic algorithm)." I sent the same prompt once each with Reasoning OFF and ON and compared the generated code side by side.


variant
reasoning_effort
latency (s)
completion_tokens


reasoning_off
none
26.54
727

reasoning_medium
medium
90.84
3,243

Latency stretched about 3.4x and output tokens about 4.5x. What caught my attention more was the structure of the generated code.
Reasoning OFF returns something compact that roughly works. The configuration uses roulette selection for fitness-proportionate parent selection, single-point crossover, mutation rate of 0.05, and elite preservation. The explanation is concise and readable at a glance.
However, on closer inspection, there is a name collision in evolve() between a function named population and a variable of the same name population that receives its return value. Due to Python's name resolution, once there is an assignment population = ... anywhere inside a function, population is bound as a local variable throughout that entire function. When Python tries to evaluate the right-hand side population(POPULATION_SIZE, target_len), the assignment hasn't run yet, so running it locally results in an UnboundLocalError from the very first call. This is an implementation shortcoming in terms of completeness, and it's something you'd want to flag in a code review.
def population(size, length):
    return [random_individual(length) for _ in range(size)]

def evolve():
    target_len = len(TARGET)
    population = population(POPULATION_SIZE, target_len)  # <- function -> variable swap happens here
    ...
Reasoning ON takes the time to reconstruct the structure more carefully. Instead of roulette, it uses tournament selection (sampling with k=2, adopting the one with higher fitness), returns two children from single-point crossover, writes Japanese docstrings for each function, and outputs progress logs every 20 generations. The population function/variable name collision from before does not occur.
def create_individual():
    """Generate a random individual (string)"""
    return ''.join(random.choice(alphabet) for _ in range(len(target)))

def tournament_selection(pop, fit, k=2):
    """Select 2 parents using tournament method"""
    selected = random.sample(list(zip(pop, fit)), k)
    selected.sort(key=lambda x: x[1], reverse=True)
    return selected[0][0], selected[1][0]
Spending the deep thinking cost doesn't suddenly transform the code into something completely different — rather, it looks like the behavior of choosing the structurally safer option from among available structural choices.
I also gave Tool calling a quick try. I defined a mock weather retrieval tool called get_current_weather and sent "Please tell me the current weather in Tokyo in Celsius" with both Reasoning OFF and ON.


variant
reasoning_effort
latency (s)
finish_reason
arguments


reasoning_off
none
3.93
tool_calls
{"city": "Tokyo", "unit": "celsius"}

reasoning_on
medium
5.72
tool_calls
{"city": "Tokyo", "unit": "celsius"}

Both OFF and ON correctly came back with finish_reason: tool_calls and invoked the tool, with identical arguments: {"city": "Tokyo", "unit": "celsius"}. This confirmed on actual hardware that the OpenAI-compatible tools / tool_choice parameters work as-is. The Reasoning ON side had completion_tokens inflate about 2.2x (58 → 128), so it seems deep thinking runs even for a single tool call.
 Calling PLaMo from Claude CodeNow that I've confirmed PLaMo works cleanly as an OpenAI-compatible API, I wanted to mount it on the agent-driven side as well. The subject is claude-code-router (CCR).
CCR is an OpenAI-compatible router for Claude Code that receives requests in Anthropic format, internally converts them to OpenAI format, and forwards them to any provider. All it takes is adding PLaMo to the Providers array in ~/.claude-code-router/config.json.
{
  "Providers": [
    {
      "name": "plamo",
      "api_base_url": "https://api.platform.preferredai.jp/v1/chat/completions",
      "api_key": "${PLAMO_API_KEY}",
      "models": ["plamo-3.0-prime"],
    },
  ],
}
It worked without the transformer section. CCR appears to auto-detect OpenAI-compatible APIs. If you already have other OpenAI-compatible providers like Sakana coexisting, just add PLaMo alongside them.
To apply the configuration, run ccr restart. If you see plamo provider registered in the logs as shown below, you're good to go.
{"level":30,"msg":"plamo provider registered"}
A curl command is sufficient to verify connectivity. Hitting the Anthropic-format /v1/messages endpoint that Claude Code normally uses, CCR converts it to OpenAI format and sends it to PLaMo, then converts the returned OpenAI response back to Anthropic format.
curl -s -X POST http://127.0.0.1:3456/v1/messages \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "plamo,plamo-3.0-prime",
    "max_tokens": 200,
    "messages": [{"role": "user", "content": "Please introduce PLaMo 3.0 Prime in 30 characters or less."}]
  }'
The following response is returned (excerpt):
{
  "id": "chatcmpl-...",
  "type": "message",
  "model": "plamo-3.0-prime",
  "role": "assistant",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 24,
    "output_tokens": 12,
    "cache_read_input_tokens": 0
  },
  "content": ["高性能AIアシスタント、PLaMo3.0Prime。"]
}
An 18-character response that properly respects the character constraint. Since CCR returns usage aligned to Anthropic format, Claude Code itself sees nearly the same interface as a regular Claude model. If you launch Claude Code with ccr code and fix the model with /model plamo,plamo-3.0-prime, you can use it directly for agent-driven tasks.
Briefly touching on Codex CLI and Hermes Agent as well: for Hermes, I was able to connect directly to PLaMo by creating one profile with provider: custom format and writing base_url and api_key directly. For Codex CLI, version 0.141.0 was restricted to wire_api = "responses" exclusively, so PLaMo, which only has Chat Completions, couldn't connect directly. Details are summarized in the "Things I'm Curious About" section.
 Things I'm Currently Curious AboutHere are some things that came to mind quickly after trying it out, in bullet form.
The spec where usage.completion_tokens_details returns null. Since the breakdown of deep thinking tokens isn't visible, cost estimation can only be done by looking at the total completion_tokens. If reasoning_tokens equivalents are exposed in a future API expansion, the billing transparency for Reasoning mode would improve significantly.
Latency variation with Reasoning ON. This time I saw 22.69s for a light question, 43.11s for a light reasoning task, and 90.84s for code generation — scaling straightforwardly with prompt complexity. I'd like to separately examine the effective latency when the full 256K context is in use.
The Console usage display not being real-time. During this verification, I wanted to track "how many tokens have I consumed so far" on the spot, but the Console numbers appear to update with a delay, and they weren't reflected immediately after the verification session. I'd hope to see improvement in the real-time nature of the billing screen going forward.
Verifying the official benchmark claims independently. The medicine (National Medical Licensing Examination / MedRECT) and law (lawqa_jp) benchmarks are subjects I'd like to lightly verify on my own.
Running it permanently via Hermes Agent. The connection itself went through with provider: custom + api_mode: chat_completions, so setting up a persistent agent that calls PLaMo via Slack or cron is a topic I'd like to write up in a separate article.
Connecting via Codex CLI. With 0.141.0 restricted to wire_api: responses only, PLaMo with only Chat Completions couldn't connect directly. Whether to downgrade to the 0.140 series or route around it with an LiteLLM proxy is another topic I'd like to pursue in a separate article.
 ConclusionI tried PLaMo 3.0 Prime on the Standard plan. Within a single day of testing, I was able to confirm that it works cleanly as an OpenAI-compatible API, that Reasoning ON changes instruction compliance and code quality, and that Tool calling works properly. The Console experience was also smooth, and as a developer it's genuinely appreciated to see a domestic full-scratch LLM come out in such a "just works" state.
On the agent-driven side, mounting it on Claude Code via CCR and adding it to a Hermes Agent profile both went through with minimal configuration. Codex CLI hit the wall of being restricted to the responses API in 0.141.0, so that one is carried over as a sequel candidate. The cost feel of Reasoning mode and the effective latency with the 256K context are also topics I'd like to dig into in a separate article.
2026-06-22, the day two domestic LLMs were released simultaneously, may well look like a meaningful milestone for the community when we look back on it.
 Reference LinksPLaMo 3.0 Prime をリリースしました（Preferred Networks Tech Blog）
国産生成 AI 基盤モデル PLaMo 3.0 Prime を正式リリース（Preferred Networks プレスリリース）
PLaMo Platform Console
PLaMo Playground
PLaMo Platform API リファレンス
claude-code-router（GitHub）
I tried PLaMo 3.0 Prime

Introduction

What is PLaMo 3.0 Prime?

Model Specifications and Pricing Plans

Positioning Against Same-Price-Range Models

Hitting It with the OpenAI-Compatible API

Diving Deeper with Deep Thinking Mode

Calling PLaMo from Claude Code

Things I'm Currently Curious About

Conclusion

Reference Links

AI白書2026 配布中

AWS Topics

Trending Topics

Products & Services

Features and Series

Item	Value
Model ID	`plamo-3.0-prime`
Context	256K (expanded from 64K in beta)
Output `max_tokens` cap	20,000
Reasoning toggle	`none` or `medium` via `reasoning_effort`
Tool calling	Supported (evaluated with BFCL v4)
Delivery format	Cloud API / On-premises
Plan	Input	Output
Standard	¥60 / 1M tok	¥250 / 1M tok
Free	Usage limits apply	Usage limits apply
Provider	Custom quote	Custom quote
variant	reasoning_effort	prompt	latency (s)	completion_tokens
reasoning_off	none	light_question	3.77	60
reasoning_medium	medium	light_question	22.69	754
reasoning_off	none	light_reasoning	2.49	42
reasoning_medium	medium	light_reasoning	43.11	1454
variant	reasoning_effort	latency (s)	finish_reason	arguments
reasoning_off	none	3.93	tool_calls	`{"city": "Tokyo", "unit": "celsius"}`
reasoning_on	medium	5.72	tool_calls	`{"city": "Tokyo", "unit": "celsius"}`