I tried PLaMo 3.0 Prime

I tried PLaMo 3.0 Prime

2026.06.22

This page has been translated by machine translation. View original

Introduction

Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.

On 2026-06-22, Preferred Networks (PFN) announced the official release of PLaMo 3.0 Prime, a domestically developed large language model built from scratch. Coincidentally, on the same day, Sakana AI's Fugu also had its GA release, making it a day when two domestic LLMs both came out into the open.

Among these, PLaMo 3.0 Prime is introduced as the first domestically built, from-scratch model to include a Reasoning (deep thinking) mode. It has a lot of appealing features: an OpenAI-compatible API, a long context of 256K, Tool calling support, and claims of top scores on medical and legal domain benchmarks.

This article documents everything I went through in a single day — from signing up for the Standard plan and issuing an API key, to observing the behavior of Reasoning OFF / ON with the OpenAI-compatible SDK, and getting it to work from Claude Code.

What Is PLaMo 3.0 Prime?

PLaMo 3.0 Prime is a domestically developed generative AI foundation model independently created by Preferred Networks (PFN). It is the latest in the PLaMo series that PFN has been continuously releasing, and it is designed to switch between a "deep thinking mode" with Reasoning enabled and a Non-Reasoning mode that prioritizes responsiveness.

The official blog lists a broad range of benchmarks mixing English and Japanese as evaluation axes: IFBench / JFBench / MT-bench / Japanese MT-bench / BFCL v4 (tool use) / BrowseComp-Plus / LongBench v1, v2 / AIME 2024 / GPQA-Diamond / LiveCodeBench / lawqa_jp / MedRECT / National Medical Licensing Examination / HELM Safety. The comparison targets are Qwen3.6-27B / gpt-oss-120b / GPT-5.4 Mini / Claude Haiku 4.5 in the same price range, and DeepSeek V4 Pro / GPT-5.5 Pro in the upper tier.

The delivery format is either a cloud API via PLaMo Platform or on-premises deployment. The weights of PLaMo 3.0 Prime itself are not publicly available.

Model Specs and Pricing Plans

The specs needed when using it via API are summarized as follows. Context and output limits are values confirmed in the official API reference.

Item Value
Model ID plamo-3.0-prime
Context 256K (expanded from the beta's 64K)
Output max_tokens limit 20,000
Reasoning toggle reasoning_effort with none or medium
Tool calling Supported (evaluated with BFCL v4)
Delivery format Cloud API / On-premises

As of 2026-06-22, three pricing plans are available. I chose Standard for this exploration.

Plan Input Output
Standard ¥60 / 1M tok ¥250 / 1M tok
Free Usage limited Usage limited
Provider Custom quote Custom quote

The unit price of ¥250 per 1M output tokens feels quite accessible even compared to overseas models in the same price range. Additionally, as a GA release campaign through 2026-07-31, simply registering as a new user grants credits equivalent to 10 million tokens. Since this verification was essentially running on that quota, I think anyone who wants to try it out can do so casually for the first few days. I'll describe how the Standard plan usage quota appears later.

Positioning Against Same-Price-Range Models

The same-price-range models listed in the official blog (Qwen3.6-27B / gpt-oss-120b / GPT-5.4 Mini / Claude Haiku 4.5) are mostly overseas players. PFN's own claim is that it "matches or exceeds the same price range in instruction following, dialogue, tool use, and the medical domain" and that "in HELM Safety it is on par with or better than overseas competitors," with specific numbers displayed as graphs in the article.

Since I didn't re-run the benchmarks myself, I chose not to reproduce the figures here. This article is scoped to simply calling the API with the OpenAI-compatible SDK and observing how Reasoning mode behaves. If you want objective numerical comparisons, please refer to the official blog directly.

In the context of domestically built, from-scratch LLMs, I thought the claims of top rankings on Japanese benchmarks in medicine (National Medical Licensing Examination / MedRECT) and law (lawqa_jp) were an angle that overseas competitors don't have.

Getting Started: From Console Contract to API Key Issuance

Accessing the PLaMo Platform Console (console.platform.preferredai.jp) brings up a sign-up screen. Proceeding with the Standard plan contract takes you to a screen for issuing an API key. Since the full key is only visible immediately after issuance, I copied it on the spot and wrote it into my local .envrc.

The Playground can be opened from plamo.preferredai.jp/chat. You can toggle Reasoning ON / OFF in the chat UI, so checking the behavior there before calling the API makes subsequent verification smoother.

The verification from here on assumes that PLAMO_API_KEY is available as an environment variable.

Calling with the OpenAI-Compatible API

The PLaMo Platform is an OpenAI-compatible API, so you can call it from Python's openai package simply by swapping base_url. To observe behavior on real hardware, I wrote a small script that sends one light question and one light reasoning task, each with Reasoning OFF and Reasoning ON.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.platform.preferredai.jp/v1",
    api_key=os.environ["PLAMO_API_KEY"],
)

# Reasoning OFF(default)
resp = client.chat.completions.create(
    model="plamo-3.0-prime",
    messages=[{"role": "user", "content": "..."}],
)

# Reasoning ON(long-think)
resp = client.chat.completions.create(
    model="plamo-3.0-prime",
    messages=[{"role": "user", "content": "..."}],
    extra_body={"reasoning_effort": "medium"},
)

I used the following two prompts:

  • light_question: "Please introduce the model PLaMo 3.0 Prime to readers in approximately 50 characters."
  • light_reasoning: "You divide 3 apples and 4 oranges between 2 people. Please answer how many each person gets and propose a fair way to divide them in 100 characters or less."

Regarding values for reasoning_effort, the official documentation only lists two values: none and medium. Out of curiosity, I tried high and low as well, following the common low / medium / high convention — both came back with HTTP 422.

{
  "detail": [
    {
      "type": "literal_error",
      "loc": ["body", "reasoning_effort"],
      "msg": "Input should be 'none' or 'medium'",
      "ctx": { "expected": "'none' or 'medium'" },
      "input": "high"
    }
  ]
}

If you're used to OpenAI or Anthropic using three levels of low / medium / high, you'll instinctively want to reach for high, but PLaMo strictly validates only two values: none and medium. Both low and high are rejected with the same error message, so since the valid values are officially specified, there's actually less room for confusion.

Here are the results from running both Reasoning OFF and ON on real hardware:

variant reasoning_effort prompt latency (s) completion_tokens
reasoning_off none light_question 3.77 60
reasoning_medium medium light_question 22.69 754
reasoning_off none light_reasoning 2.49 42
reasoning_medium medium light_reasoning 43.11 1,454

For light_question, completion_tokens ballooned from 60 → 754, roughly 12x. For light_reasoning, it went from 42 → 1,454, roughly 35x. The idea is that with Reasoning ON, that many more tokens are consumed internally beyond the visible response text.

Latency also scaled to 6–17x with Reasoning ON. The deep thinking portion naturally becomes waiting time.

What caught my attention here was the structure of the usage field. In OpenAI's Responses API and similar, you can see "how many tokens were used for deep thinking" through fields like usage.completion_tokens_details.reasoning_tokens, but in PLaMo's case, completion_tokens_details comes back as null as follows:

{
  "completion_tokens": 754,
  "prompt_tokens": 33,
  "total_tokens": 787,
  "completion_tokens_details": null,
  "prompt_tokens_details": null
}

The deep thinking portion is lumped together inside completion_tokens with no breakdown visible. Since what's charged under the Standard plan is the full output at ¥250 / 1M tok, it's worth being aware that output quota consumption accelerates with Reasoning ON.

I'll also touch on the content of the responses. For light_question, Reasoning OFF returned about 135 characters, exceeding the ~50 character target. Meanwhile, Reasoning ON returned 38 characters: "PLaMo 3.0 Prime は、最新の大規模言語モデルで、高度な自然言語処理を実現します。" — properly respecting the character constraint. It was good to observe the phenomenon where "instruction adherence accuracy improves" as a result of deep thinking.

For light_reasoning, Reasoning OFF dutifully listed the counts per element: "1.5 apples per person, 2 oranges, 3.5 total," while Reasoning ON said "7 total, 3.5 each. Fair proposal: give A 2 apples + 1 orange, give B 1 apple + 3 oranges, and split or alternate the remaining 1" — stating the total first, then adding a thoughtful twist to the fair division proposal. You can sense that the response style itself changes.

Diving Deeper with Deep Thinking Mode

With just light questions, the effect of Reasoning ON is hard to see clearly, so I'll push further with code generation. The subject is asking for "a Python script that evolves the string HELLO PLAMO using an evolutionary algorithm (genetic algorithm)." I'll send the same prompt once each with Reasoning OFF and ON, then compare the generated code side by side.

variant reasoning_effort latency (s) completion_tokens
reasoning_off none 26.54 727
reasoning_medium medium 90.84 3,243

Latency increased about 3.4x, and output tokens about 4.5x. What caught my attention more was the structure of the generated code.

Reasoning OFF returns something compact that roughly works. It uses roulette selection with fitness-proportional parent selection, single-point crossover, mutation rate of 0.05, and elitism. The explanatory text is also brief and easy to read at first glance.

However, looking closely, there's a naming collision between a function called population and a variable of the same name population that receives the function's return value inside evolve(). Due to Python's name resolution rules, if there's even one assignment like population = ... inside a function, population is bound as a local variable throughout the entire function. When Python tries to evaluate population(POPULATION_SIZE, target_len) on the right-hand side, the assignment hasn't happened yet, so running it locally immediately raises an UnboundLocalError. It's an implementation that lacks completeness in terms of real-world impact, the kind of thing you'd want to flag in a code review.

def population(size, length):
    return [random_individual(length) for _ in range(size)]

def evolve():
    target_len = len(TARGET)
    population = population(POPULATION_SIZE, target_len)  # <- function gets shadowed by variable here
    ...

Reasoning ON reorganizes the structure one step more carefully. Instead of roulette, it uses tournament selection (sampling with k=2 and taking the higher fitness), single-point crossover returns two children, each function gets a Japanese docstring, and progress is logged every 20 generations. The population function/variable naming collision from before doesn't occur.

def create_individual():
    """Generate a random individual (string)"""
    return ''.join(random.choice(alphabet) for _ in range(len(target)))

def tournament_selection(pop, fit, k=2):
    """Select 2 parents using tournament method"""
    selected = random.sample(list(zip(pop, fit)), k)
    selected.sort(key=lambda x: x[1], reverse=True)
    return selected[0][0], selected[1][0]

Rather than the code transforming into something entirely different when you pay the deep thinking cost, it looks more like the behavior of choosing the one safer option among structural alternatives.

I also did a quick test of Tool calling. I defined a weather retrieval mock tool called get_current_weather and sent "Please tell me the current weather in Tokyo in Celsius" with both Reasoning OFF and ON.

variant reasoning_effort latency (s) finish_reason arguments
reasoning_off none 3.93 tool_calls {"city": "Tokyo", "unit": "celsius"}
reasoning_on medium 5.72 tool_calls {"city": "Tokyo", "unit": "celsius"}

Both OFF and ON correctly came back with finish_reason: tool_calls and called the tool, with the arguments being exactly the same: {"city": "Tokyo", "unit": "celsius"}. The takeaway from real hardware is that the OpenAI-compatible tools / tool_choice parameters work as-is. With Reasoning ON, completion_tokens was about 2.2x larger (58 → 128), so it seems deep thinking runs even for a single tool call.

Calling PLaMo from Claude Code

Now that it's clear PLaMo works straightforwardly as an OpenAI-compatible API, let's put it on the agent-driven side as well. The subject is claude-code-router (CCR).

CCR is an OpenAI-compatible router for Claude Code that receives Anthropic-format requests, internally converts them to OpenAI format, and forwards them to any provider. You just add PLaMo to the Providers array in ~/.claude-code-router/config.json and you're done.

{
  "Providers": [
    {
      "name": "plamo",
      "api_base_url": "https://api.platform.preferredai.jp/v1/chat/completions",
      "api_key": "${PLAMO_API_KEY}",
      "models": ["plamo-3.0-prime"],
    },
  ],
}

It worked without a transformer section. It seems CCR auto-detects OpenAI-compatible APIs. If other OpenAI-compatible providers like Sakana are already coexisting, just add this alongside them.

To apply the settings, run ccr restart. If the log shows plamo provider registered as follows, it's a success.

{"level":30,"msg":"plamo provider registered"}

A curl command is sufficient for connectivity verification. Calling the Anthropic-format /v1/messages that Claude Code normally uses will have CCR convert it to OpenAI format, send it to PLaMo, and return the PLaMo response converted back to Anthropic format.

curl -s -X POST http://127.0.0.1:3456/v1/messages \
  -H "Content-Type: application/json" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "plamo,plamo-3.0-prime",
    "max_tokens": 200,
    "messages": [{"role": "user", "content": "PLaMo 3.0 Prime を 30 字以内で紹介してください。"}]
  }'

The following response is returned (excerpt):

{
  "id": "chatcmpl-...",
  "type": "message",
  "model": "plamo-3.0-prime",
  "role": "assistant",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 24,
    "output_tokens": 12,
    "cache_read_input_tokens": 0
  },
  "content": ["高性能AIアシスタント、PLaMo3.0Prime。"]
}

An 18-character response that properly respects the character constraint. The usage is returned by CCR in Anthropic format, so from Claude Code's perspective it looks nearly identical to a regular Claude model. If you start Claude Code with ccr code and fix the model with /model plamo,plamo-3.0-prime, you can use it directly for agent-driven tasks.

Briefly touching on Codex CLI and Hermes Agent: for Hermes, cutting a single profile and writing base_url and api_key directly in provider: custom format was enough to connect to PLaMo. Codex CLI v0.141.0 is limited to wire_api = "responses" exclusively, and PLaMo, which only has Chat Completions, cannot be connected directly. Details are summarized in the things I'm curious about section.

Things I'm Curious About Right Now

Here are some things I noticed while playing around, written out as bullet points.

  • The spec where usage.completion_tokens_details comes back as null. Since the breakdown of deep thinking tokens is not visible, cost estimation has to be done by looking at the completion_tokens aggregate. If a reasoning_tokens equivalent gets exposed in future API expansions, the billing transparency for Reasoning mode would improve significantly
  • Latency variance with Reasoning ON. This time it scaled straightforwardly with prompt weight: 22.69s for a light question, 43.11s for light reasoning, 90.84s for code generation. I'd like to separately check the effective latency when using the full 256K context
  • The fact that Console usage display is not real-time. During this verification I wanted to track "how many tokens have I consumed so far" on the spot, but the console numbers seem to update with a delay and weren't reflected immediately after the verification session. I'd like to see improvements to billing screen immediacy going forward
  • Reproducing the official benchmark claims. The medicine (National Medical Licensing Examination / MedRECT) and law (lawqa_jp) categories are topics I'd like to lightly verify on my own
  • Persistent operation via Hermes Agent. Since the connection itself worked with provider: custom + api_mode: chat_completions, running PLaMo as a persistent agent invoked via Slack or cron is a topic I'd like to write about in a separate article
  • Connection via Codex CLI. It was cut to wire_api: responses only in 0.141.0, and PLaMo, which only has Chat Completions, couldn't connect directly. Whether to downgrade to the 0.140.x series or route through a LiteLLM proxy is a theme I'd like to chase in a separate article

Closing

I tried out PLaMo 3.0 Prime on the Standard plan. In one day of exploration, I confirmed that it works straightforwardly as an OpenAI-compatible API, that instruction adherence and code quality change with Reasoning ON, and that Tool calling works properly. The Console experience was also smooth, and as a developer, it's reassuring to see a domestically built, from-scratch LLM come out into the world in such a "normally usable" state.

On the agent-driven side, mounting it on Claude Code via CCR and mounting it on a Hermes Agent profile each went through smoothly with brief configurations. Codex CLI ran into the part where 0.141.0 was restricted to the responses API only, so that one is left as a sequel candidate. The cost profile of Reasoning mode and the effective latency with 256K context are also themes I'd like to dig into in future articles.

2026-06-22, when two domestic LLMs were released on the same day, may well be remembered as a good milestone for the community looking back.


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article