
Since a 12B was added to Gemma 4, I tried testing Japanese performance, voice input, and MTP on the DGX Spark
This page has been translated by machine translation. View original
Introduction
Hello, I'm Morishige from Classmethod's Manufacturing Business Technology Department.
A new size "12B" has been added to Gemma 4. Until now, Gemma 4 came in small sizes (E2B / E4B) and large sizes (26B-A4B (MoE) / 31B (Dense)), leaving a gap right in the middle. The new 12B fills that gap as a mid-range model, and Google positions it as a "bridge between E4B and 26B MoE."
I've worked with Gemma 4 on the DGX Spark twice before. The first article benchmarked all sizes in Japanese and multimodal, and the second covered speeding up generation with MTP (Multi-Token Prediction). Since I already have a measuring stick on hand, I wanted to put the new 12B on the same playing field to see where it stands.
There are three noteworthy new features in the 12B: a 256K token context, native audio input (a first for mid-range models), and a built-in drafter for MTP. According to the official documentation, it can also run on just 16GB of memory, which is an attractive convenience. This time, I tested everything hands-on with actual DGX Spark hardware, covering Japanese text performance, the new audio input feature, and MTP acceleration.
To give you the conclusion upfront: the 12B's score on the Japanese commonsense test was nearly on par with E4B. Looking at text alone, the mid-range advantage isn't very apparent, but I think the highlights of the 12B are its ability to handle audio with a single model and the convenience of running on 16GB.
What Kind of Model Is Gemma 4 12B?
First, let me clarify where the 12B fits within the Gemma 4 family.
| Size | Type | Features |
|---|---|---|
| E2B (2.3B) | Dense | Smallest, edge-oriented |
| E4B (4.5B) | Dense | Small, practical for local use |
| 12B | Dense | New, mid-range, with audio input support |
| 26B-A4B | MoE (Active 3.8B) | Mid-to-large, efficiency-focused |
| 31B | Dense | Largest, highest accuracy |
What I personally found interesting about the 12B is that its architecture is encoder-free. Conventional multimodal models typically convert images or audio into vectors using dedicated encoders before passing them to the language model. Gemma 4 12B feeds both images and audio directly into the LLM body through lightweight embedding layers. On the image side, a 35M parameter embedding layer replaces what would have been 27 layers of an image encoder, projecting 48×48 pixel patches with a single matrix multiplication. On the audio side, the design is remarkably lean: 16kHz waveforms are simply sliced into 40ms frames (640 values each) and linearly transformed. Since images, audio, and text all share the same weights, fine-tuning can be done all at once with a single model.
Here are the main official benchmarks:
| Benchmark | Score |
|---|---|
| MMLU Pro | 77.2 |
| AIME 2026 (no tools) | 77.5 |
| LiveCodeBench v6 | 72.0 |
| GPQA Diamond | 78.8 |
| MMMU Pro (image) | 69.1 |
| CoVoST (audio) | 38.5 |
| MRCR v2 (long context) | 43.4 |
Test Environment and Pitfalls
I immediately ran into trouble when trying to run the 12B. At the time I started testing, the stable releases I had on hand (official release versions of transformers and vLLM) stopped with this error:
model type `gemma4_unified` but Transformers does not recognize this architecture
The encoder-free architecture of the 12B uses an internal code name gemma4_unified as a new type, which the stable versions at the time didn't recognize yet. So for my testing, I followed Google's instructions and installed transformers from the main branch on GitHub and vLLM from nightly.
For testing, I used a dual approach: hitting transformers directly for text, image, and audio accuracy; and using vLLM for MTP speed. After creating a new venv and installing transformers, I was able to load the 12B in bfloat16 via AutoModelForMultimodalLM. The pitfalls below are from the time of testing, but the causes and workarounds are still useful, so I'm leaving them here.
3 issues I encountered getting the 12B running with transformers
- torchvision is required: The 12B's image processing depends on
torchvision.transforms.v2, so if it's not installed, you'll getModuleNotFoundError: No module named 'torchvision'. - Audio dataset decoding: When loading FLEURS for evaluation using
datasets, you'll getTo support decoding audio data, please install 'torchcodec'. Since torchcodec adds more ffmpeg-related dependencies, I worked around this by usingAudio(decode=False)to get the raw bytes and reading them with soundfile. - vLLM's flashinfer sampler does JIT builds: When I tried serving the 12B with vLLM nightly, it failed to start with
FileNotFoundError: 'ninja'. The cause was flashinfer trying to compile sampling kernels on the fly and not finding the build tools on PATH. I avoided the JIT entirely withVLLM_USE_FLASHINFER_SAMPLER=0and added the venv and CUDA to PATH as a precaution, which got it running. This seems like something you'll keep running into when using nightly, so it's worth keeping in mind if you're trying MTP with vLLM.
Text Is Nearly on Par with E4B — Japanese Commonsense Test
First up is text. I used the same JCommonsenseQA (Japanese commonsense reasoning, leemeng/jcommonsenseqa-v1.1) as in the first article, with the same conditions: 3-shot, same seed, 1,116 questions. Since the previous numbers used a different backend, I re-measured the comparison sizes (E2B / E4B / 31B) with transformers this time to put them on the same playing field as the 12B.

| Size | JCQ Accuracy | Median per question |
|---|---|---|
| E2B | 87.7% | 0.08 sec |
| E4B | 94.1% | 0.15 sec |
| 12B | 94.6% | 0.32 sec |
| 31B | 97.7% | 4.84 sec |
The 12B scored 94.6%, barely different from E4B's 94.1%. It falls about 3 points short of 31B's 97.7%. The 26B-A4B I measured in my first article scored 96.4%, so the rough ranking is E4B ≈ 12B < 26B < 31B.
When you hear "bridge between E4B and 26B," you might expect a score somewhere in between, but this isn't a weakness of the 12B. It's more that JCommonsenseQA was already at 94% with E4B, which is close to a ceiling. On the Japanese commonsense reasoning benchmark, the benefits of going mid-range simply don't show up well in the scores.
In other words, the value of the 12B lies elsewhere, not in commonsense text tests. Let's look at images and audio next.
Images Test the True Potential at University Level — JMMMU
For images, I used JMMMU, a Japanese version of MMMU that collects university-level chart and diagram questions from 28 subject areas. The official MMMU Pro (image) score of 69.1 is for English, so I had the model answer 300 questions in Japanese.
The 12B achieved an accuracy of 45.7%. That might seem underwhelming at first glance, but JMMMU is a difficult benchmark where even humans struggle in some subject areas. The trends became very clear when looking at it by subject.
| Strong subjects | Accuracy | Weak subjects | Accuracy |
|---|---|---|---|
| Computer Science | 0.75 | Mechanical Engineering | 0.00 |
| Psychology | 0.75 | Architecture & Engineering | 0.00 |
| World History | 0.71 | Energy | 0.00 |
| Agriculture | 0.70 | Mathematics | 0.20 |
Text-heavy subjects were handled reliably, while engineering subjects requiring precise reading of diagrams, circuits, and mechanics were nearly impossible. This is where the limitations of a mid-range model show honestly. For applications like analyzing manufacturing blueprints, it's best not to expect too much just yet.
Testing the Standout New Feature: Audio Input in Japanese
This is what I was most eager to try this time. The 12B is the first mid-range model to support audio input.
As I mentioned earlier, the mechanism doesn't use a dedicated audio encoder. Instead, 16kHz waveforms are sliced into 40ms frames and passed directly to the LLM. In transformers, you simply pass the audio array in the message, and it works just like working with images.
content = [
{"type": "audio", "audio": arr}, # numpy array at 16kHz
{"type": "text", "text": "この音声を日本語で正確に書き起こしてください。"},
]
I transcribed 100 Japanese audio clips from FLEURS and evaluated them using Character Error Rate (CER). The results were a median CER of 16.1% and a mean of 23.1%. The mean being higher than the median is because some difficult audio clips are pulling the average up; the representative quality is better reflected by the median value of around 16%.
The official model card also lists a FLEURS CER of 6.9%, but that's a multilingual average excluding Chinese, whereas my measurement was Japanese only. Japanese is a particularly challenging language for CER evaluation due to kanji and homophones, so it's natural for the score to be larger than the multilingual average. With that in mind, a median CER of 16% for Japanese alone is not bad.

Numbers alone can be hard to grasp, so here's one actual transcription example:
Reference: インターネットで 敵対的環境コース について検索すると おそらく現地企業の住所が出てくるでしょう
12B: インターネットで 適体的な環境公社 について検索すると、おそらく現地企業の住所が出てくるでしょう
Apart from "敵対的環境コース" being transcribed as "適体的な環境公社," the rest is nearly accurate. The way it confuses homophones looks more like a language model mistake than an acoustic model mistake, which is interesting to observe.
According to the official documentation, the 12B's audio capabilities go beyond simple transcription to include speaker diarization (distinguishing who spoke) and video understanding. I focused on Japanese transcription for this test, but the range of use cases seems broader than I expected.
For clean read-aloud audio like FLEURS, dedicated speech recognition models (such as Whisper) achieve single-digit CER for Japanese. On accuracy alone, the 12B takes a step back. But what's important here is the integration: being able to handle text, images, and audio with a single model. Use cases like directly summarizing transcribed content or passing both images and audio in a single query are all possible within one model.
MTP Speeds Up Generation by About 2.8x
Finally, let's look at speed. The 12B comes with a built-in drafter for MTP (google/gemma-4-12B-it-assistant). This means I can apply the same speculative decoding approach from my second article directly to the 12B.
Using vLLM nightly, I compared long-form generation between a baseline (no drafter) and MTP with a drafter (speculative tokens = 4) under the same conditions.

| Condition | Long-form generation speed |
|---|---|
| baseline | 7.7 tok/s |
| MTP (spec=4) | 21.5 tok/s |
For 256-token long-form generation, speed went from 7.7 tok/s to 21.5 tok/s, about 2.8x faster. The output itself is unchanged from the baseline, so the improvement comes without any quality loss.
One note: the baseline here was measured with the flashinfer sampler disabled to avoid the ninja issue mentioned earlier, so the absolute tok/s figures are on the conservative side. The multiplier also depends on how baseline is measured, so these numbers can't be directly compared to the E4B figures from my second article (2.1x, baseline 18.5 tok/s) to determine superiority. Please treat this simply as confirmation that "MTP works effectively on the 12B for long-form generation."
The reason it works so well is that the DGX Spark's memory bandwidth of 273GB/s is considerably modest compared to datacenter GPUs. Generating one token at a time means re-reading the model weights from memory every single time, and this bandwidth becomes the bottleneck. MTP has the drafter pre-generate multiple tokens as a draft, which the main model verifies in batch, reducing how often weights need to be re-read and effectively sidestepping the bandwidth constraint. The narrower the bandwidth—as in environments like the DGX Spark—the greater the benefit.
The Position of a Mid-Range Model That Runs on 16GB
Based on the results so far, let me think about how to use the 12B.
On the Japanese commonsense text benchmark, the 12B was nearly tied with E4B. So there will be many situations where "E4B is sufficient for Japanese text processing alone." On the other hand, the 12B has a weapon that E4B and 26B lack: audio input in a mid-range package. If you want to run multimodal workloads spanning text, images, and audio with a single model, the 12B becomes a realistic option.
The convenience of running on 16GB is also worth noting. While 31B offers higher accuracy, my tests showed a median of 4.8 seconds per question, making it a heavy model to generate from as well. From the perspective of casually running multimodal tasks on a local machine, the 12B's size feels like just the right balance.
| Use case | Recommended |
|---|---|
| Japanese text, lightweight and fast | E4B |
| Handle audio with a single model | 12B |
| Prioritize accuracy above all | 31B / 26B-A4B |
Summary
On the Japanese commonsense test, the 12B was nearly on par with E4B, and the mid-range advantage was modest when looking at text alone. That said, being able to test native audio input in a mid-range model for the first time on actual hardware was a real find. Japanese transcription came in at a median CER of 16%, landing in practical territory, and the ability to handle text, images, and audio with a single model feels like what makes the 12B distinctly itself. Apply MTP and generation gets about 2.8x faster, and combined with the convenience of running on 16GB, I think it's a well-balanced choice as an everyday multimodal model.
Reference Links
- Google Blog: Introducing Gemma 4 12B
- Google Developers Blog: Gemma 4 12B — The Developer Guide
- HuggingFace: google/gemma-4-12B-it
- HuggingFace: google/gemma-4-12B-it-assistant (MTP drafter)
- vLLM Recipes: Gemma 4 12B-it
- JCommonsenseQA v1.1
- JMMMU (Japanese version of MMMU)
- FLEURS
- Previous article: Benchmarking Gemma 4 on DGX Spark for Japanese and Multimodal Performance
- Previous article: Running Gemma 4 MTP on DGX Spark and Measuring Japanese Generation Speed Improvements
