It seems Ollama has added MLX support, so I investigated Apple MLX and ran some benchmarks on M2 Max.
This page has been translated by machine translation. View original
Hey there! I'm Nishimura Yuji from the Operations Department!
Many people use Ollama to run LLMs locally. I read a blog post saying that Ollama has adopted Apple's machine learning framework MLX as its internal engine on Apple Silicon.
I was curious why they chose MLX over the previously used llama.cpp. Since Ollama is widely used in the local LLM community, I wanted to understand what's happening under the hood.
So this time, I looked into what Apple MLX is, why it runs fast on Apple Silicon, and what makes it different from the conventional GGUF (llama.cpp).
At the end, I also ran a simple benchmark on my MacBook Pro (M2 Max / 32GB) to measure speed and memory for both GGUF and MLX.
What This Article Covers and What It Doesn't
- Scope of investigation: The Apple MLX framework itself (design philosophy & ecosystem) and Ollama's adoption of MLX
- What I verified: MLX official repository/documentation, Ollama's official blog and release notes, differences in quantization and philosophy compared to GGUF (llama.cpp), and actual benchmarks on M2 Max with gemma4 12B
- What I did not verify: Running fine-tuning with MLX (inference only was benchmarked), Intel Mac or x86 environments, benchmarks on M3 or newer chips, cross-comparisons with mlx-lm standalone or LM Studio
What Ollama Changed
First, let me organize the "Ollama changes" that prompted this investigation in chronological order. Getting this right makes the MLX discussion in the second half easier to follow.
Ollama adopted MLX as its engine on Apple Silicon not in v0.30, but earlier — in the v0.19 preview. The Ollama official blog "Ollama is now powered by MLX on Apple Silicon (in preview)" states:
Ollama on Apple silicon is now built on top of Apple's machine learning framework, MLX, to take advantage of its unified memory architecture.
In other words, "Ollama on Apple Silicon has been rebuilt on top of MLX to leverage its unified memory architecture." This isn't just an optional addition — the inference engine's foundation has been swapped out.
The subsequent v0.30.0 (2026-05-13) release notes were not about newly adding MLX, but rather about supplementing MLX.
Ollama 0.30 is now available, with improved compatibility and performance using llama.cpp. This augments the MLX engine on Apple Silicon, bringing support to a wider range of hardware.
What we can read from this is that Ollama did not "fully migrate to MLX and abandon llama.cpp." It uses the MLX engine on Apple Silicon while also integrating llama.cpp. This creates a dual-stack configuration that supports a wider range of hardware and models (Hugging Face GGUFs, custom fine-tuned models).
Indeed, tracking the point releases in the v0.30 series, MLX stabilization fixes have been continuously merged. Even at the latest v0.30.10, improvements such as MLX embedding layer enhancements and expanded model support are ongoing.
The officially claimed benefits of adopting MLX are mainly these three:
- Speed: Improved inference speed on Apple Silicon. Especially on M5-series chips, it uses the new GPU Neural Accelerators to improve both time-to-first-token (TTFT) and generation speed (tokens/sec)
- Quantization format: Leverages NVFP4, an NVIDIA-derived 4-bit quantization (a method to compress models), to reduce memory usage and file size while maintaining accuracy (see the Ollama performance improvement blog for details)
- Memory/cache: Reuses cache across conversations to reduce memory usage
At this point, the terms "prefill" and "decode" come up. Prefill is the stage of reading the input text (prompt), and decode is the stage of generating the answer one token (a unit like a word fragment) at a time. Going forward, we'll look at speed broken down into these two parts.
In the official blog benchmark (M5, Qwen3.5-35B-A3B), decode improved from 58→112 tokens/sec and prefill from 1,154→1,810 tokens/sec. However, this is a comparison between the old implementation (0.18 / Q4_K_M) and the new implementation (0.19 / NVFP4), and the quantization formats are not aligned. It cannot be read as simply "2x faster across generations." In the hands-on benchmarks later, I'll try to align these conditions.
What Is Apple MLX?
Let me summarize what MLX — which Ollama chose as its foundation — actually is.
MLX is an array framework (a library that serves as the foundation for numerical computation) for Apple Silicon, developed by Apple's machine learning research team (ml-explore). The README definition is simple:
MLX is an array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
Released in December 2023 under the MIT license, it provides Python / C++ / C / Swift APIs, and its design is heavily influenced by NumPy, PyTorch, and JAX. Its positioning is not as an "inference-only tool," but as a general-purpose machine learning framework aimed at training, fine-tuning, and research.
Among its design features, the following four are the key ones to understand from the perspective of running local LLMs.
Unified Memory Model
The biggest difference between MLX and other frameworks is this unified memory model.
A notable difference from MLX and other frameworks is the unified memory model. Arrays in MLX live in shared memory.
On a typical PC with a discrete GPU, you need to copy data (model weights) to the GPU's dedicated memory before computation, and then write it back after computation. On Apple Silicon, however, the CPU and GPU share a single memory pool. MLX data lives in this shared memory from the start, so there's no need to "move" it by copying between CPU and GPU. As the official documentation puts it:
rather than moving arrays to devices, you specify the device when you run the operation.
The idea is not to "move data to a device," but to "specify where (CPU or GPU) each operation runs." Since LLM inference tends to be bottlenecked by memory bandwidth, this design — which eliminates copies — is well-suited for LLM inference.
Lazy Evaluation
Computations in MLX are lazy. Arrays are only materialized when needed.
MLX uses lazy evaluation. Even when you write out a computation, it doesn't execute immediately — it just remembers the steps. The actual computation only runs when the result is needed (such as when eval() is called, or when an array is printed or converted to NumPy). Computations that don't feed into the output are skipped, which helps reduce unnecessary memory consumption.
Composable Function Transformations
MLX supports composable function transformations for automatic differentiation, automatic vectorization, and computation graph optimization.
It includes computational mechanisms needed for machine learning training, such as automatic differentiation and automatic vectorization, and these can be composed together (a design inherited from JAX). While you don't need to think about this directly for inference alone, the fact that it has a foundation capable of running training is a strength. Another notable feature is that even when input data shapes change, you're less likely to be held up by recompilation.
NumPy/PyTorch-like API
The core API closely resembles NumPy, while the higher-level mlx.nn and mlx.optimizers follow a PyTorch-like design. It's built to make it easy for existing PyTorch users to transition.
The MLX Ecosystem
Since MLX itself is an "array framework," you use packages built on top of it to run LLMs. Ollama can be seen as one such user.
| Package | Role |
|---|---|
| mlx-lm | Runs LLM text generation and fine-tuning on Apple Silicon. Supports memory-efficient training methods like LoRA/QLoRA, and also enables quantization and uploading to Hugging Face Hub |
| mlx-vlm | Inference and fine-tuning for Vision-Language Models (VLMs) (community-built) |
| MLX Swift | Swift bindings for embedding into native Apple apps |
Hugging Face mlx-community |
Distribution hub for pre-quantized MLX models (4-bit/8-bit, etc.). Loadable directly from mlx-lm |
As a recent development, a CUDA backend for Linux + NVIDIA GPU has been added (in preview since 2025). However, this is positioned as a way to develop and port MLX models in non-Apple environments — not as a direct performance competitor to the NVIDIA ecosystem.
Additionally, on M5-generation chips, Neural Accelerators integrated into GPU cores have been designed for MLX, showing that Apple is co-optimizing chip design and MLX together.
Differences from GGUF (llama.cpp)
One of the key questions about MLX is how it differs from the GGUF (llama.cpp) many people are already using. I'll organize this along two main axes.
1. Difference in philosophy
llama.cpp is an inference-focused engine — a mature project that runs broadly across Windows / macOS / Linux and CPU / various GPUs. It serves as the foundation for many tools including Ollama, LM Studio, and Jan. MLX, on the other hand, is a general-purpose machine learning framework with automatic differentiation and function transformations — Apple-native and tightly coupled to the Metal GPU + unified memory. The fundamental difference from inference-only llama.cpp is that MLX aims to complete the entire workflow — including training and fine-tuning — on a Mac.
2. Difference in quantization methods
First, quantization is the compression of a model's weights (parameters) using fewer bits, making file sizes and memory footprints smaller. Even within "4-bit quantization," the internals differ by method. GGUF's K-quant (such as Q4_K_M) varies the number of bits used even within a single layer — quantizing most weights at 4 bits while using 6 bits for parts that have a greater impact on quality. MLX's quantization, by contrast, applies a uniform bit-width within each layer. In general, at the same bit-width, GGUF's K-quant tends to have a slight quality advantage, while MLX tends to produce slightly smaller file sizes.
A newer addition to this picture is NVFP4 — an NVIDIA-derived 4-bit format that aims to reduce quality degradation from compression by capturing the range of weight values more precisely per location. As mentioned earlier, the large speedups in Ollama's official benchmark appeared in the cases that combined MLX with NVFP4. The questions of "is it faster because of MLX?" and "is it faster because of NVFP4?" need to be considered separately.
Verifying on M2 Max
Now I'll run the same gemma4 12B model using both GGUF and MLX on my MacBook Pro (M2 Max / 32GB) and compare. Since the official benchmark conditions were tilted toward M5 + NVFP4, I'll align the conditions as much as possible (except for quantization) on my M2 Max and see how much of the effect carries over.
Test Environment
- MacBook Pro / Apple M2 Max / 32GB unified memory / macOS 15.7.4
- Ollama 0.30.10
- Benchmark models (both gemma4 12B class)
- GGUF:
gemma4:12b-it-q4_K_M(Q4_K_M quantization, llama.cpp backend) - MLX:
gemma4:12b-mlx(NVFP4 quantization, MLX engine)
- GGUF:
Measurement Procedure
First, I fetched three models for comparison: one GGUF version and two MLX variants (-mlx and -nvfp4).
ollama pull gemma4:12b-it-q4_K_M # GGUF version
ollama pull gemma4:12b-mlx # MLX version
ollama pull gemma4:12b-nvfp4 # NVFP4 version (for comparison)
Speed was measured using Ollama's HTTP API. Sending a request to /api/generate with stream: false returns a response that includes a breakdown of processing time. I calculated the following two metrics from this (time is in nanoseconds, so divide by 1 billion to get seconds):
- Prefill:
prompt_eval_count(number of input tokens) ÷prompt_eval_duration(time spent reading input) - Decode:
eval_count(number of generated tokens) ÷eval_duration(time spent generating)
A single request looks like this. Since gemma4 is a model that supports thinking, I added think: false to normalize generation volume.
curl -s http://127.0.0.1:11434/api/generate -d '{
"model": "gemma4:12b-mlx",
"prompt": "<benchmark prompt>",
"stream": false,
"think": false,
"options": { "num_predict": 300, "temperature": 0.0 }
}'
I ran this with two types of prompts. The goal was to separately observe which is faster — prefill (reading input) or decode (generating output).
- Decode-focused: Send a short question and generate ~300 tokens via
num_predict - Prefill-focused: Feed in a ~7,000-token long text and keep generation minimal (
num_predict: 8)
Two things I was careful about during measurement. First, I ran one warmup pass for each model before the actual benchmark to exclude model loading time. Second, for prefill, I prepended a unique string to the beginning of the prompt each time and took the median of 3 runs. If you send the same input repeatedly, the prompt cache kicks in and reading time drops to nearly zero, making it impossible to measure correctly. For decode, I took 3 measurements under the same conditions, and since the values were stable across all runs, I used the median.
The repeated requests were organized into a Python script. The core part that calls the API and calculates speed from the timing breakdown looks like this:
import json, urllib.request
def generate(model, prompt, num_predict):
body = {
"model": model, "prompt": prompt, "stream": False, "think": False,
"options": {"num_predict": num_predict, "temperature": 0.0},
}
req = urllib.request.Request(
"http://127.0.0.1:11434/api/generate",
data=json.dumps(body).encode(),
headers={"Content-Type": "application/json"})
with urllib.request.urlopen(req, timeout=600) as r:
d = json.load(r)
prefill = d["prompt_eval_count"] / (d["prompt_eval_duration"] / 1e9)
decode = d["eval_count"] / (d["eval_duration"] / 1e9)
return prefill, decode
Note that in production use, prompt_eval_duration can return 0 on cache hits, causing a ZeroDivisionError. In this article, I'm inserting a unique string at the beginning of each prompt to disable caching, so this didn't occur — but if you generalize this script, it's safer to add a zero-division guard (e.g., return None if prompt_eval_duration is 0).
Finding 1: The -mlx Tag Is Actually NVFP4
After fetching the models and listing them with ollama list, I found that gemma4:12b-mlx and gemma4:12b-nvfp4 shared the same ID (the same underlying artifact).
gemma4:12b-mlx 197a75677efb 6.8 GB
gemma4:12b-nvfp4 197a75677efb 6.8 GB
gemma4:12b-it-q4_K_M 4eb23ef187e2 7.6 GB
Furthermore, running ollama show gemma4:12b-mlx showed that the quantization was nvfp4.
Model
architecture gemma4_unified
parameters 12.0B
quantization nvfp4
In other words, for this model, the "MLX version" and the "NVFP4 version" are identical in content, and the -mlx tag is actually an NVFP4-quantized model. The question raised earlier — "is it faster because of MLX, or because of NVFP4?" — cannot be disentangled when using Ollama's default MLX model. This is because NVFP4 is currently the format chosen for the -mlx models distributed by Ollama (while the MLX engine itself can handle other quantization formats, NVFP4 is currently the standard for Ollama's official tags).
Finding 2: Memory and Default Context
Looking at ollama ps while both models were running, the MLX version used less memory and had a larger default context length.
NAME ID SIZE PROCESSOR CONTEXT
gemma4:12b-mlx 197a75677efb 6.9 GB 100% GPU 262144
gemma4:12b-it-q4_K_M 4eb23ef187e2 8.1 GB 100% GPU 65536
| Item | GGUF Q4_K_M | MLX NVFP4 |
|---|---|---|
Memory usage (ollama ps) |
8.1 GB | 6.8 GB |
| Default context length | 65,536 | 262,144 |
However, the GGUF version (gemma4:12b-it-q4_K_M) is a multimodal build that includes vision (CLIP-based Projector) and audio (confirmed via ollama show with projector: clip), so some of that memory usage should be factored out. Even so, the fact that the MLX version stays lighter thanks to NVFP4 compression is a welcome difference when running large models on a 32GB machine.
Finding 3: Decode Is Nearly Equal, GGUF Is Slightly Faster on Prefill
Here are the speed measurement results. Ollama has a mechanism that remembers previously read input (prompt cache), so if you send the same input again, the reading step is skipped and you can't measure correctly. For prefill, I disabled the cache by prepending a different string each time, and used the median of 3 runs on ~7,000-token input. Decode was measured by generating ~300 tokens.
| Metric | GGUF Q4_K_M | MLX NVFP4 |
|---|---|---|
| Prefill (~7,000 token input, median) | ~156 tok/s (146–159, stable) | ~114 tok/s (112–117, stable) |
| Decode (300 token generation, median) | ~13.7 tok/s (13.56–13.75) | ~14.0 tok/s (13.70–14.10) |
Let me walk through the results.
Decode is nearly equal, with MLX only marginally faster — 14.0 tok/s vs. 13.7 tok/s, a difference of just over 2%. In decoding, a large number of weights are repeatedly read out for each generated token, making it heavily influenced by memory bandwidth. While NVFP4 has a smaller read volume per token than q4_K_M, the difference between the GGUF version (7.6 GB) and the MLX version (6.8 GB) isn't that large, and it didn't translate into a significant decode speed difference on the M2 Max. The Ollama performance improvement blog reports that "NVFP4 generates ~20% faster than q4_K_M," but that figure is for M5-series chips — the difference appears to shrink on M2-generation hardware.
Prefill is about 1.4x faster on GGUF. For long-input processing, GGUF was stable at ~156 tok/s while MLX came in at ~114 tok/s. What's notable is that MLX's prefill showed zero variation across all three runs — consistently 112–117 tok/s. MLX tends to be slower on first runs due to JIT (just-in-time) compilation and initial Metal kernel preparation, but since I loaded and ran the model once as a warmup before measuring, this first-run penalty was cleanly isolated.
The Ollama performance improvement blog mentions Metal kernel fusion via MLX's JIT compiler (original text: "several operations are now fused into single Metal kernels via MLX's just-in-time compiler features"). MLX stabilization fixes have been continuously merged in the v0.30 point releases, and my impression from testing on Ollama 0.30.10 is that MLX prefill is now getting close to parity with GGUF.
One thing to keep in mind: the generation speed (tok/s) that appears prominently in UIs and logs typically refers to eval_count / eval_duration, i.e., decode speed. For use cases involving long input, you should also look at prompt_eval_duration (prefill time) — otherwise, you might overlook the source of perceived slowness.
As a note: the official "~2x decode" benchmark was a comparison of old Q4_K_M (0.18) vs. NVFP4 (0.19) on an M5 chip. When I aligned the conditions (except quantization) on M2 Max, the decode difference was essentially zero. The large numbers from the official benchmark likely include significant contributions from the M5's GPU Neural Accelerators. The takeaway from my hands-on testing is that the benefits of MLX depend heavily on chip generation.
How to Use MLX with Ollama
On Apple Silicon, all you need to do is specify an MLX-format model (one with a tag like -mlx or -nvfp4). No special flags or environment variables are required.
ollama run gemma4:12b-mlx
Standard models without these tags run on GGUF (llama.cpp backend). Whether a given model has an MLX version available can be checked by looking for -mlx / -nvfp4 in the tag list on the model page at ollama.com. The two prerequisites are:
- A Mac with Apple Silicon (MLX does not run on Intel Macs, and macOS 13.5 or later is required)
- An Ollama version with Apple Silicon support (v0.19 or later)
Summary
I came to understand that MLX is not an inference-only layer, but an entire machine learning framework designed together with Apple Silicon's unified memory. In the sense that it has a foundation capable of handling training and fine-tuning entirely on a Mac, it feels less like a successor to llama.cpp and more like a different layer with a different purpose.
Running the benchmarks on M2 Max with Ollama 0.30.10, the difference between MLX (NVFP4) and GGUF (Q4_K_M) was much smaller than I had expected — decode was nearly on par, and prefill had GGUF only slightly ahead. Ollama continues to push MLX stabilization updates, and at this point my impression is that either choice works well.
That said, if you want to reduce memory usage or want a wider default context window, MLX is the better pick. If you care about the breadth of available models or need to target non-Apple environments, GGUF is the safer choice. If you have an M5-series chip, you might be able to unlock the kind of MLX performance gains shown in the official blog — but on M2 Max-era hardware, the two options feel very close in practice. That's my honest impression from this experiment.
I hope this is useful to someone.
References: