Testing Gemma 4 Local Execution on MacBook (16GB) — From Model Selection with Ollama to UI Building and Practical Project Implementation
注目の記事

Testing Gemma 4 Local Execution on MacBook (16GB) — From Model Selection with Ollama to UI Building and Practical Project Implementation

How to run Google Gemma 4 locally using Ollama on a MacBook (16GB). After experiencing freezing failures with the 26B MoE model, I'll explain step-by-step how to achieve smooth operation with the E4B model, build a chat UI using Open WebUI, and practically apply it in a Text-to-SQL project.
2026.04.09

This page has been translated by machine translation. View original

Introduction

Gemma 4, released by Google in 2026, is attracting attention as an open-weight LLM that can run locally. Having your own AI endpoint with zero API costs is appealing, but the first hurdle is "Will it run on my MacBook?"

In this article, I'll introduce the process of running Gemma 4 locally on a MacBook (Apple M5) with 16GB of memory. I experienced failures with the first model I chose, which froze my screen, but eventually reached the point where I could utilize it in actual projects.

SCR-20260409-ipue

Overview of the Gemma 4 Family

There are four variations of Gemma 4.

Model Parameters Architecture Disk Space (Q4) Estimated Memory Required Features
E2B 2.3B Dense ~2 GB ~3-4 GB Ultra-light, for edge devices
E4B 4.5B Dense ~3 GB ~5-6 GB Balanced, comfortable on most Macs
26B (MoE) 25.2B (Active: 3.8B) Mixture of Experts ~17 GB ~18-19 GB High quality but consumes lots of memory
31B 31B Dense ~20 GB ~20+ GB Highest quality, 32GB+ recommended

The important point here is that disk space and required memory are different things. Disk space is the storage needed for model files, while required memory is what GPU/CPU uses when actually loading and inferring with the model.

Step 1: Check MacBook Specifications

First, check the available memory on your machine.

sysctl hw.memsize
hw.memsize: 17179869184

It's a 16GB MacBook. On Apple Silicon, CPU and GPU share the same Unified Memory. There is no separate "VRAM" area; the GPU uses a portion of the physical memory.

When launching Ollama, you can see the amount of memory actually available.

inference compute  id=0  library=Metal  name=Metal  description="Apple M5"
  total="11.8 GiB"  available="11.8 GiB"

Out of 16GB, excluding what the OS and system uses, about 11.8GB can be allocated to the model.

Step 2: Install Ollama

Ollama is a runtime environment for local LLMs. You can install it with Homebrew.

brew install ollama

Start the server.

ollama serve

Use a separate terminal tab to pull and run models.

Step 3: The Story of Trying 26B MoE and Failing

Looking at the specifications alone, you might think "26B MoE might be light since it only has 3.8B active parameters?" The Mixture of Experts doesn't use all parameters during inference; only some experts are activated for each token.

However, all model parameters need to be loaded into memory. Having fewer active parameters is about computational load, not memory usage.

ollama pull gemma4:26b

SCR-20260406-iyic

After completing the 17GB download, when I tried to run it...

model weights  device=Metal  size="10.0 GiB"
model weights  device=CPU    size="7.3 GiB"    ← Overflowing to CPU because it doesn't fit in GPU
offloaded 20/31 layers to GPU                   ← Only 20 out of 31 layers fit on GPU
total memory  size="18.8 GiB"                   ← Required memory 18.8GB > Available 11.8GB

The screen completely froze. After 13 minutes of being inoperable, I received an HTTP 500 error.

SCR-20260406-jdiw

Parts that didn't fit in GPU memory overflowed (offloaded) to CPU memory, and the data transfer between GPU↔CPU became a bottleneck, making the entire machine unresponsive.

Let's remove the model that's no longer needed to save disk space.

ollama rm gemma4:26b

Step 4: Comfortable Operation with E4B

Let's switch to E4B and try again.

ollama pull gemma4:e4b

After downloading, let's test it.

curl -s http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e4b","prompt":"Say hello in one sentence.","stream":false}'
{
  "model": "gemma4:e4b",
  "response": "Hello!",
  "total_duration": 12755806500,
  "load_duration": 12189128000,
  "eval_count": 3,
  "eval_duration": 60049084
}

The first time takes about 12 seconds to load the model, but once loaded, it stays in memory. Let's check the response speed for subsequent requests.

# Test in chat format
curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "gemma4:e4b",
    "messages": [{"role": "user", "content": "日本の首都は?"}],
    "stream": false
  }'
Test Response Time
Short question ("What is Japan's capital?") 0.5 seconds
Creative (Write a haiku) 0.9 seconds
Long answer (Explain APIs in 2-3 sentences) 13.5 seconds

It operated comfortably without screen freezing. E4B fits in about 5-6GB of memory, so there's plenty of room within the 11.8GB constraint.

Step 5: Build a Chat UI with Open WebUI

Ollama itself is an API server, so we need to set up a separate chat UI for browser use. Open WebUI is the most popular choice.

Preparing Docker Runtime Environment

There are several options for using Docker on macOS.

Tool Price Features
Docker Desktop Free for personal use, paid for commercial (250+ people or $10M+ revenue) Official GUI
OrbStack Free for personal use, $8/month for commercial ($10K+ revenue) Lightweight, fast, Apple Silicon optimized
Colima Completely free (MIT License) CLI-based, no license issues for commercial use
Podman Desktop Completely free (Apache 2.0) Developed by Red Hat, has GUI

Both Docker Desktop and OrbStack incur license fees for commercial use. Be careful if you're using it for business. For this article, I chose Colima, which can be used without licensing concerns.

brew install colima docker
colima start

SCR-20260409-jhma

colima start launches a Linux VM with Docker Engine available. From there, you can use the normal docker commands.

Starting Open WebUI

docker run -d --network=host \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  ghcr.io/open-webui/open-webui:main

Using --network=host allows the container to directly share the host machine's network, ensuring it can connect to Ollama. In this case, the Open WebUI port is 8080.

SCR-20260409-jhyd

Opening http://localhost:8080 in your browser displays a ChatGPT-like interface. Create an account on your first access.

SCR-20260409-invu

Configuring Ollama Connection

On first launch, you might see "No models available" in the model selection. This means Open WebUI can't find the Ollama endpoint.

SCR-20260409-ioun

  1. Click "Manage Connections" at the top of the screen
  2. Set the Ollama API URL to:
    http://host.docker.internal:11434
    
  3. Once the connection is confirmed, gemma4:e4b will appear in the model list

SCR-20260409-ipmt

Once connected, you can interact with Gemma 4 through a ChatGPT-like interface.

SCR-20260409-ipue

Step 6: Example of Use in a Real Project — Text-to-SQL Agent

The true value of local LLMs becomes apparent when used in actual projects. Here's an example of using it for a Text-to-SQL agent that generates SQL from natural language in Japanese.

Project Overview

This system generates and executes SQL when asked natural language questions about Japanese financial data (area-specific P&L, product category sales), then returns the results.

User: "What's the cumulative sales for Kanto region in 2025?"

Gemma 4 (E4B) generates SQL

SELECT SUM(売上_実績) FROM area_pl WHERE エリア = '関東' AND 年度 = 2025

Result: 12,345 million yen

Integrating the Ollama Python Client

Calling Ollama from Python is very simple.

pip install ollama
import ollama

MODEL = "gemma4:e4b"

response = ollama.chat(
    model=MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "関東の売上は?"},
    ],
)
sql = response.message.content.strip()

The ollama library connects to localhost:11434 by default, so no environment variables or URL settings are needed.

Including Schema Information in Prompts

To have the LLM generate accurate SQL, we dynamically incorporate table definitions and column names into the system prompt.

from functools import lru_cache

@lru_cache
def build_system_prompt():
    return f"""You are an SQL expert.
Please generate only DuckDB-compatible SELECT statements based on the following table definitions in response to user questions.

{schema_definitions}

Rules:
- Generate only SELECT statements (INSERT, UPDATE, DELETE are prohibited)
- Use table and column names as they are in Japanese
"""

Why We Chose a Local LLM

Aspect Local (Ollama) Cloud API
Cost Free Token-based billing
Latency No network needed API round-trip required
Privacy Data doesn't leave system Sending is necessary
Quality E4B class Claude/GPT-4 class
Setup Only Ollama startup required API key management needed

For the PoC phase, a local LLM that allows fast iteration at zero cost was optimal. For production environments, we designed it to switch to cloud APIs (like Claude), so migration only requires replacing ollama.chat() with the Anthropic SDK.

Summary

Here's a summary of the process of running Gemma 4 locally on a MacBook (16GB).

  • About 70-75% of Apple Silicon's unified memory can be used for models (about 11.8GB out of 16GB)
  • MoE Active parameter count ≠ Required memory. 26B MoE requires 18.8GB and wouldn't run on a 16GB machine
  • Gemma 4 E4B is the sweet spot for 16GB machines. Fits in 5-6GB and responds quickly
  • Ollama + Open WebUI lets you build a chat UI without writing code
  • Easy integration into real projects via the ollama Python library

Local LLMs are ideal for PoCs and early development prototyping. They require no API key management or token billing, allowing quick iteration while maintaining privacy. Even with a 16GB MacBook, choosing the right model can create a sufficiently practical environment.

Share this article