Running Gemma 4 locally on MacBook (16GB) — From model selection to UI building and practical project implementation using Ollama

Running Gemma 4 locally on MacBook (16GB) — From model selection to UI building and practical project implementation using Ollama

How to run Google Gemma 4 locally using Ollama on a MacBook (16GB). After experiencing freezing failures with the 26B MoE model, I'll explain step-by-step how to achieve smooth operation with the E4B model, build a chat UI with Open WebUI, and practically apply it in a Text-to-SQL project.
2026.04.09

This page has been translated by machine translation. View original

Introduction

Gemma 4, released by Google in 2026, is gaining attention as an open-weight LLM that can run locally. Having your own AI endpoint with zero API cost is attractive, but the first hurdle is "Will it run on my MacBook?"

In this article, I'll introduce the process of running Gemma 4 locally on a MacBook (Apple M5) with 16GB of memory. After experiencing a screen freeze with the first model I chose, I eventually reached a point where it could be utilized in real projects.

SCR-20260409-ipue

Overview of the Gemma 4 Family

Gemma 4 has four variations.

Model Parameter Count Architecture Disk Space (Q4) Estimated Memory Required Features
E2B 2.3B Dense ~2 GB ~3-4 GB Ultra-lightweight, for edge devices
E4B 4.5B Dense ~3 GB ~5-6 GB Balanced, comfortable on most Macs
26B (MoE) 25.2B (Active: 3.8B) Mixture of Experts ~17 GB ~18-19 GB High quality but heavy memory consumption
31B 31B Dense ~20 GB ~20+ GB Highest quality, 32GB+ recommended

The important point here is that disk space and required memory are different things. Disk space is the storage needed for model files, while required memory is the RAM used by GPU/CPU when loading the model for inference.

Step 1: Check MacBook Specifications

First, check the available memory on your machine.

sysctl hw.memsize
hw.memsize: 17179869184

This is a 16GB MacBook. On Apple Silicon, CPU and GPU share the same unified memory. There is no separate "VRAM" area; the GPU uses part of the physical memory.

When launching Ollama, it displays the amount of memory actually available.

inference compute  id=0  library=Metal  name=Metal  description="Apple M5"
  total="11.8 GiB"  available="11.8 GiB"

I found that about 11.8GB can be allocated to the model out of 16GB, after subtracting what the OS and system use.

Step 2: Install Ollama

Ollama is a runtime environment for local LLMs. You can install it with Homebrew.

brew install ollama

Start the server.

ollama serve

Pull or run models in a separate terminal tab.

Step 3: The Story of Trying the 26B MoE and Failing

Looking at the specifications alone, you might think 26B MoE is light because "it only has 3.8B active parameters." Mixture of Experts doesn't use all parameters during inference; only some experts are activated for each token.

However, all model parameters need to be loaded into memory. The smaller active parameters relates to computational load, not memory usage.

ollama pull gemma4:26b

SCR-20260406-iyic

After completing the 17GB download, when I tried to run it...

model weights  device=Metal  size="10.0 GiB"
model weights  device=CPU    size="7.3 GiB"    ← Overflowing to CPU because it doesn't fit on GPU
offloaded 20/31 layers to GPU                   ← Only 20 of 31 layers fit on GPU
total memory  size="18.8 GiB"                   ← Required memory 18.8GB > Available 11.8GB

The screen completely froze. After 13 minutes of being unable to operate, I received an HTTP 500 error.

SCR-20260406-jdiw

Parts that couldn't fit in GPU memory overflowed (offloaded) to CPU memory, and data transfer between GPU↔CPU became a bottleneck, making the entire machine unresponsive.

I deleted the unnecessary model to free up disk space.

ollama rm gemma4:26b

Step 4: Running Smoothly with E4B

Taking a fresh approach, I switched to E4B.

ollama pull gemma4:e4b

After downloading, I tested it.

curl -s http://localhost:11434/api/generate \
  -d '{"model":"gemma4:e4b","prompt":"Say hello in one sentence.","stream":false}'
{
  "model": "gemma4:e4b",
  "response": "Hello!",
  "total_duration": 12755806500,
  "load_duration": 12189128000,
  "eval_count": 3,
  "eval_duration": 60049084
}

The first time takes about 12 seconds to load the model, but once loaded, it stays in memory. Let's check the response speed for subsequent requests.

# Testing in chat format
curl -s http://localhost:11434/api/chat \
  -d '{
    "model": "gemma4:e4b",
    "messages": [{"role": "user", "content": "日本の首都は?"}],
    "stream": false
  }'
Test Response Time
Short question ("What is the capital of Japan?") 0.5 seconds
Creative task (Write a haiku) 0.9 seconds
Long answer (Explain APIs in 2-3 sentences) 13.5 seconds

It ran smoothly with no screen freezes. E4B fits in about 5-6GB of memory, well within the 11.8GB constraint.

Step 5: Building a Chat UI with Open WebUI

Ollama itself is an API server, so we need a separate chat UI for browser use. Open WebUI is the most popular choice.

Preparing Docker Runtime Environment

There are several options for using Docker on macOS.

Tool Price Features
Docker Desktop Free for personal use, paid for commercial (250+ people or $10M+ revenue) Official GUI
OrbStack Free for personal use, $8/month for commercial ($10K+ revenue) Lightweight, fast, optimized for Apple Silicon
Colima Completely free (MIT License) CLI-based, no licensing issues for commercial use
Podman Desktop Completely free (Apache 2.0) Developed by Red Hat, has GUI

Both Docker Desktop and OrbStack incur license fees for commercial use. Be careful if using them for work. I chose Colima for this project because it can be used without licensing concerns.

brew install colima docker
colima start

SCR-20260409-jhma

colima start launches a Linux VM where Docker Engine can be used. From here, regular docker commands work as usual.

Starting Open WebUI

docker run -d --network=host \
  -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
  ghcr.io/open-webui/open-webui:main

Using --network=host allows the container to directly share the host machine's network, ensuring a reliable connection to Ollama. In this case, Open WebUI's port is 8080.

SCR-20260409-jhyd

Open http://localhost:8080 in your browser to see a ChatGPT-like interface. Create an account on first access.

SCR-20260409-invu

Configuring Ollama Connection

On first launch, you might see "No models available" in the model selection. This means Open WebUI can't find the Ollama endpoint.

SCR-20260409-ioun

  1. Click "Manage Connections" at the top of the screen
  2. Set the Ollama API URL to:
    http://host.docker.internal:11434
    
  3. Once the connection is confirmed, gemma4:e4b will appear in the model list

SCR-20260409-ipmt

Once connected, you can interact with Gemma 4 through a ChatGPT-like interface.

SCR-20260409-ipue

Step 6: Real Project Example — Text-to-SQL Agent

The true value of local LLMs becomes apparent when used in actual projects. Here's an example of a Text-to-SQL agent that generates SQL from natural Japanese language.

Project Overview

A system that takes natural language questions about Japanese financial data (area-based P&L, product category sales) and generates SQL, executes it, and returns results.

User: "What are the cumulative sales for Kanto in 2025?"

Gemma 4 (E4B) generates SQL

SELECT SUM(売上_実績) FROM area_pl WHERE エリア = '関東' AND 年度 = 2025

Result: 12,345 million yen

Integrating the Ollama Python Client

Calling Ollama from Python is very simple.

pip install ollama
import ollama

MODEL = "gemma4:e4b"

response = ollama.chat(
    model=MODEL,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": "関東の売上は?"},
    ],
)
sql = response.message.content.strip()

The ollama library connects to localhost:11434 by default, so no environment variables or URL configuration is needed.

Including Schema Information in the Prompt

To help the LLM generate accurate SQL, table definitions and column names are dynamically incorporated into the system prompt.

from functools import lru_cache

@lru_cache
def build_system_prompt():
    return f"""You are an SQL expert.
Please generate only a DuckDB-compatible SELECT statement based on the table definitions below for the user's question.

{schema_definitions}

Rules:
- Generate only SELECT statements (INSERT, UPDATE, DELETE are prohibited)
- Use table and column names as-is in Japanese
"""

Why Choose a Local LLM

Aspect Local (Ollama) Cloud API
Cost Free Token-based billing
Latency No network needed API roundtrip required
Privacy Data stays internal Transmission required
Quality E4B class Claude/GPT-4 class
Setup Only Ollama startup API key management needed

For the PoC phase, a local LLM was optimal because it allowed zero-cost, fast iteration. The design allows switching to cloud APIs (like Claude) in production by simply replacing ollama.chat() with the Anthropic SDK.

Conclusion

To summarize the process of running Gemma 4 locally on a MacBook (16GB):

  • About 70-75% of Apple Silicon's unified memory can be used for models (about 11.8GB for a 16GB machine)
  • MoE Active parameter count ≠ Required memory. The 26B MoE needed 18.8GB and wouldn't run on a 16GB machine
  • Gemma 4 E4B is the sweet spot for 16GB machines. It fits in 5-6GB and responds quickly
  • Ollama + Open WebUI lets you build a chat UI without writing code
  • Real projects can easily integrate via the ollama Python library

Local LLMs are ideal for PoCs and early development prototyping. No API key management or token billing is needed, allowing quick iteration while maintaining privacy. With the right model, even a 16GB MacBook can provide a practical environment.

Share this article