How Far Can a Gaming PC Run Local LLMs? I Tried It with Ollama

How Far Can a Gaming PC Run Local LLMs? I Tried It with Ollama

I tried running a local LLM on my gaming PC. How well the model fits into VRAM greatly affects the speed, so the key is to reduce the model size through quantization.
2026.07.04

This page has been translated by machine translation. View original

Introduction

Hello everyone, I'm Akaike.
Recently, I thought it would be a good idea to get myself familiar with running local LLMs in preparation for what might be the upcoming "era of local LLMs" (though opinions vary on this).

So I started looking into machines to run them on, but it turns out that running local LLMs properly requires a fairly powerful GPU, and getting everything set up together easily costs over 1 million yen... That's not something I can just casually spend.

※ For example, in the case of the NVIDIA DGX Spark, the price may vary depending on where you buy it, but it looks like it would cost at least 800,000 yen or more at minimum
https://www.nvidia.com/ja-jp/products/workstations/dgx-spark/

That's when I suddenly remembered my gaming PC.
It has a decent GPU, so I thought maybe it could run local LLMs??
So I decided to try running local LLMs on my gaming PC.

In this blog, I've put together everything from the steps to run it with Ollama, to how to use local LLMs from VSCode / Claude Code, and how to choose a model that suits your hardware.

I hope this serves as a useful reference for those who want to try local LLMs on their own PC.

About the Environment

I'll share the environment I used for this verification.
Since the required specs for local LLMs vary greatly depending on the model you run, it would help to keep this in mind as background context when reading the article.

The key factor is GPU VRAM (video memory) capacity, which is 16 GB in my environment.
As I'll touch on later, this VRAM capacity greatly affects "how large a model can be run comfortably."

My Environment

  • OS: Windows 11 Pro
  • CPU: AMD Ryzen 5 5600X 6-Core Processor (3.70 GHz)
  • Memory: DDR4 128 GB
    • CRUCIAL CP32G4DFRA32A PC4-25600 32GB × 4
  • GPU: AMD Radeon RX 7800 XT (16 GB)
  • Storage: 5.5TB total

Side Note: Does it work with AMD GPU (Radeon)?

My GPU is the AMD Radeon RX 7800 XT, and
I had a slight worry about whether AMD would be fine, given the strong impression that "NVIDIA is the go-to for local LLMs."

To get straight to the point, Ollama officially supports AMD GPUs via a library called ROCm.
On the official GPU compatibility list, the 7800 XT is explicitly listed for both Linux and Windows.

https://docs.ollama.com/gpu

So let's trust this information for now and give it a try.

Using Ollama

Installation

First, let's install Ollama itself.
The Windows version can be downloaded as an installer from the official website.

https://ollama.com/download/windows

This time, I'll try running the official installation script from PowerShell.

irm https://ollama.com/install.ps1 | iex

Once installation is complete, let's verify the version with --version.
If the version is displayed, the installation is successful.

ollama --version
ollama version is 0.31.1

Installing a Model

Now that Ollama is installed, let's choose a model to run.
You can find available models in the official library.

https://ollama.com/search

For this example, I'll try using a model called ornith.

https://ollama.com/library/ornith

(No particular deep reason, but it looked good based on the following blog)

https://dev.classmethod.jp/articles/ornith-1-0-dgx-spark-japanese-benchmark/

Running ollama run will automatically download (pull) the model if you don't have it, and then launch the interactive mode right away.
Let's start by running the lightweight 9b version.

ollama run ornith:9b

When the Send a message prompt appears, the model is up and running.

pulling manifest
pulling af63361d2ac3: 100% ▕████████████████████████████████████████▏ 5.6 GB
pulling 151be780af0c: 100% ▕████████████████████████████████████████▏  236 B
pulling b1bca6ec8117: 100% ▕████████████████████████████████████████▏ 1.1 KB
pulling d74f445d3c6b: 100% ▕████████████████████████████████████████▏   76 B
pulling df2252e2f5bd: 100% ▕████████████████████████████████████████▏  548 B
verifying sha256 digest
writing manifest
success
>>> Send a message (/? for help)

Let's try chatting.

Q. Please tell me how to keep an Aldabra giant tortoise.
>>> アルダブラゾウガメの飼い方を教えて。
Thinking...
ユーザーはアルダブラゾウガメ(アフリカゾウガメ)の飼い方について質問しています。これは一般的なペッ
ト飼育に関する情報提供であり、特定のファイル操作やツール使用を必要とするものではありません。

ただし、私は agentic coding assistant として設計されています。この質問に対しては、直接的な回答を提
供するのが適切です。アルダブラゾウガメ(アフリカゾウガメ)の基本的な飼い方を説明します。
...done thinking.

# アフリカゾウガメ(アルダブラゾウガメ)の飼い方

## 基本情報
- **寿命**: 50〜80年(非常に長生き)
- **最大サイズ**: 成人で約1.2m、体重10kg以上
- **性格**: 穏やかだが、個体差あり

## 飼育環境の準備

### 必要なスペース
- 大人用として最小でも **3×4m** の広さが必要(子供用ならもう少し小さくても可)
- 外飼い(屋外)が最も一般的で推奨される方法

### ケージ/囲い方
- 金属製のフェンスまたはネットで作る
- ガラス窓を閉めると熱がこもって危険
- 風通しをよくする設計に

## 餌やり

### 主食
- **牧草**: タイムothy, アルファルファ(最も重要)
- **野菜**: カボチャ、ニンジン、セロリ(少量ずつ)
- **果物**: オレンジ、バナナ(たまに与える程度)

### 注意点
- 人工飼料は少量のみ
- 水はいつでも新鮮に保つ
- 葉っぱを食べるタイプなので、大量の牧草が必要

## 健康管理

### 基本的なケア
- 定期的な爪の切り
- 日焼け止め(UVケア)
- バラエティのある餌を与える
- 定期的に獣医に健康チェックしてもらう

### 病気サイン
- 食欲減退
- 目や口の異常
- 皮膚の問題

## 注意点
- **長生き**なので、将来の責任を考慮する
- 大人になるまでに数年かかる
- 子ガメの時から飼育するのが一般的
- 法律で飼育制限がある地域もある(確認が必要)

何か特定の点について詳しく知りたい場合は、さらに質問してください!

There are some questionable points in the information, but the response was surprisingly solid for something running locally.
It seems capable of handling everyday conversation without any issues.

Also, when you download Ollama, a desktop app is also available, so you can use it via a GUI as shown below.

スクリーンショット 2026-07-04 020217

You can check the resources being consumed by the model using ollama ps.
Looking at the PROCESSOR column, it shows 100% GPU, meaning the entire model is loaded onto the GPU (VRAM).
I'll cover how to read this in more detail later.

> ollama ps
NAME         ID              SIZE      PROCESSOR    CONTEXT    UNTIL
ornith:9b    a75697c14589    5.3 GB    100% GPU     4096       4 minutes from now

Using from an IDE

Using from VSCode (on Windows)

Now that we can run it from the terminal, let's try calling it from VSCode, which I normally use.
Since Ollama runs a local server, you connect to it via a compatible extension.

Here I tried two options: GitHub Copilot Chat and Continue.

GitHub Copilot Chat

First, let's try GitHub Copilot Chat.
Ollama has official documentation for VSCode integration, so this seems like a good place to start.

https://docs.ollama.com/integrations/vscode
https://marketplace.visualstudio.com/items?itemName=GitHub.copilot-chat

Install the extension and go through GitHub authentication.

スクリーンショット 2026-07-04 021449
スクリーンショット 2026-07-04 021529

After that, run the ollama launch vscode command from the terminal to select the Ollama LLM you want to use.

スクリーンショット 2026-07-04 023023

Once the local Ollama is recognized, the ornith:9b you selected earlier becomes available as an option.
After selecting the model, you can chat with the local model just like you normally would with Copilot Chat.

スクリーンショット (6)

Continue

Let me also try Continue, which is said to work well with local LLMs.
It's an open-source AI code assistant extension that also supports Ollama.

https://github.com/continuedev/continue

However, it seems it was recently acquired by Cursor, and based on the GitHub description, it looks like v2.0.0 will be the final version of the OSS release. 😭😭😭
So I'm not sure if I'll keep using it going forward, but I'll try it out in this blog for now.

https://www.continue.dev/

First, install the extension,

スクリーンショット 2026-07-04 023200

then write the model settings and other configurations in config.yaml.
The official reference has documentation on how to write it.

https://docs.continue.dev/reference

スクリーンショット (7)

This time I'll keep it simple and just specify the model.
The following configuration specifies ornith:9b on local Ollama.

config.yaml
name: Local Config
version: 1.0.0
schema: v1
models:
  - name: Ornith 9B
    provider: ollama
    model: ornith:9b

Once you save the settings, ornith:9b becomes selectable in Continue as well.

スクリーンショット 2026-07-04 024350

Using from VSCode (on WSL)

Since it worked on Windows, let's now try to use it from VSCode on Ubuntu in WSL, which is my usual development environment.

First, add the Continue extension to WSL as well,

スクリーンショット (8)

then try copying the config.yaml I created on the Windows side to the home directory on the WSL side.

$ mkdir -p ~/.continue
cp /mnt/c/Users/kanik/.continue/config.yaml ~/.continue/config.yaml

However, when I tried to connect from Continue, an error appeared.
It seems it cannot connect to the local Ollama.

Unable to connect to local Ollama instance.
Ollama may not be installed or may not running.

スクリーンショット 2026-07-04 032159

Since Ollama is running on the Windows side, it seems WSL is not connecting to it.
When I tried hitting it with curl from within WSL, I was indeed unable to connect to Ollama.

$ curl http://localhost:11434
curl: (7) Failed to connect to localhost port 11434 after 0 ms: Could not connect to server

To find the cause, I checked the WSL network mode and found it was in nat mode.

$ wslinfo --networking-mode
nat

Since WSL2 defaults to nat mode, WSL and Windows are treated as separate networks, and localhost won't connect to Ollama on the Windows side.
There are various ways to solve this, but this time I'll simply switch to mirrored mode, where WSL and Windows share the same network.

For details on WSL network modes, please refer to the official documentation.

https://learn.microsoft.com/ja-jp/windows/wsl/networking

So, I'll add networkingMode=mirrored to .wslconfig in the user folder.
※ Settings other than networkingMode are default values

[wsl2]
memory=8GB
swap=0
processors=2
networkingMode=mirrored

Shut down WSL once to apply the settings, then start it back up.

wsl --shutdown
wsl

Checking again with curl, a response came back.

$ curl http://localhost:11434
Ollama is running

Also, Continue on WSL was now able to connect to Ollama on the Windows side.

スクリーンショット 2026-07-04 032722

Using from Claude Code (VSCode on WSL)

Continue is great, but since we're at it, wouldn't it be nice to use a local model with Claude Code, which I normally use??
Actually, Ollama officially supports Claude Code too — how great is that!!

https://docs.ollama.com/integrations/claude-code

To use a local LLM, you need to point Claude Code's connection target to the local Ollama, but the following command requires Ollama and Claude Code to be in the same environment.

ollama launch claude

In this case, the setup is "Claude Code on WSL" and "Ollama on Windows," so we need to use an environment variable to specify the API base URL.

Specifically, set ANTHROPIC_BASE_URL to Ollama's address as shown below, pass the model you want to use with --model, and start it up.
※ Since authentication is local, the token can be empty

ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model ornith:9b

After running the above command, Claude Code started with the local ornith:9b.
ornith:9b is also now selectable in the model options.

スクリーンショット 2026-07-04 033817

However, when I actually tried using it, it didn't behave as expected in response to prompts, and kept returning nonsensical responses.

スクリーンショット 2026-07-04 033859

Looking into the cause, there seemed to be two main factors involved.
This topic is also mentioned in the following blog.

https://dev.classmethod.jp/articles/claude-code-ollama-local/

Insufficient Context Length

Claude Code passes a very large amount of text to the model as system prompts and tool definitions.
On the other hand, in this configuration, Ollama's default context length is 4096 tokens.
(As seen in the ollama ps output earlier, CONTEXT was 4096)

https://docs.ollama.com/context-length

In other words, the input from Claude Code greatly exceeds the context length, the overflow is silently discarded, and as a result the model cannot receive the instructions and tool definitions in full, making it unable to function properly.

Note that the context length can be expanded with OLLAMA_CONTEXT_LENGTH and similar settings, but expanding it consumes more memory (KV cache) accordingly.

Model Performance Itself

Since Claude Code is premised on an agentic usage pattern that heavily relies on tool calls, it's also conceivable that smaller local models simply struggle to handle this complex workflow reliably.

For now, let's also try the larger 35b (35 billion parameter) version to see the effect of model performance.

ollama run ornith:35b
ANTHROPIC_AUTH_TOKEN=ollama ANTHROPIC_BASE_URL=http://localhost:11434 ANTHROPIC_API_KEY="" claude --model ornith:35b

However, even in this state, accurate responses didn't come back, just like with 9b.

スクリーンショット 2026-07-04 042409

I thought "maybe it's because the effort is set to High?" and tried with Low as well, but the result was the same — it didn't work.

スクリーンショット 2026-07-04 042541

Next, I set the context length to 256k, the maximum supported by Ollama, and tried 35b again.
I'll set this in the desktop app this time.

スクリーンショット 2026-07-04 175931

> ollama ps
NAME          ID              SIZE     PROCESSOR          CONTEXT    UNTIL
ornith:35b    5a470e0f652c    25 GB    37%/63% CPU/GPU    262144     4 minutes from now

スクリーンショット 2026-07-04 180719

As a result, a proper response came back!!
However, this question took 2m45s to answer, so it seems difficult to use practically...
(The self-identification as Claude Code may be influenced by Claude Code's own system prompt)

スクリーンショット 2026-07-04 181127

So for now, using it with Claude Code seems tough, and I'll stick with Continue for the time being.
(Maybe things would change with a higher-performance model)

Choosing a Model That Runs as Powerfully and Quickly as Possible Given Your Hardware

The speed difference between 9B and 35B is obvious...

So far, I've tried running ornith:9b and ornith:35b.
However, despite running on the same PC, the perceived speed was completely different.

So let's launch with the --verbose option and measure the generation speed.

ollama run ornith:9b --verbose

total duration:       5.9517138s
load duration:        205.6493ms
prompt eval count:    77 token(s)
prompt eval duration: 136.167ms
prompt eval rate:     565.48 tokens/s
eval count:           358 token(s)
eval duration:        5.600524s
eval rate:            63.92 tokens/s
ollama run ornith:35b --verbose

total duration:       47.048687s
load duration:        212.5973ms
prompt eval count:    77 token(s)
prompt eval duration: 1.436679s
prompt eval rate:     53.60 tokens/s
eval count:           1371 token(s)
eval duration:        45.394293s
eval rate:            30.20 tokens/s

Here's what the main displayed metrics mean:

  • total duration: Total time taken for the entire request
  • load duration: Time taken to load the model into memory (on second run onwards, it's nearly zero if the model is resident)
  • prompt eval count / rate: Number of tokens processed in the input prompt and processing speed
  • eval count: Number of output tokens generated
  • eval rate: Generation speed (tokens/s) of output tokens. The most important metric for daily use

https://docs.ollama.com/api/usage

Of these, the first thing to focus on is eval rate, which represents generation speed.

  • ornith:9b: 63.92 tokens/s
  • ornith:35b: 30.20 tokens/s

9b can generate tokens at approximately twice the speed of 35b, and it was noticeably snappier in feel as well.

On the other hand, looking at total duration (overall elapsed time), 9b took about 6 seconds and 35b about 47 seconds, which looks like nearly an 8x difference.
However, this is because 35b happened to generate more tokens (eval count of 358 vs 1371), so it can't be used directly as a speed comparison.
To compare pure speed per token, it's best to look at eval rate.

The difference was even more pronounced in prompt eval rate, the speed at which the prompt is read.
9b was 565.48 tokens/s compared to 35b's 53.60 tokens/s — roughly a 10x gap here.

So why did 35b slow down so much?
The answer lies in the PROCESSOR column visible in the ollama ps command.

> ollama ps
NAME         ID              SIZE      PROCESSOR          CONTEXT    UNTIL
ornith:9b    a75697c14589    5.3 GB    100% GPU           4096       4 minutes from now
> ollama ps
ornith:35b   5a470e0f652c    21 GB     27%/73% CPU/GPU    4096       4 minutes from now

PROCESSOR indicates where the model is running, and while 9b shows 100% GPU, 35b shows 27%/73% CPU/GPU.
This means 35b couldn't fit entirely in VRAM and was partially split to the CPU side, which acted as a bottleneck and slowed it down.

"So in the end, what size model should I choose for my PC?" — that's probably what you're most curious about.
So from here, I'll summarize how to think about choosing a model that fits your hardware's capabilities.

Note: How to read the ollama ps PROCESSOR display

The PROCESSOR column seen earlier indicates where the model's memory is loaded.

  • 100% GPU: The entire model is loaded on the GPU (VRAM)
  • 100% CPU: The entire model is loaded in system RAM
  • 48%/52% CPU/GPU: Part is on the GPU and part is split to system RAM

https://docs.ollama.com/faq#how-can-i-tell-if-my-model-was-loaded-onto-the-gpu

The fact that ornith:9b (5.3 GB) showed 100% GPU and ornith:35b (21 GB) showed 27%/73% CPU/GPU was exactly this difference.
Since 35b at 21 GB couldn't fit within the 16 GB VRAM, the overflow was offloaded to system RAM.

Prerequisites: Basically, choose a model that fits in VRAM

First and foremost, the basic principle is to load the model entirely into GPU VRAM if at all possible.

The reason is that LLM token generation (the process of outputting text one token at a time) is a "memory-bound" process where memory bandwidth is the bottleneck rather than computational speed.
Each generation step requires repeatedly reading the model weights from memory, and this read speed directly affects generation speed.

https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
https://huggingface.co/docs/transformers/llm_tutorial_optimization

Also, since GPU VRAM has dramatically higher memory bandwidth than system RAM, models loaded in VRAM run at high speed.
Conversely, if the model doesn't fit in VRAM and part of it overflows to system RAM, that becomes a bottleneck and speed drops sharply.

How to estimate required VRAM

I couldn't find a clear formula for calculating the required VRAM, but based on the official Hugging Face documentation, here are three things to consider when estimating.

Roughly speaking, it seems best to think of it as "model weights" + "KV cache" + "inference overhead."

The first factor, weights, is determined by the number of parameters and precision (bytes per parameter).

  • FP32 (32-bit): approximately 4 × number of parameters [B] GB
  • FP16/BF16 (16-bit): approximately 2 × number of parameters [B] GB

Loading the weights of a model having X billion parameters requires roughly 2 * X GB of VRAM in bfloat16/float16 precision

https://huggingface.co/docs/transformers/llm_tutorial_optimization

For example, running an 8B model in FP16 requires about 16 GB just for the weights alone, which would already fill up 16 GB of VRAM...

The second factor, KV cache, increases proportionally to context length as mentioned earlier, and the same documentation shows examples where it can reach about half the size of the weights depending on the configuration.

Storing 8 billion float values in float16 precision requires around 15 GB of RAM which is circa half as much as the model weights themselves!

https://huggingface.co/docs/transformers/llm_tutorial_optimization

The third factor, overhead, is the memory consumed during inference beyond the weights themselves.
According to EleutherAI, this overshoot is generally about 20% margin of error.

However, this is only a rough rule of thumb, and EleutherAI itself doesn't go into the breakdown of overhead. So be aware that with long contexts, the KV cache growth above this 20% becomes the dominant factor.

In addition to the memory needed to store the model weights, there is also a small amount of additional overhead during the actual forward pass. In our experience this overhead is ≤ 20%

https://huggingface.co/docs/accelerate/usage_guides/model_size_estimator
https://blog.eleuther.ai/transformer-math/


So taking all of the above into account, it's clearly not feasible to run a rich LLM on my PC's specs... 😭😭😭

That's where quantization, explained next, comes in.

About Quantization

There are limits to how large a model can be loaded into VRAM.
That's why "quantization" is a technique for compressing models to make them smaller and faster to run.

Here I'll introduce two types of quantization.
LLM quantization, which compresses the model weights themselves, and KV cache quantization, which saves memory during inference.

LLM Quantization

Quantization in the context of LLMs is a technique that converts model weights from high-precision floating point (FP16/BF16 = 16-bit) to lower-bit integers (4-bit, 5-bit, 8-bit, etc.) to reduce model size and memory usage.

Notations like Q4_K_M that appear in Ollama model names represent this quantization.
This is the naming convention for "K-quants" used in the GGUF format, and the meaning is as follows:

  • The number after Q: target bit count per weight (Q4 means approximately 4-bit, Q8 means approximately 8-bit)
  • _K: a quantization method called k-quants (a method that compresses in units of "superblocks" of 256 weights)
  • _S / _M / _L: the difference between Small / Medium / Large within the same bit range (larger means more quality-focused)
  • _0 / _1: legacy format predating k-quants (Q8_0 etc. are still commonly used)

https://github.com/ggml-org/llama.cpp/pull/1684
https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md

Naturally, the lower the bit count, the smaller the size, but quality degrades a little at a time accordingly.
The llama.cpp official (PR #1684) includes actual measured comparisons on LLaMA 7B, which I'll quote here.

Type Perplexity (lower is better) File Size
F16 (non-quantized) 5.9066 13.0 GB
Q2_K 6.7764 2.67 GB
Q3_K_M 6.1503 3.06 GB
Q4_K_S 6.0215 3.56 GB
Q4_K_M 5.9601 3.80 GB
Q5_K_M 5.9208 4.45 GB
Q6_K 5.9110 5.15 GB

https://github.com/ggml-org/llama.cpp/pull/1684

Looking at the table, you can see that Q4_K_M compresses the size from 13 GB to 3.8 GB — down to about 1/3 or less — while keeping the perplexity degradation to just +0.05 compared to F16.
Perhaps because of this good balance, when you pull from Ollama without specifying a tag, you often get the Q4_K_M version, and indeed ornith pulled without a tag as latest (=9b) was the Q4_K_M version. (Details in the verification later)

KV Cache Quantization

The other type of quantization is "KV cache" quantization.

The KV cache is a mechanism that holds the Key/Value computed during Transformer inference in memory, so they can be reused each time a token is generated without recalculating them.
This speeds up generation, but has the property that as context length grows longer, KV cache memory consumption also increases.

Ollama's official documentation also explicitly states that expanding context length increases memory requirements.

https://docs.ollama.com/context-length

This KV cache can also be quantized to save memory.
In Ollama, you specify this with the environment variable OLLAMA_KV_CACHE_TYPE.

  • f16: Default. High precision, high memory usage
  • q8_0: 8-bit quantization. Approximately 1/2 the memory usage of f16, with minimal precision loss
  • q4_0: 4-bit quantization. Approximately 1/4 the memory usage of f16, with small to moderate precision loss

https://docs.ollama.com/faq#how-can-i-set-the-quantization-type-for-the-k/v-cache

However, note that this KV cache quantization only takes effect when Flash Attention is enabled.
Flash Attention is a mechanism that optimizes the Attention computation, and it currently appears that Flash Attention is used automatically when the backend and device support it.

https://docs.ollama.com/faq#how-can-i-enable-flash-attention

If you want to handle long contexts but VRAM is tight, q8_0 seems like a good balance between quality and memory savings.

Comparing Various Models

So far, I've covered the factors involved in model selection (VRAM, quantization, KV cache).
Finally, taking all of this into account, I'll run various models at different quantization levels in my environment and actually verify how they load into VRAM.

Context Length is 256K

I measured with context expanded to 256K so that the impact of the KV cache would be easy to see.
As mentioned earlier, extending the context length increases KV cache memory consumption, so whether or not the same model fits in VRAM can change.
It's important to consider not just file size, but whether everything including the KV cache fits in VRAM.

KV Cache quantized to q8_0

With a long context of 256K, keeping the KV cache at the default f16 leads to quite high memory consumption.
So this time, I decided to measure with the KV cache quantization described above enabled.

I'll choose q8_0 for the quantization level. I'll set this with environment variables this time.

setx OLLAMA_KV_CACHE_TYPE "q8_0"
setx OLLAMA_FLASH_ATTENTION "1"

Note that KV cache quantization (q8_0) only takes effect when Flash Attention is enabled, so I'm also setting OLLAMA_FLASH_ATTENTION to 1 at the same time.
Also, since environment variables are server-side settings for Ollama, I'm restarting Ollama after setting them before measuring.

Verification Results

The models verified are as follows.
I will record the PROCESSOR column from ollama ps and my impressions from actually using them.

https://ollama.com/library/ornith/tags

Model Quantization File Size Context ollama ps(PROCESSOR) Impressions
ornith:latest
ornith:9b
ornith:9b-q4_K_M
Q4_K_M 5.6 GB 256K 100% GPU Fully loaded on GPU, fastest. Quite snappy
ornith:9b-q8_0 Q8_0 9.5 GB 256K 100% GPU Also fully loaded on GPU. Slightly slower than q4 but comfortable enough
ornith:9b-bf16 BF16 18 GB 256K 37%/63% CPU/GPU Extremely slow. The slowest of this batch and impractical…
ornith:35b
ornith:35b-q4_K_M
Q4_K_M 21 GB 256K 37%/63% CPU/GPU Partially spills out of GPU, but surprisingly practical speed (~30 t/s)
ornith:35b-q8_0 Q8_0 37 GB 256K 62%/38% CPU/GPU Increased CPU ratio leads to speed reduction
ornith:35b-bf16 BF16 69 GB 256K 79%/21% CPU/GPU Almost entirely on CPU and quite slow, not practical for everyday use…

[!NOTE]
ornith:latest, ornith:9b, and ornith:9b-q4_K_M all share the same digest (a75697c14589) and are actually the same Q4_K_M version.
Similarly, ornith:35b and ornith:35b-q4_K_M are also identical (5a470e0f652c), so the table above consolidates duplicates.

The results of ollama ps and the --verbose option for each model are as follows.

List of Verification Results

ornith:9b-q4_K_M(5.6 GB)

NAME         ID              SIZE     PROCESSOR    CONTEXT    UNTIL
ornith:9b    a75697c14589    11 GB    100% GPU     262144     4 minutes from now
total duration:       3.1323146s
load duration:        208.3475ms
prompt eval count:    75 token(s)
prompt eval duration: 153.479ms
prompt eval rate:     488.67 tokens/s
eval count:           173 token(s)
eval duration:        2.756458s
eval rate:            62.76 tokens/s

ornith:9b-q8_0(9.5 GB)

NAME              ID              SIZE     PROCESSOR    CONTEXT    UNTIL
ornith:9b-q8_0    b1edce684170    14 GB    100% GPU     262144     4 minutes from now
total duration:       3.9848905s
load duration:        211.4153ms
prompt eval count:    75 token(s)
prompt eval duration: 112.603ms
prompt eval rate:     666.06 tokens/s
eval count:           190 token(s)
eval duration:        3.658139s
eval rate:            51.94 tokens/s

ornith:9b-bf16(18 GB)

NAME              ID              SIZE     PROCESSOR          CONTEXT    UNTIL
ornith:9b-bf16    0ed5c778f8a9    24 GB    37%/63% CPU/GPU    262144     4 minutes from now
total duration:       42.7021961s
load duration:        210.77ms
prompt eval count:    75 token(s)
prompt eval duration: 1.76954s
prompt eval rate:     42.38 tokens/s
eval count:           243 token(s)
eval duration:        40.719391s
eval rate:            5.97 tokens/s

ornith:35b-q4_K_M(21 GB)

NAME          ID              SIZE     PROCESSOR          CONTEXT    UNTIL
ornith:35b    5a470e0f652c    25 GB    37%/63% CPU/GPU    262144     4 minutes from now
total duration:       13.0198449s
load duration:        211.5483ms
prompt eval count:    75 token(s)
prompt eval duration: 2.055351s
prompt eval rate:     36.49 tokens/s
eval count:           326 token(s)
eval duration:        10.750258s
eval rate:            30.32 tokens/s

ornith:35b-q8_0(37 GB)

NAME               ID              SIZE     PROCESSOR          CONTEXT    UNTIL
ornith:35b-q8_0    44788b2fdab4    40 GB    62%/38% CPU/GPU    262144     4 minutes from now
total duration:       22.9267956s
load duration:        211.295ms
prompt eval count:    75 token(s)
prompt eval duration: 4.532222s
prompt eval rate:     16.55 tokens/s
eval count:           349 token(s)
eval duration:        18.179703s
eval rate:            19.20 tokens/s

ornith:35b-bf16(69 GB)

NAME               ID              SIZE     PROCESSOR          CONTEXT    UNTIL
ornith:35b-bf16    cfc852542826    73 GB    79%/21% CPU/GPU    262144     4 minutes from now
total duration:       38.7281788s
load duration:        210.656ms
prompt eval count:    75 token(s)
prompt eval duration: 10.298072s
prompt eval rate:     7.28 tokens/s
eval count:           333 token(s)
eval duration:        28.216238s
eval rate:            11.80 tokens/s

Analysis of Results

Whether it fits in VRAM is the most important factor…

The 9b models that ran at 100% GPU were both perfectly comfortable. However, as soon as even a little spills over to the CPU, the speed drops dramatically.
Even among the same 9b models, the bf16 version that didn't fit entirely in VRAM dropped as low as 5.97 t/s.

I mentioned earlier that "fitting in VRAM is the basic principle," but I didn't expect the numbers to show it this blatantly.

It was surprising that 9b-bf16 was slower than the 35b models

I had simply assumed that more parameters meant slower speed, but that wasn't quite the case.
Specifically, 9b-bf16 (5.97 t/s) was significantly slower than 35b-q4_K_M (30.32 t/s), which has nearly 4 times the parameter count.

I think this is because LLM token generation is a memory-bound process — determined more by "memory bandwidth" than "computational speed." (This ties back to the earlier point about "fitting in VRAM being the basic principle.")

When the weights that spilled over to CPU are in bf16 (unquantized, 16-bit), the amount of data read per token is simply about 4 times greater compared to quantized versions like q4_K_M (approximately 4-bit), which consumes that much more memory bandwidth.

9b seems to be the best choice for 16GB VRAM

In this environment, 9b in q4_K_M / q8_0 ran comfortably at 100% GPU even with the quite long context of 256K.
On the other hand, the 35B class and BF16 versions didn't fit within 16GB and fell back to the CPU, causing a significant speed drop.

That said, 35b-q4_K_M was still achieving around 30 t/s even with some CPU spillover, and in practice it still felt fast enough to be usable.
If you want smarter responses, I think it's perfectly reasonable to overlook the slight speed reduction and go with 35b-q4_K_M.

Closing Thoughts

That wraps up my record of running a local LLM on a gaming PC.

Even without an expensive dedicated machine, I was able to run a local LLM surprisingly easily using a gaming PC I already had (AMD / 16GB VRAM) and Ollama.
At the same time, the key takeaway was that VRAM capacity is the biggest factor affecting speed, and the basic principle is to keep the model size down through quantization within the range that fits.

It's much easier to try than I expected, so I think it's totally worth giving it a shot on whatever PC you already have.

That said, this time I only got to the "got it running" stage, so going forward I plan to use it more seriously for actual coding, and if I find new discoveries or useful ways to use it, I'll write another article about it.

I hope this article serves as a helpful reference for anyone looking to start experimenting with local LLMs.

Share this article