Building a Private LLM Environment (Qwen 3.5-4B) on AWS That Can Be Used from Both Smartphones and PCs

Building a Private LLM Environment (Qwen 3.5-4B) on AWS That Can Be Used from Both Smartphones and PCs

I have self-hosted a local LLM (Qwen 3.5-4B) on an AWS GPU instance as a backup for when Claude or ChatGPT experience outages. By using Tailscale, I can privately access it from both my smartphone and PC using the same URL. I will introduce the complete Terraform code and share my impressions from actually using it.
2026.03.05

This page has been translated by machine translation. View original

Introduction

When cloud LLM services like Claude or ChatGPT experience outages, they become unavailable. For engineers who regularly use LLMs, these outages can significantly impact work efficiency. I imagine many of you would like to have your own LLM environment as a backup.

Running a local LLM on your home PC isn't particularly difficult. However, this assumes continuous operation, and you can't access it when you're away from home.

In this article, we'll self-host an open model—a so-called local LLM—on an AWS GPU instance, and use Tailscale to create a private environment that can be accessed from both PC and smartphone regardless of location. When not in use, it automatically stops to minimize costs.

qwen-on-ec2-gif

For the LLM, I used Qwen 3.5-4B which was just released on March 2, 2026. The 4B parameter model has been verified to work on a g4dn.xlarge with T4 GPU (VRAM 16GB).

What is Qwen?

Qwen is a family of open-source large language models developed by Alibaba Cloud's Qwen team. The Qwen 3.5 series released in March 2026 comes in a wide range of sizes from 0.8B to 397B parameters, all available under the Apache 2.0 license. It supports a "thinking mode" that outputs the reasoning process during inference, and I used the 4B model, which is the easiest to run.

Test Environment

  • Project setup: WSL2 Ubuntu 24.04
    • Terraform: v1.12.2
    • AWS CLI: v2.31.31
    • AWS Region: us-east-1
    • EC2 instance type: g4dn.xlarge
  • Browser testing (PC): Windows 11
    • Tailscale: v1.94.2
  • Browser testing (smartphone): Android OS 16
    • Tailscale: v1.94.2

Target Audience

  • Engineers who want to have their own LLM environment as a backup
  • People who want to access their personal LLM from both smartphone and PC, regardless of location
  • People interested in local LLMs but don't want to keep their home PC running constantly

References

Overall Architecture

Here is an overview of the architecture.

  • Tailscale
    A service that creates a private network simply by installing client apps on your PC and smartphone. In this setup, we don't expose any inbound ports, accessing http://qwen-llm:8080 only through Tailscale. It's free for personal use.

  • Open WebUI
    A chat UI that integrates with Ollama. It features a responsive design for comfortable operation on both PCs and smartphones.

  • Ollama
    A runtime that allows easy execution of LLMs in Docker containers. You can download and start using models with just an ollama pull command.

  • Qwen 3.5-4B
    A 4B parameter LLM developed by Alibaba Cloud. Its thinking mode allows you to see the reasoning process while getting answers.

  • EC2 Spot Instance (g4dn.xlarge)
    A GPU instance equipped with an NVIDIA T4 GPU. Using it as a spot instance reduces costs by 60-70% compared to on-demand pricing.

  • Auto-shutdown Script
    A shell script that runs every 5 minutes via cron. It automatically stops the instance if it remains idle for over an hour.

Environment Setup

From here, I'll explain the steps to actually build the environment. I'll provide the full Terraform code, so you can use it as is.

Prerequisites

You'll need the following tools and accounts:

Directory Structure

Create files with the following directory structure:

qwen-on-ec2/
└── infra/
    ├── main.tf
    ├── variables.tf
    ├── outputs.tf
    ├── versions.tf
    └── userdata.sh

versions.tf

AWS provider configuration:

terraform {
  required_version = ">= 1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region
}

variables.tf

Variable definitions. Since default values are set, only tailscale_auth_key needs to be specified.

variable "aws_region" {
  description = "AWS region to deploy resources"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type for GPU inference"
  type        = string
  default     = "g4dn.xlarge"
}

variable "ami_id" {
  description = "AMI ID to use. Leave empty to auto-detect the Deep Learning Base AMI (Ubuntu 22.04)"
  type        = string
  default     = ""
}

variable "tailscale_auth_key" {
  description = "Tailscale auth key for private network access"
  type        = string
  sensitive   = true
}

variable "spot_max_price" {
  description = "Maximum spot price in USD per hour"
  type        = string
  default     = "0.25"
}

variable "root_volume_size" {
  description = "Root EBS volume size in GB (Deep Learning AMI requires >= 75)"
  type        = number
  default     = 75
}

variable "open_webui_port" {
  description = "Port number for Open WebUI"
  type        = number
  default     = 8080
}

Leaving ami_id empty will automatically detect the Deep Learning Base AMI. This AMI has NVIDIA drivers pre-installed, reducing setup time.

main.tf

This is the core infrastructure. The file is long, so I've collapsed it.

Full main.tf
# --- AMI ---

data "aws_ami" "deep_learning_base" {
  count       = var.ami_id == "" ? 1 : 0
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) *"]
  }

  filter {
    name   = "architecture"
    values = ["x86_64"]
  }
}

locals {
  ami_id = var.ami_id != "" ? var.ami_id : data.aws_ami.deep_learning_base[0].id
}

# --- Network ---

data "aws_vpc" "default" {
  default = true
}

data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }

  filter {
    name   = "default-for-az"
    values = ["true"]
  }
}

resource "aws_security_group" "llm" {
  name_prefix = "qwen-llm-"
  description = "Qwen LLM - inbound denied, outbound allowed"
  vpc_id      = data.aws_vpc.default.id

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow all outbound traffic"
  }

  tags = {
    Name = "qwen-llm"
  }

  lifecycle {
    create_before_destroy = true
  }
}

# --- IAM Role (for self-stop via AWS API) ---

resource "aws_iam_role" "llm" {
  name = "qwen-llm-ec2"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })

  tags = {
    Name = "qwen-llm"
  }
}

resource "aws_iam_role_policy" "llm_self_stop" {
  name = "self-stop"
  role = aws_iam_role.llm.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = "ec2:StopInstances"
      Resource = "*"
      Condition = {
        StringEquals = {
          "ec2:ResourceTag/Name" = "qwen-llm"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "llm_ssm" {
  role       = aws_iam_role.llm.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

resource "aws_iam_instance_profile" "llm" {
  name = "qwen-llm-ec2"
  role = aws_iam_role.llm.name
}

# --- EC2 Spot Instance ---

resource "aws_instance" "llm" {
  ami           = local.ami_id
  instance_type = var.instance_type

  instance_market_options {
    market_type = "spot"
    spot_options {
      max_price                      = var.spot_max_price
      spot_instance_type             = "persistent"
      instance_interruption_behavior = "stop"
    }
  }

  iam_instance_profile = aws_iam_instance_profile.llm.name

  vpc_security_group_ids      = [aws_security_group.llm.id]
  subnet_id                   = tolist(data.aws_subnets.default.ids)[0]
  associate_public_ip_address = true

  root_block_device {
    volume_size           = var.root_volume_size
    volume_type           = "gp3"
    delete_on_termination = true
  }

  user_data_base64 = base64encode(templatefile("${path.module}/userdata.sh", {
    tailscale_auth_key = var.tailscale_auth_key
    open_webui_port    = var.open_webui_port
  }))

  tags = {
    Name = "qwen-llm"
  }
}
  • Security Group
    No inbound rules are defined. All access is via Tailscale, so there's no need to expose ports. Only outbound is allowed.

  • IAM Role
    Grants ec2:StopInstances permission. This is needed for the auto-shutdown script to stop the instance itself. A Condition with tag Name = qwen-llm ensures it doesn't affect other instances.

  • Spot Instance
    Set with spot_instance_type = "persistent" and instance_interruption_behavior = "stop". If AWS needs to interrupt the instance, it will be stopped rather than terminated. Data on EBS is preserved, and you can continue using it after restarting.

outputs.tf

Output definitions to check after deployment:

output "instance_id" {
  description = "EC2 instance ID"
  value       = aws_instance.llm.id
}

output "public_ip" {
  description = "Public IP address (for debugging only - use Tailscale IP for access)"
  value       = aws_instance.llm.public_ip
}

output "ami_id" {
  description = "AMI ID used"
  value       = local.ami_id
}

output "access_url" {
  description = "Open WebUI URL (via Tailscale)"
  value       = "http://qwen-llm:${var.open_webui_port}"
}

userdata.sh

A script that automatically runs during the instance's first boot. It automates everything from installing Docker and Tailscale to downloading the LLM model.

Full userdata.sh
#!/bin/bash
set -euxo pipefail

exec > >(tee /var/log/userdata.log) 2>&1
echo "=== User Data: started at $(date) ==="

# Skip if already initialized (for stop/start cycles)
if [ -f /opt/qwen-llm/.initialized ]; then
  echo "Already initialized. Exiting."
  exit 0
fi

# ---- Install AWS CLI (for auto-shutdown via API) ----
apt-get update
apt-get install -y ca-certificates curl gnupg unzip conntrack
if ! command -v aws &> /dev/null; then
  curl -fsSL "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "/tmp/awscliv2.zip"
  unzip -q /tmp/awscliv2.zip -d /tmp
  /tmp/aws/install
  rm -rf /tmp/aws /tmp/awscliv2.zip
fi

# ---- Install Docker (skip if pre-installed on Deep Learning AMI) ----
if ! command -v docker &> /dev/null; then
  install -m 0755 -d /etc/apt/keyrings
  curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
    | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
  chmod a+r /etc/apt/keyrings/docker.gpg

  echo \
    "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
    $(. /etc/os-release && echo "$VERSION_CODENAME") stable" \
    | tee /etc/apt/sources.list.d/docker.list > /dev/null

  apt-get update
  apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
else
  echo "Docker already installed, skipping."
fi

# ---- Install NVIDIA Container Toolkit (skip if pre-installed) ----
if ! dpkg -l nvidia-container-toolkit &> /dev/null; then
  rm -f /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
    | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

  curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
    | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
    | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

  apt-get update
  apt-get install -y nvidia-container-toolkit
else
  echo "NVIDIA Container Toolkit already installed, skipping."
fi

# Ensure Docker is configured for NVIDIA runtime
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

# ---- Install Tailscale ----
curl -fsSL https://tailscale.com/install.sh | sh
set +x  # Don't output auth key to logs
tailscale up --auth-key=${tailscale_auth_key} --hostname=qwen-llm
set -x

# ---- Create Docker Compose file ----
mkdir -p /opt/qwen-llm
cat > /opt/qwen-llm/docker-compose.yml << 'COMPOSE_EOF'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=-1
      - OLLAMA_NUM_PARALLEL=4
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "${open_webui_port}:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - ENABLE_TITLE_GENERATION=false
      - ENABLE_TAGS_GENERATION=false
      - ENABLE_SEARCH_QUERY_GENERATION=false
    volumes:
      - open_webui_data:/app/backend/data
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:
COMPOSE_EOF

# ---- Start services ----
cd /opt/qwen-llm
docker compose up -d

# ---- Wait for Ollama and pull model ----
echo "Waiting for Ollama to be ready..."
for i in $(seq 1 60); do
  if curl -sf http://localhost:11434/api/tags > /dev/null 2>&1; then
    echo "Ollama is ready!"
    break
  fi
  echo "Waiting... ($i/60)"
  sleep 5
done

echo "Pulling Qwen 3.5-4B model..."
docker exec $(docker ps -qf "ancestor=ollama/ollama") ollama pull qwen3.5:4b

# ---- Auto-shutdown script ----
cat > /opt/qwen-llm/auto-shutdown.sh << 'SHUTDOWN_EOF'
#!/bin/bash
# Monitor TCP connections to Open WebUI with conntrack,
# and stop the instance if it remains idle for a certain period.
# Using conntrack because connections through Docker NAT aren't visible with ss.
IDLE_THRESHOLD=3600
TIMESTAMP_FILE="/tmp/last-activity"
WEBUI_PORT=${open_webui_port}

ACTIVE_CONNECTIONS=$(conntrack -L -p tcp --dport "$WEBUI_PORT" 2>/dev/null \
  | grep -c ESTABLISHED || echo "0")

if [ "$ACTIVE_CONNECTIONS" -gt 0 ]; then
  date +%s > "$TIMESTAMP_FILE"
  exit 0
fi

if [ ! -f "$TIMESTAMP_FILE" ]; then
  date +%s > "$TIMESTAMP_FILE"
  exit 0
fi

LAST_ACTIVITY=$(cat "$TIMESTAMP_FILE")
CURRENT_TIME=$(date +%s)
IDLE_TIME=$((CURRENT_TIME - LAST_ACTIVITY))

if [ "$IDLE_TIME" -ge "$IDLE_THRESHOLD" ]; then
  INSTANCE_ID=$(curl -sf http://169.254.169.254/latest/meta-data/instance-id)
  REGION=$(curl -sf http://169.254.169.254/latest/meta-data/placement/region)
  logger "auto-shutdown: Idle for $${IDLE_TIME}s. Stopping instance $${INSTANCE_ID}."
  aws ec2 stop-instances --instance-ids "$INSTANCE_ID" --region "$REGION"
fi
SHUTDOWN_EOF

chmod +x /opt/qwen-llm/auto-shutdown.sh

# Cron job: run every 5 minutes + initialize timestamp at reboot
printf '%s\n%s\n' \
  '@reboot root date +\%s > /tmp/last-activity' \
  '*/5 * * * * root /opt/qwen-llm/auto-shutdown.sh >> /var/log/auto-shutdown.log 2>&1' \
  > /etc/cron.d/auto-shutdown
chmod 644 /etc/cron.d/auto-shutdown

# Initialize timestamp
date +%s > /tmp/last-activity

# Mark as initialized
touch /opt/qwen-llm/.initialized

echo "=== User Data: completed at $(date) ==="

This script is processed by Terraform's templatefile function, so ${tailscale_auth_key} and ${open_webui_port} will be replaced with Terraform variable values. The script checks for /opt/qwen-llm/.initialized at the beginning, so it won't run anything if the instance is stopped and restarted. It's designed to run only the first time.

Running the Deployment

Create terraform.tfvars and set your Tailscale auth key:

tailscale_auth_key = "tskey-auth-xxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Run the deployment:

cd infra
terraform init
terraform apply

When apply completes, the access URL will be output to the terminal:

Outputs:

access_url  = "http://qwen-llm:8080"
instance_id = "i-0xxxxxxxxxxxx"

Automatic Setup After Launch

After terraform apply completes, the User Data script runs in the background:

  1. Install AWS CLI
  2. Install Docker
  3. Install NVIDIA Container Toolkit
  4. Install and authenticate Tailscale
  5. Start Ollama + Open WebUI containers
  6. Download Qwen 3.5-4B model (about 3 GB)
  7. Register auto-shutdown cron job

The whole process takes about 5-10 minutes. When qwen-llm appears in the Tailscale admin panel, steps 1-4 are complete.

Testing: Access from Both PC and Smartphone

Verifying Tailscale Connection

Open the Tailscale admin panel and confirm that qwen-llm appears in the device list.

device list

Accessing Open WebUI and Initial Setup

Open http://qwen-llm:8080 in your PC browser. You'll see the account creation screen on first access.

  1. Click Sign up
  2. Enter Name, Email, and Password
  3. Click Create Account

Chatting from PC

Select the qwen3.5:4b model at the top of the chat screen and send a message.

test-chat

Chatting from Smartphone

Install the Tailscale app on your Android smartphone and sign in with the same account as your PC.

Open http://qwen-llm:8080 in your smartphone's browser to access the same Open WebUI. Log in with the account you created earlier, and you'll see the chat history from your PC.

smartphone browser

Impressions from Use

Beyond just the setup process, I wanted to see if this could serve as a practical backup for cloud LLMs.

Response Speed

Qwen 3.5-4B has thinking mode enabled by default. For technical questions, this thinking mode worked effectively. When I asked it to optimize a C language for loop, it spent 101 seconds analyzing memory access patterns and returned accurate optimization suggestions considering cache locality.

qwen-on-ec2-gif

Prompt and full response
次の for 文を高速化してください。
```c
for (i = 0; i < n; i++) {
  for (j = 0; j < n; j++) {
    a[j][i]++;
  }
}
```

----

C の 2 次元配列(または multidimensional array)を高速化するための最も効果的な最適化は、**「メモリアccess パターン(キャッシュローカリティ)」**の改善です。

そのための具体的な手順とコード例を提示します。

### 1. メモリアクセスパターンの問題点
C のマルチディメンション配列は通常、**Row-Major order**(行優先)でメモリに配置されています。つまり、同じ行にある要素同士はメモリ上も隣接して配置されています。

*   **元のコード(高速化前):**
    ```c
    // i を外ループ,j を内ループ
    for (i = 0; i < n; i++) {
        for (j = 0; j < n; j++) {
            a[j][i]++; // 行番号 j は固定されない。列 index i が変わってくる。
        }
    }
    ```
  *   **問題:** `a[j][i]` ここで外ループが `i` (column) を動かすと、内ループで `j` (row) が変更されていき、メモリ上のアクセスは**「列を下から上へ(縦)」**に行きます。これにより、連続したメモリにアクセスせず、キャッシュにミスが発生し、大量のメモリアクセスがサーバーへ送られることになります。

### 2. 高速化後のコード
**内側ループを `i` から `j` に変更する**だけで、メモリ上のアクセスは「行を横から右へ(横)」に変更されます。これにより、連続したメモリ要素へのアクセスが可能になり、キャッシュヒット率が劇的に向上します。

```c
// Optimized Code: Loop Reorder (Row-Major)
int n = /* your size */;
// Assuming a is declared as int a[n][n] or similar contiguous structure.
for (j = 0; j < n; j++) {
    for (i = 0; i < n; i++) {
        // Memory access order: a[j][0], a[j][1], ..., a[j][n-1]
        // This is ROW-MAJORITY traversal.
        a[j][i]++; 
    }
}
```

### 3. 高速化の原理:キャッシュ効率
*   **Cache Line:** 一般に CPU のキャッシュラインは 64 bytes (8 words) です。
*   **Row-Major Access:** `a[j][0]...` と並んでアクセスすることで、次の要素が即座にキャッシュから取得できます(Prefetching が効きます)。
*   **Column-Major Access:** `a[0][i], a[1][i]...` はメモリアクセスはランダムになり、データが異なるキャッシュライン上にあればその度にディスクまたは RAM の直接メモリにアクセスする必要が発生します。

### 4. その他の高度な最適化オプション
上記のループ順の変更に加え、以下のオプションを組み合わせるとさらに高速化が期待できます。

#### A. コンパイラフラグ
コンパイラが自動で最適化を行う必要があります。
*   `-O3`: 高度なコード生成(例:関数アンラーリング、内側ループの展開など)。
*   `-march=native`: プロセサー固有の指令セットを使用します(AVX2, AVX-512, SSE などによるベクトル演算が可能に)。

#### B. SIMD / 並列化 (OpenMP)
CPU のコアを同時に活用できますが、`++` 演算子には**「Atomic Operation」**が必要になります。

```c
#pragma omp parallel for reduction(+:sum_variable) if (n > 100) // reduction は不要の場合の省略形は避けるので明示的
for (j = 0; j < n; j++) {
    #pragma omp simd // ベクトル化を指示(CPU が対応していれば有効)
    for (i = 0; i < n; i++) {
        a[j][i]++; // これらは異なるコアが書き込むため、OpenMP で並列化すると各コア内部で atomic な挙動が必要。
        // しかし,単純に a[j][i]++ とした場合、原子性は必要ないのか?
        // 注意:a[j][i]++ は値を読み込み、変更し、書き戻す動作。
        // もし複数のスレッドが同じ a[j][i] にアクセスする必要があるなら atomic な処理が必要。
    }
}
```
※通常、OpenMP で並列化する場合の `reduction` を使用しないと結果が崩れる可能性があります(原子性確保のため)。
しかし、上記コードの場合は「すべてを増す」だけなので、`sum` 計算などは不要です。もし単純に配列全体を増やせば、**ループ分割されたスレッド間でデータ独立性**がある場合は並列化できます。

### まとめ:最優先すべき変更
1. **ループ順序の切り替え**: `for(i...) for(j...)``for(j...) for(i...)`。これが最も直接的かつ効果的な最適化です。
2. **コンパイラオプションの適用**: `-O3 -march=native` を使用してください。
3. **メモリアライメント(追加)**: `a[j][i]` にアクセスする場合、`a[j]` はポインタとして扱えますが、動的配列の場合、`int *ptr = &a[j];` として明示的にポインタ演算を行うことでキャッシュヒット率がさらに向上する可能性があります。ただし、静的な `int a[n][n]` の場合も同様です。

最終的なコードは以下のようになります(ループ順の切り替えのみ)。

```c
// Final Optimized Code
for (j = 0; j < n; j++) { // Outer loop for Rows (Cache Locality)
    for (i = 0; i < n; i++) { // Inner loop for Columns (Sequential Access)
        a[j][i]++;
    }
}
```

On the other hand, for daily conversation-level questions like self-introductions, responses return in just seconds. There was no noticeable difference in response speed when accessing from a smartphone.

Response Quality

For everyday conversations and technical questions, it responds in natural Japanese. While there is a noticeable quality difference compared to Claude or ChatGPT, it felt sufficient to serve as a backup during outages.

However, despite not having web search capabilities, in some cases it responded with phrases like "I cannot search for reliable public information sources" or "To improve search accuracy...". This is an example where the model fails to accurately explain its own capabilities. In reality, it has no search functionality at all, so these expressions are inaccurate.

Prompt and full response
DevelopersIO の越井琢巳という執筆者について教えてください。

----

申し訳ありませんが、「DevelopersIO の越井琢巳」という個人について、信頼できる公的な情報源(公式プロフィール、記事、メディアなどの確立されたデータ)を検索できません。

可能性としては以下の通りです:
- **名称や組織名の誤記**:開発者向けのコミュニティや技術系の団体名が類似している場合があるため、より正確な名前(例:「DevelopersIO」ではなく「Devs.io」または別の名称など)と確認をお勧めします。
- **限られた公開情報**:個人が公開活動を行っていない、あるいは情報が非公式なブログや SNS 上でしか存在しない可能性もあります。

正確な情報を提供する際は、名前や所属組織の正しい表記(ローマ字・英字)、または具体的な役割(例:技術作家、プロジェクトリーダーなど)を補足いただけますでしょうか?さらに検索精度を高めるため、以下の点をご確認いただくお手伝いができます。

In thinking mode, the reasoning process is displayed in real-time. It allows you to peek into how a 4B parameter model thinks.

I also observed issues with inference processes in areas where the model's knowledge is vague. When asked "Please tell me the names of all Tokugawa shoguns," it couldn't recall the names of the 4th and subsequent shoguns, falling into an infinite loop of self-questioning dozens of times.

* Actually, the list of Tokugawa Shoguns includes:
1. Ieyasu
2. Hidetada
3. Iemitsu
4. **Iesada**? (Wait, no). It was **Yoshimune**. Wait, no.
* Okay, let's try to recall the names again:
1. Ieyasu
2. Hidetata
3. Iemitsu
4. **Iesada**? (Wait, no). It was **Yoshimune**. Wait, no.
* Actually, there is a specific name "Matsudaira Toshisuke".
* Okay, let's try to find the correct names:
1. Ieyasu
2. Hidetada
3. Iemitsu
4. **Iesada**? (Wait, no). It was **Yoshimune**. Wait, no.
* Actually, there is a common misconception about the count. Let's check if it is 15 or 16.
* Actually, the list of Tokugawa Shoguns includes:
1. Ieyasu
2. Hidetada
3. Iemitsu
4. **Iesada**? (Wait, no). It was **Yoshimune**. Wait, no.
* Actually, there is a specific name "Matsudaira Toshisuke".
* Okay, let's try to find the correct names:
1. Ieyasu
2. Hidetada
3. Iemitsu
4. **Iesada**? (Wait, no). It was **Yoshimune**. Wait, no.

Eventually, it may have reached the token limit, as the process ended with an empty response.

Verification: Auto-shutdown

To avoid costs when not in use, I've implemented a mechanism to automatically stop the instance after 1 hour of idle time.

A shell script run by cron every 5 minutes monitors active TCP connections to Open WebUI using the conntrack command. If no connections are detected for 1 hour, it retrieves its own instance ID from the instance metadata and stops itself using aws ec2 stop-instances.

For TCP connection monitoring, I use conntrack rather than ss. Open WebUI runs in a Docker container, and access via Tailscale is forwarded through iptables DNAT. The ss command on the host side doesn't show connections through Docker NAT, so I chose conntrack which references the Linux kernel's connection tracking table.

Additionally, a @reboot cron entry initializes the timestamp file. On EC2 stop/start, /tmp is preserved because it's on the EBS root volume. If an old timestamp remains when the instance starts, it might be judged as exceeding the threshold immediately and shut down.

I tested with a temporarily reduced threshold of 300 seconds and confirmed that about 10 minutes after closing the browser, the instance state transitioned from stopping to stopped in the AWS console.

Restarting and Recovery After Shutdown

You can restart a stopped instance with the following command:

aws ec2 start-instances --instance-ids <instance-id> --region us-east-1

After restart, Docker containers automatically recover due to the restart: unless-stopped setting. Tailscale also reconnects automatically, so you can access it again at the same URL after a few minutes. Models are retained on the EBS volume, so no re-download is necessary.

Notes

Here are issues I encountered during testing, which may help if you try the same setup:

Initial Response Delay Due to VRAM Unloading

By default, Ollama unloads models from VRAM after 5 minutes of inactivity. After unloading, the first request requires reloading the model, which took 1-2 minutes with this setup.

To address this, I set the OLLAMA_KEEP_ALIVE=-1 environment variable in Docker Compose to keep the model in VRAM permanently. For a single-model configuration, there's no benefit to unloading.

Open WebUI's Auto-generation Tasks Block Responses

After chat responses, Open WebUI sends title generation, tag generation, and related question generation task requests to the same LLM. Since Qwen 3.5 has a thinking mode, these tasks also execute the thinking phase and take time. Other chat requests sent during this time get blocked and remain in a loading state. Setting OLLAMA_NUM_PARALLEL=4 didn't help.

To address this, I disabled all auto-generation features using environment variables in Docker Compose:

  • ENABLE_TITLE_GENERATION=false
  • ENABLE_TAGS_GENERATION=false
  • ENABLE_SEARCH_QUERY_GENERATION=false

When title auto-generation is disabled, the initial prompt text is used as the chat title. Note that if containers are already running, adding environment variables alone won't take effect. Disable these features from the admin panel or personal settings.

Open WebUI Environment Variables and PersistentConfig

Some Open WebUI settings like ENABLE_TITLE_GENERATION are managed by a PersistentConfig mechanism. Environment variables are only read on first startup; on subsequent starts, values saved in the database take precedence. Therefore, adding environment variables after containers are already running won't take effect. To change settings, use the admin panel or personal settings.

Tailscale Auth Key Expiration

Tailscale auth keys have a default expiration of 90 days. While already joined devices aren't immediately disconnected when an auth key expires (device node keys have a separate 180-day limit), you won't be able to join the Tailscale network when rebuilding the instance after termination. For long-term operation, either periodically update the auth key and modify terraform.tfvars, or consider using Tailscale's OAuth client for automatic renewal.

Spot Instance Interruption Risk

Spot instances can be interrupted by AWS. If interrupted, you'll need to manually restart them.

Waiting for Spot Instance Restart

When restarting a stopped spot instance, start-instances may fail until the spot request state transitions to instance-stopped-by-user. If it's in a marked-for-stop state, wait a few minutes before trying again.

Conclusion

I've set up a self-hosted Qwen 3.5-4B on an AWS GPU instance, creating a private LLM environment accessible from both PC and smartphone via Tailscale. Despite being just a 4B parameter model, it provided practical answers including optimization suggestions considering cache locality for technical questions. While there are challenges like thinking mode infinite loops and hallucinations, it seems sufficient as a backup when Claude or ChatGPT are unavailable. I hope this helps those considering building their own private LLM environment.

Share this article

FacebookHatena blogX

Related articles