The story of implementing knowledge base search by connecting Vertex AI RAG Engine to Google Chat Bot

The story of implementing knowledge base search by connecting Vertex AI RAG Engine to Google Chat Bot

I implemented a knowledge base search for internal use by integrating Vertex AI RAG Engine into a Google Chat Bot. This article covers the entire process: identifying the cause of catastrophically poor search quality with text-embedding-005 on Japanese content, migrating to text-multilingual-embedding-002, and debugging OOM errors.
2026.06.19

This page has been translated by machine translation. View original

Introduction

In Part 1, we built a Google Chat Bot with Cloud Functions + Python + uv, and in Part 2, we implemented a progressive update UX with cardsV2.

This time, we finally get to the main topic — integrating a knowledge base search (RAG) into the bot. We'll cover the entire process of organizing approximately 300 internal knowledge data entries, feeding them into the Vertex AI RAG Engine, and enabling automatic responses to user questions.

To cut to the chase, the RAG Engine setup itself was straightforward, but there were Japanese-specific pitfalls in selecting the embedding model, and the process of identifying the root cause was the biggest learning experience.

Architecture

Item Choice
RAG Backend Vertex AI RAG Engine (managed)
Embedding Model text-multilingual-embedding-002
Generation Model Gemini 2.5 Flash
Knowledge Base Approximately 300 QA entries (Markdown)
Storage Google Cloud Storage
Fallback Fixed message (directing to the responsible department)

Data Organization: QA Data → KB Entries

Quality Assessment of Source Data

In this case, the source data consisted of approximately 600 QA data entries (question and answer pairs) accumulated internally. The source data was in the format of staff work notes, not user-facing procedure documents.

Quality Filtering

First, we excluded items clearly unsuitable as knowledge entries.

Filter Condition Example
Non-questions Items not categorized as inquiries Work requests, item handovers
No answer Answer field is empty
Insufficient information Answer is too short Less than 15 characters

As a result, approximately half of the "question + meaningful answer" pairs remained.

Policy of Keeping Individual Entries

We adopted a policy of keeping QA pairs as individual KB entries. Initially, we considered an approach of grouping by solution pattern and creating integrated articles, but we switched to individual retention for the following reasons.

  1. Preserving variations — Even with the same symptom, solutions differ by situation. Merging loses individual context
  2. Simplifying the pipeline — Eliminates the need for rule-based matching or manual article creation, making it easy to add new data
  3. Synthesis by Gemini — When multiple entries of the same pattern are retrieved in search results, Gemini follows system prompt instructions to summarize common response methods

Handling Source Data in Note Format

When source data is in the format of staff work notes (e.g., "resolved by changing the XX setting"), rather than showing them to users as-is, you can instruct Gemini via system prompt to convert them.

## About the Knowledge Base Format
- The knowledge base contains response records (staff notes)
- For records in the format "resolved by doing XX", please rephrase as easy-to-understand steps for the user
- If multiple similar records are included in the search results, summarize the common response methods in your answer

Answer quality directly depends on source data quality. The most effective way to get better answers is to improve the descriptions in the source data itself.

KB Entry Format

Each entry is saved as a Markdown file.

# Excel macros cannot be executed.

**Category**: office

## Response

Please change the Trust Center settings.

The question is placed in the title (= the target for matching in vector search), and the answer in the response section. The reason for keeping the question is that without it, the lexical match with the search query becomes weak (e.g., the answer may not contain the word "macro").

Ultimately, approximately 600 entries were organized into approximately 300 KB entries.

Vertex AI RAG Engine Setup

Why We Chose RAG Engine

With over 300 entries, "context stuffing" — packing the entire text into the prompt — is not practical. We adopted Vertex AI RAG Engine (managed RAG) for the following reasons.

  • Scales without configuration changes even as the KB grows
  • Chunking, embedding, and search are managed with no maintenance required
  • Stays within the Vertex AI ecosystem with no additional external services needed

What Is a Corpus?

A corpus in RAG Engine is a container for documents that are the target of search. Uploaded files are chunked and vectorized within the corpus and kept in a searchable state. It is equivalent to an index in OpenSearch, and you can create multiple corpora for different purposes (e.g., for IT support, HR, technical documentation).

With OpenSearch, you need to build the embedding pipeline and k-NN configuration yourself, but with RAG Engine, you simply specify the files and choose an embedding model — chunking, vectorization, and search index construction are all handled by the managed service.

Creating a GCS Bucket and Uploading

First, upload the KB articles to GCS.

# Create GCS bucket
gcloud storage buckets create gs://YOUR_BUCKET_NAME \
  --location=asia-northeast1

# Script to upload KB articles
uv run python scripts/upload_kb.py \
  --project YOUR_PROJECT_ID \
  --bucket YOUR_BUCKET_NAME

upload_kb.py is a script that uploads KB article .md files to GCS, creates a RAG corpus, and imports the files.

from google.cloud import aiplatform, storage
from vertexai import rag

LOCATION = "asia-northeast1"

def main():
    # SDK initialization only once at the start
    aiplatform.init(project=project_id, location=LOCATION)

    # 1. Upload to GCS
    client = storage.Client(project=project_id)
    bucket = client.bucket(bucket_name)
    for md_file in KB_DIR.glob("*.md"):
        blob = bucket.blob(f"kb_articles/{md_file.name}")
        blob.upload_from_filename(str(md_file))

    # 2. Create corpus (first time only)
    corpus = rag.create_corpus(
        display_name="my-it-support-kb",
        description="IT Support Knowledge Base",
    )

    # 3. Import files
    rag.import_files(corpus_name=corpus.name,
                     paths=[f"gs://{bucket_name}/kb_articles/"])

If you omit location in aiplatform.init(), us-central1 is used by default. If the corpus is in asia-northeast1, you will get the following error when calling rag.import_files():

FAILED_PRECONDITION: Request resource location asia-northeast1
does not match service location us-central1.

Since rag.list_corpora() and rag.create_corpus() work fine but only import_files() fails, it takes time to identify the cause. Always run aiplatform.init(location=...) before calling rag.*.

Search from RAG Engine is performed with rag.retrieval_query().

from vertexai import rag

def retrieve_context(query: str) -> list[dict]:
    response = rag.retrieval_query(
        text=query,
        rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
        rag_retrieval_config=rag.RagRetrievalConfig(
            top_k=5,
            filter=rag.Filter(vector_distance_threshold=0.6),
        ),
    )
    results = []
    for context in response.contexts.contexts:
        results.append({
            "text": context.text,
            "score": context.score,
            "source": context.source_uri,
        })
    return results

vector_distance_threshold is the cosine distance, where a smaller value indicates higher relevance. We use 0.6 as the threshold, treating values above that as "low confidence."

These are easy to confuse, so let's clarify. Cosine similarity ranges from -1 to 1, where closer to 1 indicates closer meaning. Cosine distance is the inverse, calculated as 1 - similarity, ranging from 0 to 2, where closer to 0 means closer meaning. The score returned by RAG Engine is cosine distance, so a smaller value = a better match.

Metric Range Better Direction
Cosine Similarity -1 to 1 Higher (1 = identical)
Cosine Distance 0 to 2 Lower (0 = identical)

Answer Generation and Fallback Strategy

We use two strategies depending on the confidence of the search results.

def query(question: str) -> str:
    # 1. Search KB
    contexts = retrieve_context(question)

    # 2. Confidence check
    if not contexts or all(c["score"] > 0.6 for c in contexts):
        # Low confidence → Honestly return that no information was found in KB
        return "No relevant information was found. Please contact the responsible department."

    # 3. Generate answer using KB information
    return generate_answer(question, contexts)
Search Score Strategy Reason
< 0.6 (good match) Gemini answers using KB information Provides accurate procedures specific to the organization
≥ 0.6 (low confidence) Fixed message Don't speculate with information not in the KB. Direct to the responsible department

There is an important design decision here. An architecture that falls back to Google Search grounding (a feature where Gemini answers based on Google search results) in the case of low confidence is also technically possible. However, we adopted the policy of not answering with information not in the knowledge base. For an internal bot, "I don't know, please ask the person in charge" is safer than inaccurate general information.

Obstacle: Japanese Search Quality Was Catastrophic

The setup went smoothly, but when we tested it, search results were completely off-target for some queries.

Symptoms

=== "VBA macro cannot be executed" ===
  0.26 | office_macro_blocked.md  ✅ Correct

=== "Cannot log in to the internal system" ===
  0.18 | system_user_lock.md  ✅ Correct

=== "I received a suspicious email" ===
  0.44 | smartphone_google_photos.md  ❌ Completely wrong

Queries containing English or katakana technical terms like VBA and Google Drive searched accurately, but queries in pure Japanese like "suspicious email" failed completely.

Furthermore, the article title was "Handling a Suspicious Email" — text almost identical to the query.

Isolating the Cause: Chunking or Embedding?

There were two possible causes.

  1. Chunking problem — RAG Engine is splitting files inappropriately, separating the title from the body
  2. Embedding model problem — The default text-embedding-005 is not correctly capturing the semantic similarity of Japanese text

Verifying Chunking

We ran a search targeting only a specific file and checked the chunks created by RAG Engine.

response = rag.retrieval_query(
    text="I received a suspicious email",
    rag_resources=[rag.RagResource(
        rag_corpus=corpus,
        rag_file_ids=["<suspicious_email_file_id>"],
    )],
    rag_retrieval_config=rag.RagRetrievalConfig(
        top_k=5,
        filter=rag.Filter(vector_distance_threshold=1.0),
    ),
)

The result showed that the chunk contained the full article text. Since the file is small (about 30 lines), it was not split into 1 file = 1 chunk, meaning chunking was not the problem.

Verifying the Embedding Model

Next, we manually embedded the same text and calculated the cosine distance.

from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput

model = TextEmbeddingModel.from_pretrained("text-embedding-005")

query = "I received a suspicious email"
article = "Handling a suspicious email..."
unrelated = "Smartphone Google Photos sync settings..."

# Embed without task type (default)
embeddings = model.get_embeddings([query, article, unrelated])
# → query vs article: 0.27 (Good!)
# → query vs unrelated: 0.46

With the default task type, it correctly identifies at 0.27. So why does it become 0.46 in RAG Engine?

Root Cause: Asymmetric Task Type Pairing

text-embedding-005 has a concept called task type, where different vectors are generated for the same text depending on the task type.

How Task Types Work

Task types do not switch the model's structure or weights. Internally, a prefix (instruction text) corresponding to the task type is simply prepended to the text. Conceptually, it works like this:

RETRIEVAL_QUERY    + "suspicious email"  →  Vector A
RETRIEVAL_DOCUMENT + "suspicious email"  →  Vector B
(no prefix)        + "suspicious email"  →  Vector C

Same model, same weights, same Transformer — but because the prefix differs, the attention pattern changes, ultimately generating different vectors. It's the same principle as getting different outputs from an LLM with "summarize this" versus "translate this."

This asymmetric pairing (using different task types for queries and documents) is designed to absorb the difference in nature between short search queries and long documents. The query-side prefix is trained to "expand the intent to move closer to the document space," while the document side is trained to "be positioned where relevant queries can easily reach."

However, this training depends on training data. Since text-embedding-005 was primarily trained on English query-document pairs, it can correctly "expand" for English, but for Japanese, the prefix influence distorts rather than helps the vectors — that is the essence of the problem this time.

google-chat-bot-vertex-ai-rag-engine-task-type

When Task Types Are Applied

Importantly, task types are applied at both ingestion time and query time.

  • At ingestion time (when importing files into the corpus): Each chunk is embedded with RETRIEVAL_DOCUMENT and saved to the index
  • At query time (when calling retrieval_query()): The search query is embedded with RETRIEVAL_QUERY and compared to the saved document vectors

In other words, the asymmetry is baked into the index. To change the document-side task type, you need to recreate the corpus and re-import the files.

RAG Engine internally uses the following pairing.

  • Query: RETRIEVAL_QUERY
  • Document: RETRIEVAL_DOCUMENT

When we manually tested with this combination:

q_input = TextEmbeddingInput(text=query, task_type="RETRIEVAL_QUERY")
d_input = TextEmbeddingInput(text=article, task_type="RETRIEVAL_DOCUMENT")
# → Cosine distance: 0.4605  ← Exact match with RAG Engine results!
Query-side Task Type Document-side Task Type Cosine Distance
RETRIEVAL_QUERY RETRIEVAL_DOCUMENT 0.46 (indistinguishable)
RETRIEVAL_DOCUMENT RETRIEVAL_DOCUMENT 0.26 (good)
SEMANTIC_SIMILARITY SEMANTIC_SIMILARITY 0.30 (good)

It became clear that the RETRIEVAL_QUERY / RETRIEVAL_DOCUMENT pair of text-embedding-005 cannot correctly capture the semantic similarity of pure Japanese text. Since queries containing English or katakana technical terms work fine, this is an easy problem to miss.

Comparison of Vertex AI Embedding Models

Before getting to the solution, let's organize the embedding models provided by Vertex AI.

Model Dimensions Languages Features
text-embedding-005 768 English-optimized Latest English model. Supports task types. Has weaknesses with Japanese RETRIEVAL pairs
text-multilingual-embedding-002 768 100+ languages (strong in CJK) Multilingual-specialized. Supports task types. Recommended for Japanese RAG
text-embedding-004 768 English-centric Previous generation of 005
textembedding-gecko@003 768 English Old generation. No reason to choose for new projects
textembedding-gecko-multilingual@001 768 Multilingual Old generation multilingual model

Both 005 and multilingual-002 support output dimension reduction, allowing cost-speed tradeoffs. For new projects, you'll basically choose between these two.

Solution: Migrating to text-multilingual-embedding-002

We ran the same test with text-multilingual-embedding-002.

model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")

q_input = TextEmbeddingInput(text=query, task_type="RETRIEVAL_QUERY")
d_input = TextEmbeddingInput(text=article, task_type="RETRIEVAL_DOCUMENT")
# → Cosine distance: 0.2090  ← Dramatically improved!
Model Query → Correct Article Query → Unrelated Article Discrimination Gap
text-embedding-005 0.46 0.44 0.02 (indistinguishable)
text-multilingual-embedding-002 0.21 0.48 0.27 (clearly distinguishable)

We recreated the RAG Engine corpus with text-multilingual-embedding-002.

corpus = rag.create_corpus(
    display_name="my-it-support-kb-v2",
    backend_config=rag.RagVectorDbConfig(
        rag_embedding_model_config=rag.RagEmbeddingModelConfig(
            vertex_prediction_endpoint=rag.VertexPredictionEndpoint(
                publisher_model="publishers/google/models/text-multilingual-embedding-002",
            ),
        ),
    ),
)

Search results after recreation:

=== "I received a suspicious email" ===
  0.21 | suspicious_email.md  ✅

=== "VBA macro cannot be executed" ===
  0.19 | office_macro_blocked.md  ✅

=== "My computer is running slowly" ===
  0.18 | pc_slow_troubleshooting.md  ✅

=== "Cannot log in to Salesforce" ===
  0.18 | salesforce_login.md  ✅

The correct article now comes up as top-1 for all queries.

Integrating the RAG Pipeline into worker.py

We incorporated the actual RAG processing into the 4-step progressive card built in the previous article.

def process_message(space_name, user_text, sender, ...):
    # Step 1: Analyzing inquiry
    _advance_step(state, "analyze", patcher, message_name)

    # Step 2: Building search query
    _advance_step(state, "build_query", patcher, message_name)

    # Step 3: Searching knowledge base
    contexts = retrieve_context(user_text)

    # Step 4: Generating answer
    if not contexts or all(c["score"] > 0.6 for c in contexts):
        answer = NO_RESULT_MESSAGE
    else:
        answer = generate_answer(user_text, contexts)

    # Add answer to card paragraph by paragraph
    for para in answer.split("\n\n"):
        state.content_paragraphs.append(para.strip())
        patcher.patch(build_progressive_card(state))

Deployment Considerations: Out of Memory

512Mi → 1Gi (Initial Deployment)

After the initial deployment, a problem occurred where the card would freeze at "Analyzing inquiry."

Checking the logs:

Memory limit of 512 MiB exceeded with 529 MiB used.

The google-cloud-aiplatform SDK is heavy, and at 512Mi it was being killed by OOM. Increasing to 1Gi temporarily resolved it.

1Gi → 2Gi (After Production Operation)

However, 1Gi was still not sufficient. After going into production, a situation occurred where no answer was returned when queries were sent in rapid succession.

Symptoms and Difficulty

The symptom was vague: "sometimes no answer is returned when queries are sent rapidly." The fact that it was "sometimes" rather than always was tricky. Error handling on the application side was working normally, so it didn't seem to be a code bug.

Debugging Steps

First, we checked error-level logs in Cloud Logging.

gcloud logging read \
  'resource.type="cloud_run_revision"
   AND resource.labels.service_name="YOUR_SERVICE_NAME"
   AND severity>=ERROR' \
  --limit=50 \
  --format='table(timestamp,severity,textPayload)'

Two types of errors were found.

1. OOM (Memory Limit Exceeded)

Memory limit of 1024 MiB exceeded with 1033 MiB used.
Memory limit of 1024 MiB exceeded with 1029 MiB used.

This is the direct cause of the forced instance termination.

2. Connection Errors to Chat API

ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number
http.client.IncompleteRead: IncompleteRead(290 bytes read, 400 more expected)
Failed to create initial card

After an instance was killed by OOM, a new instance starts up, but the HTTP connection pool from the previous instance was sometimes inherited in a corrupted state, causing cascading SSL errors and incomplete read errors. In other words: OOM → connection pool corruption → Chat API call failure → unable to even create initial card → no response — a cascade of failures.

google-chat-bot-vertex-ai-rag-engine-oom-cascade

3. Checking the Timeline

We checked all logs (including INFO) before and after the errors to understand the sequence of failures chronologically.

12:06:49  ERROR  Memory limit of 1024 MiB exceeded with 1029 MiB used
12:06:50  INFO   Starting new instance (AUTOSCALING)
  ...(normal operation)...
12:26:33  ERROR  Failed to create initial card (SSLError)
12:26:33  ERROR  Failed to create initial card (IncompleteRead)
  ...
12:29:56  ERROR  Memory limit of 1024 MiB exceeded with 1033 MiB used
12:29:57  INFO   Starting new instance (AUTOSCALING)

It was repeating the pattern: OOM → recovery → normal for a while → OOM again. This matched the reports of "sometimes no answer is returned."

Visualizing Memory Usage

Now that OOM was identified as the cause, we used the Cloud Monitoring API to retrieve memory usage trends to determine the appropriate memory limit.

curl -s -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  "https://monitoring.googleapis.com/v3/projects/PROJECT_ID/timeSeries?filter=..." \
  | python3 -c "..."  # Extract averages from distribution data

Results:

Phase Memory Usage Percentage of 1Gi
Immediately after cold start ~89 MiB 9%
After SDK initialization (idle) ~700 MiB 70%
During query processing ~880 MiB 86%
Peak (maximum 1-minute average) ~1011 MiB 98.8%

Even in idle state, it was already consuming 70%, and only a slight increase in memory during query processing was enough to trigger OOM.

Why It Becomes a Silent Failure

This bot processes requests in a background thread and returns {} as the HTTP response immediately. When an instance is killed by OOM, the background thread disappears with it, but from Google Chat's perspective, the HTTP request succeeded (200 OK), so no error like "app is not responding" is displayed. Users receive no notification; the answer simply doesn't come back.

Thinking About Memory Sizing

Google Cloud best practices recommend keeping peak memory usage within 50–80% of the allocated limit. Below 50% is excessive cost, above 80% risks OOM during spikes.

Memory Limit Idle Peak Verdict
512Mi 137% 197% OOM
1Gi 70% 98.8% OOM occurring
1.5Gi 46% 67% Marginal (OOM risk from sub-second spikes)
2Gi 34% 49% Within recommended range
4Gi 17% 25% Excessive

By increasing to 2Gi, peak usage was kept around 50%, providing sufficient headroom.

Python's (CPython) internal memory allocator (pymalloc) retains a pool of memory for released objects and rarely returns it to the OS. In other words, even after processing completes and threads terminate, the process RSS (Resident Set Size) does not decrease. Even if you explicitly call gc.collect(), Python's internal objects are released, but the memory usage as seen from the OS does not change. For this reason, peak memory usage becomes the resident memory, and the memory limit must be set according to the peak.

Deployment Command

gcloud functions deploy YOUR_FUNCTION_NAME \
  --gen2 --runtime=python312 --region=asia-northeast1 \
  --source=. --entry-point=handle_chat \
  --trigger-http --no-allow-unauthenticated \
  --memory=2Gi --cpu=1

# First time only: Disable CPU throttling (for background threads)
gcloud run services update YOUR_FUNCTION_NAME \
  --region=asia-northeast1 --no-cpu-throttling

The Cloud Run annotation run.googleapis.com/cpu-throttling: 'false' is maintained in subsequent deployments once set. You only need to run gcloud run services update --no-cpu-throttling once during initial setup.

Environment Variable Management

RAG configuration values are managed as environment variables. Once set with --set-env-vars in Cloud Functions, they are carried over in subsequent deployments.

# First time only
gcloud functions deploy YOUR_FUNCTION_NAME \
  ... \
  --set-env-vars="GCP_PROJECT_ID=your-project,GCP_LOCATION=asia-northeast1,RAG_CORPUS_ID=your-corpus-id,GEMINI_MODEL_ID=gemini-2.5-flash"
Variable Purpose
GCP_PROJECT_ID Project ID used for Vertex AI SDK initialization
GCP_LOCATION Region specification (e.g., asia-northeast1)
RAG_CORPUS_ID RAG corpus ID to search
GEMINI_MODEL_ID Gemini model used for answer generation

Summary

We implemented knowledge base search by connecting the Google Chat Bot to the Vertex AI RAG Engine.

Step Content
Data organization QA data → organized into approximately 300 KB entries
RAG Engine setup GCS → corpus creation → file import
Embedding model selection Changed from text-embedding-005 → text-multilingual-embedding-002
Pipeline integration Search → confidence judgment → generation (fixed message for low confidence)
Deployment Memory incrementally increased from 512Mi → 1Gi → 2Gi

The biggest lesson learned is that when building RAG with Japanese content, you should use text-multilingual-embedding-002 rather than text-embedding-005. Since text-embedding-005 works fine with English and katakana technical terms, this is a trap that can go unnoticed depending on your test cases.

Another lesson is that the quality of KB data determines the upper limit of answer quality. Given that the source data is in a staff note format, there are limits to how much you can instruct Gemini to convert it via system prompts. The most effective way to get better answers is to improve the quality of the source data itself.

References


国内企業 AI活用実態調査2026 配布中

クラスメソッドが独自に行なったAI診断調査をもとに、企業のAI活用の現在地を調査レポートとしてまとめました。企業規模別の活用度傾向に加え、規模を超えてAI活用を進める企業に共通する取り組みまで、自社の現在地を捉えるためのヒントにぜひ。

国内企業 AI活用実態調査2026

無料でダウンロードする

Share this article