
The story of implementing knowledge base search by connecting Vertex AI RAG Engine to Google Chat Bot
This page has been translated by machine translation. View original
Introduction
In Part 1, we built a Google Chat Bot with Cloud Functions + Python + uv, and in Part 2, we implemented a progressive update UX with cardsV2.
This time, we finally get to the main topic — integrating a knowledge base search (RAG) into the bot. We'll cover the entire process of organizing approximately 300 internal knowledge data entries, feeding them into the Vertex AI RAG Engine, and enabling automatic responses to user questions.
To cut to the chase, the RAG Engine setup itself was straightforward, but there were Japanese-specific pitfalls in selecting the embedding model, and the process of identifying the root cause was the biggest learning experience.
Architecture
| Item | Choice |
|---|---|
| RAG Backend | Vertex AI RAG Engine (managed) |
| Embedding Model | text-multilingual-embedding-002 |
| Generation Model | Gemini 2.5 Flash |
| Knowledge Base | Approximately 300 QA entries (Markdown) |
| Storage | Google Cloud Storage |
| Fallback | Fixed message (directing to the responsible department) |
Data Organization: QA Data → KB Entries
Quality Assessment of Source Data
In this case, the source data consisted of approximately 600 QA data entries (question and answer pairs) accumulated internally. The source data was in the format of staff work notes, not user-facing procedure documents.
Quality Filtering
First, we excluded items clearly unsuitable as knowledge entries.
| Filter | Condition | Example |
|---|---|---|
| Non-questions | Items not categorized as inquiries | Work requests, item handovers |
| No answer | Answer field is empty | — |
| Insufficient information | Answer is too short | Less than 15 characters |
As a result, approximately half of the "question + meaningful answer" pairs remained.
Policy of Keeping Individual Entries
We adopted a policy of keeping QA pairs as individual KB entries. Initially, we considered an approach of grouping by solution pattern and creating integrated articles, but we switched to individual retention for the following reasons.
- Preserving variations — Even with the same symptom, solutions differ by situation. Merging loses individual context
- Simplifying the pipeline — Eliminates the need for rule-based matching or manual article creation, making it easy to add new data
- Synthesis by Gemini — When multiple entries of the same pattern are retrieved in search results, Gemini follows system prompt instructions to summarize common response methods
Handling Source Data in Note Format
When source data is in the format of staff work notes (e.g., "resolved by changing the XX setting"), rather than showing them to users as-is, you can instruct Gemini via system prompt to convert them.
## About the Knowledge Base Format
- The knowledge base contains response records (staff notes)
- For records in the format "resolved by doing XX", please rephrase as easy-to-understand steps for the user
- If multiple similar records are included in the search results, summarize the common response methods in your answer
Answer quality directly depends on source data quality. The most effective way to get better answers is to improve the descriptions in the source data itself.
KB Entry Format
Each entry is saved as a Markdown file.
# Excel macros cannot be executed.
**Category**: office
## Response
Please change the Trust Center settings.
The question is placed in the title (= the target for matching in vector search), and the answer in the response section. The reason for keeping the question is that without it, the lexical match with the search query becomes weak (e.g., the answer may not contain the word "macro").
Ultimately, approximately 600 entries were organized into approximately 300 KB entries.
Vertex AI RAG Engine Setup
Why We Chose RAG Engine
With over 300 entries, "context stuffing" — packing the entire text into the prompt — is not practical. We adopted Vertex AI RAG Engine (managed RAG) for the following reasons.
- Scales without configuration changes even as the KB grows
- Chunking, embedding, and search are managed with no maintenance required
- Stays within the Vertex AI ecosystem with no additional external services needed
What Is a Corpus?
A corpus in RAG Engine is a container for documents that are the target of search. Uploaded files are chunked and vectorized within the corpus and kept in a searchable state. It is equivalent to an index in OpenSearch, and you can create multiple corpora for different purposes (e.g., for IT support, HR, technical documentation).
With OpenSearch, you need to build the embedding pipeline and k-NN configuration yourself, but with RAG Engine, you simply specify the files and choose an embedding model — chunking, vectorization, and search index construction are all handled by the managed service.
Creating a GCS Bucket and Uploading
First, upload the KB articles to GCS.
# Create GCS bucket
gcloud storage buckets create gs://YOUR_BUCKET_NAME \
--location=asia-northeast1
# Script to upload KB articles
uv run python scripts/upload_kb.py \
--project YOUR_PROJECT_ID \
--bucket YOUR_BUCKET_NAME
upload_kb.py is a script that uploads KB article .md files to GCS, creates a RAG corpus, and imports the files.
from google.cloud import aiplatform, storage
from vertexai import rag
LOCATION = "asia-northeast1"
def main():
# SDK initialization only once at the start
aiplatform.init(project=project_id, location=LOCATION)
# 1. Upload to GCS
client = storage.Client(project=project_id)
bucket = client.bucket(bucket_name)
for md_file in KB_DIR.glob("*.md"):
blob = bucket.blob(f"kb_articles/{md_file.name}")
blob.upload_from_filename(str(md_file))
# 2. Create corpus (first time only)
corpus = rag.create_corpus(
display_name="my-it-support-kb",
description="IT Support Knowledge Base",
)
# 3. Import files
rag.import_files(corpus_name=corpus.name,
paths=[f"gs://{bucket_name}/kb_articles/"])
If you omit
locationinaiplatform.init(),us-central1is used by default. If the corpus is inasia-northeast1, you will get the following error when callingrag.import_files():FAILED_PRECONDITION: Request resource location asia-northeast1 does not match service location us-central1.
Since rag.list_corpora() and rag.create_corpus() work fine but only import_files() fails, it takes time to identify the cause. Always run aiplatform.init(location=...) before calling rag.*.
Implementing Search
Search from RAG Engine is performed with rag.retrieval_query().
from vertexai import rag
def retrieve_context(query: str) -> list[dict]:
response = rag.retrieval_query(
text=query,
rag_resources=[rag.RagResource(rag_corpus=corpus_name)],
rag_retrieval_config=rag.RagRetrievalConfig(
top_k=5,
filter=rag.Filter(vector_distance_threshold=0.6),
),
)
results = []
for context in response.contexts.contexts:
results.append({
"text": context.text,
"score": context.score,
"source": context.source_uri,
})
return results
vector_distance_threshold is the cosine distance, where a smaller value indicates higher relevance. We use 0.6 as the threshold, treating values above that as "low confidence."
These are easy to confuse, so let's clarify. Cosine similarity ranges from -1 to 1, where closer to 1 indicates closer meaning. Cosine distance is the inverse, calculated as
1 - similarity, ranging from 0 to 2, where closer to 0 means closer meaning. The score returned by RAG Engine is cosine distance, so a smaller value = a better match.
Metric Range Better Direction Cosine Similarity -1 to 1 Higher (1 = identical) Cosine Distance 0 to 2 Lower (0 = identical)
Answer Generation and Fallback Strategy
We use two strategies depending on the confidence of the search results.
def query(question: str) -> str:
# 1. Search KB
contexts = retrieve_context(question)
# 2. Confidence check
if not contexts or all(c["score"] > 0.6 for c in contexts):
# Low confidence → Honestly return that no information was found in KB
return "No relevant information was found. Please contact the responsible department."
# 3. Generate answer using KB information
return generate_answer(question, contexts)
| Search Score | Strategy | Reason |
|---|---|---|
| < 0.6 (good match) | Gemini answers using KB information | Provides accurate procedures specific to the organization |
| ≥ 0.6 (low confidence) | Fixed message | Don't speculate with information not in the KB. Direct to the responsible department |
There is an important design decision here. An architecture that falls back to Google Search grounding (a feature where Gemini answers based on Google search results) in the case of low confidence is also technically possible. However, we adopted the policy of not answering with information not in the knowledge base. For an internal bot, "I don't know, please ask the person in charge" is safer than inaccurate general information.
Obstacle: Japanese Search Quality Was Catastrophic
The setup went smoothly, but when we tested it, search results were completely off-target for some queries.
Symptoms
=== "VBA macro cannot be executed" ===
0.26 | office_macro_blocked.md ✅ Correct
=== "Cannot log in to the internal system" ===
0.18 | system_user_lock.md ✅ Correct
=== "I received a suspicious email" ===
0.44 | smartphone_google_photos.md ❌ Completely wrong
Queries containing English or katakana technical terms like VBA and Google Drive searched accurately, but queries in pure Japanese like "suspicious email" failed completely.
Furthermore, the article title was "Handling a Suspicious Email" — text almost identical to the query.
Isolating the Cause: Chunking or Embedding?
There were two possible causes.
- Chunking problem — RAG Engine is splitting files inappropriately, separating the title from the body
- Embedding model problem — The default
text-embedding-005is not correctly capturing the semantic similarity of Japanese text
Verifying Chunking
We ran a search targeting only a specific file and checked the chunks created by RAG Engine.
response = rag.retrieval_query(
text="I received a suspicious email",
rag_resources=[rag.RagResource(
rag_corpus=corpus,
rag_file_ids=["<suspicious_email_file_id>"],
)],
rag_retrieval_config=rag.RagRetrievalConfig(
top_k=5,
filter=rag.Filter(vector_distance_threshold=1.0),
),
)
The result showed that the chunk contained the full article text. Since the file is small (about 30 lines), it was not split into 1 file = 1 chunk, meaning chunking was not the problem.
Verifying the Embedding Model
Next, we manually embedded the same text and calculated the cosine distance.
from vertexai.language_models import TextEmbeddingModel, TextEmbeddingInput
model = TextEmbeddingModel.from_pretrained("text-embedding-005")
query = "I received a suspicious email"
article = "Handling a suspicious email..."
unrelated = "Smartphone Google Photos sync settings..."
# Embed without task type (default)
embeddings = model.get_embeddings([query, article, unrelated])
# → query vs article: 0.27 (Good!)
# → query vs unrelated: 0.46
With the default task type, it correctly identifies at 0.27. So why does it become 0.46 in RAG Engine?
Root Cause: Asymmetric Task Type Pairing
text-embedding-005 has a concept called task type, where different vectors are generated for the same text depending on the task type.
How Task Types Work
Task types do not switch the model's structure or weights. Internally, a prefix (instruction text) corresponding to the task type is simply prepended to the text. Conceptually, it works like this:
RETRIEVAL_QUERY + "suspicious email" → Vector A
RETRIEVAL_DOCUMENT + "suspicious email" → Vector B
(no prefix) + "suspicious email" → Vector C
Same model, same weights, same Transformer — but because the prefix differs, the attention pattern changes, ultimately generating different vectors. It's the same principle as getting different outputs from an LLM with "summarize this" versus "translate this."
This asymmetric pairing (using different task types for queries and documents) is designed to absorb the difference in nature between short search queries and long documents. The query-side prefix is trained to "expand the intent to move closer to the document space," while the document side is trained to "be positioned where relevant queries can easily reach."
However, this training depends on training data. Since text-embedding-005 was primarily trained on English query-document pairs, it can correctly "expand" for English, but for Japanese, the prefix influence distorts rather than helps the vectors — that is the essence of the problem this time.

When Task Types Are Applied
Importantly, task types are applied at both ingestion time and query time.
- At ingestion time (when importing files into the corpus): Each chunk is embedded with
RETRIEVAL_DOCUMENTand saved to the index - At query time (when calling
retrieval_query()): The search query is embedded withRETRIEVAL_QUERYand compared to the saved document vectors
In other words, the asymmetry is baked into the index. To change the document-side task type, you need to recreate the corpus and re-import the files.
RAG Engine internally uses the following pairing.
- Query:
RETRIEVAL_QUERY - Document:
RETRIEVAL_DOCUMENT
When we manually tested with this combination:
q_input = TextEmbeddingInput(text=query, task_type="RETRIEVAL_QUERY")
d_input = TextEmbeddingInput(text=article, task_type="RETRIEVAL_DOCUMENT")
# → Cosine distance: 0.4605 ← Exact match with RAG Engine results!
| Query-side Task Type | Document-side Task Type | Cosine Distance |
|---|---|---|
| RETRIEVAL_QUERY | RETRIEVAL_DOCUMENT | 0.46 (indistinguishable) |
| RETRIEVAL_DOCUMENT | RETRIEVAL_DOCUMENT | 0.26 (good) |
| SEMANTIC_SIMILARITY | SEMANTIC_SIMILARITY | 0.30 (good) |
It became clear that the RETRIEVAL_QUERY / RETRIEVAL_DOCUMENT pair of text-embedding-005 cannot correctly capture the semantic similarity of pure Japanese text. Since queries containing English or katakana technical terms work fine, this is an easy problem to miss.
Comparison of Vertex AI Embedding Models
Before getting to the solution, let's organize the embedding models provided by Vertex AI.
| Model | Dimensions | Languages | Features |
|---|---|---|---|
text-embedding-005 |
768 | English-optimized | Latest English model. Supports task types. Has weaknesses with Japanese RETRIEVAL pairs |
text-multilingual-embedding-002 |
768 | 100+ languages (strong in CJK) | Multilingual-specialized. Supports task types. Recommended for Japanese RAG |
text-embedding-004 |
768 | English-centric | Previous generation of 005 |
textembedding-gecko@003 |
768 | English | Old generation. No reason to choose for new projects |
textembedding-gecko-multilingual@001 |
768 | Multilingual | Old generation multilingual model |
Both 005 and multilingual-002 support output dimension reduction, allowing cost-speed tradeoffs. For new projects, you'll basically choose between these two.
Solution: Migrating to text-multilingual-embedding-002
We ran the same test with text-multilingual-embedding-002.
model = TextEmbeddingModel.from_pretrained("text-multilingual-embedding-002")
q_input = TextEmbeddingInput(text=query, task_type="RETRIEVAL_QUERY")
d_input = TextEmbeddingInput(text=article, task_type="RETRIEVAL_DOCUMENT")
# → Cosine distance: 0.2090 ← Dramatically improved!
| Model | Query → Correct Article | Query → Unrelated Article | Discrimination Gap |
|---|---|---|---|
| text-embedding-005 | 0.46 | 0.44 | 0.02 (indistinguishable) |
| text-multilingual-embedding-002 | 0.21 | 0.48 | 0.27 (clearly distinguishable) |
We recreated the RAG Engine corpus with text-multilingual-embedding-002.
corpus = rag.create_corpus(
display_name="my-it-support-kb-v2",
backend_config=rag.RagVectorDbConfig(
rag_embedding_model_config=rag.RagEmbeddingModelConfig(
vertex_prediction_endpoint=rag.VertexPredictionEndpoint(
publisher_model="publishers/google/models/text-multilingual-embedding-002",
),
),
),
)
Search results after recreation:
=== "I received a suspicious email" ===
0.21 | suspicious_email.md ✅
=== "VBA macro cannot be executed" ===
0.19 | office_macro_blocked.md ✅
=== "My computer is running slowly" ===
0.18 | pc_slow_troubleshooting.md ✅
=== "Cannot log in to Salesforce" ===
0.18 | salesforce_login.md ✅
The correct article now comes up as top-1 for all queries.
Integrating the RAG Pipeline into worker.py
We incorporated the actual RAG processing into the 4-step progressive card built in the previous article.
def process_message(space_name, user_text, sender, ...):
# Step 1: Analyzing inquiry
_advance_step(state, "analyze", patcher, message_name)
# Step 2: Building search query
_advance_step(state, "build_query", patcher, message_name)
# Step 3: Searching knowledge base
contexts = retrieve_context(user_text)
# Step 4: Generating answer
if not contexts or all(c["score"] > 0.6 for c in contexts):
answer = NO_RESULT_MESSAGE
else:
answer = generate_answer(user_text, contexts)
# Add answer to card paragraph by paragraph
for para in answer.split("\n\n"):
state.content_paragraphs.append(para.strip())
patcher.patch(build_progressive_card(state))
Deployment Considerations: Out of Memory
512Mi → 1Gi (Initial Deployment)
After the initial deployment, a problem occurred where the card would freeze at "Analyzing inquiry."
Checking the logs:
Memory limit of 512 MiB exceeded with 529 MiB used.
The google-cloud-aiplatform SDK is heavy, and at 512Mi it was being killed by OOM. Increasing to 1Gi temporarily resolved it.
1Gi → 2Gi (After Production Operation)
However, 1Gi was still not sufficient. After going into production, a situation occurred where no answer was returned when queries were sent in rapid succession.
Symptoms and Difficulty
The symptom was vague: "sometimes no answer is returned when queries are sent rapidly." The fact that it was "sometimes" rather than always was tricky. Error handling on the application side was working normally, so it didn't seem to be a code bug.
Debugging Steps
First, we checked error-level logs in Cloud Logging.
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name="YOUR_SERVICE_NAME"
AND severity>=ERROR' \
--limit=50 \
--format='table(timestamp,severity,textPayload)'
Two types of errors were found.
1. OOM (Memory Limit Exceeded)
Memory limit of 1024 MiB exceeded with 1033 MiB used.
Memory limit of 1024 MiB exceeded with 1029 MiB used.
This is the direct cause of the forced instance termination.
2. Connection Errors to Chat API
ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number
http.client.IncompleteRead: IncompleteRead(290 bytes read, 400 more expected)
Failed to create initial card
After an instance was killed by OOM, a new instance starts up, but the HTTP connection pool from the previous instance was sometimes inherited in a corrupted state, causing cascading SSL errors and incomplete read errors. In other words: OOM → connection pool corruption → Chat API call failure → unable to even create initial card → no response — a cascade of failures.

3. Checking the Timeline
We checked all logs (including INFO) before and after the errors to understand the sequence of failures chronologically.
12:06:49 ERROR Memory limit of 1024 MiB exceeded with 1029 MiB used
12:06:50 INFO Starting new instance (AUTOSCALING)
...(normal operation)...
12:26:33 ERROR Failed to create initial card (SSLError)
12:26:33 ERROR Failed to create initial card (IncompleteRead)
...
12:29:56 ERROR Memory limit of 1024 MiB exceeded with 1033 MiB used
12:29:57 INFO Starting new instance (AUTOSCALING)
It was repeating the pattern: OOM → recovery → normal for a while → OOM again. This matched the reports of "sometimes no answer is returned."
Visualizing Memory Usage
Now that OOM was identified as the cause, we used the Cloud Monitoring API to retrieve memory usage trends to determine the appropriate memory limit.
curl -s -H "Authorization: Bearer $(gcloud auth print-access-token)" \
"https://monitoring.googleapis.com/v3/projects/PROJECT_ID/timeSeries?filter=..." \
| python3 -c "..." # Extract averages from distribution data
Results:
| Phase | Memory Usage | Percentage of 1Gi |
|---|---|---|
| Immediately after cold start | ~89 MiB | 9% |
| After SDK initialization (idle) | ~700 MiB | 70% |
| During query processing | ~880 MiB | 86% |
| Peak (maximum 1-minute average) | ~1011 MiB | 98.8% |
Even in idle state, it was already consuming 70%, and only a slight increase in memory during query processing was enough to trigger OOM.
Why It Becomes a Silent Failure
This bot processes requests in a background thread and returns {} as the HTTP response immediately. When an instance is killed by OOM, the background thread disappears with it, but from Google Chat's perspective, the HTTP request succeeded (200 OK), so no error like "app is not responding" is displayed. Users receive no notification; the answer simply doesn't come back.
Thinking About Memory Sizing
Google Cloud best practices recommend keeping peak memory usage within 50–80% of the allocated limit. Below 50% is excessive cost, above 80% risks OOM during spikes.
| Memory Limit | Idle | Peak | Verdict |
|---|---|---|---|
| 512Mi | 137% | 197% | OOM |
| 1Gi | 70% | 98.8% | OOM occurring |
| 1.5Gi | 46% | 67% | Marginal (OOM risk from sub-second spikes) |
| 2Gi | 34% | 49% | Within recommended range |
| 4Gi | 17% | 25% | Excessive |
By increasing to 2Gi, peak usage was kept around 50%, providing sufficient headroom.
Python's (CPython) internal memory allocator (pymalloc) retains a pool of memory for released objects and rarely returns it to the OS. In other words, even after processing completes and threads terminate, the process RSS (Resident Set Size) does not decrease. Even if you explicitly call
gc.collect(), Python's internal objects are released, but the memory usage as seen from the OS does not change. For this reason, peak memory usage becomes the resident memory, and the memory limit must be set according to the peak.
Deployment Command
gcloud functions deploy YOUR_FUNCTION_NAME \
--gen2 --runtime=python312 --region=asia-northeast1 \
--source=. --entry-point=handle_chat \
--trigger-http --no-allow-unauthenticated \
--memory=2Gi --cpu=1
# First time only: Disable CPU throttling (for background threads)
gcloud run services update YOUR_FUNCTION_NAME \
--region=asia-northeast1 --no-cpu-throttling
The Cloud Run annotation
run.googleapis.com/cpu-throttling: 'false'is maintained in subsequent deployments once set. You only need to rungcloud run services update --no-cpu-throttlingonce during initial setup.
Environment Variable Management
RAG configuration values are managed as environment variables. Once set with --set-env-vars in Cloud Functions, they are carried over in subsequent deployments.
# First time only
gcloud functions deploy YOUR_FUNCTION_NAME \
... \
--set-env-vars="GCP_PROJECT_ID=your-project,GCP_LOCATION=asia-northeast1,RAG_CORPUS_ID=your-corpus-id,GEMINI_MODEL_ID=gemini-2.5-flash"
| Variable | Purpose |
|---|---|
GCP_PROJECT_ID |
Project ID used for Vertex AI SDK initialization |
GCP_LOCATION |
Region specification (e.g., asia-northeast1) |
RAG_CORPUS_ID |
RAG corpus ID to search |
GEMINI_MODEL_ID |
Gemini model used for answer generation |
Summary
We implemented knowledge base search by connecting the Google Chat Bot to the Vertex AI RAG Engine.
| Step | Content |
|---|---|
| Data organization | QA data → organized into approximately 300 KB entries |
| RAG Engine setup | GCS → corpus creation → file import |
| Embedding model selection | Changed from text-embedding-005 → text-multilingual-embedding-002 |
| Pipeline integration | Search → confidence judgment → generation (fixed message for low confidence) |
| Deployment | Memory incrementally increased from 512Mi → 1Gi → 2Gi |
The biggest lesson learned is that when building RAG with Japanese content, you should use text-multilingual-embedding-002 rather than text-embedding-005. Since text-embedding-005 works fine with English and katakana technical terms, this is a trap that can go unnoticed depending on your test cases.
Another lesson is that the quality of KB data determines the upper limit of answer quality. Given that the source data is in a staff note format, there are limits to how much you can instruct Gemini to convert it via system prompts. The most effective way to get better answers is to improve the quality of the source data itself.
References
- Vertex AI RAG Engine overview | Google Cloud
- Choose an embeddings task type | Google Cloud
- text-multilingual-embedding-002 | Google Cloud
- Part 1: Building a Google Chat Bot with Cloud Functions + Python + uv in a Minimal Configuration
- Part 2: The Story of Hitting Wall After Wall When Implementing Progressive UX with cardsV2 in Google Chat Bot
