
The story of hitting walls everywhere when implementing progressive UX with cardsV2 in Google Chat Bot
This page has been translated by machine translation. View original
Introduction
In the previous article, we built a Google Chat Bot as a minimal echo bot using Cloud Functions + Python + uv.
This time, I'll share the trial and error involved in trying to turn that bot into a RAG (Retrieval-Augmented Generation) pipeline, hitting one Google Chat limitation after another, and ultimately arriving at a progressive update UX using cardsV2.

Configuration
| Item | Choice |
|---|---|
| Runtime | Cloud Functions 2nd generation |
| Language | Python 3.14 |
| Package manager | uv |
| Region | asia-northeast1 (Tokyo) |
| CPU | 1 vCPU (--no-cpu-throttling) |
| Memory | 512Mi |
Obstacle 1: Google Chat's Message API Is Synchronous
This was the first wall I hit when trying to build a RAG bot.
When developing AI applications normally, implementing a UX where generated text streams in real time as a streaming response is taken for granted. With Discord bots, you can easily add a reaction (👀) to a message to indicate "processing" and then edit the message when done.
However, with Google Chat's HTTP endpoint approach:
- Streaming is not supported — you simply return a single response to an HTTP request
- Messages can only be created via synchronous response — the HTTP response is the bot's reply
- Bots cannot add reactions — Chat API's
reactions.createonly supports user authentication (OAuth) and cannot be called with bot authentication (chat.botscope). The pattern of "add 👀 to indicate processing" like in Discord is not available
I actually tried adding the chat.messages.reactions.create scope, but got a 403 error with ACCESS_TOKEN_SCOPE_INSUFFICIENT. Checking the documentation, reactions are clearly stated as "Requires user authentication." For a bot to add reactions, either the user's OAuth credentials or domain-wide delegation by a Workspace administrator is required.
In other words, for processes that take several seconds to tens of seconds like a RAG pipeline, users are left waiting with no feedback at all during that time.
User: "Tell me about internal regulations"
↓
(5–10 seconds of silence)
↓
Bot: "Response text"
This makes for a poor UX. I started looking into whether there was a way around this.
Obstacle 2: The cardsV2 and Chat API "Patch" Pattern
After looking into it, I found that Google Chat has a rich UI component called cardsV2, and that using the Chat API, you can patch (update) a message after it's been created.
In other words, the following flow becomes possible:
1. Return {} in the HTTP response (return immediately to avoid timeout)
2. Call the Chat API in a background thread to create a "Processing..." card
3. Patch the card to show progress as the pipeline advances
4. Patch the card with the final result upon completion

Calling the Chat API from Code
To use the Chat API from Python, use build() from google-api-python-client.
from googleapiclient.discovery import build
import google.auth
SCOPES = ["https://www.googleapis.com/auth/chat.bot"]
credentials, _ = google.auth.default(scopes=SCOPES)
service = build("chat", "v1", credentials=credentials)
This allows you to create and update messages.
# Create a message
response = service.spaces().messages().create(
parent="spaces/SPACE_ID",
body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()
# Update (patch) a message
service.spaces().messages().patch(
name=response["name"],
updateMask="cardsV2",
body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()
Collapsible Sections in cardsV2
cardsV2 has a collapsible property that allows widgets within a section to be collapsible. Using this, you can create a UI that displays pipeline step history in an accordion while always showing the current status.
{
"collapsible": True,
"uncollapsibleWidgetsCount": 1, # The first one is always shown
"widgets": [
# ↓ Always shown (status)
{"decoratedText": {"text": "Generating response..."}},
# ↓ Inside collapsed section (step history)
{"decoratedText": {"text": '<font color="#00C853">✅ Analyzing inquiry</font>'}},
{"decoratedText": {"text": '<font color="#00C853">✅ Creating search query</font>'}},
{"decoratedText": {"text": '<font color="#2979FF">⏳ Generating response</font>'}},
]
}

Rate Limit: 1 write/sec/space
There is an important caveat here. The Google Chat API has a rate limit of 1 write per second per space. Both create and patch share this quota.
This means you cannot send a patch for every token like LLM token streaming. You need to patch at appropriate granularity, such as at step transitions or paragraph boundaries.
To handle this, I created a class called ThrottledPatcher.
class ThrottledPatcher:
def __init__(self, chat_client, message_name, min_interval=1.0):
self._chat_client = chat_client
self._message_name = message_name
self._min_interval = min_interval
self._last_patch_time = 0.0
self._buffered_body = None
def patch(self, body, force=False):
now = time.monotonic()
elapsed = now - self._last_patch_time
if force or elapsed >= self._min_interval:
if force and elapsed < self._min_interval:
time.sleep(self._min_interval - elapsed)
self._chat_client.patch_message(
self._message_name, body, "cardsV2"
)
self._last_patch_time = time.monotonic()
self._buffered_body = None
else:
self._buffered_body = body # latest-wins buffer
def flush(self):
if self._buffered_body is not None:
remaining = self._min_interval - (
time.monotonic() - self._last_patch_time
)
if remaining > 0:
time.sleep(remaining)
self._chat_client.patch_message(
self._message_name, self._buffered_body, "cardsV2"
)
self._buffered_body = None
The key point is the "latest-wins" buffer strategy. When multiple patches occur within the rate limit, only the latest state is retained and sent at the next patchable timing. There's no need to send every intermediate state — the user just needs to always see the latest progress.
Obstacle 3: Cold Start Is Slow Due to Discovery Document Download
Once the cardsV2 + patch pattern was working and I happily deployed it, I noticed that during a cold start, it took about 2 minutes for the first card to appear.

When returning plain text messages in the previous article, the response came back almost instantly even on a cold start. What was different?

The cause was build("chat", "v1").
google-api-python-client's build() downloads the API definition (Discovery Document) over the network from Google's servers. This file is about 410KB, and downloading it takes time in Cloud Functions' low-spec environment (default 0.17 vCPU).
Return HTTP response (immediately) → Start background thread
→ Download Discovery Document with build("chat", "v1") (~2 minutes)
→ Call card creation API
→ Card appears for user
First Attempt: Synchronous Message + Async Card
The first thing I tried was returning a plain text synchronous response on cold start, then creating a cardsV2 message once the Chat API was ready.
@functions_framework.http
def handle_chat(request):
body = request.get_json(silent=True)
# ...
if not is_warm():
# Cold start: return a text message synchronously
thread = threading.Thread(target=process_message, args=(...,))
thread.start()
return create_message("Processing started. Please wait a moment...")
else:
# Warm: create card in the background
thread = threading.Thread(target=process_message, args=(...,))
thread.start()
return {}
However, this didn't work either.

Obstacle 4: CPU Throttling in Cloud Functions Gen 2
After deploying and testing, the synchronous message returned immediately, but the background thread processing never executed at all. No errors. Just silence.
Checking the Cloud Run logs, the request was being received, but there were no logs at all for the Chat API calls that followed.
After investigating, the cause was CPU throttling in Cloud Functions 2nd generation (Cloud Run-based).
gcloud run services describe google-chat-bot --region=asia-northeast1 \
--format="yaml(spec.template.metadata.annotations)"
annotations:
run.googleapis.com/cpu-throttling: "true" # ← This was the cause
Cloud Functions Gen 2 = Cloud Run
An important prerequisite here: Cloud Functions 2nd generation is Cloud Run itself. In AWS terms, it's not a proprietary runtime like Lambda but rather something closer to Fargate. When you deploy with gcloud functions deploy, it's internally deployed as a Cloud Run service.
# Even though it's a Cloud Function, it shows up as a Cloud Run service
gcloud run services list --region=asia-northeast1
# NAME: google-chat-bot ← A Cloud Run service with the same name exists
And Cloud Run has CPU throttling enabled by default. This is a mechanism that allocates CPU resources only while processing HTTP requests and reduces the CPU to nearly zero after returning the response.
Why Does This Affect Background Threads?
Recall the bot's architecture:
1. Receive HTTP request
2. Launch background thread
3. Return HTTP response {} immediately ← CPU allocation drops sharply here
4. Call Chat API in background thread ← Almost no CPU available
CPU throttling doesn't "freeze" threads. Threads stay alive, but the allocated CPU drops dramatically. From the default 0.17 vCPU it goes to nearly zero, making network I/O and response parsing extremely slow.
In fact, the simple pattern implemented in the previous article (create thinking card → 1 patch) only made 2 API calls, so it somehow managed to work despite this constraint. But with progressive cards, more than 10 patches occur over several seconds, causing extreme delay from CPU starvation and making it effectively appear unresponsive.
| Pattern | Number of API calls | Behavior under CPU throttling |
|---|---|---|
| thinking → patch (v0.1.0) | 2 | Slow but eventually completes |
| Progressive card | 10+ | Extremely delayed, effectively unresponsive |
Solution: Always-On CPU Allocation
Setting --no-cpu-throttling keeps CPU allocated even after the HTTP response is returned. Background threads are guaranteed to run at full CPU.
However, --no-cpu-throttling requires a minimum of 1 vCPU. --cpu=1 is not needed for performance — it's a prerequisite for enabling --no-cpu-throttling.
# Step 1: Deploy with increased CPU and memory
gcloud functions deploy google-chat-bot \
--gen2 \
--runtime=python314 \
--region=asia-northeast1 \
--source=. \
--entry-point=handle_chat \
--trigger-http \
--no-allow-unauthenticated \
--memory=512Mi \
--cpu=1
# Step 2: Disable CPU throttling
gcloud run services update google-chat-bot \
--region=asia-northeast1 \
--no-cpu-throttling
Note: --no-cpu-throttling is not supported in the gcloud functions deploy command, so it needs to be set separately with gcloud run services update. It's precisely because Cloud Functions Gen 2 is Cloud Run itself that you can configure it directly with gcloud run commands.
Cost Considerations
Raising the CPU from 0.17 vCPU to 1 vCPU and switching to always-on allocation will increase costs. However, if min-instances=0 (default) is kept, instances scale to zero when there are no requests, so actual costs remain very low.
| Setting | min-instances | Monthly cost (Tokyo region) |
|---|---|---|
| 0.17 vCPU, 256Mi (default) | 0 | Nearly free |
1 vCPU, 512Mi, --no-cpu-throttling |
0 | ~$0.55 USD/month (billed only during requests) |
1 vCPU, 512Mi, --no-cpu-throttling |
1 | ~$50 USD/month (always running) |
This time I'm using min-instances=0. With roughly 100 requests per day at about 10 seconds of processing per request, costs come to around $0.55 USD/month. The decision is to tolerate cold start latency (a few seconds) and minimize costs.
For production environments requiring consistently low latency, setting min-instances=1 solves the problem, but incurs a cost of around $50 USD/month.
Resolving Obstacle 3: Bundling the Discovery Document
Even after resolving CPU throttling to get background threads working, the problem of the 2-minute Discovery Document download on cold start remained.
The solution was simple: bundle the Discovery Document as a static file in the project.
# Download the Discovery Document
curl -o chat_discovery.json \
'https://chat.googleapis.com/$discovery/rest?version=v1'
import json
from pathlib import Path
from googleapiclient.discovery import build_from_document
_DISCOVERY_DOC_PATH = Path(__file__).parent / "chat_discovery.json"
def _get_default_service():
credentials, _ = google.auth.default(scopes=SCOPES)
doc = json.loads(_DISCOVERY_DOC_PATH.read_text())
return build_from_document(doc, credentials=credentials)
By using build_from_document() instead of build(), network access is completely eliminated. The initial card now appears within a few seconds even on cold start.
This file is about 410KB. Note that upgrading to --cpu=1 also reduces the build() download from 2 minutes to a few seconds, but that's still far from the < 100ms of build_from_document(). This optimization remains effective separately from the CPU increase.
What If the Discovery Document Gets Stale?
The Discovery Document is like a "map" describing API URL paths, parameter names, and request/response schemas. Since Google's REST APIs maintain strict backward compatibility, the signatures of existing methods (spaces.messages.create, spaces.messages.patch) essentially never change.
Existing functionality continues to work with an older Discovery Document — you simply won't be able to use new API features. It will work fine for several months to a year, but it's reassuring to re-download it at the following times:
# Re-download periodically
curl -o chat_discovery.json \
'https://chat.googleapis.com/$discovery/rest?version=v1'
- When upgrading
google-api-python-client - When you want to use new Chat API features
- As part of periodic maintenance every few months
Obstacle 5: The Feedback Button Trap (A Triple Trap)
With progressive cards working, I added buttons for users to give feedback on response quality. However, getting these buttons to work involved stepping into 3 traps.
Trap 1: action.function Must Be the Full Endpoint URL
The button definition I first wrote looked like this:
# ❌ Doesn't work — specifying a function name in function
{
"onClick": {
"action": {
"function": "feedback",
"parameters": [
{"key": "vote", "value": "up"},
]
}
}
}
Clicking the button displayed the error "〇〇 cannot process the request". No logs arrived at the endpoint either.
The cause was that with HTTP endpoint-style Workspace Add-ons, action.function must contain the bot's endpoint URL itself. Google Chat POSTs a CARD_CLICKED event to this URL.
# ✅ Correct — specify the full HTTPS URL in function
{
"onClick": {
"action": {
"function": "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot",
"parameters": [
{"key": "action", "value": "feedback"},
{"key": "vote", "value": "up"},
]
}
}
}
With Apps Script or Dialogflow approaches you specify a function name in function, but with the HTTP endpoint approach you specify a URL. This distinction is a hard-to-find point in the documentation.
Note that since this URL comes through as-is in invokedFunction, routing is determined by adding an action key to parameters.
Trap 2: Constructing the Correct URL from Inside Cloud Functions
To avoid hardcoding the endpoint URL, I tried building it dynamically from the request.
# ❌ request.base_url returns the internal URL
endpoint_url = request.base_url
# → "http://localhost:8080/" (URL of Cloud Run's internal proxy)
Cloud Functions Gen 2 runs on top of Cloud Run, and the request that Flask receives is forwarded from an internal proxy. request.base_url returns the internal localhost:8080 rather than the external URL.
You can get the hostname using X-Forwarded-Host and X-Forwarded-Proto headers, but there's another trap:
# ❌ request.path returns "/"
host = request.headers.get("X-Forwarded-Host") # "asia-northeast1-PROJECT.cloudfunctions.net"
scheme = request.headers.get("X-Forwarded-Proto") # "https"
path = request.path # "/" ← Not "/google-chat-bot"!
The Cloud Functions runtime strips the function name path prefix before passing the request to Flask. Even if the external URL has /google-chat-bot, request.path as seen from Flask is /.
The solution is to use the K_SERVICE environment variable. In Cloud Functions Gen 2 (= Cloud Run), this environment variable is automatically set to the service name (= function name).
import os
host = request.headers.get("X-Forwarded-Host") or request.headers.get("Host", "")
scheme = request.headers.get("X-Forwarded-Proto", "https")
service = os.environ.get("K_SERVICE", "")
endpoint_url = f"{scheme}://{host}/{service}" if host else ""
# → "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot"
Trap 3: The Response Format Is actionResponse
Even after getting the URL right and events arriving at the endpoint, things won't work if the response format is wrong.
// ❌ renderActions is for dialogs
{"renderActions": {"action": {"navigations": [{"updateCard": {...}}]}}}
// ❌ updateMessageAction is for synchronous response message updates
{"hostAppDataAction": {"chatDataAction": {"updateMessageAction": {"message": {...}}}}}
The correct response format for CARD_CLICKED events is actionResponse:
// ✅ Correct response to CARD_CLICKED
{
"actionResponse": {"type": "UPDATE_MESSAGE"},
"cardsV2": [{
"cardId": "progressive-card",
"card": {
"sections": [{
"widgets": [{
"textParagraph": {"text": "Thank you for your feedback!"}
}]
}]
}
}]
}
To summarize:
| Operation | Response format |
|---|---|
| Synchronous message creation | hostAppDataAction.chatDataAction.createMessageAction |
| Message update on CARD_CLICKED | actionResponse: {type: "UPDATE_MESSAGE"} + cardsV2 |
| Show dialog | renderActions.action.navigations[].pushCard |
Final Architecture
Here is the final configuration after overcoming all the obstacles.
main.py → HTTP handler (returns {}, starts thread, routes CARD_CLICKED)
worker.py → Pipeline orchestration (with step tracking)
cards.py → cardsV2 builder (progressive card)
models.py → Pipeline data models (StepStatus, PipelineState)
throttle.py → Rate-limit-aware patcher (1 write/sec/space)
feedback.py → CARD_CLICKED event handler
chat_api.py → Chat API wrapper (static Discovery Document)
chat_discovery.json → Bundled Chat API v1 Discovery Document
Pipeline Flow
1. Receive HTTP request → Return {} immediately
2. Start pipeline in background thread
3. Create initial card (showing 4 steps, all PENDING)
4. Patch card as each step progresses
- Analyzing inquiry → ✅
- Creating search query → ✅
- Searching knowledge base → ✅
- Generating response → ✅
5. Patch while adding response paragraphs
6. Show feedback buttons on completion
Deployment
# Step 1: Deploy the function
gcloud functions deploy google-chat-bot \
--gen2 \
--runtime=python314 \
--region=asia-northeast1 \
--source=. \
--entry-point=handle_chat \
--trigger-http \
--no-allow-unauthenticated \
--memory=512Mi \
--cpu=1
# Step 2: Disable CPU throttling (required for background threads)
gcloud run services update google-chat-bot \
--region=asia-northeast1 \
--no-cpu-throttling
Summary
Implementing a progressive UX for a RAG pipeline in Google Chat Bot involved hitting 5 obstacles.
| Obstacle | Cause | Solution |
|---|---|---|
| Message API is synchronous | Streaming not supported | cardsV2 + Chat API patch |
| Cold start is slow | build() downloads Discovery Document |
Use build_from_document() with static file |
| Background threads don't run | CPU throttling in Cloud Functions Gen 2 | --cpu=1 + --no-cpu-throttling |
| Rate limit | 1 write/sec/space | ThrottledPatcher (latest-wins buffer) |
| Feedback button errors (triple trap) | Function name in action.function / URL construction error / Wrong response format |
Full URL + K_SERVICE + actionResponse |
Honestly, compared to Discord bots or Slack bots, the Google Chat Bot development experience is still rough around the edges. There are many places where documentation hasn't caught up with the Workspace Add-ons format, and many things can only be discovered through trial and error — such as how to use different response formats (createMessageAction / actionResponse / renderActions) and the fact that a full URL is required in action.function.
On the other hand, progressive updates with cardsV2 provide quite a good UX when they work. Users can see pipeline progress in real time, and feedback can be collected. For organizations using Google Workspace, I think this investment is well worth it.
Next Steps: Migration to Cloud Tasks
The current architecture works with background threads + --no-cpu-throttling, but when scaling to a full-fledged RAG pipeline, I'm considering migrating to Cloud Tasks.
Current: HTTP request → return {} → execute background thread in same instance
Future: HTTP request → enqueue to Cloud Tasks → worker processes as separate request
Migrating to Cloud Tasks offers the following benefits:
| Aspect | Current (background thread) | Cloud Tasks |
|---|---|---|
| CPU settings | --cpu=1 + --no-cpu-throttling required |
Works with default settings |
| Concurrency | Threads compete for CPU within one instance | Each task runs in a separate instance |
| On failure | Fails silently, no retry | Automatic retry + Dead Letter Queue |
| Timeout | Constrained by HTTP timeout | Up to 30 minutes per task |
Particularly with RAG pipelines that include heavy processing like LLM calls and vector searches, resource contention under concurrent requests becomes a problem. With Cloud Tasks, you naturally leverage Cloud Run's autoscaling, eliminating scalability concerns.
Note that the Discovery Document (chat_discovery.json) is a static file bundled in the container image, so it's available as-is in worker instances launched from the same deployment, requiring no additional configuration.