
The story of hitting wall after wall when implementing progressive UX with cardsV2 in Google Chat Bot
This page has been translated by machine translation. View original
Introduction
In the previous article, we built a Google Chat Bot as a minimal echo bot using Cloud Functions + Python + uv.
This time, when we tried to turn that bot into a RAG (Retrieval-Augmented Generation) pipeline, we ran into Google Chat's limitations one after another. This article introduces the trial and error that ultimately led us to a progressive update UX with cardsV2.

Configuration
| Item | Choice |
|---|---|
| Runtime | Cloud Functions 2nd gen |
| Language | Python 3.14 |
| Package manager | uv |
| Region | asia-northeast1 (Tokyo) |
| CPU | 1 vCPU (--no-cpu-throttling) |
| Memory | 512Mi |
Obstacle 1: Google Chat's Message API Is Synchronous
This was the first obstacle we hit when trying to build a RAG bot.
When developing AI applications normally, it's standard to implement a UX where text streams in real-time with a streaming response showing "generating...". With Discord Bot, you can easily attach a reaction (👀) to a message to indicate "processing" and then edit the message when done.
However, with Google Chat's HTTP endpoint method:
- Streaming is not supported — You only return one response per HTTP request
- Messages can only be created via synchronous response — HTTP response = bot's reply
- Bots cannot add reactions — Chat API's
reactions.createonly supports user authentication (OAuth) and cannot be called with bot authentication (chat.botscope). The Discord-style pattern of "attach 👀 to indicate processing" is not available
We actually tried adding the chat.messages.reactions.create scope, but got a 403 error with ACCESS_TOKEN_SCOPE_INSUFFICIENT. Checking the documentation, reactions are clearly stated to "Require user authentication." For a bot to add reactions, you need either the user's OAuth credentials or domain-wide delegation by a Workspace admin.
In other words, for processes that take several seconds to tens of seconds like a RAG pipeline, the user has to wait the entire time with no feedback whatsoever.
User: "Tell me about company policies"
↓
(5-10 seconds of silence)
↓
Bot: "Response text"
This makes for a poor UX. We started looking for alternatives.
Obstacle 2: The cardsV2 and Chat API "Patch" Pattern
Upon investigation, we found that Google Chat has a rich UI component called cardsV2, and that the Chat API allows patching (updating) messages after creation.
This means the following flow can be achieved:
1. Return {} via HTTP response (return response immediately to avoid timeout)
2. Call the Chat API in a background thread to create a "processing..." card
3. Patch the card to show progress as the pipeline advances
4. Patch the card with the final result upon completion

Calling the Chat API from Code
To use the Chat API from Python, use build() from google-api-python-client.
from googleapiclient.discovery import build
import google.auth
SCOPES = ["https://www.googleapis.com/auth/chat.bot"]
credentials, _ = google.auth.default(scopes=SCOPES)
service = build("chat", "v1", credentials=credentials)
This enables creating and updating messages.
# Create message
response = service.spaces().messages().create(
parent="spaces/SPACE_ID",
body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()
# Update message (patch)
service.spaces().messages().patch(
name=response["name"],
updateMask="cardsV2",
body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()
collapsible Sections in cardsV2
cardsV2 has a property called collapsible that allows widgets within a section to be collapsible. Using this, you can display pipeline step history as an accordion while keeping only the current status always visible.
{
"collapsible": True,
"uncollapsibleWidgetsCount": 1, # First one is always shown
"widgets": [
# ↓ Always visible (status)
{"decoratedText": {"text": "Generating answer..."}},
# ↓ Inside collapse (step history)
{"decoratedText": {"text": '<font color="#00C853">✅ Analyzing query</font>'}},
{"decoratedText": {"text": '<font color="#00C853">✅ Creating search query</font>'}},
{"decoratedText": {"text": '<font color="#2979FF">⏳ Generating answer</font>'}},
]
}

Rate Limit: 1 write/sec/space
There is an important caveat here. The Google Chat API has a rate limit of 1 write per second per space. Both create and patch share this quota.
This means you cannot send a patch for every token like LLM token streaming. You need to patch at an appropriate granularity, such as at step transitions or paragraph boundaries.
To handle this, we created a class called ThrottledPatcher.
class ThrottledPatcher:
def __init__(self, chat_client, message_name, min_interval=1.0):
self._chat_client = chat_client
self._message_name = message_name
self._min_interval = min_interval
self._last_patch_time = 0.0
self._buffered_body = None
def patch(self, body, force=False):
now = time.monotonic()
elapsed = now - self._last_patch_time
if force or elapsed >= self._min_interval:
if force and elapsed < self._min_interval:
time.sleep(self._min_interval - elapsed)
self._chat_client.patch_message(
self._message_name, body, "cardsV2"
)
self._last_patch_time = time.monotonic()
self._buffered_body = None
else:
self._buffered_body = body # latest-wins buffer
def flush(self):
if self._buffered_body is not None:
remaining = self._min_interval - (
time.monotonic() - self._last_patch_time
)
if remaining > 0:
time.sleep(remaining)
self._chat_client.patch_message(
self._message_name, self._buffered_body, "cardsV2"
)
self._buffered_body = None
The key is the "latest-wins" buffer strategy. When multiple patches occur within the rate limit, only the latest state is retained and sent at the next available patch opportunity. There's no need to send every intermediate state — the user just needs to always see the latest progress.
Obstacle 3: Cold Start Is Slow Due to Discovery Document Download
After cardsV2 + patch pattern was working and we happily deployed, we noticed that during cold starts, it took about 2 minutes for the first card to appear.

When returning plain text messages as in the previous article, responses were returned almost instantly even on cold starts. What was different?

The cause was build("chat", "v1").
google-api-python-client's build() downloads the API definition (Discovery Document) over the network from Google's servers. This file is about 410KB, and in Cloud Functions' low-spec environment (default 0.17 vCPU), the download takes a long time.
HTTP response returned (immediately) → Background thread starts
→ Download Discovery Document with build("chat", "v1") (~2 minutes)
→ Call card creation API
→ Card appears for user
First Attempt: Synchronous Message + Async Card
The first thing we tried was to return a plain text message synchronously on cold start, then create a cardsV2 message once the Chat API is ready.
@functions_framework.http
def handle_chat(request):
body = request.get_json(silent=True)
# ...
if not is_warm():
# Cold start: return text message synchronously
thread = threading.Thread(target=process_message, args=(...,))
thread.start()
return create_message("Processing started. Please wait...")
else:
# Warm: create card in background
thread = threading.Thread(target=process_message, args=(...,))
thread.start()
return {}
However, this didn't work either.

Obstacle 4: CPU Throttling in Cloud Functions Gen 2
After deploying and testing, we found that while the synchronous message returned instantly, the background thread processing never executed at all. No errors either. Just silence.
Checking the Cloud Run logs, the request was received, but there were no logs at all from the subsequent Chat API calls.
After investigating the cause, it turned out to be CPU throttling in Cloud Functions 2nd generation (Cloud Run-based).
gcloud run services describe google-chat-bot --region=asia-northeast1 \
--format="yaml(spec.template.metadata.annotations)"
annotations:
run.googleapis.com/cpu-throttling: "true" # ← This was the cause
Cloud Functions Gen 2 = Cloud Run
An important prerequisite here is that Cloud Functions 2nd generation is Cloud Run itself. In AWS terms, it's not a proprietary runtime like Lambda but rather closer to Fargate. Deploying with gcloud functions deploy internally deploys as a Cloud Run service.
# It's a Cloud Function, but it shows up as a Cloud Run service
gcloud run services list --region=asia-northeast1
# NAME: google-chat-bot ← Exists as a Cloud Run service with the same name
And Cloud Run has CPU throttling enabled by default. This is a mechanism that allocates CPU resources only while processing HTTP requests and reduces the CPU to nearly zero after returning a response.
Why Does This Affect Background Threads?
Recall the bot's architecture:
1. HTTP request received
2. Background thread started
3. HTTP response {} returned immediately ← CPU allocation drops sharply here
4. Background thread calls Chat API ← Almost no CPU available
CPU throttling doesn't "freeze" threads. Threads stay alive, but the allocated CPU drops dramatically. Starting from the default 0.17 vCPU and dropping to nearly zero, operations like network I/O and response parsing become extremely slow.
In fact, the simple pattern implemented in the previous article (create thinking card → 1 patch) only made 2 API calls, so it managed to work under this constraint. However, with progressive cards, 10+ patches occur over several seconds, causing extreme delays due to CPU shortage and making the bot appear essentially unresponsive.
| Pattern | API call count | Behavior under CPU throttling |
|---|---|---|
| thinking → patch (v0.1.0) | 2 | Slow but completes |
| Progressive card | 10+ | Extreme delay, essentially unresponsive |
Solution: Always-On CPU Allocation
Setting --no-cpu-throttling keeps CPU allocated even after the HTTP response is returned. This ensures background threads reliably run with full CPU.
However, --no-cpu-throttling requires at least 1 vCPU. --cpu=1 is not needed for performance — it's a prerequisite for enabling --no-cpu-throttling.
# Step 1: Deploy with increased CPU and memory
gcloud functions deploy google-chat-bot \
--gen2 \
--runtime=python314 \
--region=asia-northeast1 \
--source=. \
--entry-point=handle_chat \
--trigger-http \
--no-allow-unauthenticated \
--memory=512Mi \
--cpu=1
# Step 2: Disable CPU throttling
gcloud run services update google-chat-bot \
--region=asia-northeast1 \
--no-cpu-throttling
Note: --no-cpu-throttling is not supported in the gcloud functions deploy command, so it must be set separately with gcloud run services update. Since Cloud Functions Gen 2 is Cloud Run itself, you can configure it directly with the gcloud run command.
Cost Considerations
Increasing CPU from 0.17 vCPU to 1 vCPU and enabling always-on allocation will increase costs. However, since min-instances=0 (default) means instances scale to zero when there are no requests, the actual cost remains very low.
| Setting | min-instances | Monthly cost (Tokyo region) |
|---|---|---|
| 0.17 vCPU, 256Mi (default) | 0 | Nearly free |
1 vCPU, 512Mi, --no-cpu-throttling |
0 | ~$0.55 USD/month (charged only during requests) |
1 vCPU, 512Mi, --no-cpu-throttling |
1 | ~$50 USD/month (always running) |
We've chosen min-instances=0 here. With approximately 10 seconds of processing time per request and around 100 requests per day, costs would be around $0.55 USD per month. We've accepted the cold start latency (a few seconds) and chosen to minimize costs.
If you need consistently low latency in production, setting min-instances=1 solves that, but at a cost of approximately $50 USD/month.
Solving Obstacle 3: Bundling the Discovery Document
Even after solving CPU throttling and getting background threads to work, the problem of the Discovery Document download taking 2 minutes on cold starts remained.
The solution was simple: bundle the Discovery Document as a static file in the project.
# Download Discovery Document
curl -o chat_discovery.json \
'https://chat.googleapis.com/$discovery/rest?version=v1'
import json
from pathlib import Path
from googleapiclient.discovery import build_from_document
_DISCOVERY_DOC_PATH = Path(__file__).parent / "chat_discovery.json"
def _get_default_service():
credentials, _ = google.auth.default(scopes=SCOPES)
doc = json.loads(_DISCOVERY_DOC_PATH.read_text())
return build_from_document(doc, credentials=credentials)
By using build_from_document() instead of build(), network access is completely eliminated. Even on cold starts, the first card now appears within a few seconds.
This file is about 410KB. Note that upgrading to --cpu=1 also reduces the build() download time from 2 minutes to a few seconds, but that's still far from the < 100ms of build_from_document(). This optimization remains effective independently of the CPU increase.
What If the Discovery Document Becomes Outdated?
The Discovery Document is like a "map" describing the API's URL paths, parameter names, and request/response schemas. Since Google's REST APIs maintain strict backward compatibility, the signatures of existing methods (like spaces.messages.create and spaces.messages.patch) almost never change.
Even with an older Discovery Document, existing functionality continues to work. You simply won't have access to new API features. It works fine for several months to a year, but you can re-download it at these opportune times:
# Re-download periodically
curl -o chat_discovery.json \
'https://chat.googleapis.com/$discovery/rest?version=v1'
- When upgrading
google-api-python-client - When you want to use new Chat API features
- As regular maintenance every few months
Obstacle 5: The Feedback Button Trap (A Triple Trap)
With the progressive card working, we added buttons for users to provide feedback on answer quality. However, getting these buttons to work meant falling into 3 traps.
Trap 1: action.function Must Be the Full Endpoint URL
The first button definition we wrote was:
# ❌ Doesn't work — specifying a function name in function
{
"onClick": {
"action": {
"function": "feedback",
"parameters": [
{"key": "vote", "value": "up"},
]
}
}
}
Clicking the button displayed an error: "○○ cannot process the request". No logs arrived at the endpoint.
The cause was that with the HTTP endpoint method for Workspace Add-ons, action.function must specify the bot's endpoint URL itself. Google Chat POSTs a CARD_CLICKED event to this URL.
# ✅ Correct — specify the full HTTPS URL in function
{
"onClick": {
"action": {
"function": "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot",
"parameters": [
{"key": "action", "value": "feedback"},
{"key": "vote", "value": "up"},
]
}
}
}
With Apps Script or Dialogflow methods, you specify a function name in function, but with the HTTP endpoint method, you specify a URL. This distinction is a difficult point to find in the documentation.
Note that invokedFunction receives this URL as-is, so routing is handled by adding an action key to parameters.
Trap 2: Building the Correct URL from Inside Cloud Functions
To avoid hardcoding the endpoint URL, we tried to assemble it dynamically from the request.
# ❌ request.base_url returns the internal URL
endpoint_url = request.base_url
# → "http://localhost:8080/" (URL of Cloud Run's internal proxy)
Cloud Functions Gen 2 runs on Cloud Run, and the request Flask receives is forwarded from an internal proxy. request.base_url returns the internal localhost:8080, not the external URL.
You can get the hostname using the X-Forwarded-Host and X-Forwarded-Proto headers, but there's another trap:
# ❌ request.path returns "/"
host = request.headers.get("X-Forwarded-Host") # "asia-northeast1-PROJECT.cloudfunctions.net"
scheme = request.headers.get("X-Forwarded-Proto") # "https"
path = request.path # "/" ← Not "/google-chat-bot"!
The Cloud Functions runtime strips the function name path prefix before passing the request to Flask. Even if the external URL is /google-chat-bot, the request.path visible to Flask is /.
The solution is to use the K_SERVICE environment variable. In Cloud Functions Gen 2 (= Cloud Run), this environment variable is automatically set to the service name (= function name).
import os
host = request.headers.get("X-Forwarded-Host") or request.headers.get("Host", "")
scheme = request.headers.get("X-Forwarded-Proto", "https")
service = os.environ.get("K_SERVICE", "")
endpoint_url = f"{scheme}://{host}/{service}" if host else ""
# → "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot"
Trap 3: Response Format Must Be actionResponse
Even after the URL was correct and events started arriving at the endpoint, things won't work if the response format is wrong.
// ❌ renderActions is for dialogs
{"renderActions": {"action": {"navigations": [{"updateCard": {...}}]}}}
// ❌ updateMessageAction is for synchronous response message updates
{"hostAppDataAction": {"chatDataAction": {"updateMessageAction": {"message": {...}}}}}
The correct response format for CARD_CLICKED events is actionResponse:
// ✅ Correct response for CARD_CLICKED
{
"actionResponse": {"type": "UPDATE_MESSAGE"},
"cardsV2": [{
"cardId": "progressive-card",
"card": {
"sections": [{
"widgets": [{
"textParagraph": {"text": "Thank you for your feedback!"}
}]
}]
}
}]
}
To summarize:
| Operation | Response format |
|---|---|
| Synchronous message creation | hostAppDataAction.chatDataAction.createMessageAction |
| Update message on CARD_CLICKED | actionResponse: {type: "UPDATE_MESSAGE"} + cardsV2 |
| Show dialog | renderActions.action.navigations[].pushCard |
Obstacle 6: Thread Replies Become New Threads
After the progressive card was complete and the feedback buttons were working, the next problem surfaced. When a user replied within a thread to the bot's answer, the bot's response was created as a new top-level message instead of within the same thread.
User: "Tell me about company policies"
Bot: [Answer with progressive card] ← Thread ①
User: "Can you explain in more detail?" ← Reply within Thread ①
Bot: [Answer with new card] ← Thread ② (new!) ← This is the problem
This breaks the flow of conversation.
Cause: Default Value of messageReplyOption
Google Chat API's spaces.messages.create has a parameter called messageReplyOption. When this parameter is not specified, the default MESSAGE_REPLY_OPTION_UNSPECIFIED is applied, and messages are always created as new threads.
In other words, we needed to extract the thread information from the received message and pass it when replying.
Solution: Carrying Over thread.name
The received event's message object contains thread.name (the thread's resource name). It was as simple as extracting this and passing it to spaces.messages.create.
# Extract thread information from received event
thread_name = message.get("thread", {}).get("name")
# Attach thread information when creating message
kwargs = {"parent": space_name, "body": body}
if thread_name:
kwargs["body"] = {**body, "thread": {"name": thread_name}}
kwargs["messageReplyOption"] = "REPLY_MESSAGE_FALLBACK_TO_NEW_THREAD"
service.spaces().messages().create(**kwargs).execute()
REPLY_MESSAGE_FALLBACK_TO_NEW_THREAD replies within the specified thread if it exists, or creates a new thread if it doesn't — a safe fallback behavior. This single option works correctly for DMs, new messages, and thread replies alike.
Final Architecture
Here is the final configuration after overcoming all the obstacles.
main.py → HTTP handler (returns {}, starts thread, CARD_CLICKED routing)
worker.py → Pipeline orchestration (with step tracking)
cards.py → cardsV2 builder (progressive card)
models.py → Pipeline data models (StepStatus, PipelineState)
throttle.py → Rate-limit-aware patcher (1 write/sec/space)
feedback.py → CARD_CLICKED event handler
chat_api.py → Chat API wrapper (static Discovery Document)
chat_discovery.json → Bundled Chat API v1 Discovery Document
Pipeline Flow
1. HTTP request received → Return {} immediately
2. Start pipeline in background thread
3. Create initial card (show 4 steps, all PENDING)
4. Patch card as each step progresses
- Analyzing query → ✅
- Creating search query → ✅
- Searching knowledge base → ✅
- Generating answer → ✅
5. Patch while adding response paragraphs
6. Show feedback buttons upon completion
Deployment
# Step 1: Deploy function
gcloud functions deploy google-chat-bot \
--gen2 \
--runtime=python314 \
--region=asia-northeast1 \
--source=. \
--entry-point=handle_chat \
--trigger-http \
--no-allow-unauthenticated \
--memory=512Mi \
--cpu=1
# Step 2: Disable CPU throttling (required for background threads)
gcloud run services update google-chat-bot \
--region=asia-northeast1 \
--no-cpu-throttling
Summary
We hit 6 obstacles while implementing a progressive UX for a RAG pipeline in Google Chat Bot.
| Obstacle | Cause | Solution |
|---|---|---|
| Message API is synchronous | Streaming not supported | cardsV2 + Chat API patch |
| Slow cold start | build() downloads Discovery Document |
Use build_from_document() with static file |
| Background thread doesn't run | CPU throttling in Cloud Functions Gen 2 | --cpu=1 + --no-cpu-throttling |
| Rate limit | 1 write/sec/space | ThrottledPatcher (latest-wins buffer) |
| Feedback button errors (triple trap) | Function name in action.function / URL assembly mistake / Wrong response format |
Full URL + K_SERVICE + actionResponse |
| Thread replies become new threads | Always new thread when messageReplyOption not specified |
Carry over thread.name + REPLY_MESSAGE_FALLBACK_TO_NEW_THREAD |
Honestly, compared to Discord Bot or Slack Bot, the Google Chat Bot development experience is still rough around the edges. There are many areas where the documentation hasn't caught up with the Workspace Add-ons format, and many things like the distinction between response formats (createMessageAction / actionResponse / renderActions) and the requirement for a full URL in action.function can only be discovered through trial and error.
On the other hand, cardsV2 progressive updates provide quite a good UX once they work. Users can see the pipeline progress in real-time and feedback can be collected. For organizations using Google Workspace, this investment is well worth it.
Next Steps: Migration to Cloud Tasks
The current architecture works with background threads + --no-cpu-throttling, but when scaling to a full-fledged RAG pipeline, we're considering migrating to Cloud Tasks.
Current: HTTP request → return {} → background thread runs in the same instance
Future: HTTP request → enqueue in Cloud Tasks → worker processes as a separate request
Migrating to Cloud Tasks provides the following benefits:
| Aspect | Current (background thread) | Cloud Tasks |
|---|---|---|
| CPU configuration | --cpu=1 + --no-cpu-throttling required |
Works with default settings |
| Concurrency | Threads compete for CPU within one instance | Runs in separate instance per task |
| On failure | Silent failure, no retry | Automatic retry + Dead Letter Queue |
| Timeout | Constrained by HTTP timeout | Up to 30 minutes per task |
Especially for RAG pipelines with heavy operations like LLM calls and vector searches, resource contention under concurrent requests becomes a problem. Cloud Tasks naturally leverages Cloud Run's autoscaling, eliminating scalability concerns.
Note that the Discovery Document (chat_discovery.json) is a static file bundled in the container image, so it can be used as-is in worker instances started from the same deployment, requiring no additional configuration.