The story of hitting walls everywhere when implementing progressive UX with cardsV2 in Google Chat Bot

The story of hitting walls everywhere when implementing progressive UX with cardsV2 in Google Chat Bot

In the process of implementing a progressive UX for a RAG pipeline in Google Chat Bot, I will introduce five walls encountered — synchronous API constraints, Discovery Document cold start, CPU throttling, rate limits, and the triple trap of feedback buttons — along with solutions using the cardsV2 patch pattern.
2026.05.29

This page has been translated by machine translation. View original

Introduction

In the previous article, we built a Google Chat Bot as a minimal echo bot using Cloud Functions + Python + uv.

This time, I'll share the trial and error involved in trying to turn that bot into a RAG (Retrieval-Augmented Generation) pipeline, hitting one Google Chat limitation after another, and ultimately arriving at a progressive update UX using cardsV2.

SCR-20260529-iulc

Configuration

Item Choice
Runtime Cloud Functions 2nd generation
Language Python 3.14
Package manager uv
Region asia-northeast1 (Tokyo)
CPU 1 vCPU (--no-cpu-throttling)
Memory 512Mi

Obstacle 1: Google Chat's Message API Is Synchronous

This was the first wall I hit when trying to build a RAG bot.

When developing AI applications normally, implementing a UX where generated text streams in real time as a streaming response is taken for granted. With Discord bots, you can easily add a reaction (👀) to a message to indicate "processing" and then edit the message when done.

However, with Google Chat's HTTP endpoint approach:

  • Streaming is not supported — you simply return a single response to an HTTP request
  • Messages can only be created via synchronous response — the HTTP response is the bot's reply
  • Bots cannot add reactions — Chat API's reactions.create only supports user authentication (OAuth) and cannot be called with bot authentication (chat.bot scope). The pattern of "add 👀 to indicate processing" like in Discord is not available

I actually tried adding the chat.messages.reactions.create scope, but got a 403 error with ACCESS_TOKEN_SCOPE_INSUFFICIENT. Checking the documentation, reactions are clearly stated as "Requires user authentication." For a bot to add reactions, either the user's OAuth credentials or domain-wide delegation by a Workspace administrator is required.

In other words, for processes that take several seconds to tens of seconds like a RAG pipeline, users are left waiting with no feedback at all during that time.

User: "Tell me about internal regulations"

(5–10 seconds of silence)

Bot: "Response text"

This makes for a poor UX. I started looking into whether there was a way around this.

Obstacle 2: The cardsV2 and Chat API "Patch" Pattern

After looking into it, I found that Google Chat has a rich UI component called cardsV2, and that using the Chat API, you can patch (update) a message after it's been created.

In other words, the following flow becomes possible:

1. Return {} in the HTTP response (return immediately to avoid timeout)
2. Call the Chat API in a background thread to create a "Processing..." card
3. Patch the card to show progress as the pipeline advances
4. Patch the card with the final result upon completion

show-progress-status

Calling the Chat API from Code

To use the Chat API from Python, use build() from google-api-python-client.

from googleapiclient.discovery import build
import google.auth

SCOPES = ["https://www.googleapis.com/auth/chat.bot"]
credentials, _ = google.auth.default(scopes=SCOPES)
service = build("chat", "v1", credentials=credentials)

This allows you to create and update messages.

# Create a message
response = service.spaces().messages().create(
    parent="spaces/SPACE_ID",
    body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()

# Update (patch) a message
service.spaces().messages().patch(
    name=response["name"],
    updateMask="cardsV2",
    body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()

Collapsible Sections in cardsV2

cardsV2 has a collapsible property that allows widgets within a section to be collapsible. Using this, you can create a UI that displays pipeline step history in an accordion while always showing the current status.

{
    "collapsible": True,
    "uncollapsibleWidgetsCount": 1,  # The first one is always shown
    "widgets": [
        # ↓ Always shown (status)
        {"decoratedText": {"text": "Generating response..."}},
        # ↓ Inside collapsed section (step history)
        {"decoratedText": {"text": '<font color="#00C853">✅ Analyzing inquiry</font>'}},
        {"decoratedText": {"text": '<font color="#00C853">✅ Creating search query</font>'}},
        {"decoratedText": {"text": '<font color="#2979FF">⏳ Generating response</font>'}},
    ]
}

bot-show-steps

Rate Limit: 1 write/sec/space

There is an important caveat here. The Google Chat API has a rate limit of 1 write per second per space. Both create and patch share this quota.

This means you cannot send a patch for every token like LLM token streaming. You need to patch at appropriate granularity, such as at step transitions or paragraph boundaries.

To handle this, I created a class called ThrottledPatcher.

class ThrottledPatcher:
    def __init__(self, chat_client, message_name, min_interval=1.0):
        self._chat_client = chat_client
        self._message_name = message_name
        self._min_interval = min_interval
        self._last_patch_time = 0.0
        self._buffered_body = None

    def patch(self, body, force=False):
        now = time.monotonic()
        elapsed = now - self._last_patch_time
        if force or elapsed >= self._min_interval:
            if force and elapsed < self._min_interval:
                time.sleep(self._min_interval - elapsed)
            self._chat_client.patch_message(
                self._message_name, body, "cardsV2"
            )
            self._last_patch_time = time.monotonic()
            self._buffered_body = None
        else:
            self._buffered_body = body  # latest-wins buffer

    def flush(self):
        if self._buffered_body is not None:
            remaining = self._min_interval - (
                time.monotonic() - self._last_patch_time
            )
            if remaining > 0:
                time.sleep(remaining)
            self._chat_client.patch_message(
                self._message_name, self._buffered_body, "cardsV2"
            )
            self._buffered_body = None

The key point is the "latest-wins" buffer strategy. When multiple patches occur within the rate limit, only the latest state is retained and sent at the next patchable timing. There's no need to send every intermediate state — the user just needs to always see the latest progress.

Obstacle 3: Cold Start Is Slow Due to Discovery Document Download

Once the cardsV2 + patch pattern was working and I happily deployed it, I noticed that during a cold start, it took about 2 minutes for the first card to appear.

SCR-20260529-jcac

When returning plain text messages in the previous article, the response came back almost instantly even on a cold start. What was different?

SCR-20260529-jdcl

The cause was build("chat", "v1").

google-api-python-client's build() downloads the API definition (Discovery Document) over the network from Google's servers. This file is about 410KB, and downloading it takes time in Cloud Functions' low-spec environment (default 0.17 vCPU).

Return HTTP response (immediately) → Start background thread
     → Download Discovery Document with build("chat", "v1") (~2 minutes)
     → Call card creation API
     → Card appears for user

First Attempt: Synchronous Message + Async Card

The first thing I tried was returning a plain text synchronous response on cold start, then creating a cardsV2 message once the Chat API was ready.

@functions_framework.http
def handle_chat(request):
    body = request.get_json(silent=True)
    # ...
    if not is_warm():
        # Cold start: return a text message synchronously
        thread = threading.Thread(target=process_message, args=(...,))
        thread.start()
        return create_message("Processing started. Please wait a moment...")
    else:
        # Warm: create card in the background
        thread = threading.Thread(target=process_message, args=(...,))
        thread.start()
        return {}

However, this didn't work either.

SCR-20260529-ixtb

Obstacle 4: CPU Throttling in Cloud Functions Gen 2

After deploying and testing, the synchronous message returned immediately, but the background thread processing never executed at all. No errors. Just silence.

Checking the Cloud Run logs, the request was being received, but there were no logs at all for the Chat API calls that followed.

After investigating, the cause was CPU throttling in Cloud Functions 2nd generation (Cloud Run-based).

gcloud run services describe google-chat-bot --region=asia-northeast1 \
  --format="yaml(spec.template.metadata.annotations)"
annotations:
  run.googleapis.com/cpu-throttling: "true"  # ← This was the cause

Cloud Functions Gen 2 = Cloud Run

An important prerequisite here: Cloud Functions 2nd generation is Cloud Run itself. In AWS terms, it's not a proprietary runtime like Lambda but rather something closer to Fargate. When you deploy with gcloud functions deploy, it's internally deployed as a Cloud Run service.

# Even though it's a Cloud Function, it shows up as a Cloud Run service
gcloud run services list --region=asia-northeast1
# NAME: google-chat-bot  ← A Cloud Run service with the same name exists

And Cloud Run has CPU throttling enabled by default. This is a mechanism that allocates CPU resources only while processing HTTP requests and reduces the CPU to nearly zero after returning the response.

Why Does This Affect Background Threads?

Recall the bot's architecture:

1. Receive HTTP request
2. Launch background thread
3. Return HTTP response {} immediately  ← CPU allocation drops sharply here
4. Call Chat API in background thread  ← Almost no CPU available

CPU throttling doesn't "freeze" threads. Threads stay alive, but the allocated CPU drops dramatically. From the default 0.17 vCPU it goes to nearly zero, making network I/O and response parsing extremely slow.

In fact, the simple pattern implemented in the previous article (create thinking card → 1 patch) only made 2 API calls, so it somehow managed to work despite this constraint. But with progressive cards, more than 10 patches occur over several seconds, causing extreme delay from CPU starvation and making it effectively appear unresponsive.

Pattern Number of API calls Behavior under CPU throttling
thinking → patch (v0.1.0) 2 Slow but eventually completes
Progressive card 10+ Extremely delayed, effectively unresponsive

Solution: Always-On CPU Allocation

Setting --no-cpu-throttling keeps CPU allocated even after the HTTP response is returned. Background threads are guaranteed to run at full CPU.

However, --no-cpu-throttling requires a minimum of 1 vCPU. --cpu=1 is not needed for performance — it's a prerequisite for enabling --no-cpu-throttling.

# Step 1: Deploy with increased CPU and memory
gcloud functions deploy google-chat-bot \
  --gen2 \
  --runtime=python314 \
  --region=asia-northeast1 \
  --source=. \
  --entry-point=handle_chat \
  --trigger-http \
  --no-allow-unauthenticated \
  --memory=512Mi \
  --cpu=1

# Step 2: Disable CPU throttling
gcloud run services update google-chat-bot \
  --region=asia-northeast1 \
  --no-cpu-throttling

Note: --no-cpu-throttling is not supported in the gcloud functions deploy command, so it needs to be set separately with gcloud run services update. It's precisely because Cloud Functions Gen 2 is Cloud Run itself that you can configure it directly with gcloud run commands.

Cost Considerations

Raising the CPU from 0.17 vCPU to 1 vCPU and switching to always-on allocation will increase costs. However, if min-instances=0 (default) is kept, instances scale to zero when there are no requests, so actual costs remain very low.

Setting min-instances Monthly cost (Tokyo region)
0.17 vCPU, 256Mi (default) 0 Nearly free
1 vCPU, 512Mi, --no-cpu-throttling 0 ~$0.55 USD/month (billed only during requests)
1 vCPU, 512Mi, --no-cpu-throttling 1 ~$50 USD/month (always running)

This time I'm using min-instances=0. With roughly 100 requests per day at about 10 seconds of processing per request, costs come to around $0.55 USD/month. The decision is to tolerate cold start latency (a few seconds) and minimize costs.

For production environments requiring consistently low latency, setting min-instances=1 solves the problem, but incurs a cost of around $50 USD/month.

Resolving Obstacle 3: Bundling the Discovery Document

Even after resolving CPU throttling to get background threads working, the problem of the 2-minute Discovery Document download on cold start remained.

The solution was simple: bundle the Discovery Document as a static file in the project.

# Download the Discovery Document
curl -o chat_discovery.json \
  'https://chat.googleapis.com/$discovery/rest?version=v1'
import json
from pathlib import Path
from googleapiclient.discovery import build_from_document

_DISCOVERY_DOC_PATH = Path(__file__).parent / "chat_discovery.json"

def _get_default_service():
    credentials, _ = google.auth.default(scopes=SCOPES)
    doc = json.loads(_DISCOVERY_DOC_PATH.read_text())
    return build_from_document(doc, credentials=credentials)

By using build_from_document() instead of build(), network access is completely eliminated. The initial card now appears within a few seconds even on cold start.

This file is about 410KB. Note that upgrading to --cpu=1 also reduces the build() download from 2 minutes to a few seconds, but that's still far from the < 100ms of build_from_document(). This optimization remains effective separately from the CPU increase.

What If the Discovery Document Gets Stale?

The Discovery Document is like a "map" describing API URL paths, parameter names, and request/response schemas. Since Google's REST APIs maintain strict backward compatibility, the signatures of existing methods (spaces.messages.create, spaces.messages.patch) essentially never change.

Existing functionality continues to work with an older Discovery Document — you simply won't be able to use new API features. It will work fine for several months to a year, but it's reassuring to re-download it at the following times:

# Re-download periodically
curl -o chat_discovery.json \
  'https://chat.googleapis.com/$discovery/rest?version=v1'
  • When upgrading google-api-python-client
  • When you want to use new Chat API features
  • As part of periodic maintenance every few months

Obstacle 5: The Feedback Button Trap (A Triple Trap)

With progressive cards working, I added buttons for users to give feedback on response quality. However, getting these buttons to work involved stepping into 3 traps.

Trap 1: action.function Must Be the Full Endpoint URL

The button definition I first wrote looked like this:

# ❌ Doesn't work — specifying a function name in function
{
    "onClick": {
        "action": {
            "function": "feedback",
            "parameters": [
                {"key": "vote", "value": "up"},
            ]
        }
    }
}

Clicking the button displayed the error "〇〇 cannot process the request". No logs arrived at the endpoint either.

The cause was that with HTTP endpoint-style Workspace Add-ons, action.function must contain the bot's endpoint URL itself. Google Chat POSTs a CARD_CLICKED event to this URL.

# ✅ Correct — specify the full HTTPS URL in function
{
    "onClick": {
        "action": {
            "function": "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot",
            "parameters": [
                {"key": "action", "value": "feedback"},
                {"key": "vote", "value": "up"},
            ]
        }
    }
}

With Apps Script or Dialogflow approaches you specify a function name in function, but with the HTTP endpoint approach you specify a URL. This distinction is a hard-to-find point in the documentation.

Note that since this URL comes through as-is in invokedFunction, routing is determined by adding an action key to parameters.

Trap 2: Constructing the Correct URL from Inside Cloud Functions

To avoid hardcoding the endpoint URL, I tried building it dynamically from the request.

# ❌ request.base_url returns the internal URL
endpoint_url = request.base_url
# → "http://localhost:8080/" (URL of Cloud Run's internal proxy)

Cloud Functions Gen 2 runs on top of Cloud Run, and the request that Flask receives is forwarded from an internal proxy. request.base_url returns the internal localhost:8080 rather than the external URL.

You can get the hostname using X-Forwarded-Host and X-Forwarded-Proto headers, but there's another trap:

# ❌ request.path returns "/"
host = request.headers.get("X-Forwarded-Host")  # "asia-northeast1-PROJECT.cloudfunctions.net"
scheme = request.headers.get("X-Forwarded-Proto")  # "https"
path = request.path  # "/" ← Not "/google-chat-bot"!

The Cloud Functions runtime strips the function name path prefix before passing the request to Flask. Even if the external URL has /google-chat-bot, request.path as seen from Flask is /.

The solution is to use the K_SERVICE environment variable. In Cloud Functions Gen 2 (= Cloud Run), this environment variable is automatically set to the service name (= function name).

import os

host = request.headers.get("X-Forwarded-Host") or request.headers.get("Host", "")
scheme = request.headers.get("X-Forwarded-Proto", "https")
service = os.environ.get("K_SERVICE", "")
endpoint_url = f"{scheme}://{host}/{service}" if host else ""
# → "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot"

Trap 3: The Response Format Is actionResponse

Even after getting the URL right and events arriving at the endpoint, things won't work if the response format is wrong.

// ❌ renderActions is for dialogs
{"renderActions": {"action": {"navigations": [{"updateCard": {...}}]}}}

// ❌ updateMessageAction is for synchronous response message updates
{"hostAppDataAction": {"chatDataAction": {"updateMessageAction": {"message": {...}}}}}

The correct response format for CARD_CLICKED events is actionResponse:

// ✅ Correct response to CARD_CLICKED
{
  "actionResponse": {"type": "UPDATE_MESSAGE"},
  "cardsV2": [{
    "cardId": "progressive-card",
    "card": {
      "sections": [{
        "widgets": [{
          "textParagraph": {"text": "Thank you for your feedback!"}
        }]
      }]
    }
  }]
}

To summarize:

Operation Response format
Synchronous message creation hostAppDataAction.chatDataAction.createMessageAction
Message update on CARD_CLICKED actionResponse: {type: "UPDATE_MESSAGE"} + cardsV2
Show dialog renderActions.action.navigations[].pushCard

Final Architecture

Here is the final configuration after overcoming all the obstacles.

main.py              → HTTP handler (returns {}, starts thread, routes CARD_CLICKED)
worker.py            → Pipeline orchestration (with step tracking)
cards.py             → cardsV2 builder (progressive card)
models.py            → Pipeline data models (StepStatus, PipelineState)
throttle.py          → Rate-limit-aware patcher (1 write/sec/space)
feedback.py          → CARD_CLICKED event handler
chat_api.py          → Chat API wrapper (static Discovery Document)
chat_discovery.json  → Bundled Chat API v1 Discovery Document

Pipeline Flow

1. Receive HTTP request → Return {} immediately
2. Start pipeline in background thread
3. Create initial card (showing 4 steps, all PENDING)
4. Patch card as each step progresses
   - Analyzing inquiry → ✅
   - Creating search query → ✅
   - Searching knowledge base → ✅
   - Generating response → ✅
5. Patch while adding response paragraphs
6. Show feedback buttons on completion

Deployment

# Step 1: Deploy the function
gcloud functions deploy google-chat-bot \
  --gen2 \
  --runtime=python314 \
  --region=asia-northeast1 \
  --source=. \
  --entry-point=handle_chat \
  --trigger-http \
  --no-allow-unauthenticated \
  --memory=512Mi \
  --cpu=1

# Step 2: Disable CPU throttling (required for background threads)
gcloud run services update google-chat-bot \
  --region=asia-northeast1 \
  --no-cpu-throttling

Summary

Implementing a progressive UX for a RAG pipeline in Google Chat Bot involved hitting 5 obstacles.

Obstacle Cause Solution
Message API is synchronous Streaming not supported cardsV2 + Chat API patch
Cold start is slow build() downloads Discovery Document Use build_from_document() with static file
Background threads don't run CPU throttling in Cloud Functions Gen 2 --cpu=1 + --no-cpu-throttling
Rate limit 1 write/sec/space ThrottledPatcher (latest-wins buffer)
Feedback button errors (triple trap) Function name in action.function / URL construction error / Wrong response format Full URL + K_SERVICE + actionResponse

Honestly, compared to Discord bots or Slack bots, the Google Chat Bot development experience is still rough around the edges. There are many places where documentation hasn't caught up with the Workspace Add-ons format, and many things can only be discovered through trial and error — such as how to use different response formats (createMessageAction / actionResponse / renderActions) and the fact that a full URL is required in action.function.

On the other hand, progressive updates with cardsV2 provide quite a good UX when they work. Users can see pipeline progress in real time, and feedback can be collected. For organizations using Google Workspace, I think this investment is well worth it.

Next Steps: Migration to Cloud Tasks

The current architecture works with background threads + --no-cpu-throttling, but when scaling to a full-fledged RAG pipeline, I'm considering migrating to Cloud Tasks.

Current:  HTTP request → return {} → execute background thread in same instance
Future:   HTTP request → enqueue to Cloud Tasks → worker processes as separate request

Migrating to Cloud Tasks offers the following benefits:

Aspect Current (background thread) Cloud Tasks
CPU settings --cpu=1 + --no-cpu-throttling required Works with default settings
Concurrency Threads compete for CPU within one instance Each task runs in a separate instance
On failure Fails silently, no retry Automatic retry + Dead Letter Queue
Timeout Constrained by HTTP timeout Up to 30 minutes per task

Particularly with RAG pipelines that include heavy processing like LLM calls and vector searches, resource contention under concurrent requests becomes a problem. With Cloud Tasks, you naturally leverage Cloud Run's autoscaling, eliminating scalability concerns.

Note that the Discovery Document (chat_discovery.json) is a static file bundled in the container image, so it's available as-is in worker instances launched from the same deployment, requiring no additional configuration.

References

Share this article