The story of hitting wall after wall when implementing progressive UX with cardsV2 in Google Chat Bot

The story of hitting wall after wall when implementing progressive UX with cardsV2 in Google Chat Bot

In the process of implementing a progressive UX for a RAG pipeline in Google Chat Bot, I will introduce five walls encountered — synchronous API constraints, Discovery Document cold start, CPU throttling, rate limits, and the triple trap of feedback buttons — along with solutions using the cardsV2 patch pattern.
2026.05.29

This page has been translated by machine translation. View original

Introduction

In the previous article, we built a Google Chat Bot as a minimal echo bot using Cloud Functions + Python + uv.

This time, when we tried to turn that bot into a RAG (Retrieval-Augmented Generation) pipeline, we ran into Google Chat's limitations one after another. This article introduces the trial and error that ultimately led us to a progressive update UX with cardsV2.

SCR-20260529-iulc

Configuration

Item Choice
Runtime Cloud Functions 2nd gen
Language Python 3.14
Package manager uv
Region asia-northeast1 (Tokyo)
CPU 1 vCPU (--no-cpu-throttling)
Memory 512Mi

Obstacle 1: Google Chat's Message API Is Synchronous

This was the first obstacle we hit when trying to build a RAG bot.

When developing AI applications normally, it's standard to implement a UX where text streams in real-time with a streaming response showing "generating...". With Discord Bot, you can easily attach a reaction (👀) to a message to indicate "processing" and then edit the message when done.

However, with Google Chat's HTTP endpoint method:

  • Streaming is not supported — You only return one response per HTTP request
  • Messages can only be created via synchronous response — HTTP response = bot's reply
  • Bots cannot add reactions — Chat API's reactions.create only supports user authentication (OAuth) and cannot be called with bot authentication (chat.bot scope). The Discord-style pattern of "attach 👀 to indicate processing" is not available

We actually tried adding the chat.messages.reactions.create scope, but got a 403 error with ACCESS_TOKEN_SCOPE_INSUFFICIENT. Checking the documentation, reactions are clearly stated to "Require user authentication." For a bot to add reactions, you need either the user's OAuth credentials or domain-wide delegation by a Workspace admin.

In other words, for processes that take several seconds to tens of seconds like a RAG pipeline, the user has to wait the entire time with no feedback whatsoever.

User: "Tell me about company policies"

(5-10 seconds of silence)

Bot: "Response text"

This makes for a poor UX. We started looking for alternatives.

Obstacle 2: The cardsV2 and Chat API "Patch" Pattern

Upon investigation, we found that Google Chat has a rich UI component called cardsV2, and that the Chat API allows patching (updating) messages after creation.

This means the following flow can be achieved:

1. Return {} via HTTP response (return response immediately to avoid timeout)
2. Call the Chat API in a background thread to create a "processing..." card
3. Patch the card to show progress as the pipeline advances
4. Patch the card with the final result upon completion

show-progress-status

Calling the Chat API from Code

To use the Chat API from Python, use build() from google-api-python-client.

from googleapiclient.discovery import build
import google.auth

SCOPES = ["https://www.googleapis.com/auth/chat.bot"]
credentials, _ = google.auth.default(scopes=SCOPES)
service = build("chat", "v1", credentials=credentials)

This enables creating and updating messages.

# Create message
response = service.spaces().messages().create(
    parent="spaces/SPACE_ID",
    body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()

# Update message (patch)
service.spaces().messages().patch(
    name=response["name"],
    updateMask="cardsV2",
    body={"cardsV2": [{"cardId": "my-card", "card": {...}}]}
).execute()

collapsible Sections in cardsV2

cardsV2 has a property called collapsible that allows widgets within a section to be collapsible. Using this, you can display pipeline step history as an accordion while keeping only the current status always visible.

{
    "collapsible": True,
    "uncollapsibleWidgetsCount": 1,  # First one is always shown
    "widgets": [
        # ↓ Always visible (status)
        {"decoratedText": {"text": "Generating answer..."}},
        # ↓ Inside collapse (step history)
        {"decoratedText": {"text": '<font color="#00C853">✅ Analyzing query</font>'}},
        {"decoratedText": {"text": '<font color="#00C853">✅ Creating search query</font>'}},
        {"decoratedText": {"text": '<font color="#2979FF">⏳ Generating answer</font>'}},
    ]
}

bot-show-steps

Rate Limit: 1 write/sec/space

There is an important caveat here. The Google Chat API has a rate limit of 1 write per second per space. Both create and patch share this quota.

This means you cannot send a patch for every token like LLM token streaming. You need to patch at an appropriate granularity, such as at step transitions or paragraph boundaries.

To handle this, we created a class called ThrottledPatcher.

class ThrottledPatcher:
    def __init__(self, chat_client, message_name, min_interval=1.0):
        self._chat_client = chat_client
        self._message_name = message_name
        self._min_interval = min_interval
        self._last_patch_time = 0.0
        self._buffered_body = None

    def patch(self, body, force=False):
        now = time.monotonic()
        elapsed = now - self._last_patch_time
        if force or elapsed >= self._min_interval:
            if force and elapsed < self._min_interval:
                time.sleep(self._min_interval - elapsed)
            self._chat_client.patch_message(
                self._message_name, body, "cardsV2"
            )
            self._last_patch_time = time.monotonic()
            self._buffered_body = None
        else:
            self._buffered_body = body  # latest-wins buffer

    def flush(self):
        if self._buffered_body is not None:
            remaining = self._min_interval - (
                time.monotonic() - self._last_patch_time
            )
            if remaining > 0:
                time.sleep(remaining)
            self._chat_client.patch_message(
                self._message_name, self._buffered_body, "cardsV2"
            )
            self._buffered_body = None

The key is the "latest-wins" buffer strategy. When multiple patches occur within the rate limit, only the latest state is retained and sent at the next available patch opportunity. There's no need to send every intermediate state — the user just needs to always see the latest progress.

Obstacle 3: Cold Start Is Slow Due to Discovery Document Download

After cardsV2 + patch pattern was working and we happily deployed, we noticed that during cold starts, it took about 2 minutes for the first card to appear.

SCR-20260529-jcac

When returning plain text messages as in the previous article, responses were returned almost instantly even on cold starts. What was different?

SCR-20260529-jdcl

The cause was build("chat", "v1").

google-api-python-client's build() downloads the API definition (Discovery Document) over the network from Google's servers. This file is about 410KB, and in Cloud Functions' low-spec environment (default 0.17 vCPU), the download takes a long time.

HTTP response returned (immediately) → Background thread starts
     → Download Discovery Document with build("chat", "v1") (~2 minutes)
     → Call card creation API
     → Card appears for user

First Attempt: Synchronous Message + Async Card

The first thing we tried was to return a plain text message synchronously on cold start, then create a cardsV2 message once the Chat API is ready.

@functions_framework.http
def handle_chat(request):
    body = request.get_json(silent=True)
    # ...
    if not is_warm():
        # Cold start: return text message synchronously
        thread = threading.Thread(target=process_message, args=(...,))
        thread.start()
        return create_message("Processing started. Please wait...")
    else:
        # Warm: create card in background
        thread = threading.Thread(target=process_message, args=(...,))
        thread.start()
        return {}

However, this didn't work either.

SCR-20260529-ixtb

Obstacle 4: CPU Throttling in Cloud Functions Gen 2

After deploying and testing, we found that while the synchronous message returned instantly, the background thread processing never executed at all. No errors either. Just silence.

Checking the Cloud Run logs, the request was received, but there were no logs at all from the subsequent Chat API calls.

After investigating the cause, it turned out to be CPU throttling in Cloud Functions 2nd generation (Cloud Run-based).

gcloud run services describe google-chat-bot --region=asia-northeast1 \
  --format="yaml(spec.template.metadata.annotations)"
annotations:
  run.googleapis.com/cpu-throttling: "true"  # ← This was the cause

Cloud Functions Gen 2 = Cloud Run

An important prerequisite here is that Cloud Functions 2nd generation is Cloud Run itself. In AWS terms, it's not a proprietary runtime like Lambda but rather closer to Fargate. Deploying with gcloud functions deploy internally deploys as a Cloud Run service.

# It's a Cloud Function, but it shows up as a Cloud Run service
gcloud run services list --region=asia-northeast1
# NAME: google-chat-bot  ← Exists as a Cloud Run service with the same name

And Cloud Run has CPU throttling enabled by default. This is a mechanism that allocates CPU resources only while processing HTTP requests and reduces the CPU to nearly zero after returning a response.

Why Does This Affect Background Threads?

Recall the bot's architecture:

1. HTTP request received
2. Background thread started
3. HTTP response {} returned immediately  ← CPU allocation drops sharply here
4. Background thread calls Chat API  ← Almost no CPU available

CPU throttling doesn't "freeze" threads. Threads stay alive, but the allocated CPU drops dramatically. Starting from the default 0.17 vCPU and dropping to nearly zero, operations like network I/O and response parsing become extremely slow.

In fact, the simple pattern implemented in the previous article (create thinking card → 1 patch) only made 2 API calls, so it managed to work under this constraint. However, with progressive cards, 10+ patches occur over several seconds, causing extreme delays due to CPU shortage and making the bot appear essentially unresponsive.

Pattern API call count Behavior under CPU throttling
thinking → patch (v0.1.0) 2 Slow but completes
Progressive card 10+ Extreme delay, essentially unresponsive

Solution: Always-On CPU Allocation

Setting --no-cpu-throttling keeps CPU allocated even after the HTTP response is returned. This ensures background threads reliably run with full CPU.

However, --no-cpu-throttling requires at least 1 vCPU. --cpu=1 is not needed for performance — it's a prerequisite for enabling --no-cpu-throttling.

# Step 1: Deploy with increased CPU and memory
gcloud functions deploy google-chat-bot \
  --gen2 \
  --runtime=python314 \
  --region=asia-northeast1 \
  --source=. \
  --entry-point=handle_chat \
  --trigger-http \
  --no-allow-unauthenticated \
  --memory=512Mi \
  --cpu=1

# Step 2: Disable CPU throttling
gcloud run services update google-chat-bot \
  --region=asia-northeast1 \
  --no-cpu-throttling

Note: --no-cpu-throttling is not supported in the gcloud functions deploy command, so it must be set separately with gcloud run services update. Since Cloud Functions Gen 2 is Cloud Run itself, you can configure it directly with the gcloud run command.

Cost Considerations

Increasing CPU from 0.17 vCPU to 1 vCPU and enabling always-on allocation will increase costs. However, since min-instances=0 (default) means instances scale to zero when there are no requests, the actual cost remains very low.

Setting min-instances Monthly cost (Tokyo region)
0.17 vCPU, 256Mi (default) 0 Nearly free
1 vCPU, 512Mi, --no-cpu-throttling 0 ~$0.55 USD/month (charged only during requests)
1 vCPU, 512Mi, --no-cpu-throttling 1 ~$50 USD/month (always running)

We've chosen min-instances=0 here. With approximately 10 seconds of processing time per request and around 100 requests per day, costs would be around $0.55 USD per month. We've accepted the cold start latency (a few seconds) and chosen to minimize costs.

If you need consistently low latency in production, setting min-instances=1 solves that, but at a cost of approximately $50 USD/month.

Solving Obstacle 3: Bundling the Discovery Document

Even after solving CPU throttling and getting background threads to work, the problem of the Discovery Document download taking 2 minutes on cold starts remained.

The solution was simple: bundle the Discovery Document as a static file in the project.

# Download Discovery Document
curl -o chat_discovery.json \
  'https://chat.googleapis.com/$discovery/rest?version=v1'
import json
from pathlib import Path
from googleapiclient.discovery import build_from_document

_DISCOVERY_DOC_PATH = Path(__file__).parent / "chat_discovery.json"

def _get_default_service():
    credentials, _ = google.auth.default(scopes=SCOPES)
    doc = json.loads(_DISCOVERY_DOC_PATH.read_text())
    return build_from_document(doc, credentials=credentials)

By using build_from_document() instead of build(), network access is completely eliminated. Even on cold starts, the first card now appears within a few seconds.

This file is about 410KB. Note that upgrading to --cpu=1 also reduces the build() download time from 2 minutes to a few seconds, but that's still far from the < 100ms of build_from_document(). This optimization remains effective independently of the CPU increase.

What If the Discovery Document Becomes Outdated?

The Discovery Document is like a "map" describing the API's URL paths, parameter names, and request/response schemas. Since Google's REST APIs maintain strict backward compatibility, the signatures of existing methods (like spaces.messages.create and spaces.messages.patch) almost never change.

Even with an older Discovery Document, existing functionality continues to work. You simply won't have access to new API features. It works fine for several months to a year, but you can re-download it at these opportune times:

# Re-download periodically
curl -o chat_discovery.json \
  'https://chat.googleapis.com/$discovery/rest?version=v1'
  • When upgrading google-api-python-client
  • When you want to use new Chat API features
  • As regular maintenance every few months

Obstacle 5: The Feedback Button Trap (A Triple Trap)

With the progressive card working, we added buttons for users to provide feedback on answer quality. However, getting these buttons to work meant falling into 3 traps.

Trap 1: action.function Must Be the Full Endpoint URL

The first button definition we wrote was:

# ❌ Doesn't work — specifying a function name in function
{
    "onClick": {
        "action": {
            "function": "feedback",
            "parameters": [
                {"key": "vote", "value": "up"},
            ]
        }
    }
}

Clicking the button displayed an error: "○○ cannot process the request". No logs arrived at the endpoint.

The cause was that with the HTTP endpoint method for Workspace Add-ons, action.function must specify the bot's endpoint URL itself. Google Chat POSTs a CARD_CLICKED event to this URL.

# ✅ Correct — specify the full HTTPS URL in function
{
    "onClick": {
        "action": {
            "function": "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot",
            "parameters": [
                {"key": "action", "value": "feedback"},
                {"key": "vote", "value": "up"},
            ]
        }
    }
}

With Apps Script or Dialogflow methods, you specify a function name in function, but with the HTTP endpoint method, you specify a URL. This distinction is a difficult point to find in the documentation.

Note that invokedFunction receives this URL as-is, so routing is handled by adding an action key to parameters.

Trap 2: Building the Correct URL from Inside Cloud Functions

To avoid hardcoding the endpoint URL, we tried to assemble it dynamically from the request.

# ❌ request.base_url returns the internal URL
endpoint_url = request.base_url
# → "http://localhost:8080/" (URL of Cloud Run's internal proxy)

Cloud Functions Gen 2 runs on Cloud Run, and the request Flask receives is forwarded from an internal proxy. request.base_url returns the internal localhost:8080, not the external URL.

You can get the hostname using the X-Forwarded-Host and X-Forwarded-Proto headers, but there's another trap:

# ❌ request.path returns "/"
host = request.headers.get("X-Forwarded-Host")  # "asia-northeast1-PROJECT.cloudfunctions.net"
scheme = request.headers.get("X-Forwarded-Proto")  # "https"
path = request.path  # "/" ← Not "/google-chat-bot"!

The Cloud Functions runtime strips the function name path prefix before passing the request to Flask. Even if the external URL is /google-chat-bot, the request.path visible to Flask is /.

The solution is to use the K_SERVICE environment variable. In Cloud Functions Gen 2 (= Cloud Run), this environment variable is automatically set to the service name (= function name).

import os

host = request.headers.get("X-Forwarded-Host") or request.headers.get("Host", "")
scheme = request.headers.get("X-Forwarded-Proto", "https")
service = os.environ.get("K_SERVICE", "")
endpoint_url = f"{scheme}://{host}/{service}" if host else ""
# → "https://asia-northeast1-PROJECT.cloudfunctions.net/google-chat-bot"

Trap 3: Response Format Must Be actionResponse

Even after the URL was correct and events started arriving at the endpoint, things won't work if the response format is wrong.

// ❌ renderActions is for dialogs
{"renderActions": {"action": {"navigations": [{"updateCard": {...}}]}}}

// ❌ updateMessageAction is for synchronous response message updates
{"hostAppDataAction": {"chatDataAction": {"updateMessageAction": {"message": {...}}}}}

The correct response format for CARD_CLICKED events is actionResponse:

// ✅ Correct response for CARD_CLICKED
{
  "actionResponse": {"type": "UPDATE_MESSAGE"},
  "cardsV2": [{
    "cardId": "progressive-card",
    "card": {
      "sections": [{
        "widgets": [{
          "textParagraph": {"text": "Thank you for your feedback!"}
        }]
      }]
    }
  }]
}

To summarize:

Operation Response format
Synchronous message creation hostAppDataAction.chatDataAction.createMessageAction
Update message on CARD_CLICKED actionResponse: {type: "UPDATE_MESSAGE"} + cardsV2
Show dialog renderActions.action.navigations[].pushCard

Obstacle 6: Thread Replies Become New Threads

After the progressive card was complete and the feedback buttons were working, the next problem surfaced. When a user replied within a thread to the bot's answer, the bot's response was created as a new top-level message instead of within the same thread.

User: "Tell me about company policies"
Bot:  [Answer with progressive card]  ← Thread ①

User: "Can you explain in more detail?"  ← Reply within Thread ①
Bot:  [Answer with new card]  ← Thread ② (new!)  ← This is the problem

This breaks the flow of conversation.

Cause: Default Value of messageReplyOption

Google Chat API's spaces.messages.create has a parameter called messageReplyOption. When this parameter is not specified, the default MESSAGE_REPLY_OPTION_UNSPECIFIED is applied, and messages are always created as new threads.

In other words, we needed to extract the thread information from the received message and pass it when replying.

Solution: Carrying Over thread.name

The received event's message object contains thread.name (the thread's resource name). It was as simple as extracting this and passing it to spaces.messages.create.

# Extract thread information from received event
thread_name = message.get("thread", {}).get("name")
# Attach thread information when creating message
kwargs = {"parent": space_name, "body": body}
if thread_name:
    kwargs["body"] = {**body, "thread": {"name": thread_name}}
    kwargs["messageReplyOption"] = "REPLY_MESSAGE_FALLBACK_TO_NEW_THREAD"

service.spaces().messages().create(**kwargs).execute()

REPLY_MESSAGE_FALLBACK_TO_NEW_THREAD replies within the specified thread if it exists, or creates a new thread if it doesn't — a safe fallback behavior. This single option works correctly for DMs, new messages, and thread replies alike.

Final Architecture

Here is the final configuration after overcoming all the obstacles.

main.py              → HTTP handler (returns {}, starts thread, CARD_CLICKED routing)
worker.py            → Pipeline orchestration (with step tracking)
cards.py             → cardsV2 builder (progressive card)
models.py            → Pipeline data models (StepStatus, PipelineState)
throttle.py          → Rate-limit-aware patcher (1 write/sec/space)
feedback.py          → CARD_CLICKED event handler
chat_api.py          → Chat API wrapper (static Discovery Document)
chat_discovery.json  → Bundled Chat API v1 Discovery Document

Pipeline Flow

1. HTTP request received → Return {} immediately
2. Start pipeline in background thread
3. Create initial card (show 4 steps, all PENDING)
4. Patch card as each step progresses
   - Analyzing query → ✅
   - Creating search query → ✅
   - Searching knowledge base → ✅
   - Generating answer → ✅
5. Patch while adding response paragraphs
6. Show feedback buttons upon completion

Deployment

# Step 1: Deploy function
gcloud functions deploy google-chat-bot \
  --gen2 \
  --runtime=python314 \
  --region=asia-northeast1 \
  --source=. \
  --entry-point=handle_chat \
  --trigger-http \
  --no-allow-unauthenticated \
  --memory=512Mi \
  --cpu=1

# Step 2: Disable CPU throttling (required for background threads)
gcloud run services update google-chat-bot \
  --region=asia-northeast1 \
  --no-cpu-throttling

Summary

We hit 6 obstacles while implementing a progressive UX for a RAG pipeline in Google Chat Bot.

Obstacle Cause Solution
Message API is synchronous Streaming not supported cardsV2 + Chat API patch
Slow cold start build() downloads Discovery Document Use build_from_document() with static file
Background thread doesn't run CPU throttling in Cloud Functions Gen 2 --cpu=1 + --no-cpu-throttling
Rate limit 1 write/sec/space ThrottledPatcher (latest-wins buffer)
Feedback button errors (triple trap) Function name in action.function / URL assembly mistake / Wrong response format Full URL + K_SERVICE + actionResponse
Thread replies become new threads Always new thread when messageReplyOption not specified Carry over thread.name + REPLY_MESSAGE_FALLBACK_TO_NEW_THREAD

Honestly, compared to Discord Bot or Slack Bot, the Google Chat Bot development experience is still rough around the edges. There are many areas where the documentation hasn't caught up with the Workspace Add-ons format, and many things like the distinction between response formats (createMessageAction / actionResponse / renderActions) and the requirement for a full URL in action.function can only be discovered through trial and error.

On the other hand, cardsV2 progressive updates provide quite a good UX once they work. Users can see the pipeline progress in real-time and feedback can be collected. For organizations using Google Workspace, this investment is well worth it.

Next Steps: Migration to Cloud Tasks

The current architecture works with background threads + --no-cpu-throttling, but when scaling to a full-fledged RAG pipeline, we're considering migrating to Cloud Tasks.

Current: HTTP request → return {} → background thread runs in the same instance
Future:  HTTP request → enqueue in Cloud Tasks → worker processes as a separate request

Migrating to Cloud Tasks provides the following benefits:

Aspect Current (background thread) Cloud Tasks
CPU configuration --cpu=1 + --no-cpu-throttling required Works with default settings
Concurrency Threads compete for CPU within one instance Runs in separate instance per task
On failure Silent failure, no retry Automatic retry + Dead Letter Queue
Timeout Constrained by HTTP timeout Up to 30 minutes per task

Especially for RAG pipelines with heavy operations like LLM calls and vector searches, resource contention under concurrent requests becomes a problem. Cloud Tasks naturally leverages Cloud Run's autoscaling, eliminating scalability concerns.

Note that the Discovery Document (chat_discovery.json) is a static file bundled in the container image, so it can be used as-is in worker instances started from the same deployment, requiring no additional configuration.

References

Share this article