I improved the N+1 problem in Aurora DSQL from 90 seconds to 8 seconds using AI (Kiro).

I improved the N+1 problem in Aurora DSQL from 90 seconds to 8 seconds using AI (Kiro).

Search API timeout in 90 seconds? We resolved an N+1 problem that occurred in an Amazon Aurora DSQL environment with the help of an AI assistant (Kiro). I'll introduce the process that led to a 91% improvement, including execution plan analysis, AI-suggested code modifications, and addressing side effects, with actual measurement logs.
2026.01.29

This page has been translated by machine translation. View original

Last year, I had the opportunity to develop an API for blog assistance using Amazon Aurora DSQL, which became generally available (GA) in May 2025. During this process, I faced an issue where tag search response time deteriorated proportionally to the number of articles.

This article shares the process of using an AI assistant tool (Kiro) to investigate the cause, which helped reduce processing time from about 90 seconds to 8 seconds (91% improvement), along with evidence from logs and how we handled side effects that occurred during the fix.

Testing Environment

The development environment was as follows, with a record count equivalent to production data volume:

  • Database: Amazon Aurora DSQL
  • Compute: AWS Lambda (Python 3.12)
  • Data scale:
    • Article data: approximately 60,000 entries
    • Tag data: approximately 7,300 entries
    • Relationship table: approximately 431,500 entries

1. Confirming the Issue (90-second latency)

During the development of the blog API, we discovered timeouts or extreme delays when searching for articles with specific tags (e.g., AWS). First, we measured actual values using the curl command.

Response Time Measurement

# Execution command (measured with time)
time curl -s "API_URL?tags={tag_name}&limit=3" > /dev/null
Search Tag Matching Articles Response Time Assessment
lambda 3,629 entries 6 seconds Delayed
security 7,884 entries 12 seconds Significant delay
aws 28,074 entries 90 seconds Timeout

We observed that response time deteriorated linearly in proportion to the number of articles (matching hits).

2. AI Hypothesis and Verification (EXPLAIN ANALYZE)

From the abnormal 90-second latency, we suspected inefficient loop processing on the application side (N+1 problem) and asked the AI assistant (Kiro) to investigate.

Kiro's Autonomous Investigation Preparation

After explaining the situation, Kiro immediately estimated the cause and not only proposed solutions but also autonomously completed everything from "generating verification scripts" to "execution preparation," then asked for permission to execute.

Kiro:
"It's likely an N+1 problem. We should use Aurora DSQL's (PostgreSQL compatible) EXPLAIN ANALYZE to obtain the actual query execution plan.
I've created an EXPLAIN ANALYZE script. Shall I execute it?"

After reviewing the SQL in the provided Python script and confirming there was no risk of data destruction (read-only operations), I gave permission to execute. Essentially, all the human side did was "code review and approval."

Scale of Data Under Investigation

To verify, we checked the statistics of the tag relationship table that was causing the performance bottleneck.

Item Value
Total records 431,509 entries
Unique tags 11,775
Unique articles 59,721 articles
Average tags per article 7.23

There are over 7 tags assigned on average to about 60,000 articles, totaling 430,000 records. We suspected inefficient access to a table of this scale.

Execution Plan Analysis Results

We ran the script generated by Kiro and obtained the following EXPLAIN ANALYZE log:

Execution Time: 31913.008 ms

->  Nested Loop  (actual time=2583.240..13677.777 rows=28074 loops=1)
    ->  Index Only Scan on posts_pkey (actual rows=1 loops=28079)
        -- ↑ 28,079 loops (Index Scan) occurred
    ->  Index Only Scan on post_summary_pkey (actual rows=1 loops=28074)
        -- ↑ 28,074 loops (Index Scan) occurred

From the log, we identified that over 56,000 Index Scans were occurring.
We determined that the cause of the 90-second timeout was that for each of the approximately 28,000 article IDs with the "aws" tag extracted from the 430,000-record post_tag table, it was performing a re-search in a loop process one by one.

3. Fix Implementation (From Subquery to JOIN)

The AI suggested "changing the IN (SELECT ...) subquery structure to a JOIN would allow PostgreSQL's optimizer to select an efficient join process."

Code Before Fix (Python)

# Using subquery (anti-pattern)
where_conditions.append("""
    bm.id IN (
        SELECT blog_post_id
        FROM post_tag
        WHERE tag_name = ANY(%s)
        GROUP BY blog_post_id
        HAVING COUNT(DISTINCT tag_name) = %s
    )
""")

Code After Fix

# Using JOIN
from_clause = """
    FROM blogpost_meta bm
    JOIN post_tag bat ON bm.id = bat.blog_post_id -- Changed to INNER JOIN
    LEFT JOIN blogpost_ribbon br ON bm.id = br.blog_post_id
    LEFT JOIN blogpost_ai_summary bas ON bm.id = bas.id
"""

query = f"""
    SELECT ...
    {from_clause}
    WHERE bat.tag_name = ANY(%s) -- Filter in WHERE clause
    GROUP BY bm.id, ...
    HAVING COUNT(DISTINCT bat.tag_name) = %s
    ORDER BY bm.first_published_at DESC
"""

We updated the Lambda function and measured again.

4. Confirming Improvement

After the fix, we measured using the curl command under the same conditions.

Search Tag Before Fix After Fix Improvement Rate
lambda 6 seconds 5 seconds 17%
security 12 seconds 5 seconds 58%
aws 90 seconds 8 seconds 91%

For the data-heavy "aws" tag, we confirmed a reduction from 90 seconds to 8 seconds, about 91% improvement. This led us to conclude that the N+1 problem was resolved.

5. Side Effect Occurrence and Resolution (Sort Order)

While performance improved, we identified a new issue: "the article sort order (by publication date) was broken."

Issue Log

curl -s "API_URL?author=suzuki-ryo&limit=10" | jq '.articles[] | .published_at'

"2025-11-25T05:25:49.676000"
"2025-01-10T06:27:29.620000"
"2025-11-25T14:27:51.467000"

The dates were not in descending order.
Investigation revealed that when we removed the DISTINCT ON clause from the query during the N+1 fix, we lost consistency with the ORDER BY clause. By PostgreSQL specification, when using DISTINCT ON, the first argument in ORDER BY must match, but after removal, this constraint was gone, resulting in unintended sorting.

Fix Implementation

We changed the ORDER BY clause to prioritize the publication date (bm.first_published_at) rather than the primary key (bm.id).

-- Before fix
ORDER BY bm.id, bm.first_published_at DESC 

-- After fix
ORDER BY bm.first_published_at DESC

After redeployment, we confirmed that data was retrieved in chronological order as intended.

"2026-01-25T13:21:17.055000"
"2026-01-17T16:34:16.857000"
"2026-01-16T05:50:00.259000"

6. Further Optimization (Multiple Tag Search)

While single tag searches were optimized, searching for multiple tags (e.g., tags=aws,ec2,rds) still caused a delay of about 19 seconds. This was because we were performing a GROUP BY on a huge result set after joining all tables.

To address this, we adopted an approach of "filtering IDs in a subquery first, then performing JOINs on the results."

-- Optimized approach
SELECT bm.id, ...
FROM blogpost_meta bm
INNER JOIN (
    -- Filter target IDs here first
    SELECT blog_post_id
    FROM post_tag
    WHERE tag_name = ANY(ARRAY['aws', 'ec2', 'rds'])
    GROUP BY blog_post_id
    HAVING COUNT(DISTINCT tag_name) = 3
) AS matched_posts ON bm.id = matched_posts.blog_post_id
...

With this change, we confirmed that response times remained stable at around 8 seconds even as the number of tags increased.

Summary

I've shared a case of performance improvement using Kiro in an Amazon Aurora DSQL environment.

Although Aurora DSQL is a distributed database that provides a PostgreSQL-compatible endpoint and is very simple to start using, when writing complex queries in raw SQL like in this case, the engineer's implementation skills directly impact performance.

In performance tuning, "adding missing indexes" is a common first approach for many engineers. However, in this case, the bottleneck was in the inefficiency of the join algorithm.

Reading from EXPLAIN ANALYZE results "why the optimizer chose that execution plan" and making the decision to refactor the query structure itself to guide it toward a more efficient join method (such as Hash Join) requires advanced database engineering knowledge.

Kiro proved to be a strong partner, not just as a code completion tool, but by autonomously covering the DBA domain of "execution plan analysis and structural reform with join algorithms in mind."
When adopting new database technologies like Aurora DSQL, I highly recommend trying the AI agent Kiro, which can be expected to demonstrate such capabilities.

Share this article

FacebookHatena blogX

Related articles