
Creating, Discarding, Creating Again - A 49-Day Record of Completely Entrusting Internal Tool Development to Claude Code
This page has been translated by machine translation. View original
Hello. I'm Ikeda from the SRE team in the Retail Distribution Solutions Department.
Previously, I wrote about my experience creating a security check tool with Claude Code.
This time, I'm sharing a record of tackling a larger-scale business tool based on that experience.
For 49 days from January 29 to March 18, 2026, I entrusted almost all development work to Claude Code to create an internal business tool. Along the way, I completely changed the architecture, hit plan limitations requiring a project restart, and ultimately had to shelve production adoption after encountering platform limitations. I'll share this process chronologically.
What I Tried to Create
In our department, we have a workflow for managing access permissions to AWS accounts. There's a system where users submit requests stating "On this date, I want to access this account, for this business purpose, with these permissions," and when approved, AWS IAM permissions are automatically granted.
Until now, we had been operating by connecting three apps (user management ledger, AWS account management ledger, and access permission applications) built on a certain no-code tool, but the app's creator had already left the company. There were several black-box aspects, and a permission granting flow centered on GitHub PRs existed in parallel, so we wanted to integrate everything.
Preliminary Research — Deep Research and Gemini (1/29 - 2/2)
January 29 — Requesting Three Reports
The first step of the project was research, not coding.
I requested three research reports from Deep Research: AWS Fargate best practices, deployment management with Terraform + Atlantis, and how to build a temporary IAM privilege escalation system. For the third prompt, I asked Gemini to "understand our existing operational methods and suggest what to research."
Gemini responded with a questionnaire format. "What's the maximum usage time?" "What triggers automatic detachment?" "What authentication infrastructure?" Filling in these answers naturally helped solidify the requirements.
At this stage, I was planning an AWS Fargate + DynamoDB + Step Functions architecture. A closed network via VPC endpoints with Google Workspace SSO for authentication.
First Attempt with Fargate Configuration (2/3 - 2/13)
February 3 — Starting Development with Claude Code
I provided the preliminary research deliverables to Claude Code and began development using the tsumiki plugin workflow (an OSS plugin for instructing Claude Code on TDD development cycles). The flow was: requirements definition → design → task breakdown → implementation via TDD.
Basic local development was completed that day. However, reading through the generated documentation, something felt off. Feeling that more preliminary information was needed, I revised the PRD, added screenshots and HTML from the existing system as reference materials, and started over.
When I showed the requirements definition output to Gemini, it pointed out two oversight issues, which I also incorporated.
By February 6, the second task breakdown was completed.
The Plan Wall — MAX100 Limitations
A problem arose with Claude Code's token consumption under the MAX100 plan.
With research, PRD creation, design document generation, and task implementation all delegated to Claude Code, almost every session hit the session limit (5-hour restriction) after about 2 hours. During the 3-hour wait for reset, I worked on other tasks that only I could do, but by February 9, I had also hit the weekly usage limit (across all models). It reset on February 10 and I resumed work, but by February 13, I was approaching the weekly limit again...
Not being able to use Claude for most working hours was preventing development progress.
February 13 — Decision to Switch Architecture
On February 13, with the weekly limit approaching, I had a thought:
"Wouldn't a GAS web app significantly reduce operational costs?"
With the Fargate configuration, using VPC, ALB, NAT Gateway, Fargate tasks, and DynamoDB would result in fixed monthly costs in the tens of thousands of yen. In contrast, GAS is included in existing Google Workspace licenses with zero additional cost.
Therefore, I thought it would make sense to start with the option that had the highest potential for fixed cost reduction.
I consulted with my manager: "To quickly try, quickly fail, and quickly improve, I'd like to use the MAX200 plan for the month it will take to complete this project." With approval granted, I decided to simultaneously switch architectures and upgrade the plan, restarting the project as v3.
Fresh Start as a GAS SPA (2/14 - 2/16)
Carrying Over Assets from the Abandoned Project
Though the architecture was completely different from the initial Fargate configuration, the research documents (existing app data structure definitions, migration plans, security rules) were still usable. I brought these into the new project and asked Claude Code to design and break down tasks for a GAS SPA.
The difference with the MAX200 plan was substantial. Where the Fargate configuration regularly hit session limits after 2 hours with 3-hour resets, v3's design and task breakdown were completed in just 2 days (though we still occasionally hit the 5-hour plan usage limit, but wait times were at most about an hour).
Split into 111 Tasks
Claude Code's plan divided implementation into 111 tasks across 4 phases:
- Phase 1 (23 tasks): Common Foundation — Type definitions, repository layer, service layer
- Phase 2 (34 tasks): Backend — API routes, authentication, permission management
- Phase 3 (43 tasks): Frontend — React components, screens, state management
- Phase 4 (11 tasks): Finishing — E2E test environment, CSS, resolving type errors
Phases 1-2 Backend Implementation (2/17 - 2/20)
February 17 — TDD Cycle Begins
Development started with TASK-0001 (project initial setup).
When instructed to use the tsumiki plugin's TDD workflow (Red → Green → Refactor), Claude Code self-drives through the cycle of first writing failing tests, then implementing to pass all tests, and finally refactoring.
That day, we progressed from TASK-0001 to TASK-0020 in one go. These covered type definitions, common configurations, spreadsheet schemas, and repository base classes—the skeleton of the application.
February 18-20 — 34 Tasks Completed in 3 Days
Phase 2's backend implementation included 34 tasks covering the repository layer, service layer, and API routing (Hono).
Claude Code's work style was consistent, repeating the flow of requirements definition → test case design → Red → Green → Refactor → completion confirmation for each task. I mostly just reviewed test results and said "next," or provided guidance when design decisions were needed.
The permission granting mechanism had GAS call existing AWS Lambda over HTTP, with Lambda handling IAM policy attachment/detachment. Claude Code handled the GAS-side invocation logic, advance status updates with rollback on failure, and audit logging.
By February 20, all 34 tasks in Phase 2 were completed. The test count had reached about 1,200.
Phase 3 Frontend Implementation (2/20 - 2/25)
Phase 3 was the largest with 43 tasks. Using React 18 + TypeScript, we created application list screens, application detail screens, new application forms, and admin screens for user and AWS account management.
Claude Code maintained the TDD style even when creating components. State management used useReducer + Context, with business app necessities like pagination, optimistic locking, and role-based access control all included.
CSS used BEM naming conventions in a single file, with Claude Code writing all styles in one file. For screen design, I had Claude Code reference the existing system's UI and output as HTML files, then requested specific buttons or had it modify parts that were difficult for humans to read.
Later, once screens were somewhat complete, I deployed to the GAS environment to interact with actual screens, provide improvement instructions, and report bugs in features that should have been working. Throughout, I maintained the style of not writing code myself but reviewing results and giving correction instructions.
By February 25, Phase 3 was complete. The test count had reached about 2,300.
Phase 4 Finishing Touches (2/25 - 2/26)
Resolving 491 Type Errors in One Session
Phase 4 involved building the E2E test environment, CSS support for all components, and resolving TypeScript type errors.
TASK-0111 was particularly impressive, with Claude Code resolving all 491 TypeScript compiler type errors in a single session. These included type inconsistencies, missing imports, and generic inference errors of various kinds, all methodically addressed.
This was likely caused by oversight on my part due to my lack of development experience with TypeScript.
By February 26, Phase 4 was complete. All 111 tasks were done, with 2,392 tests passing.
The third time was the charm—from start to implementation completion took just 10 days.
Lambda Deployment and PR Review (3/9 - 3/17)
Not Complete with Just GAS
When all 111 tasks were completed, the GAS-side application was operational. Screen display, login, and data CRUD worked fine. However, the critical permission granting function wasn't usable yet because the Lambda to be called from GAS hadn't been deployed to production.
The permission granting mechanism follows the path: GAS → API Gateway → Lambda → IAM. The plan was to duplicate existing Lambda functions and adjust them for the GAS SPA, managed with Terraform. This work was also delegated to Claude Code.
March 9-10 — Creating Terraform Code and PR
On March 9, Claude Code added cross-account role definitions to the Terraform module. The next day, it added Terraform code for Lambda, API Gateway, and IAM resources to the environments repository and created a PR.
About One Week of PR Review Response
The PR received review comments from two team members:
- Directory structure should be unified with existing Lambda functions
- PolicyARN condition keys should be added to AttachRolePolicy/DetachRolePolicy IAM policies
- Python runtime version notation should be standardized
- Unused variables and permissions should be removed
Claude Code handled the review feedback responses as well. When informed of the issues, it would create and push correction commits autonomously.
Additionally, IAM access keys were needed for GAS to call Lambda, which had to be managed by Terraform. To address the security issue of SecretAccessKey being recorded in plaintext in Terraform state files, we added PGP encryption protection. Claude Code implemented this as well and submitted it as a separate PR for review.
During the PR review, a review-focused Claude Code Skill created by a colleague proved very helpful.
This process took about a week due to scheduling constraints of team members including myself, though Claude Code responded to requests in a matter of minutes.
On March 17, all PRs were merged and Lambda was deployed to production.
The Performance Wall (3/17 - 3/18)
March 17 — Everything Connected, But...
With Lambda deployed, the entire flow from application to permission granting finally worked.
When actually operating it, the functionality worked as expected. Submit an application, press a button, Lambda is called, IAM policy is attached. Press the usage completion button and it's detached. As designed.
But it was slow.
Every page load took 3-5 seconds of waiting, and from pressing the application button to completion took over 10 seconds. This was surprising since everything ran smoothly on the local development server.
The Cause Was Structural
When I asked Gemini, I learned that when running SPAs in GAS, communication from the frontend to the server uses a synchronous RPC called google.script.run. There's a 2-3 second overhead per request, and when a page makes multiple API calls, this adds up.
Additionally, the permission granting process follows the GAS → Lambda → AWS IAM path, adding an HTTP call via UrlFetchApp. In total, one application operation required 8-12 seconds.
March 18 — Attempting Improvements
I asked Claude Code for a performance analysis, and it listed 8 improvement suggestions. We implemented the 3 highest priority ones:
- Cache Optimization: Exclude GPG public keys from cache to stay within GAS's CacheService 100KB limit
- Optimistic UI Updates: Immediately change frontend status displays without waiting for API responses
- Ancillary Process Protection: Wrap audit log recording after successful Lambda execution in try/catch to prevent success responses from being lost due to ancillary process failures
The other 5 suggestions were deferred as they would impact existing Lambda functions.
The changes improved the feel somewhat, but the fundamental issue remained. The 2-3 second overhead of google.script.run is inherent to the GAS architecture and cannot be optimized away.
As a business tool, it was "usable," but its perceived speed was inferior to the existing no-code tool in some scenarios. Our department's engineering members submit many applications daily and handle numerous tasks, so this much time per application would directly impact their work performance. There was no compelling reason to migrate from the existing tools.
Deferring Production Adoption, and Looking Ahead
On March 18, I decided to defer production adoption.
The reason was simple: the performance characteristics didn't match the requirements of this business tool. Functionally, we had successfully integrated the 3 no-code tool apps and GitHub PR application flow into a single system, but the response time fell short of an acceptable user experience.
In the future, we plan to rebuild it with an AWS serverless configuration. We gained three assets from v3:
First, the design documents that organized the business flow. Status transitions for application-approval-permission granting and rollback considerations remain the same regardless of platform.
Second, the 2,755 test cases that defined the specifications. The tests clearly describe "what must be satisfied to be correct," providing clear goals even in a new environment. The logs of bug fixes and oversight responses discovered during operational testing will help avoid repeating the same mistakes.
Third, tips for working with Claude Code — the sense of what to delegate and where human judgment is needed.
In 49 days, I experienced "build, discard, rebuild," and I feel that this cycle speed is the greatest weapon when developing with AI.
So I plan to continue building, discarding, and rebuilding.
Looking Back in Numbers
| Item | Value |
|---|---|
| Total Period | 49 days (1/29 - 3/18) |
| Time to v3 Implementation Completion | 10 days (2/17 - 2/26) |
| Total Commits | 537 (v3) |
| Completed Tasks | 111 (4 phases) |
| Source Files | 191 files |
| Test Files | 140 files |
| Test Cases | 2,755 (final) |
Beyond the planned 111 tasks, there were additional bug fixes found during testing and PR review responses, but Claude Code wrote almost all of these files. I never wrote code directly myself.
My role was to determine direction, make design decisions, approve or reject Claude's proposals and execution confirmations, test functionality, and ultimately decide "we can't release this to production."
What I Learned from Fully Delegating to Claude Code
What Worked Well
Excellent compatibility with TDD. Once shown the rule of "write tests first, implement to make them pass, then clean up," Claude Code faithfully repeated this cycle. This rhythm never broke across all 111 tasks. With tests always present, any regressions were immediately detected during later modifications.
Strong research capabilities. Using Playwright MCP (Claude Code's mechanism for operating a browser to read screen structure and data) to investigate existing apps and document them would take a full day for a human, but Claude Code completed it in minutes.
Combining AIs increases accuracy. Using Deep Research for technical investigation, Gemini for finding requirement gaps, and Claude Code for implementation. By leveraging each one's strengths, we prevented oversights that might have been missed by a single person.
What Didn't Work Well
Production environment performance is unpredictable. Everything runs fast in local test environments. The delays only became apparent after deploying to GAS, which Claude Code couldn't see. I should have more deeply investigated GAS performance characteristics during technology selection. This was my judgment error, not Claude Code's problem.
Delayed "seeing something working." By completing all 111 tasks before deployment, we discovered the performance issues late. If we had deployed Lambda after completing Phase 1 to get a feel for it, we could have pivoted earlier. On the other hand, I don't think the decision to deploy Lambda as late as possible to avoid affecting the existing production environment was wrong either.
Reflections
Discuss plan selection early. When delegating large development projects to Claude Code, token consumption can become a bottleneck.
In v2, though progress seemed steady, work repeatedly stalled mid-week. After switching to MAX200, this problem was largely resolved, but if I had consulted with my manager earlier, we might have made more progress with v2.
Discarding isn't bad. Throughout this and other projects, I've repeatedly built, discarded, and rebuilt, but it's not about "discarding everything we built"—it's about "building but deciding not to use as a tool." Even if we don't use it as a tool, the assets and experience gained in the process will definitely be useful next time.
Summary
- I fully delegated internal business tool development to Claude Code for 49 days
- I practiced AI specialization by using Deep Research and Gemini for research and Claude Code for implementation
- After an architecture shift from Fargate to GAS, v3 completed 111 tasks and 2,755 tests in 10 days
- Production adoption was deferred due to performance reasons
- Starting a new project to rebuild with AWS serverless using the code assets
While this experience report describes a failure, I hope that some part of what I practiced and faced can help solve challenges for those looking to improve business tools or reduce tool costs.
If I sense that the MAX100 plan might be insufficient for serverless development too, I plan to consult about upgrading to MAX200 again at an early stage.