
Discarded code was a treasure trove - An account of learning from failure and completing a PoC in 4 business days
This page has been translated by machine translation. View original
Hello. I'm Ikeda from the Retail Distribution Solutions Department's SRE team.
In my previous article, I wrote about the 49-day experience of entrusting Claude Code with the development of an internal tool. The result was "no production adoption." We had to stop the project with 2,755 tests and 191 files of code because we couldn't solve the structural latency issues of the platform we adopted.
A few days after that, I pulled out the code and rebuilt a PoC with an AWS native configuration. In 4 business days, we reached the point where the entire flow worked - from application to approval, automatic IAM permission assignment, and automatic revocation after use.
This article is about those 4 days of experience. It's also my current answer to the question, "Were those 49 days wasted?"
What Are We Building
It's a system that manages access permissions to internal AWS accounts. When a team member applies for "ReadOnly access to this account" and it's approved, IAM policies are automatically assigned. When the due date arrives, or when they press "usage completed," the permissions are automatically revoked.
If you're interested in the details, please check my previous article where I discussed it.
The existing system is operating stably as a kintone app. Last time, we tried to implement it as a GAS SPA, but we abandoned production adoption due to performance barriers. This time, we're attempting to implement it using only AWS services.
Day 1 (About 4 hours)
Choosing an orphan branch
When starting the work, I decided to begin with an "orphan branch" to clear the git history.
The code and tests from the previous effort remained intact. The DynamoDB table design was also usable. However, the git history contained GAS-specific design decisions and remnants of the abandoned Fargate configuration. I wanted to avoid starting with too much noise for the new configuration, but still wanted to utilize the code. When I consulted AI about the best approach, it suggested using an orphan branch.
Starting Anew
I created a dedicated working directory and copied the code (191 files and 320 tests) from the previous effort that wasn't related to GAS SPA - the frontend, 5 backend packages, shared type definitions, etc.
I was planning to implement with Terraform management like in the production environment, but preparing an equivalent environment just for this verification seemed a bit excessive. Deciding to treat this purely as a PoC, I had new deployment scripts created that would deploy all resources via shell scripts.
Day 2 (Almost 6 hours)
Preparing the Verification Environment
Since this effort would again be just me and Claude, I decided to use a dedicated verification AWS account isolated from existing accounts.
The manual tasks like creating IAM Users, IAM Roles, AWS access keys, and registering those credentials in 1Password took some time. Since these tasks are only done when creating new accounts and my past memory isn't reliable, I had AI prepare a work procedure document.
Later, reviewing the deployment script AI prepared on the first day, I noticed it assumed credentials would be written in a .env file. While this is a common approach, I planned to use 1Password CLI, so I had it modified to retrieve credentials at runtime using the op read command.
Day 3 (About 7 hours)
According to the plan, we were supposed to complete deployment and begin verification on this day, but reality turned out differently.
Abandoning Google OIDC
Here, another oversight of mine came to light: the authentication method for web access to CloudFront. The design intended to use Google's OIDC authentication, but I didn't have permission to access the Google Cloud Console.
Since the purpose of the PoC was to verify "whether cross-account IAM automatic assignment and revocation works," we simply changed to a straightforward implementation using email and password.
Lambda Won't Start
The next problem occurred with Lambda execution. Even after deploying the code to Lambda, it wouldn't start with the error Dynamic require of "buffer" is not supported. When I relayed the error message to Claude, it responded, "You're bundling with ESM (ECMAScript Modules), but it doesn't work in the Lambda environment. Switching to CJS (CommonJS) will solve it, so I'll make the correction." I honestly didn't fully understand what was happening, but let it make the modifications, though I should properly understand this later.
Lots of Bugs
I decided to manually execute the deployment scripts, but encountered errors every time I ran them.
Each time an error occurred, Claude decoded the cause and fixed it, but this was the first time in the entire project that I encountered so many errors in a script.
All these bugs were embedded from the beginning in the scripts that Claude Code and I designed on Day 1. On reflection, this deployment script wasn't created with Claude Code's Tsumiki plugin, meaning it hadn't undergone TDD testing or verification. This oversight was completely my mistake. Lesson learned...
Somehow, all phases of deployment were completed, and we committed with 373 tests passing.
Day 4 (Almost 5 hours)
The Main Verification
Day 4. I started with excitement to see if the cross-account IAM assignment and revocation processes would work as expected.
This time, we used 2 AWS accounts. We ran the system in the PoC account, accessed the application page on CloudFront via browser to make a request. Once the application was approved, Lambda was called to attach/detach policies to an IAM role created in another verification account that would be the target for role switching.
Final Bug Fixes with Automated Tests
First, we had automated browser tests run using playwright (or maybe it was chrome-devtools-mcp), but several bugs remained.
There was a DynamoDB query without ExpressionAttributeValues causing a ValidationException. This appeared to be remnants of code from the previous effort.
The Lambda role was missing dynamodb:Scan permissions, which I think was due to insufficient preliminary investigation.
The frontend was sending PUT while the backend was expecting POST. "How did this happen?"
Environment variables were unset during frontend building, preventing Cognito authentication. It was rebuilt after configuration, but I wonder why this occurred...
The bugs this time all seemed to have easily identifiable causes. The code created through the TDD process apparently had a structure that made problem isolation easier, and having tests made it immediately clear what was broken.
But I wish these had been discovered and fixed during the testing phase. It was my fault for skipping reading the test result reports...
After Automated Testing, Trying Manual Operation
From the frontend, I created an application for "SwitchRole (ReadOnly)." I entered the target account, period, reason, and submitted. Switching to the administrator view, I clicked "approve."
Checking the Lambda logs:
[INFO] Attached policy arn:aws:iam::aws:policy/ReadOnlyAccess to ikeda.test
I switched roles via CLI and executed aws s3 ls. A list of S3 buckets was displayed. I could access resources in an account I couldn't access before. This proved the permissions were granted.
Back to the frontend, I clicked "Usage Complete." Lambda revoked the IAM policy.
Executing aws s3 ls again:
An error occurred (AccessDenied): User is not authorized to perform s3:ListAllMyBuckets
As expected, confirmation that the permissions were revoked.
Finally, the entire flow of application, approval, permission assignment, usage completion, and permission revocation was working.
Were Those 49 Days Wasted?
After completing the minimum requirements as a PoC in 4 business days, totaling about 22 hours, I can clearly state: the previous efforts were not wasted at all.
The frontend code could be reused almost as is. The DynamoDB table design and technology element research were already completed. The reason I hardly needed to think about "how to build it" this time was because of those previous 49 days. Having fewer moments of indecision allowed us to proceed without delays.
The title of my previous article was "Build, Scrap, Rebuild." In the end, this effort was also an extension of that cycle.
Not Yet Finished
To be honest, it's not "complete" after 4 business days.
What's working are DynamoDB, API, authentication, frontend, and cross-account IAM. The only access type we verified was "SwitchRole ReadOnly."
The EventBridge batch and SES email notification functions are also planned for implementation, and code has been written but not verified. Also, of the 8 types of access originally planned, 7 still need implementation and verification. For migration to the production environment, integrating into the Terraform framework remains entirely untouched.
A PoC demonstrates "can be done," not "has been done," so from here on is the real work.
Conclusion
I discarded what we built in the previous effort and rebuilt it in 4 days. What I discarded was the architecture, but everything we gained from that experience was carried forward. Tests, design, memories of failures - all of these could be brought into these 4 days.
Even with a project whose adoption was postponed, the experience, code, and commit logs remain. Build, scrap, and rebuild. Having proven that we're definitely moving forward in this cycle, I'm looking forward to proceeding with the migration to the production environment.